# A 960-Mb/s/pin Interface for Skew-Tolerant Bus Using Low Jitter PLL

Sungjoon Kim, Student Member, IEEE, Kyeongho Lee, Student Member, IEEE, Yongsam Moon, Student Member, IEEE, Deog-Kyoon Jeong, Member, IEEE, Yunho Choi, and Hyung Kyu Lim, Member, IEEE

Abstract—This paper describes an I/O scheme for use in a highspeed bus which eliminates setup and hold time requirements between clock and data by using an oversampling method. The I/O circuit uses a low jitter phase-locked loop (PLL) which suppresses the effect of supply noise. Measured results show peakto-peak jitter of 150 ps and rms jitter of 15.7 ps on the clock line. Two experimental chips with 4-pin interface have been fabricated with a 0.6- $\mu$ m CMOS technology, which exhibits the bandwidth of 960 Mb/s per pin.

*Index Terms*— Skew-tolerant, high speed bus, oversampling, phase locked loop, jitter, CMOS, phase frequency detector, voltage controlled oscillator.

#### I. INTRODUCTION

S the speed of high-speed digital systems tends to be H limited by the bandwidth of pins, new I/O architectures are gaining momentum over conventional ones. The advent of 64 Mb and 256 Mb DRAM's and faster logic chips also propels the need for high-speed I/O interface while reducing the number of pins and hence the system cost. Synchronous DRAM's increased chip bandwidth up to 220 Mb/s/pin [1]. A revolutionary architecture using delay-locked loops (DLL's) or phase-locked loops (PLL's) was also successful in providing over 500 Mb/s/pin bandwidth [2], [3]. Such a narrow, highspeed bus provides large bandwidth in a small, low pin-count package, but such high-speed bus architectures inevitably require strict phase relationships between clock and data. A phase-tolerant I/O scheme was also developed previously for a point-to-point link [4]. This paper describes an I/O scheme for use in a high-speed bus which eliminates setup and hold time margins by using blind  $3 \times$  oversampling and data recovery. In the new scheme, the clock line delivers only frequency information. The data receiving circuits extract phase information from the data itself. An 8-b data bus employing this skew insensitive scheme can deliver over 960 MB/s. Two experimental chips with 4-pin interface were fabricated.

In Section II, the chip architecture and the skew-tolerant I/O scheme will be presented. The circuit design techniques for low jitter PLL and other circuits are discussed in Section III.

Publisher Item Identifier S 0018-9200(97)02850-3.

The chip layout and experimental results are presented in Section IV followed by a conclusion in the final section.

## **II. SYSTEM ARCHITECTURE**

Two chips, bus master and bus slave, were designed. Bus masters in a system bus initiate bus transactions, and slaves respond to the tenured master. For example, a memory controller works as the master chip and a memory with a high-speed interface works as the slave chip. A simplified block diagram of the two chips is shown in Fig. 1. The bus signals are composed of 4-b wide data lines, a clock line, and a reference line. A charge pump PLL multiplies the external clock by two and generates two sets of multiphase clocks for both bit serialization and data oversampling. The relationship between internal 12-phase clocks and external clock is shown in Fig 2. First set of multiphase clocks are 12 multiphase clocks with  $30^{\circ}$  of phase separation. These 12 clocks are shown in Fig 2(a) as PCK[0] to PCK[11]. These multiphase clocks were laid out to minimize the interference. Fig 2(b) shows the multiphase clock distribution. Ground lines were inserted between each multiphase clock to minimize the interference. When one clock is switching, the adjacent clocks are guaranteed to be in stable state. This configuration minimizes coupling between clocks. The second set of multiphase clocks are four multiphase clocks with 90° of phase separation. This second set of multiphase clocks, TCK[0] to TCK[3], are in phase with PCK[0], PCK[3], PCK[6], PCK[9], respectively. We generate these two separate sets of clocks to equalize loading conditions.

An 8-b parallel data stream is first converted to a 4-b data stream by an internal clock and then serialized with a serialization circuit. The serializer circuit used is the same type of circuit reported in [4]. The only difference is that four phase clocks instead of ten phase clocks of the previous design are used in this design, thereby reducing area and parasitic capacitance at high-speed nodes. The serial stream is driven by a current controlled open-drain output driver. The second set of multiphase clocks, TCK[0] to TCK[3], are used by the transmitter to serialize 4 b of data. Each pin connected to a high-speed bus has 12 oversamplers and a output driver. In [6], 32 clock phases are generated to oversample the incoming data. The decision on the degree of oversampling is a tradeoff between input data phase jitter tolerance, power, and area. If too many clock phases are used per bit period, power consumption and chip area will increase. But low oversampling ratio may affect the tolerance of phase

Manuscript received August 20, 1996; revised December 3, 1996.

S. Kim, K. Lee, Y. Moon, and D.-K. Jeong are with the Inter-University Semiconductor Research Center, Seoul National University, Seoul 151-742, Korea.

Y. Choi and H. Lim are with Samsung Electronics Co., Yongin-City, Kyungki-Do, Korea.



Fig. 1. Simplified block diagram of master and slave chip.



Fig. 2. (a) External clock and 12 multiphase clocks relationship. (b) Multiphase clock layout.

jitter on the incoming data. If the phase jitter on the incoming data is low and the PLL has low jitter characteristics, the oversampling ratio can be as low as three [7]. The oversampler oversamples the bus data three times per bit using 12 phase clocks provided by a PLL. To extract correct phase information from the data stream, the high-to-low transition is inserted in each head of a packet on each pin for correct data sampling. The slaves of the bus keep oversampling the bus signals to catch the start of a bus transfer. This process is illustrated in Fig. 3. The serial input data is sampled at the rising edges of each multiphase clock. The receiver samples the serial data blindly without any constraint on setup and hold time margins. The sampled data is amplified again regeneratively to reduce possible metastability. Fig. 3 shows two high-speed bus signals, bus signal 0 and bus signal 1, with skew between them. When the signal receiver detects the first 1-to-0 transition, it selects the next bit as the first valid data. The third bit after the first valid bit is also selected as valid. It is assumed



## Fig. 3. Skew-insensitive bus operation.



Fig. 4. Byte skew handling operation.



Fig. 5. Functional block diagram of charge pump PLL.



Fig. 6. Conventional phase frequency detector.



Fig. 7. Implemented phase frequency detector.

that the next oversampled bit after the first 1-to-0 transition was sampled near the center of data eye pattern. Each pin of the data bus tracks the start phase of a data transfer separately. After each pin catches the start of a data transfer, the demultiplexed data of each pin is retimed into a single internal clock domain. Since this process can be done in one clock cycle, the masters can respond quickly as distance from the signal source changes.

Since this scheme allows skew not only in clock line but also among data lines, there is a possibility that some of the demultiplexed parallel data are one internal clock cycle earlier or later than the other demultiplexed data after retiming.



Fig. 8. (a) PFD dead zone and (b) PLL jitter.



Fig. 9. Voltage controlled oscillator circuit diagram.

The skew handler examines the parallel output of each pin and checks whether every pin is aligned properly. If some of the parallel outputs are not aligned, skew handler delays the parallel outputs which arrived earlier. Fig 4 explains the operation of the interpin skew handler operation.

# **III. CIRCUIT DESCRIPTION**

## A. Low Jitter PLL

The performance of the PLL or DLL is one of the limiting factors of the high-speed interface or serial communications. The jitter characteristics become more important especially for such applications that require integration of PLL or DLL with noisy digital circuits. Integration with digital circuits induces noise on the supply rails or on the substrate. Since the charge pump PLL used in this design generates multiple phase clocks to divide one external clock period into many equally spaced intervals, the accuracy and the jitter characteristics become more important.

Fig. 5 shows the functional block diagram of a charge pump PLL clock generator. It consists of a phase frequency



Fig. 10. Simulated UP/DOWN pulse width difference as a function of input phase difference.

detector, charge pump, loop filter, clock divider, and a voltagecontrolled oscillator (VCO). With a six-stage differential VCO, 12 clock phases are available to oversample the incoming data and to serialize parallel data into serial bit stream.

One of the critical building blocks of the PLL is the phase frequency detector (PFD). A low precision PFD has a wide dead zone (undetectable phase difference range), which results in increased jitter. The jitter caused by the large dead zone can be reduced by increasing the precision of the phase frequency detector. Fig. 6 shows a conventional implementation of a static PFD [8]. This conventional PFD is an asynchronous state machine. The delay time to reset all internal nodes determines the circuit speed. The critical path of the conventional PFD is shown in bold lines in Fig. 6. The critical path forms a feedback path with six gate delays. The dead-zone occurs when the loop is in a lock mode and the output of the charge pump



Fig. 11. Voltage controlled oscillator circuit diagram.

does not change for small changes in the input signals at the PFD. Any width of the dead-zone directly translates to jitter in the PLL and must be avoided.

To overcome the speed limitation and to reduce the dead zone, a new dynamic logic style PFD was designed. A similar dynamic comparator was reported before [9]. But our implementation requires fewer number of transistors. Fig. 7 shows the circuit diagram of the PFD. Conventional static logic circuitry was replaced by dynamic logic gates. As a result, the number of transistors in the PFD core is reduced from 44 to 16. The critical path of this PFD is shown also in Fig. 7. The critical path of this PFD is composed of threegate feedback path. The shortened feedback path delay and dynamic operation allow high precision in the high-frequency operation.

Fig. 8. shows the relation between dead zone of PFD and the phase error of PLL. If the phase difference of EXT clock and VCO clock is smaller than the dead zone, the PFD cannot detect the phase difference. So the phase error signal of PFD will remain zero, resulting in unavoidable phase error between EXT clock and VCO clock. The minimum peak-to-peak phase error caused by this dead zone is

Minimum Peak-to-Peak Phase Error = 
$$2\pi \times \frac{T_{\text{deadzone}}}{T_{\text{period}}}$$
. (1)

In order to avoid dead zone, the PFD asserts both UP and DOWN outputs as shown in Fig 9. For in-phase inputs of EXT\_CLK and VCO\_CK, the charge pump will see both UP and DOWN pulse for the same short period of time. If there is a phase difference between EXT\_CLK and VCO\_CK, the width of UP and DOWN pulse will be proportional to the phase differences of the inputs. Fig. 10 shows the SPICE simulation result of the UP/DOWN pulse width differences as a function of the input phase differences. The deadzone of the PFD is significantly smaller than the measured maximum PLL jitter.

Several critical parameters of the PLL, such as speed, timing jitter, spectral purity, and power dissipation, strongly depend



Fig. 12. VCO operation for step supply noise.

on the performance of the VCO. So the noise insensitivity of the VCO is very important. The VCO implemented in this design has a simple bias circuit to reject supply step noise. The processor or bus can have intervals when there is heavy circuit activity in switching large amounts of capacitance and intervals when there is very little circuit activity. This will show up as steps or impulses on the power supply of PLL [8]. The actual peak-to-peak jitter in this case becomes dominated by the peaks in the impulse transient noise response. The VCO used in the design is a six-stage differential-type ring oscillator with limited voltage swing and is shown in Fig. 11. Each stage is made up of a differential NMOS pair with variable resistance loads made of PMOS devices operating in the triode region. The bias voltage for the PMOS is generated by a replica bias circuit. The operation of this bias circuit is shown in Fig. 12. The  $V_{ref}$  voltage dynamically tracks the supply variations. The replica bias circuit which consists of replica delay cell and an op-amp sets the minimum voltage level of the internal VCO swing to  $V_{ref}$ . The  $V_{ref}$  signal is generated by two resistors and one capacitor. When the supply rail is quiet, the voltage swing of the internal VCO is  $V_{dd}$ - $V_{ref}$ . Let us assume that there is a supply voltage step variation of  $\delta V_{dd}$  at some point. After the



Fig. 13. Phase and byte sync block diagram.



Fig. 14. Sampler circuit diagram.

step change at the supply, the  $V_{\mathrm{ref}}$  level settles to

$$\frac{R_2}{R_1 + R_2} (V_{dd} + \delta V_{dd}) \tag{2}$$

with a time constant of

$$\frac{R_1 \cdot R_2 \cdot C}{R_1 + R_2}.$$
(3)

At the instant of supply step change, the voltage difference between  $V_{dd}$  and  $V_{ref}$  remains the same due to the capacitor at the  $V_{ref}$  generator. If  $V_{dd}$ - $V_{ref}$  is fixed, the delay cells run a little bit faster due to the supply voltage increase instead of keeping exact constant delay. Since  $V_{dd}$ - $V_{ref}$  remains the same temporarily, the delay cells run a little bit faster due to the increased supply voltage for a short period of time. And the voltage swing at the VCO increases with a time constant determined by  $R_1, R_2, C$ , and OPAMP bandwidth and approaches to

$$\frac{R_1}{R_1 + R_2} (V_{dd} + \delta V_{dd}) \tag{4}$$

which result in the increase of one stage delay. This gives an averaging effect on the VCO delay after the supply step change, making the delay change minimized with supply step change. If we select  $R_1, R_2$ , and C values for a minimum average delay change, the effect of supply step change can be nullified. The values we chose for this particular process are  $R_1 = 4 \ k\Omega, R_2 = 10 \ k\Omega$ , and  $C = 1.5 \ pF$ . PLL circuits can be sensitive to noise pickup from the supplies and substrate. So the PLL circuit has a dedicated power and ground pads. Bypass capacitors are included in the layout to stabilize VDD and GND of PLL. Guard rings are used to isolate PLL and other digital parts. The placement of multiphase clocks were carefully chosen to remove possible coupling between clocks.

# B. Phase and Byte Sync

Phase and byte sync block at Fig. 1 is shown in Fig 13. It consists of 3-to-1 mux array, metastability resolver, start bit finder, phase memory, word memory, shifter, and D-flipflops (DFF's). This circuit finds the start bit and decimates the oversampled 12 b and aligns the byte boundary. The oversampled 12 b are sent from the sampler to the metastability resolver. Since the oversampled 12 b are not sampled at the center of the eye, there is a possibility that some of the bits are still at the metastable state. The metastability is practically removed by one more stage of synchronizers in the metastability resolver. The start bit finder receives information from the metastability resolver and selects one of the three phases as a correct phase and also extracts byte align information. The phase and byte align information are stored at the phase and word memory. The 3-to-1 mux array decimates 12 b into 4 b. The shifter at the final stage aligns the byte boundary according to the value of the word memory.

# C. Oversampler

The oversampler used in the data receiver is shown in Fig. 14. Each oversampler is a cascaded sense amplifier and uses four clocks for correct, timely sampling. It is very important to reduce the probability of metastability by careful design and layout. The same size is used for both PMOS and NMOS in the core synchronizing amplifier to maximize the loop bandwidth.

### **IV. EXPERIMENTAL RESULTS**

Two prototype chips, master and slave, have been fabricated in a 0.6-µm double-metal CMOS process. Fig. 15 shows the microphotograph of the fabricated master chip. This chips occupies 4100  $\mu$ m × 4300  $\mu$ m including pad area. The master chip incorporates a common skew-insensitive I/O macro block, a bus protocol handler, and a self-test circuit for chip and system diagnostics. The common skew-insensitive I/O macro block includes a charge-pump PLL for multiphase generation, oversamplers, I/O buffers, parallel-to-serial converters, and a bias generator for internal use. The core area for the skewinsensitive I/O macro block is 3600  $\mu$ m $\times$  700  $\mu$ m for 4-pin interface. The microphotograph of the fabricated slave chip is shown in Fig 16. It has the same die size as the master chip. Many blocks are shared with the master chip. The skewinsensitive I/O macro block and the charge pump PLL are the same as those of the master's. The slave chip includes a small internal fast SRAM to verify correct read/write operations.

The measured charge pump PLL jitter histogram of the master and the slave chips is shown in Fig. 17. Since the two chips use the same PLL, it showed similar jitter performance.



Fig. 15. Microphotograph of master chip.



Fig. 16. Microphotograph of slave chip.

The rms jitter is 15.7 ps when the tested chip is active. The peak-to-peak jitter was measured to be less than 150 ps. This PLL jitter characteristic is especially important for multiphase operation.

Fig. 18 shows an output data waveform at 960 Mb/s. The master chip is sending data to the bus according to the predetermined bus protocol. The jitter at the output data is larger than the jitter at the charge pump PLL clock due to the extra modulation effect of supply voltage fluctuation to data



Fig. 17. PLL jitter histogram.



Fig. 18. Output data waveform.

output. The speed limit came from several reasons. CMOS driving capability limitation and the signal degradation through chip packaging and printed circuit board (PCB) were among the main factors.

The skew-insensitive receiving operation was also observed. There are four high-speed pins in the prototype chip. We made a PCB with four high-speed impedance controlled bus lines. The length of normal lines is 12 cm. One of the high-speed signal paths was made intentionally longer than the other signals by 10 cm. The 960 Mb/s high-speed serial data was sent into the receiver. The receiver recovers the serial data into 8-b 120-MHz parallel data. Fig. 19 shows 120 MHz recovered parallel data. The upper waveform is from the

| TABLE I<br>Main Features of the Chip |                               |
|--------------------------------------|-------------------------------|
| Core Area                            | 3.6 mm × 0.7 mm               |
| Technology                           | $0.6-\mu m$ double-metal CMOS |
| Supply Voltage                       | 3.3 V                         |
| Data Rate                            | 960 Mb/s                      |
| PLL jitter                           | 15.8 ps rms @ 960 Mb/s        |
| Power                                | 0.7 W fully active            |
|                                      |                               |

pin with a longer trace. The lower waveform is from the normal length pin. Although the two pins have different trace lengths, the chips could receive data without errors. The power dissipation at 960 Mb/s was 0.7 W for the master chip. The chip characteristics is summarized in Table I.



Fig. 19. Skew-insensitive I/O operation.

## V. CONCLUSION

A new high-speed skew-insensitive I/O scheme has been described in this paper. Two chips that incorporated the new I/O scheme using the low jitter PLL technique have been fabricated in a 0.6-µm double-metal CMOS process. Three times oversampling technique relaxed the strict requirement of setup and hold margins of high-speed chip-to-chip interfaces. Newly designed fast phase frequency detector and a high noise immunity VCO circuit improved jitter performance of PLL. The measured PLL rms jitter was 15.7 ps. Accurate multiphase clock generation for oversampling the bus signal was made possible by utilizing the low jitter PLL. By using such techniques, skew-insensitive data transfer was tested. This skew-insensitive I/O scheme is useful for high-speed ASIC-to-memory and ASIC-to-ASIC interfaces. This scheme will become more important as the chip-to-chip data transfer speed goes up.

 H. Notani *et al.*, "A 622-MHz CMOS phase-locked loop with prechargetype phase frequency detector," in *Proc. Symp. VLSI Circuits*, June 1994, pp. 129–130.

**Sungjoon Kim** (S'91) was born in Pusan, Korea, on June 2, 1970. He received the B.S. and M.S. degrees in electronics engineering from Seoul National University in 1992 and 1994, respectively. Since 1994 he has been working toward the Ph.D. degree in the same university.

He spent the summer of 1995 working on the limiting factors of CMOS Gb/s transmission at SUN Microsystems, CA. His research interests include clock and data recovery for high-speed communication and high-speed I/O interface circuits.

## REFERENCES

- M. Horiguchi et al., "An experimental 220 MHz 1 Gb DRAM," in ISSCC 1995 Dig. Tech. Papers, pp. 252–253.
- [2] M. Horowitz *et al.*, "PLL design for a 500 MB/s interface," in *ISSCC 1993 Dig. Tech. Papers*, pp. 160–161.
  [3] T. H. Lee *et al.*, "A 2.5 V CMOS delay-locked loop for an 18 Mbit,
- [3] I. H. Lee et al., "A 2.5 V CMOS delay-locked loop for an 18 Mbit, 500 Megabytes/s DRAM," *IEEE J. Solid-State Circuits*, vol. 29, pp. 1491–1496, Dec. 1994.
- [4] E. Reese *et al.*, "A phase tolerant 3.8 GB/s data-communication router for a multiprocessor supercomputer backplane," in *ISSCC 1994 Dig. Tech. Papers*, Feb. 1994, pp. 296–297.
- [5] S. Kim *et al.*, "A pseudo-synchronous skew-insensitive I/O scheme for high bandwidth memories," in *Proc. Symp. VLSI Circuits*, June 1994, pp. 41–42.
- [6] M. Bazes and R. Ashuri, "A novel CMOS digital clock and data decoder," *IEEE J. Solid-State Circuits*, vol. 27, pp. 1934–1940, Dec. 1992.
- [7] S. Kim et al., "An 800 Mbps multi-channel CMOS serial link with 3× oversampling," in Proc. IEEE Custom Integrated Circuit Conf., 1995, pp. 451–454.
- [8] I. Young et al., "A PLL clock generator with 5 to 110 MHz lock range for microprocessors," *IEEE J. Solid-State Circuits*, vol. 27, pp. 1599–1607, Nov. 1992.



**Kyeongho Lee** (S'92) was born in Seoul, Korea, on August 5, 1969. He received the B.S. and M.S. degrees in electronics engineering from Seoul National University in 1993 and 1995, respectively. He is currently working toward the Ph.D. degree in electronics engineering of the same university.

He is working on various CMOS high-speed circuits for data communication. His research interests include high-speed CMOS interface circuits, highspeed video display system, and PLL systems for Gigabit communication.



**Yongsam Moon** (S'97) was born in Incheon, Korea, on March 1, 1971. He received the B.S. and M.S. degrees in electronics engineering from Seoul National University in 1994 and 1996, respectively, where he is currently working toward the Ph.D. degree.

He has been working on architectures and CMOS circuits for microprocessors. His current research interests are in clock and data recovery circuits for high-speed data communication.



**Deog-Kyoon Jeong** (S'87–M'89) received the B.S. and M.S. degrees in electronics engineering from Seoul National University, Seoul, Korea, in 1981 and 1984, respectively, and the Ph.D. degree in electrical engineering and computer sciences from the University of California, Berkeley, in 1989.

From 1989 to 1991, he was with Texas Instruments, Dallas, TX, where he was a member of the technical staff working on the single chip implementation of the SPARC architecture. Since 1991, he has been on the faculty of the School of Electrical

Engineering and the Inter-University Semiconductor Research Center, Seoul National University. His main research interests include high-speed circuits, VLSI systems design, microprocessor architectures, and memory systems.



Yunho Choi was born in Incheon, Korea, on March 29, 1960. He received the B.S. degree in electrical engineering from Seoul National University, Seoul, Korea, in 1983.

He joined Samsung Semiconductor Inc., Santa Clara, CA, in 1983, where he was engaged in the design of the 256K DRAM. Since 1986, he has been working on the design of high-density dynamic memory including synchronous DRAM at the Semiconductor Research Center, Samsung Electronic Company, Ltd., Kiheung, Korea. Currently he

is in charge of specialty memory design such as graphics memory and merged DRAM and logic product development.



Hyung Kyu Lim (S'82–M'84) was born February 4, 1953, in Kyung-Nam, Korea. He received the B.S. degree from the Seoul National University, Seoul, Korea, the M.S. degree from the Korea Advanced Institute Science and Technology, and the Ph.D. degree from the University of Florida, Gainesville, all in electrical engineering, in 1976, 1978, and 1984, respectively.

Since 1976, he has been with the Semiconductor Research and Development Center, Samsung Electronics Co., Kiheung, Korea. From 1978 to 1981,

he was engaged in the development of bipolar linear integrated circuits and CMOS watch chips. After finishing his Ph.D. study, he worked mainly in the area of high-density MOS memory development. Starting from a 64 Kb EEPROM design in 1984, he led various memory device research and development projects that include 256 Kb EEPROM, 16 Mb mask ROM, 1 Mb high-speed static Ram, and 1/3 inch CCD image sensor. He is currently responsible for design engineering of all MOS memory research and development projects in which dynamic RAM and specialty memories are added. He has authored or coauthored over 20 technical journal and conference papers and holds 23 patents.

Dr. Lim is a member of the IEEE Electron Device Society.