# Design of a 40 Gb/s Wireline Transceiver in 65nm CMOS

Xuqiang Zheng 01 Dec 2017

School of Computer Science, University of Lincoln, Lincoln, UK

Institute of Microelectronics, Tsinghua University, China





### Outline

### Motivation

- Transmitter Design
- Receiver Design
- Adaptive Equalization
- Conclusion

### **Outline**

### Motivation

- Transmitter Design
- Receiver Design
- Adaptive Equalization
- Conclusion

- Global network traffic will grow 3-fold from 2013 to 2018.
- Global data center traffic will also triple from 2012 to 2017.



Reference : [Cisco VNI Global Forecast, 2014]

### High speed SerDes applications.



New standards require over 40 Gb/s data rate





### Transmitter Design

- Receiver Design
- Adaptive Equalization

### Conclusion

### **Transmitter-FFE Implementations**





(1) Narrow Operation Range(2) Susceptible to PVT variations

(1) Wide Operation Range(2) Relaxed Timing Margin(3) Robust Operation

Operation speed is limited by the N:1 MUX

### **Transmitter-Timing Constrains**



## **Transmitter-Final 4:1 MUX I**



#### **Drawbacks**

(1) Additional Area Occupation and Power Consumption

(2) Robustness Degradation

## **Transmitter-Final 4:1 MUX II**



Advantages (1)Remove the 2:1 Critical Path (2)Increased Selection Margin (2) Half the Clock Speed Disadvantage (1) Doubled Self-Loading Drain Capacitances Limit the Maximum Operation Speed

### **Transmitter-Conceptional 4:1 MUX**



Splitting ANDing and sampling operations is preferred for multiple-MUX-based FFE because the ANDing stage can be shared among the multiple MUXs to save area and power

### **Transmitter-Unit Cells**



## **Transmitter-Proposed Unit Cell**





**Data-up Structure** 



Charge Sharing

**Clock-up Structure** 



Introduce PM1/PM2 to form inverters to always pre-charge or pre-discharge nodes X/Y.

**Proposed Unit Cell** 

### **Improved 4:1 MUX-behavior**



### **Improved 4:1 MUX-behavior**



## **Transmitter-Improved 4:1 MUX-behavior**



### **Bandwidth enhancement**

- Glitch elimination improves the noise margin to allow a lower swing.
- The pre-charge of capacitances at nodes X/Y allows large-size NM1/NM2. Large-size NM1/NM2 allows to reduce NM3/NM4 (clock transistors).
- Added PMs help to pull-up output, speeding up edge transition.

### **Pesudo-NAND2** implementation



- By removing PM2, output capacitance can be reduced.
- Similar realization of pseudo-NAND and inverter mitigates the delay mismatch.
- An abutment layout approach is employed to reduce parasitic capacitance at node X.

## **Transmitter Architecture**



Quarter-rate architecture (3UI timing margin, quarter-rate clock)

Problem: Limited bandwidth of the 4:1 MUX; Solution: Improve the 4:1 MUX

- Multi-MUX based 4-tap FFE (Accurate 1UI, small area, wide range)
- Problem: 16 data streams generation. Solution :Interleaving Latch array

## **Die Micrograph**



#### TSMC 65nm CMOS process

## **Channel profile**



#### Channel profile including bonding wire, PCB trace, SMA connector and connection cable.

## **Measured eye-diagrams**



### **Power breakdown**



### Total Power= 156 mW

MUXs only occupy 14.1%.

## **Performance Summary and Comparison**

| Reference                              | This work                   | ISSCC'14 [1]     | ISSCC'15 [2]                 | JSSC'15 [3]      |
|----------------------------------------|-----------------------------|------------------|------------------------------|------------------|
| Technology (nm)                        | 65                          | 65               | 14                           | 65               |
| Supply (V)                             | 1.2                         | 1.2              | N/A                          | 1.2              |
| Data Rate (Gb/s)                       | 5-50                        | 60               | 16-40                        | 50-64            |
| Chip Area( mm <sup>2</sup> )           | $1.2 \times 0.5$            | $2.1 \times 1.0$ | $0.215 \times 0.13$          | $1.2 \times 1.0$ |
| FFE                                    | 4-tap                       | N/A              | 4-tap                        | 4-tap            |
| 1UI-delay Gen.                         | Multi-MUX                   | N/A              | Multi-MUX                    | LC-delay         |
| MUX Type                               | 4:1                         | 2:1              | 4:1                          | 4:1              |
| Data Jitter<br>RJ (ps <sub>rms</sub> ) | 0.23@40Gb/s<br>0.18@50Gb/s  | 1.08@30Gb/s      | 0.33@28Gb/s<br>0.51@40Gb/s   | N/A              |
| Data Jitter (ps)<br>TJ (BER=1e-12)     | 9.90@40Gb/s<br>10.58@50Gb/s | N/A              | 10.72@28Gb/s<br>12.89@40Gb/s | N/A              |
| Power (mW)                             | 156                         | 450              | 518                          | 199              |
| Energy Efficiency<br>(pJ/bit)          | 3.1                         | 7.5              | 12.9                         | 3.1              |

### Palmary jitter performance and power efficiency



Transmitter Design

- Receiver Design
- Adaptive Equalization

### Conclusion

## **Conventional CDR Architecture**



(2) INL<sub>s</sub>caused by data sampling clock drift

25/23

Voter

Κ

Phase

Integ.



## **Working Principle of BBPD**



### Effects of Introduced LPF on NTF











### **Phase Transfer Characteristics**



## **Phase-Compensating PI Implementation**



## **Phase-Compensating Pl**



## **Die Micrograph and Power Breakdown**



#### Fabricated in TSMC 65nm CMOS process

### **Measured Recovered Clock Jitter**



#### **Measured JTRAN and JTOL**



# **Performance Summary and Comparison**

PERFORMANCE SUMMARY AND COMPARISON.

|                              | JSSC'15 [3]    | JSSC'14 [10]        | This work        |
|------------------------------|----------------|---------------------|------------------|
| Technology (nm)              | 28             | 22                  | 65               |
| Supply (V)                   | 1.1/0.85       | 1.07                | 1.2              |
| Data Rate (Gb/s)             | 40             | 4-32                | 40               |
| Multi-phase Clock Gen.       | DLL+PIs        | MCDLL+PIs           | DIV2+PIs         |
| Jitter Suppression           | Split-path CDR | N/A                 | Adaptive-BW LPFs |
| JTOL Amplitude (UI)          | 0.2@80MHz      | 0.2@40MHz           | 0.41@100MHz      |
| JTOL Bandwidth (MHz)         | 10             | 20*                 | 20               |
| Chip Area (mm <sup>2</sup> ) | 0.81/lane**    | 0.079/lane          | 0.15             |
| Power (mW)                   | 630*†          | 79.64 <sup>††</sup> | 159              |

\*Estimated from jitter tolerance results, \*\*Area of whole transceiver \*<sup>†</sup>Including FFE+DFE equalization, <sup>††</sup>Including RXFFE

#### Palmary jitter tolerance at high frequency



#### Motivation

Transmitter Design

Receiver Design

#### Adaptive Equalization



## **Equalization Scheme**



DFE is ruled out here, mainly because of its operation speed limitation, complicated implementation, and significant power consumption.

## **Previous Adaptation Algorithm**



- LMS requires additional samplers to detect the signed errors between the equalized and expected eye heights.
- Traditional ZF needs an extra ADC to convert the equalized output voltages into digital codes.
- MEO requests an even more complicated eye monitor, involving threshold-adjusting samplers, phase-adjusting PIs, micro-controller, and measurement software.

## Developed Edge-Data Correlation Based S-ZF



| D(n₋ℓ)⊕E(n) | D(n) ⊕ D(n+1) | ResCor <sub>l</sub> 1(n) | ResCor <sub>/</sub> 0(n) |
|-------------|---------------|--------------------------|--------------------------|
| 0           | 0             | 0                        | 0                        |
| 1           | 0             | 0                        | 0                        |
| 0           | 1             | 0                        | 1                        |
| 1           | 1             | 1                        | 1                        |

Note: The signed ResCor<sub>l</sub>(n) is represented by two bits: ResCor<sub>l</sub>1(n) and ResCor<sub>l</sub>0(n).

### Developed Edge-Data Correlation Based S-ZF



 $\alpha_{l}(k+1) = \alpha_{l}(k) - \lambda \cdot sign[e(k)] \cdot D(k-l), (l = -1, 0, 1, 2)$ 

#### Developed Edge-Data Correlation Based S-ZF



44/23

Derivation of the Edge-Data  
Correlation Based S-ZF  
Correlation coefficient should be zero  

$$\hat{\rho}_{e,d} = C\alpha = 0$$
  
 $\hat{\rho}_{e,d} = (\hat{\rho}_{e,d}(-1), \hat{\rho}_{e,d}(0), \hat{\rho}_{e,d}(1), \hat{\rho}_{e,d}(2))^T$   
 $\alpha = (\alpha_{-1}, \alpha_0, \alpha_1, \alpha_2)^T$   
 $C = \begin{pmatrix} c_{0.5} & c_{-0.5} & c_{-1.5} \\ c_{1.5} & c_{0.5} & c_{-0.5} \\ c_{2.5} & c_{1.5} & c_{0.5} \end{pmatrix}$   
Recursive equation

 $\begin{aligned} \alpha(k+1) &= \alpha(k) - \lambda C \alpha(k) = \alpha(k) - \lambda \hat{\rho}_{e,d}(k) \\ \alpha_l(k+1) &= \alpha_l(k) - \lambda \cdot sign[e(k)] \cdot D(k-l), (l=-1,0,1,2)_{45/23} \end{aligned}$ 

#### **Measurement Setup**



·--/



46/23

#### **Measurement Results**



Adaptively-adjusted bias voltages of the TX-FFE with different RX-CTLE control voltages.

Measured bathtub curves under different bias conditions

## **Transmitter Far-end Eyediagrams**





#### Motivation

- Transmitter Design
- Receiver Design
- Adaptive Equalization

#### Conclusion

## **Conclusion and Main Techniques**

- This paper implements a 40-Gb/s TX and RX chipset over a >16-dB loss PCB channel using a 65-nm CMOS process.
- The TX utilizes a bandwidth-enhanced 4:1 MUX and an interleaved-retiming latch array to obtain wide operation range, high power efficiency, and small area occupation.
- By introducing bandwidth-adaptively adjusting LPFs into the clock path for data sampling, the CDR achieves high performance on both low-frequency jitter tracking and high-frequency jitter suppression. A TA-based compensating PI is designed to optimize the PI linearity.
- A combined TX-FFE and RX-CTLE is employed to compensate for the channel loss, where a low-cost edge-data correlation-based S-ZF adaptation algorithm is proposed to automatically adjust the TX-FFE's tap weights.

# Thanks for your attention.

Email: zhengxuqiang@mail.tsinghua.edu.cn