Volume: 03 Issue: 07 | July-2016 www.irjet.net

#### e-ISSN: 2395 -0056 p-ISSN: 2395-0072

# **Design of Fixed One-Bit Latency Serdes Transceiver**

Sangeetha N<sup>1</sup>, Mr.K.V.Ramana Reddy<sup>2</sup>, Dr Siva.S.Yellampalli <sup>3</sup>

- <sup>1</sup> Student, Mtech13, UTL VTU Extension center, Bengaluru, Karnataka, India.
- <sup>2</sup> Assistant Professor, UTL VTU Extension center, Bengaluru, Karnataka, India.
- <sup>3</sup> Professor, UTL VTU Extension center, Bengaluru, Karnataka, India.

Abstract- Today's communication world experiences a maximum amount of problems linked with serial interconnects since they occupy the entire communication field, therefore the serializer/deserializer (SerDes) devices make huge changes in the market with large differences in cost and performance. But they fail to maintain constant communication latency throughout the transmission after each reset or power up. In this paper a fixed one bit latency serdes transceiver is proposed which is built using delay tuning and phase shifting technologies. It overcomes the shortcomings of buffering and delays generated by clocks. A specific implementation based on Xilinx Spartan 6 FPGA is presented in this paper. The results indicate that the device achieves a constant latency with improvements in buffering each time after reset.

Key words: - Changeable Delay Tuning, Dynamic clock phase shifting, fixed one bit latency, FPGA, Serializer/deserializer.

#### I. INTRODUCTION

Serializer/Deserializer devices embedded in the GTP transceivers though seem to be much advantageous in the field of communication for their high speed transfer capabilities, face much problems when they undergo processes like reset, relock and when powered up.[1] In order to overcome these problems a solution in the form of designing an external dedicated circuitry was introduced. But this design was not actually needed for telecom and Datacom communications. However today's SerDes Devices replace the parallel connectors due to their high speed multi gigabit transfers because of the huge developments involved in bandwidths. Serdeses today are not only meant for their high speeds alone but show great improvements in device parameters like information formatting ,device topology, protocol overheads etc.Serdeses are also associated with maintaining clocking, timing ,latencies, buffering and logic which increased their cost and performance parameters linked to data acquisition and manipulation.[2] The problem associated with the serdes chips is that they do not actually maintain the same latency after few operations like reset, power up and relock.[1],[2]. Therefore a need for extra circuitry had to be implemented in the presently used transceivers in GTP in order to overcome the problem to improve the parameters linked with communication overhead.[3] Therefore an extra circuitry named as clock and data slider block is designed and implemented in the present work. The principle goal is layout of a fixed one bit latency serial transceiver based on delay tuning which is changeable roulette approach it processes all clock phase offsets produced in serializing and parallelizing conversion. Subsequently, it removes out the reset-relock crisis. [7] This paper stresses on usage of a rigid and fast serial transceiver designed on the principle of changeable delay tuning and dynamic clock phase shifting technologies. It facilitates to overcome all the problems posed by buffering and clocking mechanisms involved for transmissions with the development of phase offset values between the clocks of the transmitter and receiver. It also focuses in removing out the reset-relock problem.[1] This paper speaks approximately excessive-speed, constant-latency serial links in dispensed information acquisition and manipulation structures, inclusive of the timing trigger and control (TTC) device for excessive energy physics experiments.[2] This paper describe how to make use of the inner alignment circuit of the SerDes transceivers to put off the clock phase offset, which results in a variation in connection latency.[3] This paper speaks approximately of utilizing source and data acquisition for transfer of data. It focuses on implementing a high speed transceiver in FPGAs with a clocking scheme and two configurations. It pressurizes on implementing pipelining mechanism for the serial link to improve the latency and performance.[4] This paper stress on a vast development in the field of radiation tolerance, limited area for hardware, and synchronization mechanism.CBM network protocol uses fiber connection which is bidirectional and single ended to achieve good latency and synchronization and thus helping to build a new network topology.[5] This paper reports on the examine-out device of the future KM3NeT undersea area of numerous synchronized optical detecting nodes.[6] This paper speaks about the Compressed Baryonic matter (CBM) experiment which were used For identifying the prototypes to improve performance in serial links. The DAQ software discusses about using various data inputs, optical connections reading through USB and Ethernet etc.[7]

and phase transfer of clock technology. As compared with the

# II.Serdes devices undergoing latency variations in existing technique.

The Xilinx FPGA which embeds a GTX transceiver is used to explain the latency overheads. The simplified



e-ISSN: 2395 -0056 Volume: 03 Issue: 07 | July-2016 www.irjet.net p-ISSN: 2395-0072

structure of this design has a Medium Attachment (PMA) sub layer and a physical Coding Sub layer (PCS). The PMA parallelizes the serial data and serializes the parallel data.PCS takes care of control operations like processing before serialization and after parallelization. The clock which is parallel XCLK, a reference clock for Clock and data recovery

Delay problems may arise in GTX transceiver's PMA and PCS blocks the delay in PMA are because of division and multiplication of frequencies in serial links. CLK\_IN is improved in frequency by a factor four which yields a serial clock. XCLK the parallel clock is same as CLK IN.XCLK is used to obtain RXECCLK. The problem linked to delay is measured in unit interval UI, the serial symbol length. The 20 bit information can be obtained by the 16 bit words due to serial exchange. The PLL present in the transmitter performs serial bit rate. The barrel shifter present in the receiver end aligns data parallely. A synchronization mechanism helps for the complete alignment of data in the barrel shifter.1) The time difference value between any two points can be identified by sending a marking signal to and fro.2) A time interval is initiated when the master point generates a "start" pulse and declined when the receiver in master point decodes the marking signal thus initiating "stop" pulse.

(CDR) circuit, and an input serial clock for parallel in /serial out block are generated from the PLL. The serial input/parallel output block needs RXRECCLK and gets it from CDR circuit. The phase adjusts FIFO removes out variations in transmitter clock and elastic buffer removes out variations in receiver clock.

#### III.Proposed design structure

In the below mentioned constant latency transceiver a changeable delay tuning technology and dynamic phase shifting technology is used to design a clock and data sliding block to allow data and clock latencies to turn out to be narrowed.

#### IV.Complete design architecture

A Design consisting of two GTX transceivers, one RX phase Align, one TX phase Align, one Payload Generator, and one Clock and data Slider (CDS). TXUSRCLK/ TXUSRCLK2 and XCLK do not have phase coordination; therefore REFCLKOUT is connected to it. Thus XCLK is aligned with REFCLKOUT with the help of TX phase align . This ends up with the bypass of phase adjust FIFO in the transmitter end. Similarly the RXUSRCLK/RXUSRCLK2 combined to is bypassing the elastic buffer in the receiver. Thus delays created due to the buffers are reduced.



Fig. 1 Serial transceiver with constant latency.

#### V. Clock and data Slider

The CDS consists of one Dynamic Clock phase shifting (DCPS) block, one Comma Detector and data Alignment (CDDA) block and one Changeable delay Tuning (CDT) block. Consistently the bit-shift value 'n' of RXDATA and the phase offset  $\Delta P$  among RXRECCLK and the transmitted clock fulfill the following equation:

$$\Delta P = nx360^{\circ}/N.....1$$

'n' value ranges from 0 to N-1

In which N is the internal data-path width. The K28.5 symbol (one of 8 b/10 b control characters), within the parallel acquired data RXDATA is identified by the CDDA block, To decide the bit-shift value 'n'.

#### VI. Comma Detector and data Alignment block (CDDA)

The simplified structure of CDDA is proven to consist fractional bits of RXDATA and fractional values of RXDATA's delay register, namely RXDATA\_DLY.The bit-shift value 'n' can be extracted from RXDATA. Includes the bit-shift value 'n' comes to a decision to integrate the two data sources. Once the CDDA obtains the bit-shift value 'n', RXDATA\_DLY [n-1:0], RXDATA [N-1:n] is selected as its result.

#### VII. Dynamic Clock phase shifting (DCPS) block

A digital Clock manager (DCM), a phase-Locked Loop (PLL), and a phase Shift control Unit (PSCU) are used to design the DCPS block. The coarse and fine grained phase shift is obtained by DCM. In the "VARIABLE\_POSITIVE" shift mode



**Volume: 03 Issue: 07 | July-2016** www.irjet.net p-ISSN: 2395-0072

the DCM is used to operate in any of the 5 possible operating modes. The phase-shift value 'P' defines the equation given as:

P = p \* tclk/256.....2

Where p is the integer parameter.DCM timing parameter gives the variety of the integer.

tclk is the frequency of input clock.

CLK180 output of the DCM enables to extend the phase-shift value while fine-grained phase varying is in evolution, since the entire clock outputs of the DCM are adjusted. Coarse-grained phase transferring is achieved by PLL. The PLL outputs CLK0, CLK90, CLK180, and CLK270 which are phase-shifted by using a quarter of clock input length relative to one another. With the aid of the PSEN, PSINCDEC, PSCLK, and PSDONE ports, the phase Shift control Unit allows the DCM to execute the phase-shifting feature. two steps are considered for this operation:

- 1) It first resets the DCM and waits for it to relock .The initial rate of p is zero, so the phase shift value P is zero ns.
- 2) If the phase offset value is lesser than a 180° then, clk0 is selected because the input clock of PLL, and if its value is greater than 180° then, clk180 is selected as the input clock of PLL. CLK\_SEL signal driven by way of the phase shift control unit selects the clocks amongst those clocks of PLL.

#### VIII. Asynchronous FIFO

An asynchronous FIFO is a mechanism wherein Data from one clock domain to another clock domain are passed safely. The width and depth of the asynchronous FIFO can be selected accordingly by the designer. One clock signal helps the asynchronous FIFO to write information into the FIFO where as another clock signal helps to read the data from the FIFO.

#### IX. Changeable delay tuning Block

The CDT block includes one asynchronous FIFO, one Comma Comparator, and one delay Tuning Unit, as shown. In the Comma Comparator, the parallel input data realigned with the help of CDDA block is compared whether it consists the comma values which are pre defined. If this comma comparator block obtains the comma values in the parallel data then, the comparator result is high. This high signal goes as a Write enable signal to the asynchronous FIFO block and also as the input signal to the first flip-flops of all the delay units. Upon high signal on write enable the parallel data queued is written into asynchronous FIFO. Now it is the choice of the multiplexer to select any one of the delay unit blocks based on phase offset value and bit shift value. once the multiplexer is done with its task, a flip flop stores the

#### 1. DCPS block Implementations

The PSCU block takes the input Bit\_shift\_val from the CDDA block, clk, PSDONE\_OUT and LOCKED\_OUT as input from DCM block and PSEN\_IN, PSINCDEC\_IN as output to the DCM

result as to which delay unit is activated .thus the parallel Data which was stored in FIFO will be read by device. At this time the comma out signal is high indicating the detection of comma word.

e-ISSN: 2395 -0056

### X. Detailed Operation of the delay tuning Block

The Changeable delay Tuning block gets the phase offset  $\Delta P$  from the parallel received data stream. According to the phase offset  $\Delta P$ , the DCPS block generates four multi-phase clocks: Rec\_clk, Rec\_clk90, Rec\_clk180, and Rec\_clk270. Here, the data-path width N of the GTX transceiver is set to 20, so there are 20 exclusive values for  $\Delta P$ .

- 1) If  $\Delta P$  is  $342^{\circ}$ ,  $324^{\circ}$ ,  $306^{\circ}$ ,  $288^{\circ}$ , or  $270^{\circ}$  the first delay block is active and this output is selected as the result of the Changeable Delay Tuning circuit.
- 2) If  $\Delta P$  is 252°, 234°, 216°, 198°, or 180° the second delay block is active and this output is selected as the result of the Changeable Delay Tuning circuit.
- 3) If  $\Delta P$  is  $162^{\circ}$ ,  $144^{\circ}$ ,  $126^{\circ}$ ,  $108^{\circ}$  or  $90^{\circ}$  the third delay block is active and this output is selected as the result of the Changeable Delay Tuning circuit.
- 4) If  $\Delta P$  is 72°, 54°, 36°, 18°, or 0° the fourth delay block is active and this output is selected as the result of the Changeable Delay Tuning circuit.

#### XI. Implementation and results

#### 2. Simulation results of CDDA block

The CDDA block consists of RXRECCLK receiver recovered clock and 20 bit input data stream RXDATA as inputs. The output of this block are the bit shift value 'n' and the 20 bit realigned data. The realignment is done depending on the bit shift value.

The simulation results depicts the RXRECCLK, RXDATA as input data and the output from this block is the 5 bit, Bit\_ shift\_ val and the 20 bit out align data. As shown in the result, the 5 bit result 'n'depicts the positioning of value 1 in the entire 20 bit data stream. The delta value indicates the phase offset value. for eg when the 20th bit is 1 in the entire 20 bit data stream, then the phase offset value is as calculated from formula 3 is 342°. Similarly when the 19th bit is 1 in the entire 20 bit data stream, then the phase offset value is as calculated from formula 3 is 324°.and this repeats for the entire 20 bit data stream and thus total of 20 different phase offset values are generated for 20 bit positions of input data. This phase offset value is helpful for the DCPS block to generate the four multi phase clocks to activate the CDT block.

block. This result states that the PSEN\_EN signal should be high in order to give output to the DCM block.



e-ISSN: 2395 -0056 Volume: 03 Issue: 07 | July-2016 www.irjet.net p-ISSN: 2395-0072

The PSCU block takes the input Bit\_shift\_val from the CDDA block, clk, PSDONE\_OUT and LOCKED\_OUT as input from DCM block and PSEN\_IN, PSINCDEC\_IN as output to the DCM block. This result states that the PSEN\_EN signal should be high in order to give output to the DCM block.

#### 1.1. Simulation Results of DCM block

The DCM block as an IP consists of various inputs and output as in schematic, but for this design the area of concentration is on the output signals coming from PSCU block as inputs to DCM, and the two clock signals clk0 and clk180 are given out as output signals for the PLL block. The simulation results show that whenever clock signal arrives and reset is low, and the PSDONE\_OUT signal makes transition the outputs CLK0\_OUT and CLK180\_OUT which are 180° out of phase with each other emerge out as output from DCM block and arrive as input to PLL block. The PLL block indicate the two input signals CLKIN1\_IN and CLKIN2\_IN represent the signals CLK0 and CLK180 respectively. There are four multi phase clock outputs

#### 3. Simulation Results of CDT block

The simulation results Show the detailed outcome of the entire design. Here we can observe that upon clock and low on reset signal the 20 bit parallel input data Rx\_in\_data is given to the CDDA block and the result of CDDA block is 20 bit output out\_align.This block is also responsible for generating the 5 bit .bit shift value. The bit shift value is used as the select line from the multiplexer. From the result it can be inferred that for every five phase offset values the bit shift value changes. It is depicted from the result that for every change in position of '1' in the input parallel data stream, there is a different phase offset value generated. For

from this block each with a quarter phase shifts.CLKOUT0\_OUT represent CLK0 i.e. rec\_clk, CLKOUT1\_OUT rec\_clk90, represent CLK90 i.e. **CLKOUT2 OUT** represent **CLK180** rec\_clk180, i.e. CLKOUT3\_OUT represent CLK270 i.e. rec\_clk270.these multi phase clock goes as input to the CDT block.

The simulation results of PLL block indicate the two input signals CLKIN1\_IN and CLKIN2\_IN represent the signals CLK0 and CLK180 respectively. There are four multi phase clock outputs from this block each with a quarter phase shifts.CLKOUT0\_OUT represent CLK0 i.e. rec\_clk, CLKOUT1\_OUT represent CLK90 rec\_clk90, i.e. CLKOUT2\_OUT represent **CLK180** i.e. rec\_clk180, CLKOUT3\_OUT represent CLK270 i.e. rec\_clk270.these multi phase clock goes as input to the CDT block.

example: if the 20th bit in the parallel input data stream is high then,the phase offset is 342°, if the 19th bit in the parallel input data stream is high then, the phase offset is 324° and so on. The above result shows the phase offsets upto 180° ie until the 11<sup>th</sup> bit is high. Depending on the multiphase clocks obtained from DCPS block and high signal on Rd\_en the output data is obtained across the asynchronous FIFO. The simulation result in fig 15. shows the phase offsets from 162° upto 0° ie from the 10<sup>th</sup> bit to last bit becoming high.At this point we would obtain the output which shows the bit shift variation for each phase offset value.

## 3.2. Device utilization summary of the implementation

This summary gives a brief note of the available resources used resources for the implementation and percentage is proven that this design shows less amount resource usage details.

Table.1 Device utilization summary of the implementation

| Device Utilization Summary (estimated values) |      |                 |           |             |
|-----------------------------------------------|------|-----------------|-----------|-------------|
| Logic Utilization                             | Used | Previous Design | Available | Utilization |
| Number of Slice Registers                     | 271  | 351             | 28800     | 0%          |
| Number of Slice LUTs                          | 269  | 493             | 28800     | 0%          |
| Number of BUFG/BUFGCTRLs                      | 6    | 11              | 32        | 18%         |
| Number of DCM_ADVs                            | 1    | 1               | 12        | 8%          |
| Number of PLL_ADVs                            | 1    | 1               | 6         | 16%         |

#### XII.CONCLUSION AND FUTURE SCOPE

In this thesis, the clock and data slider block and its relative blocks has been successfully implemented and have been designed and implemented using Verilog HDL. Functional verification, synthesis, post synthesis simulation and static timing analysis were carried out using the Xilinx ISE design suite 14.1, and the tests are carried out to check the result through an Emulator tool known as Chip scope Pro for an input signal of clock frequency of 50Mhz. The performance metric is compared with the previous existing methods.



e-ISSN: 2395 -0056 Volume: 03 Issue: 07 | July-2016 www.irjet.net p-ISSN: 2395-0072

This thesis mainly focuses on minimizing the latency occurring due to buffering and clocking. From the resource Utilization it can be understood that the area, power and timing constraints are comparatively less from the existing architectures for the following modules.

- 1. Comma detector and data alignment block.
- 2. Dynamic clock phase shifting block.
- Changeable Delay Tuning block
- 4. Complete integrated block

Almost all the GTX transceivers existing presently have overcome problems relating to phase synchronizations and clock timings. But there are still problems which have not found any solutions to beat the temperature variations, physical media dispersions and clock jitters etc. To design a rugged module which would overcome all these physical parameters and problems associated to latency changes is the future work.



Fig.2 Simulation Results of CDT block for the phase offsets from 342° upto 180°.



e-ISSN: 2395 -0056 Volume: 03 Issue: 07 | July-2016 www.irjet.net p-ISSN: 2395-0072



Fig.13 Simulation Results of CDT block for the phase offsets from 162° upto 0

#### XIII.REFERENCES

- [1] Xue Liu, Qing-Xu Deng, Ze-Ke Wang, "Design and FPGA Implementation of High-Speed, Fixed-Latency Serial Transceivers", IEEE Transactions on Nuclear Science, vol. 61, issue 1, pp. 561-56702/2014
- [2] A. Aloisio, F. Cevenini, R. Giordano, and V. Izzo, "Highspeed, fixed-latency serial links with FPGAs for synchronous transfers," *IEEE Trans. Nucl. Sci.*, vol. 56, no. 5, pp. 2864-2873, Oct. 2009.
- [3] R. Giordano and A. Aloisio, "Fixed-latency, Multi-Gigabit
- serial links with Xilinx FPGAs," IEEE Trans. Nucl. Sci., vol. 58, no. 1, pp.194-201, Feb. 2011.
- [4] A. Aloisio, F. Ameli, V. Bocci, M. Della Pietra, R. Giordano, and V.Izzo, "Design, implementation and test of the timing trigger and control receiver for the LHC," J. Inst., vol. 8, no. 2, p. T02003, Feb. 2013.
- [5] J Adamczewski-Musch, N Kurz, S Linev and P Zumbruch "Data acquisition and online monitoring software for CBM test



e-ISSN: 2395 -0056 Volume: 03 Issue: 07 | July-2016 www.irjet.net p-ISSN: 2395-0072

- beams", Published under licence by IOP Publishing Ltd ,Journal of Physics: Conference Series, Volume 396, Part 1
- [6] P. P. M. Jansweijer and H. Z. Peek, Measure Propagation Delay Over a 1.25 Gbps Bidirectional Data Link, Tech. Rep. ETR 2010-2010 [Online]. Available: 01, http://www.nikhef.nl/pub/services/biblio/technical
- 8] Jinhong Wang, Xueye Hu, Thomas Schwarz, Junjie Zhu, J.W. Chapman, Tiesheng Dai, and Bing Zhou, "FPGA Implementation of a Fixed Latency Scheme in a Signal Packet Router for the Upgrade of ATLAS Forward Muon Trigger Electronics," IEEE Trans. Nucl. Sci., vol. 57, no. 2, pp. 467-471, Apr. 2010.
- [9] SCAN25100 24576, 1228.8, and 614.4 Mbps CPRI SerDes with Auto RE Sync and Precision Delay Calibration Measurement,

- reports/ETR2010-01.pdf
- [7] F. Lemke, D. Slogsnat, N. Burkhardt, and U. Bruening, "A Unified DAQ Interconnection network with precise time synchronization," IEEE Trans. Nucl. Sci., vol. 57, no. 2, pp. 412-418, Apr. 2010.
  - [Online]. Texas Instru-2013 Available: ments, http://www.ti.com.cn/product/cn/ scan25100.
- [10] TLK2711 A 1.6 TO 2.7 GBPS TRANSCEIVER, Texas Instruments, 2012 [Online]. Available: http://www.ti.com.cn/product/cn/tlk2711a