

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

www.irjet.net

National Conference on Recent Advancements in Communication, Electronics and Signal Processing-RACES'20 Organised by: Department of ECE, Velammal Engineering College, Chennai-66

# NOVEL STRUCTURE FOR AREA-EFFICIENT IMPLEMENTATIONOF FIR FILTERS

## AKASH P<sup>1</sup>, CHANDRA MOHAN REDDY M<sup>2</sup>, PETCHI RAJA N<sup>3</sup>, SIDARTH S<sup>4</sup>, MAGESH V<sup>5</sup>

<sup>1-4</sup>Department of Electronics and Communication Engineeering, Velammal Engineering College, Ambattur – Redhills Road, Chennai-66.

<sup>5</sup>Assistant Professor, Department of Electronics and Communication Engineeering, Velammal Engineering college, Ambatur – Redhills road, Chennai-66.

ABSTRACT: It is observed that in multiplierless implementation of transposed direct form (TDF) finite impulse response (FIR) filters, the adders in the product accumulation block, named as structural adders (SAs), contribute the major part of the overall logic complexity. A novel FIR filter structure is therefore proposed to reduce the hardware complexity of the product accumulation block. In the proposed structure, half of the long wordlength SAs are replaced by adders, named as prestructural adders (PSAs), which have relatively shorter word-length. The filter coefficients are carefully grouped to take advantage of the symmetric impulse response of linear phase FIR filters. Analysis and experimental results show that the overall area complexity and power consumption can be reduced at the expense of negligible delay overhead. The average area and power reduction over existing techniques can be as much as 23.8% and 25.4%. The overall area-delay performance and powerdelay performance of the proposed implementation is superior to existing techniques.

*Index Terms*—Finite impulse response (FIR), product accumulation block, structural adder (SA)

## INTRODUCTION

Finite impulse response (FIR) filter is a fundamental building block in digital signal processing (DSP) systems. Generally, an FIR filter can be implemented in direct form or transposed direct form (TDF). For very large scale integration (VLSI) implementation, the TDF is preferred over the direct form due to its inherent pipelined accumulation section.

A TDF FIR filter consists of two parts: (i) a multiple constant multiplication (MCM) block and (ii) a product accumulation block. The products generated by the MCM block are delayed and accumulated in the product accumulation block to produce the filter output. The adders in the product accumulation block are referred to as structural adders (SAs) in the rest of the paper. To reduce the complexity of FIR filters, a lot of efforts have been put on the efficient implementation of MCM blocks and a lot of design techniques have been proposed. The

\*\*\*
hat in multiplierless
rect form (TDF) finite
adders in the product
\*\*\*
SAs, on the other hand, are often ignored by most of the
research. Faust *et al* proposed to optimize the SAs by
splitting the intermediate accumulation

This work was supported by the Shenzhen Science and Technology Development Funds under Grant JCYJ20160308094919279 and JCYJ20150626090521275, by the National Natural Science Foundation of China under Grant 61601301 and 61504086.

x.Lou and W. B. Ye are with the School of ElectronicScienceandTechnology, ShenzhenUniversity, China518060(Correspondingauthoremail:yewenbin@szu.edu.cn).

Y. J. Yu is with the Department of Electrical and Electronic Engineering, Southern University of Science and Technology, China 518055.

P. K. Meher is with the School of Computer Engineering,

NanyangTechnological University, Singapore 639798.

Results, at the cost of more registers. In a recent publication, the authors proposed to split the SAs into small adder blocks to reduce the delay of SAs. Round-off can be performed on the intermediate accumulation result to reduce the complexity of SAs [9], but the precision of the filter output may be sacrificed.

## MOTIVATION

The number of SAs and registers needed to implement the accumulation block increase linearly with the order of the filter. The accumulation block of an N th order (N+ 1 taps) TDF FIR filter consists of N SAs and Nregisters(delay elements). Moreover, as the wordlengths of intermediate accumulation results increase, the word-lengths of the SAs and registers increase accordingly. For a fixed coefficient filter, the word-length of the *i*th SA (corresponding to the *i*th filter coefficient) can be estimated as

$$W_i = W_X + \lceil \log_2 \sum |h_k| \rceil - l_i, \quad \text{for } 0 \le i < N(1)$$

$$k=i$$

where,  $W_X$  is the word-length of the filter input,  $h_k$  is the

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 08 | Aug 2020 www.irjet.net p-ISSN: 2395-0072

National Conference on Recent Advancements in Communication, Electronics and Signal Processing-RACES'20 Organised by: Department of ECE, Velammal Engineering College, Chennai-66

*k*th filter coefficient and  $l_i$  is the number of left shift to generate  $h_i$  from the corresponding synthesized fundamental using MCM technique. Therefore, the number of full adders (FAs) needed for the product accumulation block is given by

$$M = \sum_{i=0}^{N-1} W_i = (N-1)W_X + \sum_{i=0}^{N-1} (\lceil \log_2 \sum_{k=i}^N |h_k| \rceil - l_i).$$
(2)

Similarly, the number of bit registers needed for the product accumulation block can be estimated in the same manner. Therefore, the complexity of the accumulation block increases with the filter order *N*.

The complexity of the MCM block, on the other hand, is related to the fundamental set that is implemented using shift add operations. The number of unique fundamentals depends on the values of the filter coefficient set, and does not increase monotonically with the filter order. For example, if all the filter coefficients are power-of-two terms, the number of unique fundamentals is zero and no addition is needed for the MCM block.

Fig. 1(a) shows the relationship of filter order and number of unique fundamentals for fifteen commonly referenced benchmark filters [11]. It is observed that there is no clear relationship between the number of unique fundamentals and the order of filters, and the number of unique fundamentals of a coefficient set can be low even for large order filters. Fig. 1(b) shows the average proportion of the MCM block and theaccumulation block in terms of FA cost for these benchmark filters.



Fig. 1. (a) Relationship of filter order and number of unique fundamentals for 15 commonly referenced benchmark filters [11]. (b) The average proportion of the MCM block and the accumulation block in terms of FA consumption.

As can be seen, the accumulation section contribute the major part of FA cost in a filter. The MCM only consumes about 10% of the total FAs while the accumulation block consumes about 90%. Therefore, the optimization of SAs, which contributes the major part of overall complexity, deserves more attention for low-complexity filter implementation.

## THE PROPOSED FILTER STRUCTURE

For the implementation of FIR filters, ripple-carry adders (RCAs) have been widely used due to their area efficiency

and structural suitability [6]. In this section, we derive a filter structure for low-complexity implementation and discuss the area and time complexity of the proposed structure based on RCAs.

#### A. Derivation of the Proposed Structure

The input-output relationship of an *N*th order FIR filter can be expressed as

$$Y(z) = H(z) \cdot X(z), (3)$$

where H(z) is the transfer function of the filter, given by

$$H(z) = \sum_{k=0}^{N} h_k z^{-k} .(4)$$

The direct implementation of (4) results in the direct form structure. Alternatively, using the Horner's rule [12], (4) can also be expressed as

$$H(z) = h_0 + z^{-1} \Big( h_1 + z^{-1} \big( h_2 + z^{-1} \big( h_3 + \dots z^{-1} \big( h_{N-1} + z^{-1} h_N \big) \dots \big) \Big),$$
(5)

which results in the TDF structure. The adders in (5) are structural adders.

(4)Starting from can be paired up in two directions:  $(ih_N-1z^{-(N-1)} \text{ and } h_0z^{-})^0$ , the terms of  $(h_0 + h_1z^{-}_{1)}, z^{-2(N}(h^{-24})+(hh_N3z^{-4}+1))$  et al. and  $(ii)h_N-3z^{-1})$  et al. Therefore, for odd  $z^{-(N-2)}(hN-2+hN-N1z, (4)-1), z^{-}$  can be expressed as

Fig. 2. The direct implementation of (9). Where

$$T_{i} = \begin{cases} h_{2i} + h_{2i+1}z^{-1} & 0 \le i \le \lceil \frac{N-1}{4} \rceil - 1 \\ h_{2i+1} + h_{2i+2}z^{-1} & \lceil \frac{N-1}{4} \rceil \le i \le \frac{N-3}{2} \end{cases}$$
(7)

2*i* for 
$$0 \le i \le \lceil \frac{N-1}{4} \rceil - 1$$
  
2*i*+1 for  $\lceil \frac{N-1}{4} \rceil \le i \le \frac{N-3}{2}$ 

Here  $T_i$  is referred to as compound coefficients (CCs). The adders in the CCs are referred to as pre-structural adder (PSA) in the rest of the paper. By applying the Horner's rule again, we have

#### International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 IRJET Volume: 07 Issue: 08 | Aug 2020

www.irjet.net

p-ISSN: 2395-0072

National Conference on Recent Advancements in Communication, Electronics and Signal Processing-RACES'20 Organised by: Department of ECE, Velammal Engineering College, Chennai-66

$$H(z) = T_0 + z^{-2} \Big( T_1 + z^{-2} \big( T_2 + \dots + z^{-2} \big( T_{\lceil \frac{N-1}{4} \rceil - 1} \\ + z^{-2} \big( h_{2 \cdot \lceil \frac{N-1}{4} \rceil} + z^{-1} \big( T_{\lceil \frac{N-1}{4} \rceil} + z^{-2} \big( T_{\lceil \frac{N-1}{4} \rceil + 1} + z^{-2} \big) \\ + \dots + z^{-2} \big( T_{\frac{N-3}{2}} + z^{-2} h_N \big) \dots \Big) \Big)$$
(9)

Fig. 2 shows the direct implementation of (9). Note that the first term for all the CCs are fed with the same input samples and the second term for all the compound coefficients are fed with the same delayed input samples as well. Therefore, all the multiplications with the first terms of compound coefficients can be performed by an MCM block and the multiplications with the second terms can be performed by another MCM block. Fig. 3 shows the proposed filter structure for odd order N. The constants in MCM-1 and MCM-2 are

$$\begin{cases} C1_{\text{odd}} = \{h_{N-1}, h_{N-3} \dots h_3, h_1\} \\ C2_{\text{odd}} = \{h_N, h_{N-2} \dots h_{2 \cdot \lceil \frac{N-1}{4} \rceil} \dots h_2, h_0\} \end{cases}$$
(10)

Similarly, for filters with even order  $N_{1}$  (4) can be rewritten as

$$\begin{array}{c} H(z) = (h0 + h1z - 1) + z - 2(h2 + h3z - 1) + z - 4(h4 + h5z - 1) \\ + \ldots + z - {}^{(N-2)}(h_N - 2 + h_N - 1z - 1) + z - {}^N h_N \\ N/2 - 1 \end{array}$$

 $= \sum z - 2iTi + z - NhN, i = 0 \quad (11)$ where  $T_i = h_{2i} + h_{2i+1}z^{-1}$  for  $0 \le i \le N/2$ . By applying the Horner's rule, we have

$$H(z) = T_0 + z^{-2} \Big( T_1 + z^{-2} \big( T_2 + \dots + z^{-2} T_{N/2 - 1} + z^{-2} h_N \big) \dots \Big) \Big).$$
(12)

Fig. 3. The proposed structure for odd *N*. The constants in MCM-1 and MCM-2 are

$$\begin{cases} C1_{\text{even}} = \{h_{N-1}, h_{N-3} \dots h_3, h_1\} \\ C2_{\text{even}} = \{h_N, h_{N-2} \dots h_2, h_0\} \end{cases}$$
(13)

A similar structure as Fig. 3 (without the tap  $h_2 \cdot [\underline{N4-1}]$  in the middle) can be obtained for even order *N*.

## B. Area Complexity of the Proposed Structure

The focus of this paper is the proposal of a novel filter structure for efficient implementation of a given filter with fixed filter coefficients. The design of these given filter coefficients is out the scope of this paper. The readers can refer to [1], [2] for details about the filter coefficients design problem.

For linear phase FIR filter, the impulse response is symmetric, i.e.,  $|h_k| = |h_N - k|$ . Therefore, the distinct [N/2]. In the proposed structure, the multiplications of fixed coefficients with the input variable are performed by two separate MCM blocks. Since there is no sharing of partial products across the two MCM blocks, overhead is introduced by splitting an MCM block into two. Let TS<sub>1</sub> and  $TS_2$  be the target sets reduced from C1 and C2, respectively, and  $AS_1$  and  $AS_2$  be the corresponding additional fundamental sets needed to implement  $TS_1$  and  $TS_2$ , respectively, the intersection of constants is given by

$$\begin{cases} O = (TS_1 \cup AS_1) \cap (TS_2 \cup AS_2), (14) \\ C1_{\text{odd}} = \{h_{\lceil \frac{N}{2} \rceil - 1}, h_{\lceil \frac{N}{2} \rceil - 2} \dots h_1, h_0\} \\ C2_{\text{odd}} = \{h_N, h_{N-1} \dots h_{\lceil \frac{N}{2} \rceil + 1}, h_{\lceil \frac{N}{2} \rceil}\} \end{cases}$$
 which

need to be implemented for both MCM-1 and MCM-2, is the overhead introduced by splitting of the MCM block. The size of O depends highly on the partition of the coefficients and can be as large as  $O = (TS_1 \cup AS_1)$  if the coefficient set is partitioned as(15)

In the proposed structure, the two symmetric filter coefficients are grouped into the same MCM block as shown in (10) and (13). Therefore, the distinct coefficients need to be implemented for C1 and C2 are two exclusive half of  $\{h_i | 0 \le i \le \lfloor N/2 \rfloor\}$  and there is no intersection of distinct coefficients from the perspective of symmetrical impulse response. As a result, the area overhead of the MCM block is minimized by the proposed coefficient partition. Since the MCM contributes only the minor part of the FA cost of a filter, the overhead could be made negligible compared to the overall area complexity.

In the accumulation block, the number of structural adders is reduced to [N/2] at the expense of |N/2| PSAs increment, i.e., |N/2| structural adders in the TDF are replaced with |N/2|PSAs. As discussed in Section II, the word-length of the structural adders increase due to the expansion of the accumulation result and can be relatively large for taps close to the filter output. The PSAs, on the other hand, only sums up two products from the two MCM blocks, which results in smaller word-length. The wordlength of the PSA can be estimated as

$$W_{\text{PSA}_i} = W_X + \lceil \log_2(|h_j| + |h_{j+1}|) \rceil - \max(l_j, l_{j+1}),$$
(16)

where  $h_j$  and  $h_{j+1}$  are the two coefficients in the *i*th CC. The word-length of the replaced SA in TDF structure is given by

$$W_{\mathrm{SA}_i} = W_X + \lceil \log_2 \sum_{k=j+1}^N |h_k| \rceil - l_j$$
.(17)

The area saving, in terms of FA cost, is  $\delta i = WSA_i - WPSA_i$ 

$$\sum_{k=j+1}^{N} \frac{\sum_{k=j+1}^{N} ||h_{k}|| - \lceil \log_{2}(|h_{j}| + |h_{j+1}|) \rceil - l_{j} + \max(l_{j}, l_{j+1})}{\sum_{k=j+1}^{N} \sum_{k=j+1}^{N} ||h_{k}|| \rceil - \lceil \log_{2}(|h_{j}| + |h_{j+1}|) \rceil}$$

$$k=j+1 \quad (18)$$

Since  $\lceil \log_2 \sum_{k=j+1}^N |h_k| \rceil$  increases with decreasing *j*,  $\delta_i$ may be large for the taps close to the filter output. For a 120th-order filter from [13],  $W_{PSA0}$  (the PSA in  $T_0$ ) is 13 while the wordlength of the corresponding SA in TDF structure is 25. The FA cost is reduced by as much as 48% for this structural adder. Fig. 4 shows the FA cost of the 60 PSAs and the replaced SAs in TDF for this filter. As can be seen, the saving of FA cost is significant,

International Research Journal of Engineering and Technology (IRJET)e-ISSN: 2395-0056Volume: 07 Issue: 08 | Aug 2020www.irjet.netp-ISSN: 2395-0072

National Conference on Recent Advancements in Communication, Electronics and Signal Processing-RACES'20 Organised by: Department of ECE, Velammal Engineering College, Chennai-66

especially for the first 30 PSAs. This is because the first 30 PSAs are close to the filter output, where the word-length of the corresponding SAs are large due to the expansion of the accumulation result while the word-length of PSAs are kept low in the proposed structure.



Fig. 4. FA cost of PSAs and the replaced SAs for the filter from [13].

TABLE IFA COST (#FA) OF TDF STRUCTURE AND PROPOSEDIMPLEMENTATION

| Filter | Order | TDF (#FA) |      | Propose<br>(#FA) | ed Str | #FA  | %     |        |
|--------|-------|-----------|------|------------------|--------|------|-------|--------|
|        |       | MCM       | Acc  | MCM-1            | MCM-2  | Acc  | Saved | Saving |
| А      | 120   | 576       | 2546 | 319              | 335    | 2249 | 219   | 8.6%   |
| В      | 221   | 435       | 4281 | 257              | 270    | 3657 | 532   | 12.4%  |
| С      | 278   | 270       | 5194 | 170              | 182    | 4352 | 760   | 14.6%  |
| D      | 417   | 271       | 7458 | 163              | 195    | 6185 | 1186  | 15.9%  |
| Е      | 440   | 901       | 9249 | 603              | 534    | 7827 | 1186  | 12.8%  |
| F      | 515   | 244       | 8896 | 167              | 178    | 7438 | 1357  | 15.3%  |

Acc: accumulation block.

Table I lists the FA cost of the TDF structure and the proposed structure of six commonly referenced benchmark filters from [11]. It is observed that the FA cost of the accumulation section is significantly reduced at the expense of slight increase in the MCM part. The "#FA Saved" and "% Saving" presented in the last two columns are the overall FA cost (the total number of FAs of the MCM part and the accumulation section) reduction and the percentage of overall FA cost reduction, respectively. As can be seen, since the accumulation section contributes the major part of the FA cost in an FIR filter, the overhead of the MCM part can be compensated and the overall area complexity is significantly reduced due to the word-length reduction of PSAs.

## C. Timing Complexity of the Proposed Structure

According to the delay model proposed in [10], the delay of the *i*th tap in a TDF FIR filter can be estimated as

$$D_{i} = T_{h} \qquad \begin{array}{c} \langle V \\ & \cdot W_{X} + \lceil \log_{2} \sum |h_{k}| \rceil \\ & k_{=i} \\ + \\ & max(T_{v} \cdot a_{i,j} - T_{h} \cdot (s_{i,j} + l_{i})), \\ & j \in [1,M] \end{array}$$
(19)

where  $T_v$  and  $T_h$  are the vertical propagation delay and horizontal propagation delay of an FA, respectively,  $s_{i,j}$  is the number of left shift of the *j*th partial product of the *i*th coefficient in the shift-add network and  $a_{i,j}$  is the number of adder stages the partial product passes through.

In the proposed structure, the products generated by the MCM blocks go through two additions: one PSA and one SA. The number of adder stages of the proposed structure can be estimated by  $a'_{i,j} = a_{i,j} + 1$  if the shift-add operations to generate the product is assumed to be the same. Therefore, the delay of a tap in the proposed structure is given by

$$D_i = (T_h \cdot W_X + \lceil \log_2 \sum |h_k| \rceil)$$

$$k_{=i} \qquad (20)$$

 $+ \max(T_v \cdot a_{i,j} - T_h \cdot (s_{i,j} + l_i)) + T_v, j \in [1, M]$ 

which is only slightly larger than that of the TDF structure by  $T_{\nu}$ . Moreover, due to the splitting of the MCM block, the sharing of partial products is reduced. The average fanout of the nodes in the shift-add network is thus reduced, which is beneficial for the minimizing the critical path of the filter. Moreover, routing of the circuit can be simpler for small MCM blocks. Therefore, the overall timing performance of the proposed structure is expected to be similar to its TDF counter part.

## D. Generalized Structure

In the proposed structure in Section III-A, a CC consists of only two coefficients to take the advantage of the symmetric filter coefficients. The proposed structure can be generalized to group L coefficients into one CC. Therefore, the original MCM block is split into L sub-MCM blocks, where each subMCM block contains N/L (zero padding is needed if *N* is not dividable by *L*) coefficients. The L products generated by the L sub-MCM blocks are summed up by an adder tree using (L - 1)PSAs. The results of adder trees are accumulated to generate the filter output. Fig. 5 shows the proposed generalized structure. In the generalized structure, instead of 1/2, a ratio of (L - 1)/L of SAs in the TDF structure are replaced with PSAs. More replacement of SA with PSA means more area saving since the wordlength of PSAs are generally smaller than their corresponding SAs. Note that the word-length of the PSAs increases with *L* due to the extension of the sizes of the intermediate results. The increment of delay of a tap due to the increased adder stages in the adder tree is  $T_v$ .  $(\log_2 L)$ . The adders in adder trees along with the SA can be implemented using Wallace tree adder (WTA) to reduce the critical path delay. The generalized structure is a direct extension of the proposed structure and we do not give example here due to the length limitation. The value of *L* can be used to trade-off the critical path delay against area complexity.

Organised by: Department of ECE, Velammal Engineering College, Chennai-66

## EXPERIMENTAL RESULTS

In the first experiment, we compare the proposed implementation with [7], which focuses on the SA optimization. The same benchmark filters as in [7] is used. Fig. 6 shows the SA reduction (in # FAs) of the proposed structure and [7]. Note that it is the FA cost reduction in the accumulation section only. It can be seen that the proposed structure is more efficient in reducing the complexity of SAs. The reduction is almost doubled by the proposed structure compared to that of [7].

In the second experiment, we compare the proposed design with existing low-complexity FIR filter implementation techniques of [4] and [8]. All the design examples are implemented



Fig. 5. The proposed generalized structure. AT: adder tree. WTA: Wallace tree adder.



Fig. 6. SA reduction (in #FAs) of the proposed structure and [7].

in Verilog-HDL and synthesized using 65 nm CMOS technology library. Place and route are performed using Cadence SOC encounter. The power consumption is estimated at a frequency of 180MHz. Note that in all the designs, adders in built at bitlevel instead of behavioral description to guide the tool to use RCAs. We have used relaxed constraint during synthesis to avoid the impact by optimization of synthesis tools. Table II presents the post layout results of six filters. As can be seen, the proposed design outperforms other existing designs in terms of area and power consumptions, while the overall delay

performance of the proposed architecture is nearly maintained at the same level as that of [8]. The average area reduction of the proposed implementation over [4] and [8] are 34.0% and 18.8%, respectively, and the average power reduction are 41.4% and 23.5%, respectively. The overall area-delay performance and power-delay performance of the proposed design are therefore superior to the existing techniques.

TABLE II POST LAYOUT RESULT OF SIX BENCHMARK FILTERS

| Filter | Method   | Area<br>(µm²) | CPD<br>(ns) | Power<br>(mw) | ADP<br>(µm²*ns) | PDP<br>(mw*ns) |
|--------|----------|---------------|-------------|---------------|-----------------|----------------|
|        | [8]      | 73001         | 3.88        | 37.2          | 283243.9        | 144.3          |
|        | [4]      | 60102         | 5.12        | 28.3          | 307722.2        | 144.9          |
| А      | Proposed | 54691         | 3.71        | 23.4          | 202903.6        | 86.8           |
|        | [8]      | 119880        | 4.37        | 79.1          | 523875.6        | 345.7          |
|        | [4]      | 98820         | 5.27        | 61.1          | 520781.4        | 322.0          |
| В      | Proposed | 82763         | 4.93        | 50.4          | 408021.6        | 248.5          |
|        | [8]      | 173029        | 4.95        | 102.3         | 856493.6        | 506.4          |
|        | [4]      | 116659        | 4.80        | 79.6          | 559496.6        | 381.8          |
| С      | Proposed | 94232         | 4.97        | 48.7          | 468333.0        | 242.0          |
|        | [8]      | 204399        | 3.87        | 110.9         | 791024.1        | 429.2          |
|        | [4]      | 173099        | 4.26        | 85.9          | 737401.7        | 365.9          |
| D      | Proposed | 127682        | 4.02        | 64.7          | 513281.6        | 260.1          |
|        | [8]      | 234281        | 4.09        | 120.5         | 958209.3        | 492.8          |
|        | [4]      | 195892        | 5.28        | 86.4          | 1034309.8       | 456.2          |
| Е      | Proposed | 170018        | 4.56        | 69.6          | 775282.1        | 317.4          |
|        | [8]      | 240581        | 3.97        | 119.2         | 955106.6        | 473.2          |
|        | [4]      | 209876        | 4.87        | 95.6          | 1022096.1       | 465.6          |
| F      | Proposed | 150023        | 4.56        | 73.2          | 684104.9        | 333.8          |

CPD: critical path delay. ADP: area delay product. PDP: power delay product.

## CONCLUSION

A novel FIR filter implementation is proposed to reduce the area complexity of the product accumulation block by replacing parts of long word-length SAs with shorter wordlength PSAs. The filter coefficients are carefully grouped to take advantage of the symmetry of the impulse response of linear phase FIR filters. Analysis and experimental results show that the overall area complexity can be reduced at the expense of negligible delay overhead. The average area reduction of the proposed implementation over [4] and [8] are 34.0% and 18.8%, respectively, and the average power reduction are 41.4% and 23.5%, respectively. The overall area-delay and power-delay performance of the proposed implementation is superior to existing techniques.



#### REFERENCES

- [1] C. Y. Yao, W. C. Hsia, and Y. H. Ho, "Designing hardware-efficient fixed-point FIR filters in an expanding subexpression space," IEEE Trans. Circuits Syst. I, vol. 61, no. 1, pp. 202–212, Jan 2014.
- [2] W. B. Ye and Y. J. Yu, "Bit-level multiplierless FIR filter optimization incorporating sparse filter technique," IEEE Trans. Circuits Syst. I, vol. 61, no. 11, pp. 3206– 3215, Nov. 2014.
- [3] M. M. Peiro, E. I. Boemo, and L. Wanhammar, "Design of high-speed multiplierless filters using a nonrecursive signed common subexpression algorithm," IEEE Trans. Circuits Syst. II, vol. 49, no. 3, pp. 196–203, Mar. 2002.
- K. Johansson, O. Gustafsson, and L. Wanhammar, "Bitlevel optimization of shift-and-add based FIR filters," in Proc. IEEE Int. Conf. Electronics Circuits Syst., vol. 3, Marrakech, Morocco, Dec. 2007, pp. 713–716.
- [5] A. G. Dempster and M. D. Macleod, "Use of minimumadder multiplier blocks in FIR digital filters," IEEE Trans. Circuits Syst. II, vol. 42, no. 9, pp. 569–577, Sep. 1995.
- [6] X. Lou, Y. J. Yu, and P. K. Meher, "Fine-grained critical path analysis and optimization for area-time efficient realization of multiple constant multiplications," IEEE Trans. Circuits Syst. I, vol. 62, no. 3, pp. 863– 872, Mar. 2015.
- [7] M. Faust and C. H. Chang, "Optimization of structural adders in fixed coefficient transposed direct form FIR filters," in Proc. IEEE Int. Symp. on Circuits Syst., Taipei, May 2009, pp. 2185–2188.
- [8] M. Faust, M. Kumm, C. H. Chang, and P. Zipf, "Efficient structural adder pipelining in transposed form FIR filters," in Proc. IEEE Int. Conf. Digital Signal Processing, Singapore, Jul. 2015, pp. 740–743.
- [9] M. Faust and C. H. Chang, "Low error bit width reduction for structural adders of FIR filters," in Circuit Theory and Design (ECCTD), 2011 20th European Conference on, Aug 2011, pp. 713–716.
- [10] X. Lou, Y. J. Yu, and P. K. Meher, "Lower bound analysis and perturbation of critical path for areatime efficient multiple constant multiplications," IEEE Trans. Comput.-Aided Design Integr. Circuit Syst., accepted, 2016.
- [11] FIRsuite. (2011) Suite of constant coefcient FIR filters.[Online].Available: http://www.firsuite.net/J. F. Epperson, An Introduction to Numerical Methods and Analysis. John Wiley and Sons, Inc., 2002.