Hardware Efficient Reciprocal Using Second Order Harmonized Parabolic Synthesis and Squaring Shrunk Method

1 Abstract — In applications as in wireless communication, computer graphics and digital signal processing, a massive of complex matrix operations is often performed. Reciprocal is computed in large quantities in these matrix operations. To obtain high performance, efficient algorithm and hardware architecture are important in terms of low cost, low computation time and high precision. Second order first sub-function and squaring shrunk method have been proposed to build efficient hardware architecture for reciprocal using field programmable gate array. Second order first sub-function in harmonized parabolic synthesis is presented to improve the approximating precision and decrease the memory usage at the cost of additional multipliers. To further reduce the complexity, squaring shrunk method is proposed to decrease the expensive cost of multipliers. The combination of these techniques yields good performance trade-off. Precision simulation and hardware implementation result has shown that hardware reciprocal of high precision, low memory and low multiplier usage has been obtained compared to traditional first order first sub-function harmonized parabolic synthesis method.


I. INTRODUCTION
Arithmetic element functions are playing very important roles in wireless communication, computer graphics and digital signal processing, reciprocal is one of these functions which are frequently computed in matrix operations [1].Because of the characteristic of high throughput and low latency, hardware implementation has become a main approach in computing acceleration.To approximate reciprocal, a lot of algorithms have been developed so far.Look up table [2], [3], Newton-Rapshon (NR) method [4], [5], Coordinate Rotation Digital Computer (CORDIC) [6], series expansion [7], and et al. are of these algorithms developed to compute reciprocal.The key problem stated among these algorithms is how to find the best trade-off among precision, convergence speed, throughput, latency and cost.Look up table is easy to implement, but the memory size will be huge when a high precision is required.Newton-Raphson method is a multiplicative method, it has the property of quadratic convergence, but it consumes additional multipliers.CORDIC is a subtractive technique, which is hardware Manuscript received 19 November, 2017; accepted 8 March, 2018.
friendly, but it occupies a long latency especially at a high precision.Series expansion is also multiplicative based, which is a polynomial approximation method.It has the advantages of fast convergence, but the mass multipliers used in hardware are expensive.
In order to design efficient hardware architecture for reciprocal, parabolic synthesis method has been presented [8].It uses the product of series of sub-functions to approximate the original reciprocal function, and it indicates performance improvement over CORDIC and Newton-Raphson method [9].Parabolic synthesis method has shown a wide application in arbitrary function, e.g., sine and cosine function [10], [11], logarithmic and exponential function [12], and roots, inverse and inverse roots function [13].To further release the hardware burden and improve the computing precision of reciprocal, harmonized parabolic synthesis (HPS) [14] and non-linear interpolation [15] have been proposed recently.For example, a simple first order first sub-function with non-linear interpolation second order second sub-function in harmonized parabolic synthesis for reciprocal has been analysed in [14] and [16], it shows performance improvement over Newton-Raphson method.
To construct a hardware efficient implementation of reciprocal with high precision and low memory and multiplier usage, second order first sub-function and squaring shrunk method have been proposed in this paper.The proposed method inherits the harmonized parabolic synthesis method with non-linear interpolation in second sub-function as shown in [14] and [16], and constructs second order first sub-function to obtain a high precision.The proposed second order first sub-function reduces the memory usage by employing the symmetric property of the first help function.To overcome the additional consumed expensive multipliers in second order first sub-function, squaring shrunk method is employed in both the first and the second sub-function.It reduces the hardware complexity by decreasing the number of used multipliers as well as shrinking the multipliers to squares.The proposed harmonized parabolic synthesis with second order first sub-function and non-linear interpolation in second sub-function combined with squaring shrunk method has obtained performance improvement for hardware architecture of reciprocal in terms of precision and complexity (memory bits and multipliers).It shows a competitive solution for reciprocal in application of field programmable gate array (FPGA) and acceleration of complex algorithms.

II. RECIPROCAL APPROXIMATION METHOD A. Newton-Raphson Method
Newton-Raphson method has a long history [4], it is proposed to solve the non-linear problem by using of linear equations.NR method for computing reciprocal ( 1 / z v  ) can be expressed by (1), where t represents the number of iterations .
In order to obtain a fast convergence, look-up table (LUT) and NR method are combined [17].LUT is used to acquire a relatively precise initial value, and NR method is employed to get a high precision of reciprocal approximating.
The hardware architecture of NR method is illustrated in Fig. 1, it uses table size of 10 2 10  bits, and consumes two additions and two multipliers.The input data binary is composed of 1 bit of integer and 15 bits of fraction.

B. Harmonized Parabolic Synthesis Method
The normally form of reciprocal computation can be shown in (5), it is the mantissa of the floating-point number [14].In the harmonized parabolic synthesis method, the input and output variable should be normalized to the range of [0, 1].Thus, a pre-processing step is needed, as shown in (6), which normalizes the input variable ( ) x .As to normalizes the output variable ( ) y , the target computing function can be expressed in (7).Finally, the original reciprocal function (5) can be computed through (8), which is an after-processing step.As the pre-and after-processing step is easy to be implemented in hardware by addition and shifting, the target function ( 7) is mainly focused, computed and analysed in this paper where 0 1, 0 1.
Parabolic synthesis is a newly developed method in approximating unary functions, which is first presented by Erik in [8].It uses the product of a series of second order sub-functions, defined as where 1 1, 1 1. x f     In parabolic synthesis, the number of sub-functions affects the approximating accuracy.To construct hardware friendly architecture, harmonized parabolic synthesis (HPS) method has been developed [18], which uses two sub-functions, as defined in (10).HPS simplifies the approximating process, which leads to hardware complexity reduction.To avoid precision loss, it often combines with non-linear interpolation method, which will be introduced in section C where 1 1, 1 1.
In parabolic synthesis, the first help function, 1 ( ) f x , is defined as the quotient of the original function and the first sub-function, as shown in (11).The first help function is used in HPS to estimate the second sub-function

C. First Order First Sub-Function Based Harmonized Parabolic Synthesis Method (FOFS-HPS)
In the HPS method, in order to implement reciprocal with high accuracy and low complexity, constructing sufficient sub-functions has become the key.To simplify the hardware implementation, simple first order first sub-function has been presented in [14] and [16] for approximating reciprocal.However, it has the drawback of low convergence and low accuracy.To overcome this issue, non-linear interpolation has been utilized in the second sub-function in HPS, which can compensate the precision loss.The first order first sub-function based HPS method (FOFS-HPS) is shown in (12), where i represents the th i interval (total number of intervals is and w x is defined in (13) which can be implemented by shifting in hardware: ( ) , (2 ), where 2,i l , 2,i j and 2,i c are constant coefficients, and they are defined by ( 14), ( 15) and ( 16), respectively.
, start i x and , end i x denotes the starting and ending point of the th i interval, separately.2,i k is the slope of the th i interval, as shown in (17).The employment of non-linear interpolation can approximate the second sub-function more precisely through second order multiple piecewise approximation: , ), 2 ).
The hardware architecture of FOFS-HPS method is illustrated in Fig. 2, in which 16 intervals ( 4) n  is used as an example.Input and output variables are 15 bits and 16 bits, respectively.It costs three multipliers, one square and three additions.Among which, only an addition is consumed in the first sub-function, leading to a low complexity implementation.

D.Proposed Method Using Second Order First Sub-Function and Squaring Shrunk in Harmonized Parabolic Synthesis
The limitation of traditional FOFS-HPS method is that the low accuracy in the first sub-function may result in more intervals in the second sub-function so as to obtain a relatively high precision.To overcome this issue, HPS with second order first sub-function (SOFS-HPS) has been proposed in this paper.As shown in (18), a second order approximation in the first sub-function has been constructed, leading to a precise approximation of 1 ( ) s x .In (18), i represents the th i interval divided in the first help function, and , j and 2,i c are the same as in ( 14), ( 15), ( 16) and ( 17), respectively: .
Because of the using of SOFS, the first help function of HPS can exploit the symmetric property which can be utilized to save the memory bits, as illustrated in Fig. 3.It can be found that the first help function using SOFS has the property of symmetric, which can lead to nearly half of memory bits saving in hardware compared to the FOFS method.Hardware architecture of proposed SOFS-HPS is shown in Fig. 4, where 4 n  and the input and output width is the same as in Fig. 2. It can be seen from ( 17) and ( 18) that the main difference of approximating reciprocal relies on the first sub-function.High accuracy is obtained by using of second order approximation.Hardware architectures (from Fig. 2 and Fig. 4) indicate that the proposed SOFS-HPS method can reduce nearly half of the memory bits at the cost of an additional multiplier and addition.
To further release the hardware complexity burden of the proposed SOFS-HPS method, squaring shrunk method has been employed.The fundamental of the proposed HPS with squaring shrunk method relies on the fact that squarer has lower complexity than multiplier.The complexity analysis is based on the following assumption and hypothesis., , In the assumption, T represents the latency of the macro, and C denotes the complexity of the macro.Obviously, the multiplier is the most complexity and slowest macro among three.
Hypothesis 1 The squaring function in (22) needs one addition, one squarer followed by another multiplication and addition.However, the target second polynomial function in (21) requires one squarer, two multipliers and two adders.The complexity and latency comparison is shown in Table I.It can be seen that the advantage of the function in ( 22) is obvious in terms of hardware cost where two multipliers are shrunk to a squarer while the latency is the same as in (21).
The proposed SOFS-HPS using squaring shrunk method in approximating reciprocal is shown in (23), where  As multipliers are shrunk to squares, the hardware complexity of the proposed SOFS-HPS method is reduced a lot.Number of resources usage is compared in Table II.Compared to FOFS-HPS and proposed SOFS-HPS method, the proposed SOFS-HPS with squaring shrunk method consumes the least multipliers at the cost of one additional squarer without latency and precision loss.It also saves the memory bits comparing to FOFS-HPS method.The hardware architecture of the proposed SOFS-HPS with squaring shrunk method is shown in Fig. 5, where 4 n  is illustrated.The proposed method has the same approximating accuracy as SOFS-HPS, which devotes a high precision.By utilizing second order first sub-function, non-linear interpolation in the second sub-function, and squaring shrunk method, hardware implementation of the proposed SOFS-HPS with squaring shrunk method for reciprocal yields good trade-off between precision and complexity.
The proposed SOFS-HPS with squaring shrunk method has advantages in considering the precision and complexity of implementing reciprocal in hardware.On one side, the proposed method improves the approximating precision by using of second order first sub-function at the cost of additional resources.It reduces the memory bits usage as well by exploiting the symmetric property.On the other side, squaring shrunk method is proposed in this paper to reduce the multiplier consumption.Thus, a good trade-off between approximating precision and hardware complexity has been obtained which leads to hardware efficient reciprocal implementation.

III. ANALYSIS, RESULT AND COMPARISON A. Precision Analysis
To evaluate the precision of different approximating method, error   FOFS-HPS method and the proposed SOFS-HPS with squaring shrunk method.The non-linear interpolation intervals are 16 (n = 4).To be clarified, the proposed SOFS-HPS with squaring shrunk method has the same error distribution as SOFS-HPS method.Because they are using the same approximating equations in software even though their hardware architectures are different.It can be seen that the proposed method has a relatively higher precision than FOFS-HPS method, and it shows comparable precision as NR method with single iteration and table size of 10 2 10  bits.It is also shown that NR method with two iterations and table size of 7  2 7  bits has the highest precision among them.
Figure 7 displays the average error of different approximating methods.It can be found that the proposed method yields an average precision promotion by 98 % compared to FOFS-HPS method considering different interpolation intervals.More intervals will lead to lower average error.The proposed SOFS-HPS with 64 non-linear interpolation intervals in the second sub-function method can reach an average error magnitude of 10 -8 .It can also be seen that NR method has a faster convergence with the increasing of iteration.Nevertheless, HPS method has a lower convergence with the increasing of intervals.It should be pointed out that the error analysis in Fig. 6 and Fig. 7 is based on software simulation result.In hardware architecture, there will be precision loss considering the size of the memory and the width of the input, output and internal signal.This will cause additional truncated error which is often related to specific application.

B. Hardware Precision Analysis Based on Table Size
In hardware architecture, the truncated error caused by the truncation of different coefficients ( In the FOFS-HPS and proposed SOFS-HPS method, the second sub-function can be expressed as .
It can be seen from ( 28) that increasing the size of the table (enlarge H  will decrease ,   directly decrease the truncated error.
In the proposed SOFS-HPS with squaring shrunk method, the second sub-function can be expressed as .
High order of truncated error has been in (29), where only first order of error has been concerned to simplify the analysis.It can be found in (29) that the table size of H k has a linear proportion to decrease the truncated error while k  has the minimum influence on the precision.The other two variables have a conversed relationship on the truncated error, where p  has a more obvious influence on the precision than m  .It can be seen that the proposed method in (29) can effectively reduce the multipliers in hardware, but more memory bits will be consumed to store the more accurate coefficients so as to obtain the same precision as in (28).Average error based on different table size of the proposed hardware architecture has been shown in Fig. 8, where there are 16

C. Implementation Result of the Proposed SOFS-HPS Method
HPS with 16, 32 and 64 intervals on the second sub-function is implemented.FOFS-HPS and SOFS-HPS method are compared using FPGA.The implemented FOFS-HPS architecture is the same as in [16].The target device is Cyclone V 5CSEMA5F31C6, and the verification platform is DE1-SoC.Quartus Prime 16.0 and ModelSim 6.2b is chosen as the synthesis and simulation tool.The width of input is 15 bits and the output is 16 bits, implementation of multiplier uses intellectual property (IP) provided by Quartus.
Implementation results are shown in Table III.All implementations are using full pipelined structures, which lead to computed data refreshed every clock (the value of throughput equals maximum speed).It can be seen that the proposed method reduces the memory bits by 53 %, and improves the precision by 32 % on the average.The improved precision and memory bit has a direct proportion to the intervals.Although extra adaptive logic modules (ALMs) are consumed in the proposed architecture because of the additional usage of addition and multiplexer, it can be found that the proposed SOFS-HPS method is more effective when more intervals are used.Above all, the proposed SOFS-HPS method is a hardware-efficient architecture by taking advantage of symmetric property in the first help function.It brings performance improvement in terms of precision and memory bits at the cost of two adders and one multiplier over FOFS-HPS method.

D.Implementation Result of the Proposed SOFS-HPS with Squaring Shrunk Method
In order to validate the hardware efficiency of the proposed SOFS-HPS with squaring shrunk method, implementation of NR method with single iteration and table size of 7  2 7  bits (M1), NR method with single iteration and table size of   IV, where 16 intervals are considered.It can be seen that the proposed SOFS-HPS with squaring shrunk method consumes the least number of DSP blocks and moderate memory bits (Mem.) at the cost of extra resources compared to FOFS-HPS and SOFS-HPS method.The extra ALMs/ALUTs which are considered less expensive than DSP is resulted from the squaring shrunk approach.Although NR method reaches a low complexity in terms of logic element and DSP blocks, it is due to the mass usage of memory bits.Nevertheless, the proposed SOFS-HPS with squaring shrunk method can lead to relatively low memory consumption.
Error of the implemented hardware for different method has been shown in Fig. 9, where the standard deviation of M1, M2, M3, M4 and M5 is 1.07 × 10 -5 , 2.82 × 10 -6 , 1.22 × 10 -5 , 1.33 × 10 -5 and 6.58 × 10 -6 , respectively.It can be seen from Table IV and Fig. 9 that HPS method has outperformed NR method in terms of speed and memory usage at the cost of extra logic elements and DSP blocks.It also shows that the proposed M5 method has the most robust error distribution among them.From mathematical analysis, precision simulation and hardware implementation, it can be found that the proposed SOFS-HPS with squaring shrunk method devotes a hardware-efficient architecture by exploiting second order first sub-function, symmetric property and squaring shrunk technology.It obtains a more robust error distribution compared to existing FOFS-HPS.It also eliminates the expensive multipliers without precision loss.Although more logic elements are consumed to obtain the same precision as FOFS-HPS, it can be inferred that (from Fig. 8) the influence of table width for different methods will be negligible when the accuracy is improved.The proposed method can be easily implemented in low cost FPGAs for acceleration purpose or be integrated into complex circuit design where memory and multiplier is difficult or expensive to implement.

IV. CONCLUSIONS
Second order first sub-function of harmonized parabolic synthesis with squaring shrunk method has been proposed in this paper to approximate reciprocal.The use of high order first sub-function can benefit the accuracy of reciprocal, and reduce the memory usage by exploiting symmetric coefficients.Squaring shrunk method can be utilized to release the burden of multipliers.These techniques have been combined to yield an efficient hardware reciprocal, which has been shown performance improvement over traditional harmonized parabolic synthesis in terms of precision, memory and multiplier.Mathematical derivation, simulation and hardware implementation have been presented to demonstrate the efficiency of the proposed method, which shows a promising solution suitable for hardware acceleration and FPGA design for wireless communication, matrix inversion, image processing, and et al.

Fig. 1 .
Fig. 1.Hardware architecture of NR method with single iteration and initial value approximation based on LUT.

Fig. 3 .
Fig. 3.The first help function using conventional first order first sub-function (FOFS) and proposed second order first sub-function (SOFS). sqr

Figure 6
Figure 6 illustrates the error of LUT based NR method,

10 2 10 
bits (M2), traditional FOFS-HPS method (M3)[14],proposed SOFS-HPS method (M4) and SOFS-HPS with squaring shrunk method (M5) based on 16 intervals   4 n  have been implemented and compared using FPGA.The verification platform and software tools are the same as in section C. Stratix IV EP4SGX230FF35C4 FPGA is chosen as a comparison target device as well, where adaptive look-up table (ALUT) is the basic element of resources.The implemented hardware architecture is shown in Fig. 1, Fig. 2, Fig. 4 and Fig. 5, respectively.Full pipelined structures are implemented either.
2, , k are defined by (24).It can be seen that two adders and one squarer are needed in 1 ( ) s x and the coefficients symmetric property can be inherited in the second sub-function 2 ( ) : s x

TABLE I .
COMPLEXITY AND LATENCY COMPARISON.

TABLE II .
COMPLEXITY COMPARISON OF DIFFERENT METHODS.
2, ,  and k  denotes the truncated error caused by the limited table width of 2, , i p 2,i m and 2,i k ) can influence the precision of hardware.H l , H j , H c , i k respectively.Then, the following function is met: intervals and the table width of coefficient 2,i m is 11 bits.It can be seen that the table width of 2,i p plays more important role in affecting the average error than table width of 2, .It is consistent with the analysis of the truncated error in (29).
i k

TABLE III .
IMPLEMENTATION COMPARISON OF DIFFERENT HPS METHOD.

TABLE IV .
IMPLEMENTATION RESULT OF DIFFERENT APPROXIMATING METHODS.