Optimizing FDTD Memory Bandwidth by Using Block Float-Point Arithmetic

1 Abstract —Finite-difference


I. INTRODUCTION
Finite-difference time-domain (FDTD) is a powerful algorithm for the modelling of the electromagnetic field.It provides a direct time-domain solution of Maxwell's curl equations in differential form by discretizing both the physical region and time interval by using a uniform grid called the Yee grid, after its inventor [1].The steps in the algorithm are the following: first all fields are initialized to 0. Afterwards, the sources of electromagnetic fields are introduced to the system.In the main loop of the algorithm the electric field is calculated based on the magnetic field.When the whole grid is updated, the magnetic field is calculated based on the electric field which has changed previously.Any objects inside the grid reflect or absorb the waves in some quantity.This process goes through time until we reach the desired iteration number.A special absorbing layer is added on the edge of the grid.This layer is called the absorbing boundary conditions (ABC) and is used to truncate the outgoing waves which would otherwise bounce off the edges of the grid, to obtain more realistic results [2], [3].Some benefits of the FDTD algorithm are that it is good for large scale simulations because it scales almost linearly, works well for broadband and transient simulations and naturally handles nonlinear behaviour.Some drawbacks of the method are the not so efficient representation of curved surfaces due to grid representation and inefficient representation of highly resonant devices which require much longer simulations [4].The areas of application of the FDTD include antenna design [5]- [7], optoelectronics [8] mine detection [9], microwave tomography [10], electromagnetic compatibility [11] and more.
Even though the method was introduced in the year 1966 [1], it was not widely used until the last decade due to limited computing resources which could not provide enough computation power to solve all the equations in time [12].As a memory intensive problem, it has sparked an interest to many authors in the past in the field of optimization.To improve the performance, the algorithm has been successfully implemented on multi-core architectures [13], with additional improvements being made by using graphics processing units (GPUs) as accelerators [14], [15] and field-programmable gate arrays (FPGAs) [16], [17].Most recently, the trends among authors is using Open Computing Language (OpenCL) in order to achieve a more flexible and portable solution based on the FPGA [18], [19].
One of the issues some of these solutions face is that they are based on the PC architecture which is compute orientedit can do multiple computing operations in the time frame of one memory access, resulting in a lot of idle time for the computing resources when memory intensive problems are involved.The FDTD algorithm's memory access to number of operations ratio makes PC architectures suboptimal.To cope with these issues, the authors proposed a solution based on multiple FPGAs working in conjunction to achieve maximal possible bandwidth [20].The main feature of the solution was its lower cost over-time due to low power consumption and cheaper components when compared to existing solutions.However, the performance of existing solutions could not be matched due to the far superior bandwidth of the double data rate type five synchronous graphics random-access memory (GDDR5 SDRAM) used in GPUs when compared to the double data rate type three random access memory (DDR3 SDRAM) modules the authors did the analysis for.This is mostly due to the fact that the GPUs have become a commodity because of their widespread usage in PCs, providing a better price to performance ratio.The idea that followed was to artificially improve the memory bandwidth without using more powerful FPGAs which would increase the cost and go against the main philosophy of the solution, which is a cheaper and more efficient solution.Therefore, the approach the authors took was to optimize the memory usage which came down to data compression.The goal was to use fewer bits to represent the data while maintaining precision within a specified threshold.This is the area where block-floating point arithmetic comes into play.
Block-floating point (BFP) arithmetic is a way of emulating floating-point arithmetic by using fixed-point.Unlike the standard floating-point, where every element has its own exponent and fraction, in BFP an entire block of numbers has a mutual exponent.The elements within the block are represented by their respective mantissas which are scaled according to the mutual exponent which is calculated based on the highest value.
When measuring the performance of a particular number representation, several characteristics are important.One of them is the dynamic range which is the ratio between the smallest and largest number that can be represented.The second characteristic is precision which defines the resolution or the measure of detail.When numbers are concerned, precision defines how close two consequent numbers are -the smaller this gap is, the better the precision.For the same word length, floating-point numbers have a better dynamic range whereas fixed-point representation has greater precision.The third characteristic worth mentioning is complexity.Fixed-point arithmetic is much simpler when compared to floating-point because it does not involve normalization.Adding two fixed-point numbers is no different from adding two integers.Floating-point arithmetic on the other hand requires shifting of the mantissa based on the exponent before adding two values.This results in floating-point arithmetic being more complex, requiring more transistors to do the computations which further increases the cost.So choosing the best representation for the data is an optimisation problem on its own -how to find the optimal ratio of dynamic range and precision while keeping the costs at the lowest value.
The benefit of using BFP is that it allows more aggressive scaling with a single block exponent while at the same time retaining a greater dynamic range in the output compared to typical fixed-point.Also, when compared to standard single precision floating-point, it requires fewer normalization operations because a single exponent is mutual for more elements.The important factor here is the size of the problem and the size of each block.The bigger the block and the problem, the more memory bandwidth is "saved" but with a certain lack in precision.
BFP provides a good trade off between complexity and dynamic range, making it an efficient number representation format in some cases [21].It is most commonly used in algorithms which are suited for fixed-point processors such as DSPs but require or benefit from additional dynamic range in certain areas.One such application is the Fast Fourier Transform (FFT) and other algorithms based on it [22]- [24].Other areas include neural networks [25], matrix multiplication [26], digital filters [27], [28], communication systems [29] and more.

II. SOLUTION
In order to use the BFP several adjustments to the existing algorithm had to be made.The first step was the profiling of the algorithm where all the variables and matrices were monitored to pinpoint the most extreme values.Afterwards scaling factors have been introduced so that all the values would fit into the range [-1, 1].This was required in order to avoid any overflows when switching from single precision 32bit floating-point to fixed-point.The results showed that by making this change, no significant errors were introduced.Despite the satisfying results, the overall performance increase was too small to make a change so the next step was to introduce the data compression.
Experiments showed that the minimal number of bits which could be used to represent the data elements was 16.However, there were two main issues with this type of solution.The first is the significant loss of precision due to insufficient dynamic range.The second problem was that even though the results could fit into 16 bits, the intermediate values used during the computation could not which caused overflows.A more extreme scaling factor was introduced but it deteriorated the precision even further.In the end a 24bit fixed point solution was chosen as a starting point.This reduced memory usage by 25 % with a decrease in precision of around 5 % which was acceptable.
The next step was to see if a custom data representation could yield better results and so block floating-point was introduced.Due to the nature of travelling electromagnetic waves (especially the large spatial sampling rate), the Yee grid could be divided into smaller blocks called tiles with similar values within the tile as shown in Fig. 1.This made it possible to determine the mutual exponent for the tile which then allowed a more aggressive scaling.The next task was to figure out how big should the tiles be without losing on precision by scaling them all in regard to the element with the most extreme value.If the division of the grid is too fine (having many tiles), the number of operations needed to find an exponent for each of the tiles will make the algorithm last longer and use up more memory.The most extreme case is when the tiles have the dimension of 1 × 1 which creates the equivalent of a grid with all floating-points.On the other hand, if the tiles are too large, some elements will not be represented precisely and the loss of precision will accumulate over time giving worse results as the simulation runs further.The optimal solution lies somewhere in between the extremes.
The most challenging aspect of switching to BFP was tile management.A new square tile is formed from elements from two matrices -one that holds the 16 bit fixed-point values which represent the mantissas and the other smaller matrix that contains the exponents as 8bit integers for each tile.If we examine the case where the tile size is 4 × 4, the fixed-point matrix has the dimension of 64 × 64, whereas the exponent matrix has 16 × 16 cells.First, the 4 × 4 tile of the fixed-point matrix is read and all values are expanded to 24 bit fixed-point and scaled by the exponent which corresponds to that tile.This process was named unpacking.Once the tile is unpacked, it is used in all the calculations and the a reverse process happens called packing.During this process a new exponent is based on the elements the tile and the values are written down in the final form which again is 16 bit.A logarithm with the base 2 is applied to the maximal value within the tile and then a ceiling function determines the value of the exponent.Afterwards, this value is stored in the matrix which holds the exponents.The fixed-point matrix is then scaled with the newly calculated exponent and written to the fixed-point matrix.A simplified algorithm is given bellow.

... for tj in 1:num_of_tiles_y_axis for ti in 1: num_of_tiles_x_axis # unpacking for jj in 1:tile_size for ii in 1:tile_size j = ((tj -1) << tile_shift_x) + jj i = ((ti -1) << tile_shift _y) + ii X_tile[ii,jj] = from_block(X_bf[i,j], block_exp) end end do_the_calculations() #calculations #packing for jj in 1: tile_size for ii in 1: tile_size j = ((tj -1) << tile_shift _x) + jj i = ((ti -1) << tile_shift _y) + ii X_bf[i,j] = to_block(X_tile[ii,jj], block_exp) end end end end …
The ti, tj, ii, and jj are counters used to manoeuvre around the grid.The values in the outer for loops (ti and tj) move through all the tiles, whereas the values in the inner loops (jj and ii) move within a single tile.The i and j variables are used to index the entire fixed-point grid.The X_tile is the temporary matrix used for calculations, and the X_bf is the matrix which is correspondent to the tile which is currently being worked on.Block_exp is the exponent which corresponds to the current tile.Some equations used in the algorithm require values which are located in the tiles not updated yet.To cope with this issue, the algorithm was divided into two phases and special data structures had to be introduced.During the first phase all tiles which would be needed for the second phase are calculated and written back.Special boundary cases which occur on the edge of each tile are handled by the so called bands which represent only a one-dimensional array that has the same length as the tile side as shown in Fig. 2.
These additional structures use up extra bandwidth and produce some overhead and so does the additional phase.However, they are necessary in order to reduce the internal memory needed for the proper functioning of the algorithm.Both are included in the final performance overview.For the tiles on the border of the grid there is another special boundary case.Because their neighbouring tiles are in fact outside of the grid, this is modelled by assuming that they have all zeros, and these zeros are used when updating the values of these elements.Once the entire grid has been updated, the last step is to measure the values in certain parts of the grid to be used for visualisation.The principle of locality also comes into play.The next tile to be processed is always close in the memory so spatial locality is satisfied.This kind of predictable behaviour enables performance optimizations for techniques such as pre-fetching, pipelining and burst memory reading.Temporal locality condition is not achieved because the data will not be reused in a time frame short enough so the applicability of caching is very limited and was thus discarded.

III. RESULTS
The Yee grid was modelled by a 64 × 64 matrix with the simulation running for 150 iterations.To find the optimal solution, the tile size varied from 2 × 2 to 32 × 32.The exponent was always an 8bit integer while the number of bits used to represent the fraction ranged from 10 to 14, with the remaining 6 to 2 bits being used for the integer part of the number.In the end, 14 bits were used to represent the fraction.So a total of 24 bits were used represent the data to have the best comparison with the 24 bit purely fixed-point solution.In order to measure the memory bandwidth, the number of bits per matrix was used as defined in (1) 16 8.

Sum Size Num
where Sum is the total number of bits, Size is the size of the tile (for example 4 × 4) and it is multiplied by 16 because those are fixed-point values.Num is the number of tiles which are needed to cover the entire Yee grid and it is multiplied by 8 because each exponent is represented by a single 8 bit integer.The differences between the sizes of the matrices are indicative of the difference between the actual bandwidth needed.
The metric to measure the performance along with the memory usage was precision.It was calculated by using the mean absolute relative percent difference of the Ez field (the value of the electric field oriented towards the z axis).First the error itself is defined as follows , .
x y x y D x y Otherwise x y The value is 0 if both x and y are 0 to avoid division by 0, otherwise it is defined as shown above.The precision is calculated by going through an array of matrices each containing the value of the Ez field for every iteration.In the end, this sum is divided by the number of iterations (iters in the formula) and the size of the matrices (Nx and Ny being the dimensions) to form the median absolute value of the error which was explained in (2).L1 and L2 are the values from the initial floating-point and final block floating-point version of the solution This type of relative error (3) takes into account how big the difference in the values measured is when compared to the absolute value.When multiplied by 100, it becomes a relative percentage error.For instance, if the difference between the observed values is 0.1 where one value is 0.5 and the other 0.6, that is considered as a bigger error than if the difference in the value is 10 and the observed values are 550 and 560.The reasoning behind this choice it that the authors believe it provides a better perspective of the performance of the solution.The performance expressed in precision and memory bandwidth for several tile sizes is given in Table I.
The initial 32bit floating-point solution is considered to be the benchmark.Its relative error is 0 because it does not differ from itself and it requires the biggest amount of memory regardless of the size of the grid and number of iterations, so for illustrating purposes it needs 100 % of the memory bandwidth.The 1st improved solution which used 24 bit fixed-point arithmetic uses up 25 % less bandwidth because each element in the Yee grid is represented by 24 bits instead of 32.In other words 75 % of the benchmark bandwidth is required for the solution to work.Using the equation for bandwidth (1) we have calculated which amount of the initial bandwidth used for the benchmark is needed for the block floating-point solutions.The table shows only around 50 % of the memory is needed when compared to the initial solution.This is almost the same result as when using 16 bit fixed-point solution but with an error 2-6 times smaller, depending on the size of the tile.This shows that the BFP can be used in situations where memory usage needs to be as small as possible but without sacrificing the precision too much.These calculations show a promising result but this kind of solution is not the best option for the standard PC architecture.This is due to the granularity of the data CPUs and GPUs use which is usually in chunks of 8, 16, 32 or larger.So there is no way of really benefiting from saving a few bits for every data element, unlike when working with an FPGA.
Another interesting characteristic of the algorithm is the ratio of reads and writes and also the ratio of total memory accesses when compared to the number of operations (additions and multiplications).When using the 8 × 8 tile, the number of writes per iteration is 7224, while the number of reads is 20152, so almost 3 times as more.This difference is a good indicator of the large amount of data needed for computation.An even better perspective is given once all the operations (additions and multiplications) are included.Together, the additions and multiplications amount to the ratio of 0.58 operations per single byte of data, for the 8 × 8 tile size.If we approximate the GPUs FLOPS (floating-point operations per second) with the bandwidth we see just how little compute potential of the GPU is used.Here are the specs of the now average, graphics card such as an AMD Radeon R7 265.According to the manufacturer, this GPU achieves 1843.2GFLOPS with the maximal theoretical bandwidth of 179.2 GB/s.If we divide the two we get the ratio of 10.28 FLOPS per byte which is 17 times more than that the algorithm needs.If we move further up the market price and see the specs of a high end GPU such as the GeForce GTX 1070, we see that the result is even worse in terms of effective computation power.The GTX 1070 achieves 7816.4GFLOPS with the theoretical bandwidth reaching 256 GB/s, making the ratio 30.52 FLOPS per byte which is over 50 times of that which is needed for the algorithm.The operation per byte ratio of the algorithm is just far below of that the GPU can achieve which makes them not the optimal choice when it comes to power

TABLE I .
TILE SIZE, NUMBER OF TILES IN THE GRIDAND PERFORMANCE.