Performance Analysis of Two 8-Bit Floating-Point-based Piecewise Uniform Quantizers for a Laplacian Data Source
DOI:
https://doi.org/10.5755/j02.eie.37430Keywords:
Accuracy, Floating-point arithmetic, Neural network compression, QuantizationAbstract
In this paper, we employ the analogy between the representation of the floating-point (FP) format and the representation level distribution of the piecewise uniform quantizer (PWUQ) to assess the performance of FP-based solutions more thoroughly. We present theoretical derivations to assess the performance of the FP format and the PWUQ determined by this format for input data from the Laplacian source. We also provide a performance comparison of two selected 8-bit FP-based PWUQs. Beyond the typical evaluation of the applied FP format, through the accuracy degradation caused by the application of both FP8 solutions in neural network compression, we also use objective quantization measures. This approach offers insights into the robustness of these 8-bit FP-based solutions with respect to changes in input variance, which can be important when the input variance changes. The results demonstrate that the allocation of bits to encode the exponent and mantissa in the FP8 format is important, as it can significantly impact overall performance.
Downloads
Published
How to Cite
Issue
Section
License
The copyright for the paper in this journal is retained by the author(s) with the first publication right granted to the journal. The authors agree to the Creative Commons Attribution 4.0 (CC BY 4.0) agreement under which the paper in the Journal is licensed.
By virtue of their appearance in this open access journal, papers are free to use with proper attribution in educational and other non-commercial settings with an acknowledgement of the initial publication in the journal.
Funding data
-
Ministarstvo Prosvete, Nauke i Tehnološkog Razvoja
Grant numbers 451-03-65/2024-03/200102