Performance Analysis of Two 8-Bit Floating-Point-based Piecewise Uniform Quantizers for a Laplacian Data Source

Authors

DOI:

https://doi.org/10.5755/j02.eie.37430

Keywords:

Accuracy, Floating-point arithmetic, Neural network compression, Quantization

Abstract

In this paper, we employ the analogy between the representation of the floating-point (FP) format and the representation level distribution of the piecewise uniform quantizer (PWUQ) to assess the performance of FP-based solutions more thoroughly. We present theoretical derivations to assess the performance of the FP format and the PWUQ determined by this format for input data from the Laplacian source. We also provide a performance comparison of two selected 8-bit FP-based PWUQs. Beyond the typical evaluation of the applied FP format, through the accuracy degradation caused by the application of both FP8 solutions in neural network compression, we also use objective quantization measures. This approach offers insights into the robustness of these 8-bit FP-based solutions with respect to changes in input variance, which can be important when the input variance changes. The results demonstrate that the allocation of bits to encode the exponent and mantissa in the FP8 format is important, as it can significantly impact overall performance.

Downloads

Published

2025-02-24

How to Cite

Nikolic, J. R., Peric, Z. H., Jovanovic, A. Z., Tomic, S. S., & Peric, S. Z. (2025). Performance Analysis of Two 8-Bit Floating-Point-based Piecewise Uniform Quantizers for a Laplacian Data Source. Elektronika Ir Elektrotechnika, 31(1), 56-61. https://doi.org/10.5755/j02.eie.37430

Issue

Section

TELECOMMUNICATIONS ENGINEERING

Funding data