Abstract—The aim of this paper and research was to analyse the efficiency of the compiler-generated code for the graphics library and to present results obtained by optimization for the MIPS (Million Instructions Per Second) architecture. Libpng is the official Portable Network Graphics reference library for use in applications that read, create, and manipulate PNG (Portable Network Graphics) raster image files. Given the data structure in the PNG files, as well as the capabilities of the MIPS instruction set, it was expected that significant improvements could be made. Graphic library libpng is optimized by using MIPS instruction set extension and tested on MIPS Malta 74K platform. Test results show that by using MIPS optimization test, execution times are substantially improved. Our libpng optimization has achieved performance increase of 10 %–78 % depending on optimized routine.

Index Terms—DSP; Embedded software; Image processing; MIPS; Optimization; PNG; SIMD.

I. INTRODUCTION

The field of multimedia communications is evolving at a high speed and is conditioned by the introduction of new video and audio standards in new multimedia electronic devices. An important requirement for the end user is that multimedia electronic devices operate at a satisfactory speed in real-time. As new technologies require the increasing processing power of computer architectures, which is to some extent limited, it is necessary to significantly optimize software support on a given computer architecture to enable real-time operation.

Embedded system capabilities are a function of both the hardware platform and the software implementation strategy [1], [2]. When there are plenty of hardware resources with higher processing speeds and power like desktop computers, achieving real-time performance is not a major issue. Today, there has been an exponential growth in the use of mobile embedded systems like cell phones, tablets, video game console, etc. in the consumer electronics market, which have a relatively small power consumption and form factor. Multiple applications like video streaming, smart cameras, context-aware computing, and virtual reality are resident on these devices to meet consumer needs.

The aim of this paper and research is to analyse the efficiency of the compiler-generated code for the PNG graphics library and to present results obtained by optimization for the MIPS (Million Instructions Per Second) architecture for real-time operation.

Portable Network Graphics (PNG) is an undistorted and bitmapped image format that employs lossless data compression. PNG was created to improve upon and replace GIF (Graphics Interchange Format) as an image-file format not requiring a patent license [3].

The official graphic library that is used in applications for PNG image processing is libpng. This library has been designed to handle multiple sessions at one time, to be easily modifiable, to be portable to the vast majority of machines (16-, 32-, and 64-bit) available, and to be easy to use.

This graphic library supports almost all of PNGs features:

- indexed-color,
- grayscale images,
- truecolor images,
- optional alpha channel,
- sample depths range from 1 bit to 16 bits.

This graphic library is dependent on zlib library for data compression and decompression routines [4].

By analysing the assembler code obtained with the compiler, we found that it is possible to improve the performance by writing the assembler manually. Libpng has significant potential for SIMD (Single Instruction Multiple Data) acceleration due to the type of data and operations performed during image processing.

MIPS 74K core is a MIPS high-performance processor core and includes features that can support applications with significant signal processing requirements. These processors can be found in set-top boxes, VoIP SoCs (Voice over IP System on a Chip), networking equipment, digital TVs, DVD players and recorders, and in applications that use signal processing. MIPS 74K core operates at up to 1.11 GHz in a 65 nm process and supports MIPS DSP (Digital Signal Processing) oriented instruction set called DSP ASE (Application Specific Extension) Revision 2 that uses SIMD extensions to obtaining vectorised execution. [5].

Manuscript received 13 November, 2019; accepted 20 February, 2020.

This work has been partially funded by the Ministry of Education, Science and Technological Development of the Republic of Serbia under the grant (No. TR32014).
The purpose of DSP ASE instruction is to speed up algorithms in areas such as audio and video compression, image processing, communications, etc.

SIMD extensions to the embedded processors have come to support the increasing requirements by providing data parallelism. This extension improves the performance for multimedia applications, audio and video codecs, image processing, etc.

The paper is organized as follows. MIPS architecture overview is presented in Section II. The overview of MIPS32 instruction set is presented in Section III. Libpng optimization is described in Section IV. Finally, implementation results are given in Section V, and conclusions in Section VI. After these sections, we listed all references that we used in this paper. To the best of our knowledge, there are no works on the libpng topic and there are no works or results of libpng optimization. In the absence of literature and papers related to the libpng graphics library, we used references related to embedded systems, PNG optimizations, MIPS specifications, and manuals for DSPs and instruction sets.

II. MIPS ARCHITECTURE

The MIPS32 74K core is the first family of synthesizable 32-bit RISC (Reduced Instruction Set Computing) CPU (Central Processing Unit) cores and offers the highest performance yet from a synthesizable core. All 74K family cores implement the MIPS32 Release 2 instruction set architecture and supports the DSP ASE Revision 2 instruction set extensions. This platform has the following options:

- 1 and D-Caches: 4-way set associative: 0 Kbytes, 8 Kbytes, 16 Kbytes, 32 Kbytes or 64 Kbytes in size;
- L2 (secondary) cache: 128 Kbytes–1 Mbyte in size;
- Fast multiplier: 1-per-clock repeat rate for 32×32 multiply and multiply/accumulate;
- DSP ASE Revision 2: this instruction set extension adds many new computational instructions with a fixed-point math unit crafted to speed up popular signal-processing algorithms. Some of these functions do two math operations at once on two 16-bit values held in one 32-bit register;
- Floating Point Unit (FPU): if fitted, this is a 64-bit unit, which most often runs at half or two-thirds the clock rate of the integer unit;
- The “CorExtend” instruction set extension: defines a hardware interface, which makes it relatively straightforward to add logic to implement new computational instructions in your CPU using predefined instruction encodings.

The MIPS 74K core contains a load/store unit and separate data path, and can execute up to two instructions in parallel or up to four instructions - two floating point and two integers. These units share thirty-two 32-bit GPR and four 64-bit accumulators. The data path contains a 32-bit ALU (Arithmetic Logic) and MDU (Multiply/Divide) unit [6].

Figure 1 shows a block diagram of the MIPS 74K core.

III. OVERVIEW OF MIPS32 INSTRUCTION SET

The MIPS 74K core supports the baseline MIPS32 instruction set and two instruction set extensions:

- DSP ASE Revision 2;
- MIPS 16e for 16-bit compressed instruction set.

In this paper, we will focus on DSP ASE Revision 2. This instruction set to the MIPS32 architecture was introduced by MIPS Technologies to optimize the performance of signal processing and multimedia applications running on MIPS core processors. The DSP ASE is a set of new instructions and new architectural state with computational support for fractional data types, SIMD, saturation, and other operations commonly used in DSP applications [7].

The MIPS DSP ASE Revision 2 provides support for a number of powerful data processing operations. There are instructions for fractional arithmetic (Q15/Q31) and for saturating arithmetic. Additionally, for smaller data sizes, SIMD operations are supported allowing 2 × 16 bit or 4 × 8 bit operations to occur simultaneously. Many of the DSP ASE instructions work in SIMD mode and perform certain operations simultaneously on several data packed into 32-bit vectors [5]. Instructions operating on vectors can be recognized because the name includes .ph (paired-half, usually signed, and often fractional) or .qb (quad-byte, always unsigned, only occasionally fractional) [8]. These instructions automatically perform additional operations such as rounding and saturation, which without them requires a significant number of processor cycles for implementation.

Another feature of the ASE is the inclusion of additional HI/LO accumulator registers to improve the parallelization
of independent accumulation routines. All 32-bit operand arithmetic DSP instructions (except multiply) are executed in the ALU pipe, while the 64-bit operand arithmetic and multiply class DSP instructions are executed in the MDU pipe.

New instructions in DSP ASE Revision 2 help in minimizing the data packing, unpacking, shuffling, etc. That is required for SIMD arithmetic. The MIPS32 DSP ASE instructions can be classified depending on the implemented function [3]:

- **Arithmetic Class** - perform basic arithmetic operations, such as addition and subtraction (ADDQ.PH/QB and SUB.PH/QB), as well as some more specialized operations like data packing/unpacking, absolute value (ABSQ), horizontal sum of the source register data elements (RADDU), circular buffers (MODSUB), etc.;
- **Shift Class** - implement logical and arithmetic shift of the individual SIMD data elements that are packed into 32-bit registers (SHL/R.QB/PH). The shift amount can be variable (specified by a register) or fixed. In addition, there is saturation option in the left shift instruction (SHL.S.PH/QB);
- **Multiply Class** - include instructions for integer and fractional element-wise multiplication (MULEU.S.PH and MULQ_RS.PH), dot product accumulate and subtract (DPAU, DPSU, DPAQ_S, DPSQ_S), and compute the real and imaginary parts of a complex multiply (MULSAQ.S.W.PH and DPAQ.S.W.PH) accumulating the results to a specified accumulator;
- **Bit Manipulation Class** - include instruction that reverses the order of the 16 right-most bits in the source register (BITREV), as well as the data replication instructions REPL that copy a given scalar value to all the SIMD elements of the destination register. There is also the INSV instruction that support variable position and size values;
- **Compare and Pick Class** - include instruction that performs SIMD (element-wise) comparison of the data in the source registers (CMP.L.T.PH), as well as instruction that selects elements from two registers based on the result of the comparison (PICK.QB and PICK.PH);
- **Accumulator and DSPControl Register Access Class** - contains the EXTR instruction that extract values from the accumulator after shifting it to the right, as well as the EXTP instruction extracts a number of bits from a specified position. Instructions WRDSP and RDDSP transfer data between the general-purpose registers and the DSPControl Register;
- **Indexed Load Class** - add a new addressing mode for loading integer and fixed-point data in the form of base + index. The three variants, LBUX, LHX, and LWX, load unsigned bytes, half-words, and 32-bit words, respectively.

IV. **LIBPNG OPTIMIZATION**

One of the problems of embedded system design is how to make software efficiently run on the used processor. Generally, software optimization has to be done for optimal system performance by matching the program code with the processor [9], [10]. During the process of code optimization, we have to take the architecture of the target processor, pipeline, compiler features, and features of instruction set into account for effective execution on target processor.

First phase in our work was to find the bottlenecks of the PNG graphic library. This involves analysing of the existing code, data structure, profiling analysis, and executing all existing tests before optimizing the code. We used following steps to do graphic library optimization.

A. **Software Analysis**

Firstly, we did software module partition and performance estimation in C. This step can help to understand the program structure and to find optimization goals in higher-level language. In this step, we examine the underlying data types that the program operates. How many bits are there in a single data element? Can multiple data elements be packed into a 32-bit register for SIMD data types?

B. **Profiling**

Performance analysis, better known as profiling, is an examination of program behavior using information gathered during program execution itself. This analysis identifies parts of the program that are suitable for optimization both of execution speed and memory usage. It is necessary to know, which parts of the software solution are most often executed, because there is no point to do optimization of routines that are rarely executed. Performance analysis shows the number of processor cycles required for each function, thus calculating its occupancy.

We used Opfile tool for profiling analysis and to find functions that take the most time across the whole system. Opfile is an open source project that includes a statistical profiler for Linux systems capable of profiling all running code at low overhead. Profile data can be produced on the function-level or instruction-level detail. Source trees annotated with profile information can be created. That enables collection of various low-level data and association with particular sections of code [11].

Profiling was done for original source code and for all test cases developed by the authors of the libpng graphics library. Some of results are presented in Table I, where a list of routines suitable for the optimization is given. The number of cycles and the percentage of function share in test case for original source code of the graphic library follow the function name. Table I shows functions relevant only to libpng graphic library.

Profiling results showed that there were three computationally intensive tasks when reading PNG files: decoding row filters, expanding interlacing, and combining interlaced or transparent row data with previous row data.

In addition, we analysed all of routines to determine if optimization could be done for each one. The results show that it is not possible to optimize all of libpng routines, so we opted for partial optimization of problematic routines.

We have selected routines that we will optimize based on the operations they perform and the types of data. Each routine, where it was possible to vectorize operations and pack more data into a single 32-bit register, was chosen to be optimized.
TABLE I. PROFILING RESULTS FOR ORIGINAL SOURCE CODE - LIST OF ROUTINES THAT COULD BE OPTIMIZED WITH NUMBER OF PROCESSOR CYCLES AND CPU OCCUPANCY.

<table>
<thead>
<tr>
<th>Routine</th>
<th>Samples</th>
<th>%</th>
</tr>
</thead>
<tbody>
<tr>
<td>png_write_find_filter</td>
<td>179509</td>
<td>85.39</td>
</tr>
<tr>
<td>png_read_filter_row_paeth_multibyte_pixel</td>
<td>150527</td>
<td>74.01</td>
</tr>
<tr>
<td>png_do_read_interlace</td>
<td>32169</td>
<td>25.07</td>
</tr>
<tr>
<td>png_combine_row</td>
<td>28563</td>
<td>19.82</td>
</tr>
<tr>
<td>png_do_chop</td>
<td>2369</td>
<td>9.53</td>
</tr>
<tr>
<td>png_read_filter_row_sub</td>
<td>2203</td>
<td>9.11</td>
</tr>
<tr>
<td>png_read_filter_row_avg</td>
<td>1756</td>
<td>6.97</td>
</tr>
<tr>
<td>png_read_filter_row_up</td>
<td>1637</td>
<td>6.09</td>
</tr>
<tr>
<td>png_do_scale_16_to_8</td>
<td>1223</td>
<td>5.66</td>
</tr>
<tr>
<td>png_do_gamma</td>
<td>657</td>
<td>2.84</td>
</tr>
</tbody>
</table>

C. Compiler Analysis

A new compiler that enables code translation for that architecture accompanies the new architecture on the market. Initially, first compiler versions were not perfect in the sense that it failed to produce the most optimal code, so assistance by programmers and hand-coding assembler writing are required in some parts of the routine. Over time, the compiler evolves and improves, and with each successive version, the code that translates is better and more optimized.

In this work, we used GCC (GNU Compiler Collection) compiler 4.5.1. [12], on Linux operating system to translate the libpng graphic library and run on MIPS architecture. This compiler and GNU Assembler provide options –mdsp and –mdspr2 that enable the availability of Revision 1 or Revision 2 MIPS DSP ASE instructions, respectively, for code generation.

There are few methods [13] of utilizing the MIPS DSP ASE that are supported by the GCC GNU Assembler (GAS) [14] for MIPS cores:

- Hand-coding in assembly language;
- Hand-coding in inline assembly;
- ASM-macros;
- Intrinsics;
- Fixed-point data types and operators in C;
- Auto-vectorization.

Hand-coding in assembly language is the most time-consuming method, but can produce code with the highest performance.

Firstly, we tried to use auto-vectorization. It is necessary to activate automatic vectorization using a specific compilation flags (-mdspr2 -O3 -EL). The advantage of auto-vectorization is that the compiler can recognize scalar variables in order to utilize SIMD instructions automatically. Unfortunately, this method failed to get good results for all routines, so we decided for handwriting inline assembly.

D. Inline Assembly

Using the inline assembly can reduce the number of instructions required to be executed by the processor. Also, it is important because of its ability to operate and make its output visible on C/C++ variables. With inline assembly, it is possible to do partial optimization for speed-critical sections of code.

Before writing the inline assembler, it is necessary to analyse all original routines that will be optimized. It is generally necessary to make modifications to the program structure, data structure, as well as to do loop unrolling [9] in order to prepare the original code for vectorization and optimizations. For example, we had to carry out in parallel on operands packed into a single 32-bit general-purpose register for SIMD type operations that could be 2x16-bit or 4x8-bit operations.

E. Assembly Tips & Rules

In order to do better optimizations of routines, a few specific and important tips and rules need to be followed:

Instruction’s latencies. It is necessary to know the basic latencies of instructions as executed by the 74K core, i.e. how many cycles later can be issued as a dependent instruction. For these purposes, there are four classes of instructions [9]:

- Simple ALU instructions - run in 1 cycle. This includes bitwise logical instructions, mov (an alias for addu with $0), shifts up to 8 positions down or up, test-and-set instructions, and sign-extend instructions;
- Simple DSP ASE operations – 2 cycle latency. The same as most regular MIPS32 arithmetic without multiply and saturation;
- DSP instructions, which feature saturation or rounding - three-cycle latency;
- Special DSP multiply - 6–9 cycle latency.

Out-of-order execution. The 74K core supports out-of-order instruction execution, which can automatically reorder instructions to more efficiently pair up instructions for parallel execution and help hide instruction and cache latencies [10].

Loads and load-to-use delay - MIPS CPUs cannot deliver loaded data to the immediately following instruction without a delay. Typically, data is delivered one clock later, so we can try to put some useful and non-dependent instruction between the load and its first use.

Branch delay slot instruction. The MIPS architecture defines that the instruction following a branch (the “branch delay slot” instruction) is always executed. Also, there are special forms of branches called “branch likely”, which execute the branch delay slot instruction only when the branch is taken.

All of these tips and rules were used to reduce program and data space and to rearrange data space by manually modification of the assembly code.

F. Optimization

Our goal was to reduce the number of memory accesses and to do as many operations as possible in parallel to get the best results. A reduced number of accesses to memory is derived by taking 32-bit value from memory instead of taking 8-bit or 16-bit values, which are then computed within a 32-bit value. Compared to the original software solution, they avoid unnecessary memory access up to eight times within a single loop pass.

We analysed the original source code with an Objdump
tool [14] that generates assembler code from the source C code. Objdump display information from object files. In this step, we identified where the delays occur in executing the instructions. These delays are most often caused by the interdependence between two consecutive instructions. Namely, if the execution time of an instruction is more than one cycle (for example, multiplication operations) and the result is used in the next instruction, the flow structure is stopped until the result of the first instruction becomes available. Such problems are solved by simply rearranging the order of execution of instructions, i.e. by inserting some independent instructions into the flow structure.

In addition, instructions with the greatest delay were placed at the beginning of the assembler block, and after these instructions, we set other instructions that could be executed in parallel.

The loading of coefficients and constants is done outside the loop avoiding additional memory access, which occurs in the original software solution.

The processing of most routines is based on computing with 8-bit data. Due to out of range, it is not possible to use DSP ASE instructions over 8-bit data sets. That is why 8-bit data are placed within each 16-bit data of one 32-bit registry.

DSP ASE instructions were used over this data, which made parallelization possible. In particular, the PRECEU.PH.QBR instruction was used for these situations.

In digital signal processing, processing is generally performed on larger data sets cyclically. There are a number of iterations of the same operations, where only the input data is changed. In this kind of processing, it is possible to parallelize the processing of several iterations simultaneously. This method is called “loop unrolling”, and thus avoids the delay caused by memory access. The unrolled loop assumes that the number of iterations of the original loop is multiple of the unroll factor. For example, if a loop is unrolled four times (4x), the number of iterations in the original loop has to be a multiple of four. This is usually not worrisome and the code can be written to work correctly for any number of iterations. Sometimes, it is easiest to just produce a few extra don’t care values at the tail-end of the unrolled loop; just to ensure that the data buffers are large enough to hold all the output values from the unrolled loop [15].

In the following example, in Fig. 2 and Fig. 3, we present loop unrolling and some optimization details for routine png_read_filter_row_paeth_multibyte_pixel.

```
1  int bpp = (row_info->pixel_depth + 7) >> 3;
2  png_bytep rp_end = row + bpp;
3  while (row < rp_end)
4    {
5      int a = *row + *prev_row++;
6      *row++ = (png_byte)a;
7      rp_end += row_info->rowbytes - bpp;
8    }
9  while (row < rp_end)
10    {
11      int a, b, c, pa, pb, pc, p;
12      c = *(prev_row - bpp);
13      a = *(row - bpp);
14      b = *prev_row++;
15      p = b - c;
16      pc = a - c;
17      pa = p < 0 ? -p : p;
18      pb = pc < 0 ? -pc : pc;
19      pc = (p + pc) < 0 ? -(p + pc) : p + pc;
20      if (pb < pa) pa = pb, a = b;
21      if (pc < pa) a = c;
22      c = b;
23      a += *row;
24      *row++ = (png_byte)a;
25    }
```

Fig. 2. Original C code for png_read_filter_row_paeth_multibyte_pixel routine.

Figure 2 shows original C code for png_read_filter_row_paeth_multibyte_pixel routine. There are two while loops in original code: first for reading each row data and second for data processing. Reading of each row data was done by taking 8-bit values from memory (Fig. 2 - Line 6). So, there are too many accesses to memory. We reduced number of accesses to memory by taking 32-bit value from memory (Fig. 3 - Line 9).

Original code uses bpp variable to calculate the end of row. This value is the number of bytes per pixel and could have only the value of? 3 or 4. We certainly do 4 pixel processing, so we avoided this calculation by using parallelization.

Also, the data processing in the second while loop worked with 8-bit values in original code. In our optimization, we used SIMD instructions to process 4 bytes in parallel, whenever it was possible. All MIPS DSP ASE instructions with .qb in names (Figure 3 - Lines 6, 14–17, etc.) use 4 bytes processing in parallel. Due to out of range for some processing, we had to place 8-bit data within each 16-bit data of one 32-bit registry. As we already mentioned, we used instruction PRECEU.PH.QBR for this purpose.

After this data packing, we had to use instructions that process 2 bytes in parallel. These instructions have .ph in
names (Figure 3 - Lines 23–44, etc.). Also, whenever it was possible, we set other instructions after the instruction with greatest delay.

After optimization, the number of instructions in the original and optimized code was compared. Almost all routines are optimized to handle 4 pixels in one pass, so the number of instructions of optimized code compared to the original code is greatly reduced.

```c
unsigned int rowbytes = row_info->rowbytes;
__asm__ volatile {
"lw" $t0, 0($row)
"lw" $t1, 0($prev_row)
"addu.qb" $t2, $t0, $t1
"addiu" $t1, %[rowbytes], -4
"addu" $t0, %[row], $t1 /* t0 = end address */
"sw" $t2, 0($row)
"li" $t1, 1
"lw" $t1, 0($prev_row) /* t1 = c */
"lw" $t2, 4($prev_row) /* t2 = b */
"lw" $t3, 0($row) /* t3 = a */
"subu.s.qb" $t4, $t2, $t1 /* p = b - c */
"subu.s.qb" $t5, $t1, $t2 /* t5 = c - b */
"subu.s.qb" $t7, $t3, $t1 /* pc = a - c */
"subu.s.qb" $t6, $t1, $t3 /* t6 = c - a */
"or" $t4, $t4, $t5 /* t4 = abs(p) [pa] */
"or" $t5, $t5, $t6 /* t5 = abs[pc] [pb] */
"cmpeq.li.qb" $t5, $t4 /* if (pb < pa) */
"pick.qb" $t4, $t5, $t4 /* t4 = pa */
"pick.qb" $t8, $t2, $t3 /* a ? b : a */
"preceu.ph.qbr" $t5, $t3 /* */
"preceu.ph.qbr" $t6, $t2 /* */
"preceu.ph.qbr" $t9, $t1 /* */
"addq.ph" $t6, $t5, $t6 /* t6 = lo(a+b) */
"shll.ph" $t5, $t9, 1 /* t5 = lo(2*c) */
"preceu.ph.qbr" $t7, $t4 /* */
"subq.ph" $t5, $t6, $t5 /* */
"preceu.ph.qbl" $t6, $t3 /* */
"absq.s.ph" $t5, $t5 /* */
"cmpeq.li.ph" $t5, $t5 /* t5 = abs(lo(pc)) */
"preceu.ph.qbr" $t3, $t8 /* */
"preceu.ph.qbl" $t7, $t2 /* */
"pick.ph" $t5, $t9, $t3 /* */
"preceu.ph.qbl" $t9, $t1 /* */
"addq.ph" $t7, $t6, $t7 /* */
"shll.ph" $t6, $t9, 1 /* */
"preceu.ph.qbl" $t3, $t4 /* */
"subq.ph" $t6, $t7, $t6 /* */
"addiu" $t6, $t6, $t6 /* */
"absq.s.ph" $t5, $t5 /* */
"cmpeq.li.ph" $t5, $t7 /* t7 = lo(a) */
"preceu.ph.qbr" $t7, $t8 /* */
"preceu.ph.qbl" $t7, $t2 /* */
"pick.ph" $t5, $t9, $t3 /* */
"preceu.ph.qbl" $t9, $t1 /* */
"addq.ph" $t7, $t6, $t7 /* */
"shll.ph" $t6, $t9, 1 /* */
"preceu.ph.qbl" $t3, $t4 /* */
"subq.ph" $t6, $t7, $t6 /* */
"addiq" $t6, $t6, $t6 /* t6 = abs(hi(pc)) */
"cmpeq.li.qb" $t5, $t7 /* t6 = abs(hi) */
"lw" $t3, 4($row) /* */
"pick.ph" $t6, $t9, $t7 /* */
"precq.ph.qb" $t8, $t6, $t5 /* */
"addiu" $t5, $t6, $t5 /* */
"addu.qb" $t8, $t8, $t3 /* */
"bne" $t0, %[row], 1b /* */
"sw" $t8, 0($row) /* */
"3: /* */
};
```

Fig. 3. MIPS DSP ASE optimization for png_read_filter_row_paeth_multibyte_pixel routine.

V. EXPERIMENTAL RESULTS

The results of the optimizations were obtained by measuring the performance of the graphics library using test programs developed by the authors of the libpng graphics library. The testing was performed on the MIPS 74K Malta development platform with an integrated Linux operating system.

Tests generally do conversions between PNG and other image formats. To run existing tests, it is advisable to use images required by the authors of the libpng graphics library, which can be found on the libpng official website. These existing tests check output’s bit-exactness after images processing and have time execution measurement. Time measurement is done in terms of second (s).

Before using existing libpng tests, we created our standalone tests to verify bit-exactness and acceleration for all routines. These tests have shown that optimized routines give bit-exactness outputs for same inputs and acceleration of about from 35% to 80% depending on the optimized routine. It was expected that these standalone tests would perform better results than the existing tests, because these
Tests measure time and number of cycles for each one optimized routine separately. These tests use the same input data from the existing tests. Each optimized routine has a separate test, in which we first check bit-exactness of the output relative to the reference code, and then the processing time is measured in milliseconds (ms). By comparing results, we can see a significant improvement for each optimized routine as shown in Table II.

**TABLE II. STANDALONE TEST RESULTS.**

<table>
<thead>
<tr>
<th>Optimized routine</th>
<th>ref [ms]</th>
<th>opt [ms]</th>
<th>gain [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>png_read_filter_row_sub</td>
<td>1604</td>
<td>645</td>
<td>59.79</td>
</tr>
<tr>
<td>png_read_filter_row_avg</td>
<td>2177</td>
<td>783</td>
<td>64.03</td>
</tr>
<tr>
<td>png_read_filter_row_up</td>
<td>1587</td>
<td>782</td>
<td>50.72</td>
</tr>
<tr>
<td>png_read_filter_row_paeth_multibyte_pixel</td>
<td>2135</td>
<td>893</td>
<td>58.17</td>
</tr>
<tr>
<td>png_do_gamma</td>
<td>1785</td>
<td>645</td>
<td>63.87</td>
</tr>
<tr>
<td>png_do_scale_16_to_8</td>
<td>1169</td>
<td>756</td>
<td>35.33</td>
</tr>
<tr>
<td>png_do_read_interlace</td>
<td>2258</td>
<td>786</td>
<td>65.19</td>
</tr>
<tr>
<td>png_combine_row</td>
<td>1763</td>
<td>961</td>
<td>45.49</td>
</tr>
<tr>
<td>png_write_find_filter</td>
<td>2581</td>
<td>511</td>
<td>80.20</td>
</tr>
<tr>
<td></td>
<td>1884</td>
<td>1024</td>
<td>45.65</td>
</tr>
</tbody>
</table>

Names of optimized routines are given in the first column followed by the results for original source code (ref) and for optimized routines (opt) that were measured in milliseconds. The last column presents gain in percent.

After these standalone measurements, we have done profiling measurements with **OProfile** tool for all **libpng** tests, with and without optimized routines. Table III shows results of reducing the cycle consumption of each function separately for both original and optimized routines. First column presents names of optimized routines followed by the test that is executed. After that, we presented results for original source code (ref) and for optimized routines (opt). Finally, the last column shows gain in percent. All optimized routines have achieved performance increase as we expected Table III.

Results in Table III show that the percentage of functions in the graphics library has changed significantly.

Routines **png_read_filter_row_paeth_multibyte_pixel** and **png_write_find_filter** had the highest CPU time and their acceleration increased by 25%–50% and about 15%, respectively, depending on the test case.

Also, routines **png_do_chop** (acceleration about 75%) and **png_do_read_interlace** (acceleration about 65%) have the greatest acceleration, probably because of the large number of constants, additions and shifts used in their original processing that we have been able to improve using all of the above techniques.

Last step was to measure execution time for all tests with and without optimized routines. These tests showed satisfactory results also Table IV.

Results in Table IV show the total gain of all graphics library optimizations. Results depend on the input images and the processing performed in the tests. The best results are shown by the **pngtopng** test (acceleration about 25%), because this test uses almost all optimized routines.

The overall gain is lower than the gain of individual functions and profiling results due to compression and decompression routines from the **zlib** library that consume most of the CPU time and CPU cycles in the existing tests.
VI. CONCLUSIONS

In this paper, we have proposed embedded software code optimization of the graphic PNG library libpng on MIPS32 platform.

The libpng library optimizations for the MIPS architecture described in this paper show satisfactory results. Further enhancement of the execution speed of the PNG image processing algorithm on the MIPS platform can be achieved by introducing support for the MSA (MIPS SIMD Architecture) extension of the MIPS instruction set. MIPS MSA implements 128-bit wide vector registers that significantly increases the possibility of parallelization. In the meantime, we worked on optimization for the zlib library, but did not get satisfactory results for the MIPS DSP ASE instruction set. It is possible that the MSA instruction set would produce much better optimization results for the zlib library, and therefore the libpng graphic library that relies directly on zlib.

CONFLICTS OF INTEREST

The authors declare that they have no conflicts of interest.

REFERENCES

[9] Programming the MIPS32® 74K™ Core, Revision 02.13, June 04.2010, MIPS Technologies, Inc. 955 East Arques Avenue, Sunnyvale, CA 94085-4521.