Power and Energy Consumption Models for Embedded Applications

1 Abstract —This paper describes a study on the power and energy consumption estimation models that have been defined to facilitate the development of ultra-low power embedded applications. During the study, various measurements have been carried out on the instruction and application level to challenge the models against empirical data. The study has been performed on the multicore heterogeneous hardware platform developed for ultra-low power Digital Signal Processors (DSP) applications. The final goal was to develop a tool that can provide insight into power dissipation during the execution of embedded applications, so that one can refactor the source code in an energy-efficient manner, or ideally to develop an energy-aware C compiler. The side effect of the research presents interesting insight into how the custom hardware architecture influences power dissipation. The selected platform has been chosen simply because it represents R&D state of the art ultra-low power hardware used in hearing aids. The presented solution has been developed and tested in an Eclipse environment using Java programming language. These models are derived from numerous experiments and


I. INTRODUCTION
Energy consumption represents one of the key properties of embedded devices, especially for gadgets that run on batteries. Optimal power dissipation leads to greater autonomy, making a device more competitive on the market, in addition to making the product more "green". The problem that is addressed by this research is how to develop energy-efficient embedded software solutions for digital signal processors (DSP) hardware platforms with ultra-low power consumption. This paper presents a follow-up of studies published by the authors in [1]- [4]; therefore, if one would like to gain more information on the research, I would highly recommend reading these studies first.
To create an optimal software solution, energy-and performance-wise, one needs to have a clear insight into the execution flow and its influence on overall power dissipation. This statement was taken as a starting point for the research. More generally, software tools like compilers, assemblers, profilers, etc., or Integrated Development Environments that comprise all the above, facilitate development of optimal solutions and provide developers with abstraction of the system. In most cases, this helps, but when it comes to fine-tuning of the system, one needs to have direct connection to the hardware. During this research, such a connection has been made. The entire vertical has been observed, from the highest level of abstraction, like C or assembler instruction, down to complementary metal-oxide-semiconductor (CMOS) transistors that are engaged in its execution and the associated energy footprint. One of the key challenges of the research was how to establish such a connection, which methodology to use, and how to measure accuracy against empirical data. This is the essence of this research. Different approaches have been used in various research, which is described in the following section.
The practical goal of this research was to develop an energy estimation tool that can provide software engineers with information about the energy dissipation related to instruction set selection and source code organization, as well as core utilization. This information is used during the entire project lifecycle:  In the early stage of the project (rapid prototyping), where engineers learn about the system how different cores and clock cycle, instruction set selection, and source code structure can impact energy consumption;  In the late project phase, when final tweaks and system optimization occur.
To reach this goal, it was first necessary to identify all key contributors and establish an appropriate methodology to measure energy consumption at the instruction level, the inter-instruction effect, static and dynamic power dissipation [2], [3], as well as the influence of different peripherals and cores on the overall energy balance. Section III provides a brief description of the target hardware platform on which these experiments have been performed.
Section IV of the paper describes dissipation components, such as static and dynamic. Section V describes the measurement methodology used.
The estimation models of overall energy consumption and average power dissipation are presented in Section VI. These models are derived from numerous experiments and are used as the core of the estimation tool. The models are fed with obtained data, and useful information is generated.
In Section VII, the models are put under scrutiny. The general idea was to take a typical DSP application, like finite impulse response (FIR) filter, and to measure power dissipation on two different cores against estimated values. Since two cores have a different architecture and instruction set, it was interesting to perform a comparative analysis of two different implementations. To reduce the gap between the quality of software diversity, an experienced engineer developed both applications.
Section VIII not only elicits some conclusions, but also provides some thoughts in regards to future research.

II. RELATED PAPERS
The diversity and exponentially increasing number of low-power embedded systems that operate autonomously under a small battery inspired this research. A variety of different hardware solutions in most cases also implies different tooling, instruction set, pin layout, hardware resources, etc. The solution presented in this paper aims to provide universal methodology and estimation models regardless of the diversities of the systems mentioned above. In this section, similar solutions have been described and compared against ours.
Estimation models proposed by the authors in [5] introduce Hamming distance and weight of the instructions, instead of coping directly with inter-instruction effect to optimize measurements. Such a model would be highly inaccurate if applied to the hardware platform used in this research, since inter-instruction effect, in some cases, has a similar contribution as the base cost (single instruction energy footprint).
Some research, such as in [6]- [8], approximated dynamic power dissipation as a uniform distribution over the entire instruction set. This approach can probably provide a good enough estimation for the overall power dissipation, but when it comes to the cycle level, it highly depends on the underlaying architecture and instruction set base cost deviation. The hardware platform used in this research has quite diverse instructions energy footprints, therefore, such approach would not provide accurate estimation.
The power and energy estimation models presented in [9] have been used during the hardware design process to optimize the system at the architectural level. This research, on the other hand, is more focused on the application level and optimizations that can be performed on the given hardware.
In [10], the prerequisite for the estimation model is hardware virtualization. Such an approach is simply unapplicable for the target platform used in this research since the digital twin of the hardware is not available. Also, it has been proven in [10] and [11] that estimations based on real hardware measurements are more accurate than those from a simulated environment.
The research presented in [12] compares twenty-seven well-known software languages to draw conclusions, which one offers the best ratio between performance and energy. It was no surprise that the C language took the win, where energy and time were the main objectives. It is worthwhile to emphasize that this kind of measurement highly depends on implementation and the used compiler. The C is also the language of choice in this research. One of the future objectives of the research presented in this paper is to feed a C compiler with measured values (base costs and interinstructions effect) and to use this information during the compilation (instruction selection and scheduling), thus making it an energy-aware compiler. Furthermore, in [12], the different influence of static and dynamic components on energy consumption is not explicitly considered, thus not making clear conclusions on how energy and time relate, which is clearly separated in this paper as two different contributors in the estimation models.
As mentioned above, this paper represents a continuation of the research published by the authors in [1]- [4]. In the first paper [1] in the series, basic models and a general idea regarding energy and power estimations have been presented. In this paper, the basic model from [1] has been extended and parameterized with effective capacity as the quantitative measure of dynamic dissipation, clock frequency, and power supply voltage. In [2]- [4], the focus was on dissipation components and measurement methodologies as essential ingredients of this study. This paper briefly recaps this in Sections IV and V, respectively, since it is important for the overall context. The main contribution of this paper is presented in Section VI, where the models of power and energy estimation are derived using the empirical data and methodologies described by the authors in [2] and [3]. Finally, this research validates not only conclusions and methodologies presented by the authors in [1]- [4], but also derived models in this paper using the classic embedded algorithm such as Finite Impulse Response (FIR) Filter.

III. DESCRIPTION OF TARGET PLATFORM
The block structure of the target DSP platform that has been used during the research is presented in Fig. 1.
The presented ultra-low power hardware platform uses a small battery as a power supply, therefore any optimization of power dissipation influences device autonomy, one of the key properties.
The most interesting segments of the platform are five heterogeneous DSPs. Two DSPs are designed for accelerated numerical processing (naDSP), while the remaining three cores play the role of general-purpose DSPs (gpDSP). One of the three gpDSPs takes the role of a microcontroller (uC), which synchronizes and controls the entire system. All these DSPs, as well as the whole system in general, are designed to operate in a very low power consumption mode. Also, it is important to emphasize that the DSP pipeline structure has three consecutive phases: 1. Fetching instruction; 2. Decoding instruction; 3. Executing instruction. During each cycle, the current instruction is being executed whilst the next one is fetched and decoded. This implies that only two adjacent instructions are involved in each cycle, thus making a two-phase pipeline.
In addition, the DSP platform hosts six different categories of peripherals: (ABE); 2. System -Clock and reset distribution block; 3. Input/Output (I/O) -I2C, Universal Asynchronous Receiver/Transmitter (UART), General-purpose input/output (GPIO), Touch switch, etc.; 4. Local Processing Unit (LPU) -Responsible for Direct Memory Access (DMA) transfer, setting: interrupt handlers, timers, watchdogs, and external address context; 5. Utility -Sine generator, traffic lights, mailboxes, and decompression blocks; 6. Wireless Data Module (WDM) system block. All peripherals can be turned on and off independently, so that power dissipation can be optimized depending on the use. The energy consumption of the peripherals has been included in the model described in the following section.

IV. DISSIPATION COMPONENTS
There are two major dissipation components present in CMOS integrated circuits [13]: 1. Static dissipation; 2. Dynamic dissipation. Figure 2 depicts the relation between the two components and how power dissipation, energy consumption, and time relate. In Fig. 2, energy is represented as an area of the rectangle (light and dark areas). Static energy consumption (dark rectangle area) increases over time linearly, whilst dynamic energy (light rectangle area) remains constant. Regarding power dissipation, the story is opposite; static power dissipation is constant over time, and dynamic linearly decreases over time. These are important properties that were used during measurements and calculations.

A. Static Dissipation
Static power dissipation, also known as leakage dissipation, emerges as the sum of all leakage currents (I leak ) multiplied by voltage supply (V DD ): .
Static energy consumption is calculated when static power dissipation is multiplied by the time during which energy was consumed

B. Dynamic Dissipation
Dynamic power dissipation arises during the transition from one logic state to another. The key property of dynamic power dissipation is effective capacity (C eff ), which represents nothing but the capacity that is being transferred during the logic state change. Effective capacity is important since it can be associated as a constant instruction property used to estimate instruction energy consumption at various clock frequencies (f)

V. MEASUREMENT METHODOLOGY
It is important to emphasize that there were three important distinguished dissipation contributors for which different measurement methodologies have been used; those are: static dissipation, base instruction cost, and interinstruction effect.

A. Static Dissipation
Methodology for empirical measurement of static contribution is based on the previously explained property that static power dissipation (P stat ) does not depend on clock frequency, and on the other hand, dynamic component (P dyn ) scales linearly (Fig. 3).

B. Base Instruction Costs
The term "Base Instruction Cost" is coined to express the isolated energy drift induced by a single instruction running on the chip. Figure 4 represents the source code built and deployed on the target hardware platform to measure the base cost of instruction "SUB x1 b0 b0". The methodology is quite simple: the current is measured before (I stat ) and after (I M ) deployment and execution of the test code (Fig.  4.). The calculated value (I B ) represents the base current, but the more important dynamic property is the base effective capacity (C BEC ) which was defined in the previous section: , B M stat

C. Inter-instruction Effect
The inter-instruction effect represents additional energy that is being consumed when two adjacent instructions with different Operational (OP) codes are being executed. The measurement methodology for this effect is the following. First, base costs for both instructions must be determined (I B1, I B2 ), as well as the leakage current (I stat ). Then, the code from Fig. 5 is executed, and the measured value (I M ) is constituted from three main components: inter-instruction effect, leakage current, and mean average base cost current. Again, since this effect belongs to dynamic power dissipation scope, the main property that needs to be calculated is inter-instruction effective capacity C IIEC , based on base cost capacities (C BEC1 , C BEC2 ) of two adjacent instructions whose effect is being measured, and other already mentioned elements: 12

VI. POWER AND ENERGY ESTIMATION MODELS
Estimation models represent the essence and epilogue of this research. In this section, two estimation models will be derived: Estimation model for mean power dissipation and Estimation model for overall energy consumption.

A. Estimation Model for Mean Power Dissipation
The mean power dissipation P n can be defined as the arithmetic mean of the power dissipation P c caused by each individual clock cycle n that was executed during the observed period If there are N peripherals present in the system, then overall power dissipation caused by peripherals P Peripherals can be calculated as the sum of all individual dissipations P p , where C p represents effective capacity of the peripheral: , Similarly, power dissipation, which is induced by cores execution P MCore , can be defined as the sum of power dissipations P DSP caused by all active cores K ( , , ) .
Combining (20) and (21), dynamic power dissipation P MCore is derived in (22) Now, all individual contributors are defined: static dissipation in (2), dynamic dissipation in (16), dissipation caused by peripherals (19), and dissipation induced by core execution. Using the dissipation contributions mentioned above, the overall power dissipation during one clock cycle Multicore embedded application power dissipation P n , is defined as an arithmetic mean of all P c during the number of clock cycles n, through which the application is being executed. Using (14) and (23), one can derive the application power dissipation P n as:

B. Estimation Model for Overall Energy Consumption
The overall energy consumption E n represents the sum of contributions that were consumed during each individual cycle E c(k) The energy consumed by all peripherals can be calculated as in (31), where N represents the number of peripherals, and the energy footprint E p(i) of the i-th peripheral Since the main property of dynamic energy consumption is the effective capacity C p, the energy consumed by a single peripheral can be defined as in (32) 2 ( , ) .
Similarly, the dynamic energy consumed by all cores can be defined (34), where K represents the number of DSP cores being active in the current clock cycle During one clock cycle, the energy that is spent on one DSP core depends strictly on the instruction being executed at the moment, and its properties: base effective capacity C B and inter-instruction effective capacity C I (35) Combining (34) .
The energy consumed during one clock cycle E c (38) is derived from (29) and (37) Finally, the overall energy consumption E n (39) can be derived from (28)

C. Discussion
The equations derived for mean power dissipation (26) and for overall energy consumption (39), are parametrized with the following parameters:  Supply voltage -VDD;  Clock period -T;  Effective capacities;  Number of cores;  Number of clock cycles. This parametrization is important because it provides flexibility in estimating power and energy using different supply voltages, clock periods, different instructions, number of cores, and number of clock cycles to refine energy cost on ultra-low power target platforms.
Also, it is interesting to note that for the expression of the mean power dissipation (26) the static component is independent of the operating clock frequency, while for the expression of the overall energy consumption (39), the dynamic component is not in function of the operating clock frequency. Figure 2 illustrates the derived conclusions.

VII. EXPERIMENTAL RESULTS AND VALIDATION
To prove methodology and estimation models described and derived in this paper, it was necessary to run estimation models against real-world embedded applications. For this purpose, the Finite Impulse Response (FIR) filter has been selected, as one of the most common applications found in the DSP domain.
The target platform contains two different flavors of DSP cores, one dedicated mostly to data transfer -gpDSP, and the other designed for number crunching -naDSP. It was interesting to do a comparative analysis using this hardware and software diversity. To mitigate the implementation quality gap, the same developer created both applications, for the gpDSP and for the naDSP.
The instruction sets used in both cases have been profiled using measurement methodologies described in Section V, and then the estimation models from the previous section have been applied. On the other hand, both applications were deployed, executed, and measurements were taken on real hardware at four different clock frequencies. When completed, estimated and measured values were compared.
A. Finite Impulse Response (FIR) Filter -Case Study 1. Implementation for the gpDSP Instruction histogram (Fig. 6) reveals an unequal distribution, since gpDSP is not designed for such processing as FIR filter. In Table I, there are estimated power dissipation values P E and measured values P M . Estimations have been made based on the model (26), and measurements were taken on the board whilst the FIR filter application was running.  Both tables (I and II) contain column Acc, abbreviated from "accuracy", which was calculated for all values. The values obtained on gpDSP verify that the model and measurement methodologies, presented in this paper, provide a high level of accuracy.

Implementation for the naDSP
The histogram in Fig. 7 represents the number of instructions used for the implementation of the FIR filter on naDSP. It is obvious that the distribution of used instructions is much more even since the core is designed for such processing. Table III contains the measured P M and estimated values P E for power dissipation while the FIR filter application was running on the naDSP.  Similarly, the accuracy of the gpDSP estimation is quite high (above 97 %), implying that the presented measurement methodology and the model (39) derived provide reliable information about the energy footprint of the embedded application.

B. Discussion
The activity diagram (Fig. 8.) presents the time necessary for one FIR filter processing loop on two different cores, gpDSP and naDSP. Figure 9 depicts the mean power dissipation on gpDSP and on naDSP. It is interesting to note that the mean power dissipation on naDSP is slightly higher than on gpDSP.
But the trend presented in Fig. 10 provides another insight into the overall energy consumption of the application, which implies that the overall energy consumption on gpDSP is around 4.5 times lower than on gpDSP. This comparative analysis derives the conclusion that hardware design can have a huge impact on energy consumption.

VIII. CONCLUSIONS AND FUTURE WORK
This paper presents measurement methodologies and estimation models for the power and energy that are consumed during the execution of embedded applications on the heterogeneous multicore platform. The described approach has been validated and verified using real-world application (FIR filter) and proved to be quite accurate (above 97 %). With that being taken into account, there is a diversity of potential applications and future research. This study has been inspired by the ultra-low power embedded application development; therefore, two main tools can be developed to facilitate this:  Energy estimation tool that can provide insight into the system energy footprint;  Energy-aware compiler that will be fed by instruction set measurement data, and accordingly execute selection and scheduling. The estimation tool would provide passive assistance during development, but it would still teach users of the system about its properties.
An energy-aware compiler would provide active support, by making decisions about instruction selection and scheduling based on empirical data.

CONFLICTS OF INTEREST
The author declares that he has no conflicts of interest.