Efficient Feature Set Developed for Acoustic Gunshot Detection in Open Space

1 Abstract —This paper presents an efficient approach to automatic gunshot detection based on a combination of two feature sets: adapted standard sound features and hand-crafted novel features. The standard features are mel-frequency cepstral coefficients adapted for gunshot recognition in terms of uniform gamma-tone filters linearly spaced over the whole frequency range from 0 kHz to 16 kHz. The first 18 coefficients calculated from the 41 filters represent the best set of the optimized cepstral coefficients. The novel features were derived in the time domain from individual significant points of the raw waveform after amplitude normalization. Experiments were performed using single and ensemble neural networks to verify the effectiveness of the novel features for supplementing the standard features. The novelty of the work is the proposed feature combination, which allows to achieve very effective detection of gunshots from hunting weapons using 23 features and a simple neural network. In binary classification, the developed approach achieved an accuracy of 95.02 % in gunshot detection and 98.16 % in disregarding other sounds (i.e., non-gunshot).


I. INTRODUCTION
Automatic gunshot detection in audio streams can help protect property or increase security. Although a gunfire sound can today be detected with a smartphone by both civilians [1] and police [2], it still makes sense to develop reliable detection methods for specific situations. Our research into gunshot detection has been initiated by the Save Elephants society to protect elephants against poachers in Central Africa. Some wild elephants today wear collars that are equipped with a GPS module to track the movement of elephant herds. The collar equipment, called "smart collar", will be complemented by a gunshot detection module that sends an alarm signal along with location information. In such a case, game rangers can act very promptly.
The widespread features in acoustic pattern recognition, such as mel-frequency cepstral coefficients (MFCCs) or linear prediction coefficients (LPCs), have their origins in speech recognition, but their application was extended to various acoustic scenes, including gunshot recognition [3], Manuscript [4]. An introduction of these features may be found in [5]. In particular, the MFCCs are successfully used in many acoustic event recognitions (e.g., see [6], [7]). The authors of [4] used three MFCC variants as a neural network input for detecting two sounds -gunshot and glass breaking. The 3-set combination of MFCCs, i.e., static coefficients, delta coefficients, and delta-delta coefficients became the standard technique for speech recognition. Some researchers tested gunshot recognition using methods that were primarily developed for image recognition. For instance, the study in [8] describes the successful use of two-dimensional sound visualizations based on a spectrogram, MFCC, and a self-similarity matrix showing signal correlation. An overview of successful approaches developed by academics can be found in the proceedings resulting from the competition "Detection and Classification of Acoustic Scenes and Events (DCASE)" [9], a challenge that invited the authors to compare their sound detection algorithms where gunshots were the target sounds. A comprehensive review of gunshot detection technologies in urban environments, including a history of gunshot detection, can be found in [10].
Most of the developed methods and autonomous systems are intended for military, urban, or in-building applications. A surveillance system for automatic detection of gunshots in an indoor environment is proposed in [11]. In recent years, many commercial products for gunshot detection have been developed; for example, "Shooter Detection" [12] built into smartphones for various personal applications, "Boomerang" [13] installed on military patrol vehicles, and "ShotSpotter" [14] designed for urban use. Some practical aspects considering the implementation of ShotSpotter in an urban environment are discussed in [15]. Study in [16] describes the experience of using automatic gunshot detection in US cities.
There are very few studies dealing with gunshot detection developed to protect endangered animals, such as elephants or rhinos in the wild. Paper in [17] presents a prototype of an anti-poaching system built on elephants' collars -it is the same application as the goal of our work. The system is based on 10 features calculated from the shockwave. The authors did not disclose the overall accuracy of the detection. All the features used in [17] differ from our features defined in this paper.
The rest of the paper is organized as follows. The next section provides basic information about gunshot sounds. Section III introduces a combination of new efficient features developed to detect gunshot events in open space. The experiments carried out, including the evaluation of the results obtained, are presented in Section IV. Finally, Section V concludes the paper and outlines the continuation of work for the intended application.

II. GUNSHOT SOUND CHARACTERISTICS
The sound of a gunshot depends on the generating mechanisms and differs in detail according to the type of firearm, especially according to its caliber and barrel length. The sounds are naturally impulsive signals characterized by very high intensity and short duration (a few milliseconds). A typical gunshot wave consists of two parts: an initial high-intensity signal, which has a (usually) N-shaped waveform, and a subsequent ending phase with falling intensity. It therefore makes sense to analyse both parts separately in addition to the whole signal. Due to the psychoacoustic properties of the human hearing organs [18], [19], our subjective perception of gunshots is much longer than the millisecond duration of a purely physical sound signal. It should also be noted that the intensity of the gunshot drops non-linearly with increasing distance from the source (as well as the intensity of other sounds) by absorption in the air and spherical propagation. For rifles, the typical sound pressure (SPL) level is around 160 dB (at a distance of 1 m from the barrel), the main frequency components cover the range of 250 Hz-450 Hz and the velocity of the projectile is between 800 m/s and 900 m/s. A typical waveform of the gunshot and its corresponding spectrum is depicted in Fig. 1. In this case, the initial N-wave has a symmetrical oscillation. There are three clear local peaks in the spectrum in the range of 0 kHz-2 kHz with a dominant frequency below 500 Hz. Generally, the height and frequency of individual peaks differ for different types of firearms, and in some details they can be influenced by the current technical condition of the firearm. A similar spectral phenomenon is well known, e.g., in speech signal processing, where local peaks (so-called "formants") in vowel spectra serve primarily to distinguish individual vowels [20], but their small changes may reflect the speaker's state [21].

III. EFFECTIVE COMBINATION OF TWO FEATURE SETS
A key step in the classification of audio signals is the extraction of appropriate signal features that represent and distinguish the audio event of interest. To reliably detect gunshots from hunting weapons, we have created and combined two feature sets from different domains. The first one is based on standard sound features resulting in an optimized cepstral coefficient set. The second feature set includes new individual shooting-specific features derived directly from the shot waveform.

A. Optimized Cepstral Coefficients
First, the performance of standard sound features, such as MFCCs, LPCs, autocorrelation coefficients, and energy in frequency bands, was tested, and then the feature sets were subsequently tweaked for gunshot recognition. The best results were achieved with optimized MFCCs, which differ from standard MFCCs in two filtering parameters: the original triangular filters spaced in mel-frequency scale were replaced by uniform gamma-tone filters spaced linearly on the frequency axis [22]. The arrangement of the entire filter bank is shown in Fig. 2 (alternating red and blue colors are used for adjacent filters to increase readability). These features will be called "linear frequency cepstral coefficients" (LFCCs) to emphasize the linear distribution of filter boundaries. The gamma-tone filter response synthesizes an impulse response from nerve cells in the auditory fiber. A gammatone filter is described in the time domain by the impulse response given by the product of the gamma envelope and the sinusoidal tone where a, n, b, f 0 , and φ are the peak value, filter order, bandwidth, characteristic (i.e., center) frequency, and initial phase, respectively [23]. In the frequency domain, the filter is approximately symmetric around f 0 . The optimum LFCC structure found for gunshot detection consists of 41 filters covering the frequency range of 0 kHz-16 kHz with an equidistant center frequency spacing of 390 Hz. Then, the set of the first 18 coefficients proved to be the most efficient coefficients. These coefficients were used in our experiments.
After filtering, the calculation of the coefficients continues further according to the rules of standard MFCC as follows 1 log( ) cos ( 0.5) , where N = 41 is the number of filters used and S n stands for the output power of the n-th filter of the filter bank. The signal spectrum was calculated by the Fourier transform. In our calculation, the index m ranges from 1 to 18. The steps of the complete LFCC algorithm are graphically summarized in Fig. 3. First, the audio signal is segmented into short frames. Then, fast Fourier transform (FFT) is applied to the signal in each frame to get a short-term spectrum. The magnitude spectrum is filtered by the linear gamma-tone filter bank. The outputs of the bandpass filters are processed into a logarithmic value of energy. Finally, the LFCC coefficients are calculated using discrete cosine transform (DCT). The gamma-tone LFCCs seem to be more powerful than MFCCs in recognizing impulsive signals, such as gunshots, as the filter bank simulates a slightly different property of the auditory organ. The mel-frequency scale reflects overall auditory perception, and standard MFCCs based on this scale have been proposed to automatically recognize speech/speaker. A speech signal contains many small details in pronunciation. However, the gunshots are very short (< 10 ms). In addition, their acoustic signals are mechanically generated sounds.

B. Shooting-Specific Features
We were looking for new specific features derived directly from the gunshot acoustic waveform, which is usually clearly N-shaped at the beginning of the curve. The time-domain based feature set will be called "TDF".
The significant TDFs applied herein are the time interval T between the dominant peaks (positive and negative) measured as the interval between the located points of maximum and minimum in the gunshot waveform, as well as the area P defined by the peak-to-peak curve and the horizontal time axis depicted as the filled area in Fig. 4. Both features T and P characterize the N-shape at the time of the onset of the wave, which distinguishes gunshots well from other sounds that may occur. Other features are related to the overall decreasing part of the curve. With respect to damped irregular oscillations, the two areas were calculated separately for positive and negative amplitudes. Positive area A (above the horizontal axis) begins with the dominant positive peak and ends after a time of 6 ms. Negative area B (below the horizontal axis) begins with the dominant negative peak and ends after 6 ms too. This means that both areas correspond to a time period of 6 ms, but are offset from each other as can be seen in Fig.  5. In addition, the ratio of areas A and B was considered. The features defined above are included in the TDF set. An overview structure of the TDF extraction is shown in Fig. 6.

C. Overall Feature Extraction
The overall process of feature extraction is shown in Fig.  7. The first step in sound processing is the preprocessing of digital audio data. Preprocessing is used to prepare the input signal for reliable extraction of acoustic features. In our case, preprocessing involves two procedures: segmenting the signal into frames and normalizing the amplitude, so that all signal points are scaled in the range from −1 to +1. Regarding TDF, it is also important to remove the DC component (if an offset occurs). A rectangular window was used for framing all signals (gunshots and non-gunshots). The created frames have a length of 10 ms with an overlap of 50 %. This means that new features are calculated every 5 ms.
The feature sets LFCC and TDF are independent of each other, so the calculation can be implemented simultaneously in parallel processing. The obtained values are then combined into one vector of features. In the following gunshot detection, each signal frame is represented by a feature vector of 23 elements (18 LFCCs and 5 TDFs). Neither LFCC nor TDF alone outperform other feature sets used in audio event recognition. However, our experiments prove that their combination creates a very powerful feature vector for gunshot detection.

IV. EXPERIMENTS AND RESULTS
All experiments are based on real acoustic signals representing both gunshots and non-gunshots. The described calculations and approaches were implemented using the MATLAB programming environment on a desktop computer with a straightforward configuration. Due to the short duration of the gunshot (< 10 ms), the signal analysis is carried out on a one-frame basis, i.e., only one feature vector is extracted from each gunshot.

A. Used Data
Most of the sound recordings used in this study were from the GUDEON corpus [24], created specifically for research into gunshot detection in open nature. In addition, some signals were taken from two sources: Still North Media [25] for gunshots and Urban Sound Datasets [26] described in [27] for non-gunshots. The gunshot category includes 1500 gunshots from various hunting weapons such as AK-47, AR-15, Carl Gustaf m/45, Tikka T3, etc. Note that the AK-47 (7.62 mm × 39 mm) is not a typical hunting weapon, but it is often used by poachers in the territory of the intended application of gunshot detection in Central Africa. The non-gunshot category is divided into 6 different classes as follows: barking dog, sounds from elephants, sound of rain and storm, car horn, engine sounds, and human sounds (including short shouts). Each class contains 4000 individual sound frames, i.e., the non-gunshot category covers a total of 24000 sound frames. The sound classes were chosen due to the high frequency in which they appear around elephants. Some recordings in external databases originally had a different data bitrate, i.e., sampling frequencies, quantization levels, and one or two channels. We have converted the recordings with different parameters so that all signals used in our experiments were single-channel sounds, sampled at 44.1 kHz and quantized by 16 bits. All recordings were in WAV format.

B. Evaluation Criteria
To evaluate the performance of the features, we have used a fully connected neural network (NN) algorithm with two hidden layers of 20 neurons each. This architecture was chosen after simply searching the grid of from 1 to 3 hidden layers with 10, 20, and 30 neurons. More complex networks do not need to be considered due to the relatively low number of input features. The total dataset was divided into training, validation, and testing subsets. The training subset contains 600 gunshots and 6 × 600 non-gunshots (i.e., 600 sounds from each non-gunshot group) randomly selected, the validation subset contains 200 gunshots and 6 × 200 non-gunshots, and the testing subset contains 700 gunshots and 6 × 3200 non-gunshots. To limit the effect of a specific training subset, each subset was generated 5 times with a predefined random seed and the results presented were calculated as an average of 5 different training/testing phases.
Two appropriate event-based metrics were applied to the evaluation of gunshot detection performance using the developed features and neural networks, namely, the true positive rate (TPR) and the true negative rate (TNR), respectively, defined as follows: where TP is the number of true positives (i.e., gunshots identified as gunshots) and FN is the number of false negatives (i.e., missed gunshot detection). Analogously, TN stands for the number of true negatives (i.e., non-gunshots identified as such) and FP stands for the number of false positives, also known as false alarm (i.e., number of nongunshots identified as gunshots). In the overall evaluation, TNR was preferred for reliability measurement due to the prevailing non-gunshot sounds in continuous acoustic scene monitoring. Furthermore, frequent false alarms would dull the administrator's attention.

C. Achieved Results
First of all, the power of LFCC and MFCC to detect gunshots from hunting weapons was tested. Each feature set was optimized separately for this purpose. LFCC parameters are described in Section III. The search for MFCC parameters resulted in the values as follows: frequency range of 0 kHz-22 kHz, linear scale from 0 Hz to 1000 Hz, and mel-scale for higher frequencies, triangular filter shapes, 28 filters, 12 coefficients. Table I shows a comparison of the results obtained using the optimized settings of the LFCC and MFCC algorithms. As can be seen, LFCC gives better results according to both TPR and TNR criteria.
Under the conditions described in the preceding section, various setups of 18 LFCCs and 5 TDFs were investigated and evaluated. The basis was the separate performance of the LFCC set and TDF set. Subsequently, the performance of the merged feature set LFCC + TDF (i.e., 23 features in total) was evaluated. Since TDFs are extracted from the time domain and their character is quite different from the LFCCs, low mutual information between the two feature sets is expected. Table II shows the results obtained in these tests with a single neural network. As can be seen, compared to LFCCs only, the combined set of LFCC + TDF performs significantly better in terms of TPR, but with a slight decrease in TNR.
In another test, the combination of LFCCs with TDFs was tested in an ensemble approach with a separate network trained for LFCC features and another one for TDF features. In this case, the recognition algorithms run in parallel in two branches, each resulting in a probability of the binary classification gunshot/non-gunshot. The final decision is then given by the sum of the probability scores from both separate networks. The results of this approach are shown in the last row of Table II. Overall, the ensemble network provides the best results in terms of both TPR and TNR.
The individual features of the TDF set have different potential for gunshot detection. In the investigation, each feature was added separately to the complete LFCC set and the change of performance was observed. Based on the performance improvement, the TDFs were sorted in descending order as follows: T, A, R, B, and P. Thereafter, the features from the TDF set were gradually added to the LFCCs in this order. Table III summarizes the effect of increasing the number of added TDFs on the detection performance in terms of TPR for both single and ensemble neural networks. As can be seen, the inclusion of T and A in the ensemble network increases TPR by approximately 10 %. On the other hand, feature B has no significant benefit. Thus, in practical applications where the amount of data to be computed plays a role, the benefit of increased performance and the associated increase in computational costs should be taken into account.  Table IV shows a brief overview of the accuracy achieved by methods of mono-channel gunshot detection that have been published in recent years. In comparison to other methods, we achieved a high accuracy of 95.02 % with a low number of 23 features. For example, the study in [28] reported the best accuracy of 96.10 % when applying 338 features. Moreover, we use a relatively simple neural network. In Table IV, SVM stands for Support Vector Machine [29], CNN stands for Convolutional Neural Network [30], and kNN for k-Nearest Neighbors [28].

V. CONCLUSIONS
This paper presents a set of new features that have been developed for reliable acoustic detection of gunshots from hunting weapons in the wild. Our contribution to novelty in the field lies both in finding new types of features and in optimally combining individual features into an efficient feature vector. Based on real acoustic signals, a gunshot detection rate of 95.02 % was achieved. Very useful, especially for practical use, is also the ability of the proposed method to ignore almost all non-gunshots, i.e., 98.16 % of other environmental sounds.
In the intended application for the protection of elephants, an important aspect is not only the high reliability of gunshot detection, but also the low energy consumption because the implemented system will be powered by a selfsustainable energy source. In our next work, we will optimize the feature calculations and the classification approach to reduce the power consumption of the processor [32]. For this purpose, the study in [33] compares computational costs versus classification performance for different approaches of sound recognition.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.