A Novel Fitting Model for Practical AIS Abnormal Data Repair in Inland River

Affected by the environment of inland waterway, an Automatic Identification System (AIS) collects lots of abnormal data, which significantly reduces the inland river navigation performance using AIS data. To this end, this paper aims to restore the AIS data by repairing the lost data points. By analysing enormous abnormal AIS data, the abnormal data were firstly divided into three types, i.e., the erroneous data, short-time lost data, and long-time lost data. Then, a cubic spline interpolation method was employed to deal with the erroneous data and short-time lost data. Meanwhile, a least square support vector machine method was utilized to repair the long-time lost data. Finally, field experiments were carried out to validate the applicability of the proposed method, and it is shown that the fitting model can repair the AIS data with an accuracy of more than 90 %.


I. INTRODUCTION
Intelligentization is the goal of inland river navigation, and acquiring inland river vessel traffic flow information is an important support to realize the goal of inland river intelligent shipping. The peculiarities of obtaining complete information on the ship traffic flow have attracted increasing attention from all walks of life. Automatic Identification System (AIS) is the main way to obtain real-time ship dynamic information in inland rivers. Compared with the coastal AIS network, the inland AIS network has its own characteristics. On the one hand, inland river navigation environment is generally in the area with dense water network and the main road of the river, and the channel is curved and the terrain is complex. The dense area of water network is mainly plain, and there are many hydraulic structures such as Bridges, so there is some shadowing phenomenon. At the same time, some inland waterway trunk lines are mostly located in mountainous areas, so it is difficult for AIS base station to achieve the coverage of the whole river. On the other hand, because the AIS does not have a complete information verification mechanism, there are many abnormal data in the actual application of AIS. Therefore, according to the characteristics of abnormal data of AIS and specific repair requirements, the repair process of AIS abnormal data in inland river environment is investigated, and a novel repair model is developed. Experimental evaluation demonstrates that the proposed repair model is able to improve the completeness of AIS data for intelligent shipping.
II. RELATED WORK The automatic identification system is composed of shore-based and shipborne equipment. It is a new type of modern digital navigation aids systems. AIS adopts a series of information technologies, including data communication technology, information display technology, computer technology, network technology, etc. Its main application fields include ship collision avoidance, maritime management, and enhanced Vessel Traffic Services (VTS), etc.
In the aspect of AIS data transmission, the characteristics of inland river AIS communication link and the packet error rate of the receiver are mainly studied. In [1], the relationship between AIS message length and packet error rate is analysed, and the corresponding packet error rate prediction model is given. On this basis, the limit capacity of AIS base station to send short messages to ships under crowded waterway is proposed, and the consistent effect is obtained in the actual verification. Chu, Liu, Ma, Liu, and Zhong [2] optimized the parameters of Okumura-Hata model by using linear regression method, and the signal field distribution characteristics of AIS communication system in mountainous channel were studied by using Okumura-Hata model. Hu, Cao, Gao, Xu, and Song [3] aimed at solving the problem of AIS false alarm rate, and others studied a false alarm rate measurement system based on very high-frequency (VHF) fault diagnosis.
In terms of AIS message parsing, many scholars have conducted researches from the perspectives of improving the real-time and efficiency of AIS message parsing and how to make full use of AIS messages. Goudossis and Katsikas [4] introduce the Identity-Based Public Cryptography and Symmetric Cryptography to enhance the security properties of the AIS. Li and Yang [5] introduced the static and dynamic information of the AIS in detail in the process of studying the AIS message information. They studied the analytical method of the AIS message and completed the analysis of the AIS data with the actual data. In order to reduce trajectory data storage space, Sun, Chen, Piao, and Zhang [6] added the sliding window to the classical SPM (scan-pick-move) algorithm to better compress vessel trajectory data regarding compression efficiency. Wang and Fang [7] used a large number of original AIS data, and others analysed the characteristics of AIS message with secret code and proposed an AIS information analysis method based on a secret code.
In the aspect of AIS data analysis, relevant scholars mainly studied the trajectory information in AIS packets. Gaglione et al. [8] proposed a Bayesian based method to integrate AIS data and oceanographic high-frequency surface-wave (HFSW) radars for multi-target tracking. Sang, Wall, Mao, Yan, and Wang [9] took advantage of the cubic spline interpolation algorithm to repair the ship's trajectory. In the verification and comparison of the actual trajectory in the bridge section, it is found that the trajectory repair effect is better. Zhang [10] thought that the characteristics of longitude and latitude variation of ship track point are analysed, and it is pointed out that the method of cubic exponential smoothing is more suitable for ship track prediction. Yan et al. [11] proposed a ship trip semantic object (STSO) based on method to extract ship traffic routes at sea using ship history AIS data. Iphar, Napoli, and Ray [12] introduces the quality dimensions of data that shall be used in a quality assessment of AIS messages. Wu et al. [13] summarized several types of AIS track anomalies based on the in-depth analysis of a large number of AIS data, and ship track anomalies are automatically detected according to the characteristics of each type. Chu et al. [14]- [16] aimed at solving the problems such as loss or error of ship AIS data. The preliminary repair or prediction of AIS data was realized by using the Piecewise cubic Hermite interpolation, and the back-propagation (BP) neural network training set and test set were established to carry out single point and continuous multi-point AIS data prediction. Zhong, Jiang, Chu, and Liu [17] thought that recursive neural network is used to restore the AIS data of inland river ships and solve the problem of missing AIS repair.
In conclusion, because of the complexity of the AIS system, related scholars have usually carried on the thorough research only for certain aspects, such as AIS data transmission, AIS message parsing, short-term abnormal AIS data restoration algorithm, and the application of AIS data. Domestic and foreign scholars have put forward method, and the model can better solve the corresponding problems [18]. The system has repaired of all kinds of abnormal AIS data caused by the limitations of the environment, but the equipment and the AIS system have not been in-depth. Therefore, this paper will first define abnormal AIS data in inland river environment, then propose restoration methods and evaluation models of various abnormal AIS data systems, and finally carry out empirical research.

III. DEFINITION OF ABNORMAL AIS DATA
The basic definition of abnormal data refers to system failure, data loss, and data integrity destruction. In terms of inland river traffic flow, AIS system is an important way to obtain real-time ship traffic flow data in the inland river. However, due to the fact that inland waterway is dominated by natural waterway, which passes through mountainous areas, basins, and other regions with complex topography in the form of ribbon, and with the increasing number and tonnage of inland river vessels, the performance of AIS system of inland river has a great decline compared with that of open sea, which is mainly reflected in the following aspects:  Reduction of the effective coverage of the AIS system. As the AIS signal is transmitted in a straight line by radio waves, it belongs to the VHF transmission mode, which is prone to interference in the transmission process, and its effective coverage is limited. However, there is much interference, such as coastal mountains, bridges, and urban high-rise buildings in inland waterways, which are easy to cause the attenuation of AIS signals and the reduction of effective coverage range, and eventually lead to errors or even data loss in the original AIS data.  AIS channel capacity is insufficient. The AIS design capacity cannot meet the communication needs of increasingly crowded inland rivers.  Abnormal AIS equipment causes error information.
Because the shipborne AIS in the inland river comes from different manufacturers, and different manufacturers have different technological and technical levels, there is a problem of poor performance of positioning equipment in the inland river application, and a large number of wrong location data appear. Therefore, the abnormal data of AIS are defined as follows: due to the influence of the abnormalities of the AIS equipment itself, the transmission characteristics of VHF and environmental factors, the collected AIS data are lost and wrong, which affects the completeness of the AIS data. According to the performance characteristics of abnormal data, the abnormal data of AIS are divided into three categories: AIS error data, AIS loss data in a short period of time, and AIS loss data in a long period of time.

A. AIS Error Data
The longitude, latitude, speed, and course of AIS dynamic data have certain value range under inland river conditions. In the main channel of the Yangtze river (e.g., between longitude 90 ° ~ 122 °E and latitude 24 ° ~ 35 °N, and heading 0 ° ~ 360 °), the ships' longitude and latitude values will change at different sailing speeds. Figs. 1-3 show the ships' velocity profile of the Wuqiao Bridge at the Yangtze River, where the most of the velocity values exceed the normal range. As a result, these AIS data can be regarded as error data.

B. AIS Data Loss in Short Time and Data Loss in Long Time
According to Table I and Fig. 1, it can be seen that there are at least 5 data in 1 min for ships with type A shipborne AIS and at least 2 data in 1 min for ships with type B shipborne AIS. After statistics of 6,829 original AIS data (including 3,975 for Type A shipborne AIS and 2,854 for type B shipborne AIS), it is found that the number of data pieces for type A shipborne AIS in 1 min is mainly 1 ~ 5, with an average of 3.3 bars/min. The number of data bars for type B shipborne AIS in 1 min is mainly 1 ~ 2, with an average of 1.6 bars/min. The number of messages per minute distributed for type A and Type B shipborne AIS is shown in Fig. 2 and Fig. 3. The message loss rule is shown in Table II. At the same time, it is found by statistics that the loss time length of type A shipborne AIS is mainly concentrated within 3 minutes, accounting for 75 %. The length of time for data loss of Type B shipborne AIS is mainly concentrated within 5 minutes, accounting for 84 %. Therefore, considering the accuracy of data repair and the time spent, the short-time data loss of type A shipborne AIS is defined as the case where the data number is less than 3 within 1 min and there is no data within 3 min, and the long-time data loss is defined as no data beyond 3 min. Similarly, the short-time data loss of type B shipborne AIS is defined as the situation where the data number is less than 2 within 1 min and there is no data within 5 min, and the long-time data loss is defined as no data over 5 min.

A. Abnormal AIS Data Repair Model Framework in Inland River Environment
The purpose of abnormal AIS data repair is to obtain relatively complete inland river AIS data, and the specific result is to obtain the correct value of the target ship's heading, speed, longitude and latitude at the corresponding time point. According to definition of error data, short-time loss data, and long-time loss data, the cubic spline interpolation method is used to repair the error data and short-time loss data, while the Least Square support vector machine (LSSVM) is used to repair the long-time lost data. The overall framework for abnormal AIS data repair is shown in Fig. 4. It is necessary to determine the repair time of abnormal data before repairing AIS data. Figure 5 shows the velocity distribution of the cargo ship and passenger ship in Wuqiao Bridge section of the Yangtze River Waterway, respectively. It can be obtained that the velocity of the cargo ship and passenger ship is basically 0-14 knots. According to the information update frequency of class A and B AIS, when the ship speed is 0-14 knots, the message transmission frequency of class A berth is 10 s, and that of class B berth is 30 s.

B. Cubic Spline Interpolation Model for AIS Data Repair
For AIS error and data loss in a short time, this paper will adopt cubic spline interpolation method to quickly restore, and its calculation model is as follows.
If there is a set of AIS data, then within the longitude range, the cubic spline interpolation function is: There is a total unknown in the three tracks of each segment   There are also continuous first derivative and second derivative at two end points of the curve track. If the curve is connected to the straight track at the end point and the second derivative of the straight track is 0, then there are boundary conditions '' If the curve is connected with the circular track at the end point, there are boundary conditions Substituting (3), (4), (5), and (6) into (1), the curve function can be solved.

C. LSSVM Regression Algorithm for AIS Data Repair
The LSSVM is a new support vector machine (SVM) proposed by Suyken J.A.K. LSSVM uses the least square linear system as the loss function and replaces the quadratic programming method adopted by traditional support vector machine, simplifies the complexity of calculation, and improves the operation speed. Therefore, this paper adopts the LSSVM to repair the long-lost AIS data. The calculation process is as follows.
where the Lagrang multiplier  is available by: This transforms the optimization problem into a linear solution problem, i.e., to solve the following problems: where 1 is the matrix with 1 for each element, which meets the Mercer criterion, I is the identity matrix, and The solution to equation (10) is the following regression estimation function

D. AIS Data Repair Evaluation Model
The evaluation index of the model is an important part of the model, which is used to detect the pros and cons of the repair results of the model AIS data, i.e., whether the repair results of the model are within the acceptable range. In this paper, the square of the correlation coefficients and mean square error (MSE) are used as evaluation indexes of AIS abnormal data repair model. MSE is the expected value of the squared difference between the corrected value and the true value of the data, and the correlation coefficient is a statistical indicator to reflect the degree of correlation between variables MSE and R2. The calculation formula is as follows.
Mean square error In the evaluation index: the closer the repair value is to the true value, the smaller the MEAN square error (MSE) will be. It also indicates that the repair effect of the repair model is better and the generalization ability of the repair model is stronger. On the contrary, if the MSE is smaller, the repair effect and generalization ability of the model will be worse. As for the correlation coefficient R2, it indicates the degree of correlation between two variables. Usually, R2 is within 0 to 1 and R2 = 1 means perfect fit.

A. Empirical Data Collection
Wuhan Bridge section is located in the middle reaches of the Yangtze River channel and is a typical bridge section. The original AIS data in this paper mainly come from the collection points of Baishazhou Bridge, Wuhan Yangtze River Bridge, and Tianxingzhou Bridge in this section (Fig.  6). Generally, ships sail all year round and their routes are fixed. Therefore, the time of passing through these three collection points is relatively regular and similar historical data can be easily obtained. To find similar historical data in convenience, this article used the AIS data from Changhang freight 0316 ship. This test ship collected each month's AIS data, including type A and Type B. The acquisition receiving point of empirical data is shown in Figs. 3-5. In the repair of AIS data, it is necessary to pre-process the original AIS data. The specific process is as follows: 1. Judge the type of shipborne AIS of the ship in this article; 2. If the data point LON > 180 or latitude LAT > 90 or COURSE > 359.9 in the data of the AIS data sample is deleted and marked as the AIS error data point, it needs to be fixed. 3. If the number of data per 1 min in the AIS data sample is less than the standard number of data (type A shipborne AIS is 3, type B shipborne AIS is 2), it is marked as the data point of AIS loss in a short time and needs to be repaired. 4. If there is no data in type A shipborne AIS data sample for from 1 to 3 minutes continuously or from 1 to 5 minutes in type B shipborne AIS data sample, it is marked as AIS lost data in a short time and needs to be repaired. 5. If there is no data in the type A shipborne AIS data sample for more than 3 min or in the type B shipborne AIS data sample for more than 5 min, it shall be marked as AIS data loss for A long time and needs to be repaired.

B. AIS Error Data Repair
The idea of AIS error data repair is to use interpolation method to find the interpolation after removing the error data. In order to facilitate the comparison between the repair value and the corresponding real value, this paper artificially sets some error data as the original data of data repair (see Table  III). The article 11 in the table is typical error data.
Specific repair process: 1. Data pre-processing: 11 pieces of data except time in Table III were removed and marked as error data points.     Analysis of results: In the above repair, the pair of real and repair values is shown in Table IV. After a large number of data verification, the results show that the repair accuracy is more than 93 %. After the error data is repaired, the original data can be replaced with the repaired data.

C. Repair of AIS Data Loss in a Short Time
The idea of AIS short-time lost data repair is also to use interpolation method. Different from the wrong data, the short-time lost data often needs to obtain several consecutive interpolation points. Therefore, in order to obtain higher repair accuracy, the input sample is larger than the sample of the wrong data repair. Table V shows the raw data in the case of a short period of data loss. At least 4 data were lost between 08:04:11 and 08:06:52 in Table V.
Specific repair process: 1. Data pre-processing: mark from 8:04:11 to 08:06:52 in Table V      According to the analysis of results, it can be seen from Table VI that the repair value of the data lost in a short time is close to the real value. After calculation, the accuracy of the repair value is above 90 %, and the time point of the repair data is not much different from the real time point. After the fix value is obtained, the fix value can be inserted into the fix marker in the raw data.

D. AIS Loss Data Repair for a Long Time
The repair idea of AIS lost data for a long time is to use LSSVM regression fitting method and similar historical data to get the repair value. The core idea of using historical data to repair lost data assumes of functional relationship between lost data and historical data. So, the key to repair lost data is to find a function that can describe the relationship between lost data and historical data. Therefore, the problem of using historical data to repair lost data is essentially the same as that of function fitting. The loss data of AIS data is selected for repair. According to the time series variation rule of ship traffic flow, the AIS data of a ship are inevitably related to the AIS data in a similar navigation environment, which is because ships will adopt similar sailing mode in a similar environment.
The key to using LSSVM model to repair long-lost AIS data is to adopt appropriate parameter optimization method and utilize the historical data closest to the data to be repaired for training and testing. In Section IV, similar historical data are defined and compared with common parameter optimization methods. In this paper, the parameter optimization method of LSSVM model is particle swarm optimization (PSO) algorithm. In order to verify the result of data repair by comparing the repaired value with the real value, the following number of the data to be repaired is selected artificially.
Data to be repaired:  May 16, 2018, in which the blue circle represents the true value and the red star represents the predicted value. In order to improve the accuracy of the prediction results, the selected training samples are all larger than the latitude and longitude range of the data to be repaired, so the repair data are only a part of the predicted value.   Figures from 19 to 22 show the repair results of the lost data on September 24, 2017, respectively, and the repair value is only a part of the predicted value. R2 is correlation coefficient, MSE is mean square error, and both of them are evaluation indexes, where R2, the closer to 1, while the closer MSE is to 0, the better the repair.    Recently, the intelligent algorithms [19]- [22], including the Fuzzy model, recurrent neural network (RNN), and Random Forest (RF), have been widely used for complex data mining. These methods exhibited powerful ability for short time data mining and uncertainty estimation; however, for long time data, they may be not superior to the proposed method in this work. In order to evaluate the long-time data repair, Table VIII and Table IX compare the repair results, where the Fuzzy, RNN, and RF are directly applied to the long-time data without pre-process because for different algorithms a suitable pre-processing method is not always available. As can be seen in tables VIII and IX, the proposed method provides better accuracy over the other three intelligent algorithms for long-time data repair. The reason is probably that the long-time data are pre-processed to tailor the LSSVM. As a result, the proposed method is more suitable for long-time data repair.