Research on Big Data Mining Technology of Electric Vehicle Charging Behaviour

1 Abstract —Thousands of electric vehicles (EV), which are large in number and flexible in their use of electricity, will be connected to the power system in the near future, which will bring more uncertainty to the power system. Therefore, it is necessary to study the general characteristics of EV charging behaviours. In the charging process, big data regarding charging behaviour of EVs are generated. This paper proposes a big data mining technique based on Random Forest and Principle Component Analysis for EV charging behaviour to identify and analyse clusters with different charging characteristics from the big data. This paper uses Dundee’s January 2018 EV charging data to conduct experiments, and obtains the charging behaviour clusters of the workdays, weekends, and holidays of January. The superiority of the random forest algorithm in the EV clustering problem is reflected when compared to the Euclidean distance method. The clusters obtained by the random forest algorithm have clearer characteristics, including the user’s charging method and travel behaviour. The results show that the charging behaviour of EVs has certain regularity, and the charging load has obvious peak-to-valley difference that is necessary to be regulated.


I. INTRODUCTION
As environmental pressures and the lack of energy shortage become more serious, electric vehicles attract more and more attention because of their high energy efficiency and low emissions of pollutant gases [1]- [3]. Analysing the spatial and temporal distribution of EV charging load is the basis for studying the impact of large-scale development of electric vehicles on the power grid, capabilities of participating in grid interaction, and charge/discharge-control strategies [4]. However, the charging behaviours of EVs are normally random and diverse, which seemingly makes them complicated and hard to analyse [5].
At present, the modelling of EV charging behaviour is mainly based on the fitting of statistical data and the probability distribution function [6]. Reference [7] proposed a calculation method that comprehensively considered the charging time distribution of different types of EVs, especially private cars, buses, and taxis. Reference [8] Manuscript received 24 April, 2019; accepted 5 October, 2019. This paper was completed under the support of Electric Power Research Institute, State Grid Tianjin Electric Power Company (KJ17-1-02). The Dundee City Council gave data support. introduced the gravity model in traffic demand forecasting and analysed the spatial distribution of EV charging loads. However, data used in [7] and [8] was relatively vague. Reference [9] analysed the differences in EV charging behaviour between residential and commercial areas based on demographic and traffic information. In [10], the multi-agent simulation was used to establish the spatiotemporal distribution model of EV charging loads. However, the probability distribution functions used in [9] and [10] were mainly based on simulation, and hardly considered a large amount of actual charging data.
Because the results of the above-mentioned two methods may have a large deviation from the actual, this paper will create the third method of EV-charging-behaviour modelling, which is based on the measured charging information and the big data mining technology. The current trend shows that the functions of charging facilities are constantly updated, and more multiple types of charging data will be recorded to form Big Data on the charging behaviour of EVs in a specific region [11]. In the absence of human intervention, the collected EV charging data samples usually contain information, such as charging start time, end time, charging energy, and charging location, which are not marked with a clear category. Thus, the EV charging data sample is Unlabelled Data. However, the study of the regularity and cluster of EV from the perspective of measured charging information is quite lacking. Currently, the classification of electric vehicles is mainly based on the types of vehicles. Reference [12] classified the charging mode of hybrid EVs by considering the difference in the structures of hybrid EVs. It did not cover all types of EVs and do not consider the differences in the behaviour patterns of EV users. However, if the actual measured charging information, including all kinds of EVs in a certain region, can be fully used, the accuracy of the charging behaviour description will be greatly improved, which is beneficial to manage the charging behaviour of EVs.
The big data mining technology used in this paper is mainly based on Principle Component Analysis (PCA) [13] and Random Forest (RF). The Random Forest algorithm designed by L. Breiman in the early 21st century is one of the most successful methods currently available for processing Big Data [14]. The RF algorithm refers to a machine learning method that uses multiple decision trees to train and predict samples [15]. It is capable of highly parallelized processing and can meet the clustering requirements of high-dimensional and large-sample data sets in the era of Big Data. At the same time, the RF algorithm does not produce over-fitting problems and can also assess the importance of variables [15], [16]. Besides, it has successfully addressed a variety of practical areas. Due to its characteristics and advantages, Random Forest is very suitable for cluster analysis of EV behaviour based on Big Data. Since the EV charging data sample is Unlabelled Data, the category corresponding to the samples cannot be trained. Thus, Random Forest can only be trained from the original sample set. This problem belongs to Unsupervised Learning.
The purpose of this paper is to establish a big data mining technology for charging behaviour of electric vehicles. PCA will be used to reduce the dimensions of the data. Cluster analysis will be based on the Random Forest algorithm and on the EV charging data of Dundee in the UK. The application results of the big data mining model will be given.

A. Model Framework
As shown in Fig. 1, the big data mining model can be divided into 4 parts.  Firstly, the original data are cleaned and standardized. Secondly, PCA is applied to reduce the dimensionality of the big-data-mining problem without losing much EV charging information. Thirdly, the big data should be clustered according to characteristics of EV charging behaviour. Random Forest is used to complete the cluster function. Finally, the clusters of EV charging behaviour are achieved after the correlation analysis of the data samples is done by RF.

B. Principle Component Analysis Algorithm Flow
Principal Component Analysis is a kind of Deep Learning method. It is used to extract important sample features to reduce the number of features, but still retain most of the information of the original samples.
In order to obtain the principal components of the EV charging data set, the following calculation steps are required, including: Step (1): Standardize the raw EV charging data and delete the error data; Step (2): Establish a coefficient matrix R between variables   , where ij r is the element of R and m is the number of features owned by each sample.
where ki x is the value of k th charging event. i x is the average value of the i th charging characteristic.
Step (3) The cumulative contribution rate of the first N principal components N C is recorded as follows It can be commonly considered that only the first N principal components can represent most of the information of the original feature when the cumulative contribution rate of the current N principal components is as high as from 85 % to 95 %.

C. Random Forest Algorithm Flow
Random Forest belongs to an integrated algorithm, which is based on the Bagging algorithm, and the Classification and Regression Tree (CART) algorithm [17]. The Bagging algorithm is short for Bootstrap Aggregating, and it is based on Bootstrap sampling [18]. The core idea is to use the results of Bootstrap sampling to construct a number of independent classifiers.
The implementation process of Random Forest is as follows.
Step (1): Assume that the training data set regarding EV charging S have M features. It is repeatedly sampled by Bootstrap sampling to obtain randomly generated training data sets 12 , , , . n S S S These training data sets have m ( mM  ) features. The probability that this new data set does not contain the sample is about 36.8 %. This part of the data is called Out of Bag (OOB).
Step (2): Each decision tree is constructed by the binary recursive method in the CART algorithm based on the corresponding training data set 12 , , , . Choose the best split mode (e.g. get the largest Gini metric) to split each node. Repeat the above splitting process until a certain condition is met. Each tree will not be pruned, when classifiers 12 , , , n C C C are trained corresponding to each training data set.
Step (3): Use the original EV data set X that has been separated, and use the decision trees have not been pruned obtained in Step (2) to discriminate. Then, obtain the sample distribution at the end of each decision tree Step (4): Count the samples at the end of each decision tree. If two samples appear at the same end, the correlation between the two will increase.

D. Principle Component Analysis Algorithm Flow
The correlation between any two samples in the EV charging big data is defined as the ratio of the numbers of these two samples appearing on the last node of the same decision tree. Assuming that the total number of EV charging data samples is n , a similar matrix of n n  dimensions can be constructed. Each element in this matrix belongs to [0, 1] and represents the similarity of two corresponding data samples in each tree of the Random Forest. The probability of aggregation of similar data samples at the end of the decision tree is greater than the probability of the aggregation of dissimilar data samples [16]. The correlation of RF data samples can be used for cluster analysis.

1) Operation Principles of RF Clustering
The similarity of data samples can be used as input for the traditional clustering operation in the clustering problem of EV charging data sets. However, not all multi-structured data sets can show the form of clustering [19]. The RF algorithm can detect EV charging data sets through unsupervised machine learning and may not require prior assumptions about the clustering characteristics of the data set [20].
The main steps of unsupervised machine learning are as follows. Firstly, set the initial EV charging big data as Data Set 1. Secondly, the values of each parameter in the same data sample are independently replaced to generate Data Set 2. The methods of independent permutation are diverse. In this paper, the method of random adoption is adopted according to the empirical boundary distribution of the parameters. Data Set 1 and Data Set 2 form a mixed data set. The Data Set 2 has independent random parameter distributions, but all parameters in Data Set 2 have the same univariate parameter distribution characteristics as that of the corresponding parameters in Data Set 1. Thus, Data Set 2 destroys the non-independent structure of Data Set 1. Random Forest needs to extract Data Set 1 from the mixed data set when trees are being trained. This double-type problem can be simulated by the RF algorithm. The biggest benefit of describing it as a double-type problem is the increased feasibility of clustering.

2) Multidimensional-Scale-based Clustering Results Display
Multidimensional scaling analysis is one of the ways to help Random Forest to analyse the characteristics of a data set. This method can be used to represent the degree of correlation between EV charging data samples. The EV charging data set can be represented in coordinates in a low-dimensional coordinate system by appropriate dimensionality reduction. The distance between any two points in the coordinate system reflects the correlation between two corresponding EV charging samples, which help to explore the factors that affect the correlation of the EV charging samples.
The RF algorithm does not need to specify the distribution characteristics of the parameters before the operation and can estimate the ability of each parameter to influence the correlation prediction results. The cross-validation within the RF algorithm can be used to evaluate the error rate of the correlation prediction. This kind of evaluation has a high accuracy.

A. Data Source
The electric vehicle charging data used herein were collected from 29 charging stations or charging piles in Dundee. The charging locations in Dundee are shown in Fig.  2. The EV charging data is provided by Dundee City Council (https://data.dundeecity.gov.uk). The time period for collecting data is from January 1st to January 31st, 2018. The variables used for observation include the charging start time, the charging end time, the charging duration, the charging energy, and the charging location. After removing the inefficient charging behaviour with incomplete parameters or obvious parameter errors, the total of the remaining effective charging behaviour is 5654 times, including 4220 for workdays, 1180 for weekends, and 254 for holidays.

B. Results of Principal Component Analysis
Based on PCA, the original five observed variables are converted resulting in five new principal components, whose contribution degrees to express the information of entire charging data set are shown in Fig. 3, respectively. The first four principal components can represent 93 % of the information and characteristics of the original data. Therefore, the fifth component is not needed in the clustering analysis. Practice results show that this method reduces the computational time of the RF algorithm and the occupied computer memory.

C. Clustering Implementation Process
In this paper, R programming language is used to implement the clustering function of the RF algorithm. The coordinates of the classical multidimensional scale (CMS) are used to characterize the relationship between the charging behaviours. The specific process is as follows.
Step (1): Set the number of forests and the number of decision trees in each forest. This paper is set to 3 and 500, respectively.
Step (2): Set the number of parameters used by each node on the decision tree to fork, making it equal to an integer. This integer is close to one-half of the total number of parameters and its value is set to 2 in this paper.
Step (3): Enter the charging behaviour data into the random forest algorithm, which is implemented using the RFdist function in RStudio software. So, the importance degree of each parameter (Gini index) is obtained.
Step (4): Using the cmdscale function in RStudio software to generate two-dimensional coordinates of each charging behaviour in the classical multidimensional scales.
Step (5): The classical multidimensional scales of all charging behaviours are drawn based on the two-dimensional coordinates of the charging behaviour obtained in Step (4). The distance between the two points represents the dissimilarity of the two charging behaviours. Each point is represented by   , xy.
Step (6): According to the shape feature of the classical multidimensional scale obtained in Step (5), the image is divided, and the dense points are classified into the same class. Thereby, the clustering results of the charging behaviour of the EV users are obtained.
In this paper, the charging behaviours of workdays, weekends, and holidays are clustered separately and the corresponding classical multidimensional scales are obtained, respectively, as shown in Fig. 4.

D. Clustering Results
According to the shape features of the CMSs of the three periods, the clusters are performed separately, so that the clustering results of the three periods can be obtained. Figure  6 shows the distribution characteristics of different clusters of parameters. The abscissas of Fig. 5 represent time or charging energy and the ordinates represent the number of times that the corresponding value of the abscissas appear in the big data. Table I

E. The Superiority of Random Forest Clustering
Firstly, the effectiveness and superiority of the Random Forest algorithm for the clustering of EV charging behaviour will be analysed in this section. The Euclidean distance method is used to express the correlation between all charging behaviours in the workdays of January 2018. The result is shown in Fig. 6. It can be seen that there are more than 10 clusters in Fig. 6. So, the clustering results obtained by the Euclidean distance method are too complicated. Moreover, according to the evaluation of the importance of the feature parameters by the R programming, the scatter distribution obtained by the Euclidean distance method depends only on one feature parameter and is independent of the remaining feature parameters. In addition, as can be seen from the values of the axes, the distance between each point is much larger than the result obtained by Random Forest. Therefore, the clusters obtained by the Euclidean distance method are diverse, but not suitable for the general characteristics of the reaction behaviour. However, the shape of the scatter plot obtained by RF clustering is regular and dense. The images are striped in strips, which makes it easy to segment and discover categories of images. Besides, the difference between the abscissa and the ordinate of each point is small. This is consistent with the characteristics of the RF algorithm, i.e., the correlation value of the data samples generated by the random forest algorithm should be between 0 and 1. This also makes the CMS of RF more versatile. In summary, the random forest algorithm has obvious advantages in the cluster analysis of EV charging behaviour.

F. Analysis of EV Cluster General Characteristics
According to the results of Table I, the information of EV charging data is further mined in this section. Table II summarizes the links between EV clusters and social behaviours.
As can be seen from Table II, the characteristics of each cluster are clear and closely related to different social behaviour. This also shows that the Random Forest algorithm effectively implements the function of clustering the charging behaviour. However, there are fewer clusters for holidays and weekends and their characteristics are not as obvious as in the clusters of workdays. Because of less data samples for holidays and weekends, there are less clusters that exist objectively. Also, the behaviour patterns of EV users on holidays and weekends do differ from the working days. In addition, due to the small amount of data on holidays, the images are scattered and irregular, which is not conducive to clustering based on CMS. However, the number of holidays is much smaller than the number of workdays during the whole year. So, even if there is inaccuracy in the holiday clustering, it will not have a huge impact on the overall results. These users often work at night, charging mainly at midnight and dawn. Their charging duration is very short, generally no more than an hour, but the charging energy is relatively large. 2 These users are basically active during daytime, mainly charging after dawn. They charge longer, but charge less. 3 This type of user mainly starts charging in the afternoon. Their charging time is very short, but the charging energy is relatively large.

4
This type of user mainly starts charging in the evening. The end of charging duration is the next morning. Therefore, the charging time is longer. However, the amount of charge is relatively small.

5
This type of user mainly starts charging in the morning. However, the charging end time is irregular. The charging duration is very long, which can even contain a few days, but the charging energy is relatively small.

Weekends 1
This type of user charges mainly at midnight and dawn. Their charging duration is very short, but the charging energy is relatively large.
2 This type of user starts charging all day long and takes a long time to charge. However, the amount of charge is relatively small.

3
These users start charging mainly after dawn and their charging duration is very short. However, the amount of charge is relatively large.

Holidays 1
This type of user mainly starts charging in the morning. They charge for a long time, even for a few days. However, the amount of charge is relatively small.

2
The distribution of charging start times for such users is more dispersed. The charging duration is very short, but the charge is relatively large.

G. Analysis of EV Charging Loads
It can be seen from Table I that Workday Cluster 2, of which the main charging period is morning and afternoon, has the highest proportion and consumes the most amount of electricity. This will create the first peak of charging during the day. The charging behaviour of Workday Cluster 3 and Cluster 4 is concentrated in the middle of the night and is mainly based on the fast charging. These two clusters will create the second charging peak during the day. In addition, the trends of charging behaviour on weekends and holidays are roughly similar to that of workdays. However, the number of times of charging and the charging energy of a single day on weekends and holidays are less than that of workdays, which indicates that the charging behaviours of EVs in the workday are more frequent. Therefore, it is necessary to distinguish the charging laws of the three periods.
Furthermore, the charging load of electric vehicles has a distinct peak-to-valley difference. If the charging load of the EVs can be controlled to effectively reduce the peak load, the power and grid investment can be reduced and the operating cost of the grid will possibly be cut also.

V. CONCLUSIONS
This paper proposed a big data mining technique based on Random Forest and Principle Component Analysis for electric vehicle charging behaviour to identify and analyse different types of charging behaviour characteristics. Experiments were carried out using Dundee's January 2018 charging data to obtain the charging behaviour clusters for the workdays, weekends, and holidays. It was found that each cluster has relatively clear characteristics and the users' charging methods can be inferred by analysing the clustering results. The conclusion is that the charging behaviour of EVs has certain regularity. It is necessary to control the EV charging behaviour because it can help to reduce the reconstruction and operating costs of the power grid.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.