Vehicle Make Detection Using the Transfer Learning Approach

1 Abstract —Vehicle detection and classification is an important part of an intelligent transportation surveillance system. Although car detection is a trivial task for deep learning models, studies have shown that when vehicles are visible from different angles, more research is relevant for brand classification. Furthermore, each year, more than 30 new car models are released to the United States market alone, implying that the model needs to be updated with new classes, and the task becomes more complex over time. As a result, a transfer learning approach has been investigated that allows the retraining of a model with a small amount of data. This study proposes an efficient solution to develop an updatable local vehicle brand monitoring system. The proposed framework includes the dataset preparation, object detection, and a view-independent make classification model that has been tested using two efficient deep learning architectures, EfficientNetV2 and MobileNetV2. The model was trained on the dominant car brands in Lithuania and achieved 81.39 % accuracy in classifying 19 classes, using 400 to 500 images per class.


I. INTRODUCTION
The ability to recognize a vehicle from a video can be very helpful, from traffic monitoring to following the fleeing driver from a crime scene. So far, Automatic License Plate Recognition systems have been mostly utilized to identify the vehicle under investigation. However, if the criminal is driving fast and the recorded license plates are of poor quality, the main tool for locating the suspect is monitoring a car with certain attributes via city cameras and manual searches by officers. Nonetheless, suspects can swap license plates, or witnesses can write down only part of the license plate or recall only a few facts about the car, such as color, make, or model. As a result, an automatic vehicle characteristics recognition system becomes critical in assisting officers and improving the intelligent transportation system [1].
The potential of classifying key characteristics of a vehicle is constantly being examined due to the increasing use and research of convolutional neural networks (CNNs) and the development of the graphics processing unit. CNN has already been used to detect and track cars or to count the number of cars passing by on the road [2], [3]. However, due to the most severe CNN constraint of requiring a large dataset, the transfer learning approach [4] was designed to tackle limited data resources. Deep learning technology is already helping individuals to reduce eye tracking on multiple screens, so it can also be used to determine the attributes of a car, thus helping automate toll collection, intelligent parking systems [5] or law enforcement [6]. Filtering video segments by specific car attributes can be simplified using a vehicle classification model. The task of identifying vehicle characteristics has unique challenges and problems. The first is the impact of weather, time of day, and varied lighting exposures, all of which can make the model difficult to perform, as the classes are very similar [1]. The lack of open-source datasets that include a range of commonly used vehicles, viewing angles, diverse image quality, and data with more than several images per class is another barrier [7]- [9]. And the third challenge is that newer vehicle models are released on a regular basis, requiring the deep learning model to be retrained with even small amounts of data. According to statistics, 42 new models were created in the United States alone in 2020, 34 in 2021, and 62 models are expected in 2025 [10], demonstrating an upward trend in car manufacturing. To address the last two challenges, a country-specific dataset is collected and vehicle brand classification is explored using a transfer learning technique that does not require a large dataset for training, making it appropriate for constantly upgrading systems.
The aim of this research is to create a retrainable vehicle detection and classification framework that can accurately classify vehicle images from various viewpoints to simulate the effect of city cameras. To present the most popular cars on the roads, the dataset of vehicle brands is taken from the Lithuanian car marketplace website. This research uses architectures designed for lower computing power, MobileNetV2 and EfficientNetV2, to ensure that the solution can be implemented in real time and retrained with new vehicle models. The pre-trained feature vectors of defined architectures are fine-tuned, and the study examines multiple combinations of dense classification layers to suggest a novel architectural design. Finally, the generalization abilities of the model are evaluated by comparing the classification performance of various distribution datasets, and examples of the final proposed system are presented. At the end of the study, the findings are discussed, and conclusions and recommendations for further research are presented.

II. COMPUTER VISION LITERATURE REVIEW
Convolutional Neural Networks are widely used in computer vision tasks to classify objects such as cars, pedestrians, and more [11]. To learn how to replicate human intellect, this supervised machine learning technique requires labeled data [12]. The basic CNN consists of convolutional, polling, and fully connected layers that classify the final classes, where the convolutional layer extracts low-and high-level visual features using a weightsharing mechanism [11].
A state-of-the-art CNN model named "You Only Look Once" (YOLO) demonstrated exceptional performance in real-time image processing due to its high performance and accuracy [13]. The third version of the model is the latest and includes 53 convolutional layers and 23 residual layers. The YOLOv3 model has already been pre-trained on the COCO dataset, which covers 80 object categories, including the means of transportation such as bicycles, cars, buses, trains, trucks, and so on. Furthermore, it has been used successfully in vehicle detection applications and has improved recognition capabilities for tiny objects [13].

A. Transfer Learning
CNNs, on the other hand, have the disadvantage of being strongly reliant on large amounts of data to avoid overfitting [14] and hence requiring massive computer resources [15]. As a result, a machine learning technique called "transfer learning" [4] was established, in which a previously trained model with stored weights is used to train a new model with a smaller similar dataset. Because pre-trained models have already learned to recognize edges, colors, and patterns, the weights of the feature extraction network can be frozen to convey this information [16]. Finally, the last layers can be used to obtain additional detailed information about the classified classes [15]. Another transfer learning method is fine-tuning, which involves retraining the layers with initialized pre-trained model weights rather than freezing them [17]. In terms of time and computational resources, freezing layers is the most effective strategy, whereas training from scratch is the most expensive [18].

B. Classification of Vehicle Attributes
Vehicle classification is one of the most interesting tasks in an intelligent transportation system. It has been studied using a logo-based technique, in which the vehicle logo is first localized and then classification is performed [12]. Classification of car brands and models based on front images reaches accuracies of 98.22 % (49 classes) [5], 98.7 % (107 classes) [19], 97.89 % (35 classes) [1], and 96.33 % (766 classes) [20]. However, the region of interest is usually strongly cropped to remove the background and requires a certain angle image; this strategy, for example, is suited for parking lot cameras. There have been successful experiments to determine the type of vehicle, such as a car, van, truck, or another, with an accuracy of up to 99.68 % using a modified pre-trained ResNet-152 model [3] or 76.28 % using ResNet34 [4]. Another example shows that the AlexNet architecture can be used to produce a perfect classifier to assess the position of a vehicle [16]. This type of study can assist in the development of multiple models for different viewpoints to improve the overall performance of model recognition. The classification of vehicle brands using the logo has been of interest for many years, but the challenge of detecting cars that do not rely on this information only has remained largely unsolved [21].
Another type of vehicle classification is the appearancebased classification, which uses elements such as lights, windows, and the vehicle body to classify the vehicle [12]. To date, the most used car data from different viewpoints are the Stanford car dataset, which consists of 196 model classes with 24 to 68 images in the training dataset and the same amount of test data as it has been split 50/50 ratio [6], [15], [16], [18]. The publications that use these data do not take into consideration the impact of unbalanced data, although the sample size of the test data is almost three times smaller for certain classes. However, using this dataset, the fully trained GoogLeNet architecture [9] and the MobileNet transfer learning model [22] had the highest accuracies of the classification of the make and model, at 80 % and 78.27 %, respectively. In 2020, a unique study was carried out to actualize the classification of vehicle models using real surveillance cameras, and the model achieved an accuracy of 62.09 % in real environmental settings [23]. This, together with the increasing number of vehicles and city cameras [7], shows that vehicle classification is relevant and more research is needed to make it adaptive to real-world scenarios.

C. EfficientNet and MobileNet Architectures
This study selected two of the most efficient classification architectures. The first architecture is the lightweight MobileNetV2 [24], which, according to the reviewed articles, achieves one of the highest accuracies. It employs a slow downsampling strategy and more layers have large feature maps, resulting in detail preservation [25]. Another selected architecture is EfficientNetV2, a newer architecture network that uses a compound scaling approach to uniformly scale and balance the depth, width, and resolution of the network to improve accuracy and efficiency [26]. The MobileNetV2 architecture is pre-trained using the ImageNet dataset, which contains 1000 classes covering high-level categories such as animals and vehicles [27]. Meanwhile, EfficientNetV2 is pre-trained on ImageNet-21k, which has 21,841 classes, and is simply a larger version of ImageNet [27]. The pre-training on ImageNet-21k is claimed to provide better performance than the use of ImageNet1k [28].

III. PROPOSED VEHICLE MAKE DETECTION SYSTEM
The suggested system architecture is illustrated in Fig. 1. Here, the preparation phase involves the preparation of data and the training of the classifier, while the usage phase involves the adaptability of the model to the use of external applications containing traffic camera data. The final approach can locate the car in the frame, classify the make for each bounding box, and be leveraged for rendering or other purposes. Each of the three phases is described in further detail:  Data Preparation. The source material contains images of vehicles for sale posted on the Internet. These images contain a lot of noise, such as the gearbox, steering wheel, engine, car interior, etc. The pre-trained YOLOv3 state-of-the-art detection model was utilized as an auto labeler to filter out unnecessary data and crop the region of interest. To ensure that no noisy data was left out, the photographs were reviewed using the designed interface to exclude unrepresentable images. There is also a problem with data-class imbalance; thus, only classes with at least 400 images are taken, and those with more than 500 images are truncated. Image pre-processing includes resizing, converting to grayscale, and normalizing such that the image input matches the pretrained data. Subsequently, data augmentation was used to create a more diversified representation of the vehicle's sides and dimensions. The last step includes data separation training and testing.  Classification Model. The classification task is the central objective of the study. The most efficient architectures of the pre-trained models, MobileNetV2 and EfficientNetV2, were investigated by fine-tuning their feature vectors and using different dense classifiers. With a limited dataset, these models could easily be retrained. See the subsection "Image Classification Using Various Classifiers" for further information.  Car Detection and Make Classification. The YOLOv3 model, which has previously been used for vehicle image filtering, is combined with the best classifier. This can be used for vehicle monitoring via video cameras. Image Classification Using Various Classifiers. The suggested classification architecture has two primary components: reusing a previously trained CNN model and adding a new classifier. For initial feature extraction, selected model architectures with pre-trained weights are used. The MobileNetV2 architecture was trained using a considerably larger and more general ImageNet dataset to obtain the vector of image features. With a total of 2257984 parameters, the vector output size is 1280. Meanwhile, the feature vector of the EfficientNetV2 neural network is trained on the ImageNet-21k dataset, resulting in a vector output size of 1536 and nearly 6 times more parameters: 12930622. Although the learning time increases, the feature vector is fine-tuned with new classifier layers to obtain greater accuracy. The classifier receives the resulting dense 1-D tensor feature vector output and learns to categorize the input image according to the brands of the vehicle. Figure 2 illustrates the classification architecture by reusing pretrained feature vectors and adding new classifier layers. When the image feature vector is extracted using a grid with pre-trained weights rather than starting from a random weight initialization, then the generated classifier is based on a fully connected grid. In this way, the knowledge already acquired can be re-used instead of training from scratch. The classifier can be as simple as a dense layer with a Softmax activation function to obtain the probability of each class. This classification is called the "baseline" or "version 1". Different classifier versions are tested for greater accuracy. The 2 nd version adds weight adjustment L1 and L2 to reduce the number of features. A batch normalization layer is added in the 3 rd version to allow the network to train faster. Dropout at a rate of 0.5 is utilized in the 4 th version to avoid overfitting and learn more robust features. And different combinations of dense layers are then introduced with the rectified linear block activation function. All classifier architectures are presented in Fig. 3.

IV. IMPLEMENTATION RESULTS
This section describes the steps in building a vehicle monitoring system. The best classification architecture, which outperforms prior studies, is proposed. A detailed assessment of the classifier is provided, examining each class performance and providing some visual explanations using gradient-weighted class activation mapping. Finally, the generalizability of the model is investigated.

B. Auto Labeler
A pre-trained YOLOv3 model on the COCO dataset with 80 classes and an input image size of 416 was used to detect image files that do not show an exterior of the car. This model was designed to search for classes such as car, bus, or truck with a preset intersection over a union threshold of 0.5 and a score threshold of 0.89. If the area of the bounding box was greater than 33 % of the image, it was considered a vehicle and cropped. A sample result is shown in Fig. 5.

C. Validation Interface
Vehicle search and image cropping operations using the YOLOv3 model did not yield perfect results. The data contained images of the car with the doors open or very close-up shots that had to be manually removed. To speed up manual work, a graphical user interface has been developed using the Tkinter Python library. Using the interface, it is possible to view images and delete poorly selected photos at the touch of a button. Examples of acceptable and unacceptable images are shown in Fig. 6. Fig. 6. Simplified validation interface. Here, the red "Delete" button deletes the image, the green "Next" button proceeds to the next image, and the white "Undo" button returns the user back to the previous screen.

D. Image Pre-Processing
The images have been resized to 224×224 dimensions to represent the lower quality of the real-life image and to speed up model training. Each image is converted from RGB to grayscale to remove color information that is not a decisive factor in classifying a car brand. However, the pretrained model uses 3 color channels as input, so the gray channel is repeated 3 times. Finally, the data are normalized from 0 to 1. The final dense 4-D-shaped tensor is of the following shape [batch size, height, width, color channels]. Examples of pre-processed images are shown in Fig. 7. According to the study in [29], 200 or more images per class are preferred to achieve the best accuracy using the transfer learning strategy. Therefore, only vehicle brands with more than 400 images have been included. And excess data with more than 500 photos per class were not included to avoid difficulties with imbalanced classes. The class labels were one hot encoded. Finally, a total of 9451 images were divided by 80/20 ratio for training and testing, respectively. There were a total of 7560 train and 1891 test photos collected and 19 classes remained: Audi, BMW, Citroen, Ford, Honda, Hyundai, Kia, Lexus, Mazda, Mercedes-Benz, Mitsubishi, Nissan, Opel, Peugeot, Renault, Skoda, Subaru, Volkswagen, Volvo.

E. Data Augmentation
The data augmentation technique artificially creates modified images and is used to reduce overfitting [29]. The more data available, the better models can be developed [14]. Two different augmentation techniques were used to depict reality. The first flips the horizontal axis to make both sides of the car visible. The second method uses a random 30 % scaling to make the model more robust to small changes in object size. Figure 8 illustrates both augmentations. Data augmentation is used for the training dataset only, so the size of the test data remains the same (1891), but the training data increases from 7560 to 11560, about 53 % of the original data were augmented. As a result, 65.4 % of the training data are real, while 34.6 % are artificially augmented. Figure 9 shows the number of images by car brand.

F. Training Environment
The model is built using cloud-based Google Colab environment. The code is written in the Python programming language using the TensorFlow GPU library. The models are trained on Ubuntu 18.04.5 LTS using NVIDIA-SMI 495.44 with CUDA version 11.2. More information about the experimental setup can be found in Table I. Furthermore, the initially established hyperparameters used to regulate learning processes are presented in Table II. The maximum number of training iterations, called "epochs" is 200. Since too many or too few can lead to overfitting or underfitting, a monitoring system was introduced with an early stop feature such that if the validation loss did not improve in the last 15 epochs, the training would be terminated. The categorical cross-entropy loss function was used for model training as an objective function to be minimized. The test loss measures the distance between the predicted probability distribution of belonging to each class and the target values. To avoid memory overflow, the batch size is set to 16 to feed the model with only a fraction of the training data at a time.
Adam, an optimizer, with a learning rate of 0.0001 was selected.

G. Testing Classifiers Versions Using Pre-Trained Models
Performing classification using the pre-trained models MobileNetV2 and EfficientNet, eight classifiers were tested. All the classifier structures are defined in Section III. Model architectures trained with random weights were also evaluated. However, due to small training data, the model with random weights did not learn the influential features and achieved the highest accuracy of 8.83 % (see Fig. 10). This demonstrates the benefits of employing pre-trained weights, which have a 12.5 times greater accuracy on average than randomly generated weights. The accuracy of the test dataset, which estimates the number of times the highest probability prediction matches the ground truth labels, is shown in Fig. 10 and the model training duration is shown in Fig. 11.  The EfficientNet-B3 feature vector and the 7 th classifier had the best accuracy results of 81.39 %, although the 3 rd classifier was only 0.11 % less accurate. The 7 th version took 50 minutes to train, while the 3 rd took 44 minutes. Although both versions are usable, the 7 th has been selected. The summary of the model with the 7 th classifier is shown in Fig. 12.
The layer names, type, output shape, and number of weight parameters are all listed in the model summary (see Fig. 12). At the bottom of the table is the total number of trainable and non-trainable model parameters. The "None" output value refers to the batch size. The Keras layer in this case is an invoked object for loading a stored model EfficientNet-B3 with a trainable parameter equal to "True".

H. Training Performance of the Best Classifier
The training performance of the EfficientNet-B3 using the 7 th classifier version is shown in Fig. 13.
Here, the blue line indicates the training, and the orange indicates the validation curve. The accuracy plot shows that the model gains high accuracy in the early epochs because the pre-trained weights are reused. To avoid overfitting and generalization errors, the model was trained for less than 40 epochs until the loss of validation stopped improving. However, because there is a gap between the training and validation curves, it indicates a slight overfitting, so increasing the dataset may improve the results. Additionally, the shape of the training loss curve suggests that a good learning pace was chosen. At predicted class probability thresholds ranging from 0 to 1, the receiver operating characteristic curve (ROC) displays the true positive rate on the y-axis versus the false positive rate on the x-axis. The true positive rate indicates the probability that a defined car label will be predicted correctly, and the false positive rate shows how often the prediction of a defined car is incorrect. It is usually plotted for binary classifier; therefore, the one-vs-all approach is employed for multi-class analysis. The ROC curves and the area under the curve are given in Fig. 14. Looking at the ROC curves, the model is close to a perfect classifier, as the curve is above the diagonal line. A contrast between real and predicted values is shown in the confusion matrix (see Fig. 15). Using the confusion matrix, the most confused classes could be identified and their interrelationships explored. However, the model predictions are scattered, and the diagonal reveals the strongest link between the predictable and true classes.

I. Classifier Visual Explanations Using Grad-CAM
Deep learning models are known for their excellent accuracy, yet their complexity makes them hard to interpret. Gradient-weighted class activation mapping (Grad-CAM) is an approach to better understanding how the CNN model works. Grad-CAM descends to the final convolutional layer and uses gradient information to create a rough localization map that identifies key discriminating areas in the image for class prediction. No fully connected layers are used. The class activation mapping is then upscaled to the size of the input image so that a heat map could be presented [30].
The most important properties that determine the classifier decision are marked in red on the heat map (see Fig. 16). In the images, where the car is visible from the front, the model focuses on the brand symbol, the front grille, and a little bit on the headlights. Meanwhile, looking at the cars from the side, the model distinguishes the most discriminative regions in the middle and upper parts of the car (body, hood, and windows). To classify the car from the rear view, the model searches for the brand symbol around the license plate number. The angle of view is very important for good performance of the model. If the car is seen from above, the focus is on the body. And when the car is at a 45-degree angle, the discriminating region includes a side lamp or other car shapes.

J. Generalization Testing
It is critical to check how the model works with realworld data [31]. Even if the accuracy of a test dataset is high, this does not guarantee that the model will work well in every situation. There may be a generalization gap if the model receives previously unseen data, such as different backgrounds or different weather conditions. Three different scenarios are used to verify the reliability of the model.
To start, 168 photos from the test dataset are randomly selected so that the vehicles are only seen from the side when the car brand symbol is not visible. This results in 17 classes with a 76.19 % accuracy (see Table III). As a reminder, the accuracy of the test on all data is 81.39 %, which means that the side of the car is about 5 % less predictable.
The second scenario combines both training and testing sets from the Stanford cars dataset. Because the data contained brands that our model cannot recognize, it was filtered out. That left 10 classes, the majority of which were formed of Audi and BMW vehicles. A total of 4119 pictures were acquired and an accuracy of 66.86 % was obtained.
In the third scenario, 12 photos are taken from a video recorded in the evening with artificial lighting. Six vehicle Generalization testing reveals that additional datasets have a mean accuracy of 72.7 %, resulting in 8.7 % poorer estimation than the initial test dataset (81.39 %). In general, the model can be generalized, but the accuracy will not be as good as for the same training distribution. As a result, integrating diverse data distributions in the training phase can help the model adapt to a broader range of scenarios.

V. DISCUSSION
In this paper, the collection of country-based datasets and a strategy for vehicle localization and brand classification are proposed. The data represent the most frequent cars on the Lithuanian market and were balanced to avoid biases. The data preparation stage was automated using the YOLOv3 car detection model; however, the manual image validation stage could be eliminated by optimizing the model parameters and making it more conservative. Analyzing the classifiers, it was discovered that EfficientNetV2 is 7.14 % more accurate than MobileNetV2. The 7 th version of the classifier, consisting of three dense layers, batch normalization, dropout, and L1 and L2 regularization, improved the most in accuracy. Compared to the baseline classifier, the EfficientNetV2 7 th classifier improved its accuracy by 9 % (from 72.4 % to 81.4 %). This was accomplished by classifying 19 automobile manufacturer classes using a batch size of 16 and 38 epochs. Furthermore, the architecture created outperforms the maximum accuracy of the fully trained GoogLeNet architecture of 80 % [10], although the performance cannot be directly compared due to different datasets and specific configurations.
After generalization tests using the EfficientNetV2 7 th classifier, the average accuracy of three different datasets was 72.7 %. Compared to the accuracy of the original dataset testing sample, the accuracy dropped by 8.7 %. The reduction in accuracy is not significant, and some fluctuation is expected. As observed from the results of the Grad-CAM method, the model is trained to recognize the brand of the car by the location of the logo, and if it is not found, the car body, hood, or windows become an essential predictor. On the other hand, mixing various data into a training dataset can help strengthen model generalization abilities.
The following are suggestions for future improvements:  More innovative augmentation techniques might be proposed to increase generalization performance in diverse weather and lighting settings. For example, artificially manipulating photos to make them appear as though they were taken at night, with a darker background and light reflections on the car. In addition, it would be useful to add more data from different sources, representing different nature of images, diverse quality, viewpoints, and so on. As background clutter reduces model performance [12], removing it improves vehicle classification results [1]. By changing the eliminated background with various representations of road pictures, this technique might be utilized in augmentation.  As one study was able to classify the position of a car with 100 % accuracy [16], it could be re-used effectively to classify cars from different angles. The suggestion would be to first classify the visible position of the car and then, depending on the results, classify the features of the car using several trained models. As a result, the learnt parameters of the model would be unique for each given viewpoint, although it requires labeled data on the viewpoint of the vehicle.  The task could also be expanded to include other vehicle attributes, such as vehicle type, make, model, or color, so that the vehicle can be precisely identified. It would also be beneficial to test the model using video data.

VI. CONCLUSIONS
In this paper, a vehicle detection and classification system was suggested and implemented, which was found to be able to classify common local vehicle brands regardless of the viewable angle. This is especially relevant as smart city systems and the use of security and traffic monitoring cameras become more widespread. To represent the country's most popular car manufacturers, static images from the Lithuanian car sales website were used. However, this step of data collection has also become a barrier in the system, as it requires manual validation, which should be replaced by an automated and more efficient data labeling solution. Effective deep learning architectures have been investigated to make the model easier to use in real time and to incorporate new car classes as production grows. The proposed EfficientNetV2 architecture adjustment improves the performance of the original classifier by 9 % and achieves an acceptable classification score of 81.4 %. However, to adjust the model to real-world conditions, data from the environment in which it will be used should be incorporated into the training. The findings imply that the trained model can be used for urban vehicle monitoring, and various improvements are proposed for future research.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.