Improving Intrusion Detection with Adaptive Support Vector Machines

The research topic that this paper is focused on is intrusion detection in critical network infrastructures, where discrimination of normal activity can be easily corrected, but no intrusions should remain undetected. The intrusion detection system presented in this paper is based on support vector machines that classify unknown data instances according both to the feature values and weight factors that represent importance of features towards the classification. The major contribution of the proposed model is significantly decreased false negative rate, even for the minor categories that have a very few instances in the training set, indicating that the proposed model is suitable for aforementioned environments. DOI: http://dx.doi.org/10.5755/j01.eee.20.7.8025


I. INTRODUCTION
Machine learning based intrusion detection system (IDS) learns to classify events based on knowledge which is obtained from the training set.Training set for the network IDS is the set of network connection records formed from a raw network data.Each record is described with a set of features and is labelled as member of appropriate class.After the training, the system is able to predict and classify previously unknown network traffic as normal or malicious.Previous researches have shown that IDS systems based on some machine learning algorithms can be computationally expensive if they are trained with a set that has a large number of features.As a solution, many authors have proposed feature reduction methods that select features important for classification and train the classifiers only with these features.Although these systems operate at much higher speed, it is easy to notice they are just a compromise between accuracy and speed and that they are not suitable for critical environments where no attack should pass undetected.Another downgrade is that all the features in the remaining set are considered equally, as if they contribute with the same amount of knowledge to the classification.
Support vector machines (SVM) solve the first problem.The model proposed in this paper solves the second one.
In this paper, we proposed a SVM IDS model that classifies unknown data instances according to both feature values and the contribution of each feature towards the classification.Feature weights are calculated by scaling the accuracy change of a classifier from which one attribute is removed.Comparing to unmodified SVM classifier and classifiers trained with reduced feature sets, the proposed model significantly reduces false negative rate and increases detection rate for all attack categories, including minor ones.

II. RELATED WORK
Although some authors have made some efforts into the similar research area which this paper deals with, their research ended with a simulation via feature reduction.For example, Yao et al. presented a good feature weight calculation method based on rough sets in [1], and perform an approximation of modified kernel function by cutting off features with low weights.Although testing results of their classifier indicate high detection rates and low false negative rates, the ability to detect specific categories like User to Root (U2R) and Remote to Local (R2L) remains unknown.This area is explored in approximation presented in [2]; authors report high detection rates (over 99 %) for all five categories of classifiers trained and tested with the smaller subsets.However, comparing the results reported in the literature is sometimes impossible due to lack of information in papers and some methodological factors -for example, how the training and testing subsets are created.Although this does not have an impact on normal traffic or probing attacks detection, it is a fundamental issue with the minor categories -U2R and R2L.

III. IDS EVALUATION
The Knowledge Discovery and Data Mining (KDD) Cup '99 IDS evaluation data set [3] is derived from the data gathered at MIT Lincoln Laboratory under DARPA sponsorship with the purpose to evaluate IDS.Data is collected from a network that simulates a typical U.  I.The KDD Cup '99 network traffic data is connection based.Each data record, described with 7 categorical and 34 numerical attributes, corresponds to a connection between two IP addresses.In addition, a label is provided, indicating whether the record is normal or it belongs to one of the four attack categories [5].Categorical features that have two possible values (e.g., logged-in or land) are represented by a binary entry with the value of 0 or 1.During the preprocessing phase, categorical features with more than two possible values (e.g., protocol or service) are transformed into a set of binary features.For example, feature that has possible values tcp, udp and icmp is mapped into three features: (0, 0, 1), (0, 1, 0) and (1, 0, 0).

IV. SUPPORT VECTOR MACHINES
Support vector machines [6], [7] are supervised learning algorithms that they learn very effectively from high dimensional data [8], which eliminates the need for feature reduction.The basic idea is to find a hyper-plane which separates the d-dimensional data perfectly into two classes.If the data is often not linearly separable, SVM's introduce the kernel induced feature space which casts the data into a higher dimensional space where it is separable.
The data for a two class learning problem consists of n objects xi (I = 1, … n) labelled with one of two labels yi corresponding to the two classes: +1 (positive class) or -1 (negative class).Let x denotes a vector with components xi, w the weight vector and b the bias which translates hyperplane from the origin.A linear classifier is based on a linear discriminant function f(x) ( . ) The hyper-plane divides the space in two, while the sign of function f(x) denotes the side of the hyper-plane (as shown in the Fig. 1).
Let k(x, x') denote the kernel function and Φ its effect on an object.In the feature space, (6) takes the form , .( ) ( ) ( ) ( ) As the feature space may be high dimensional, the kernel function must be computed efficiently.The maximum margin classifier is the discriminant function that maximizes the geometric margin 1/‖w‖, where ‖w‖ is the norm of the weight vector.This leads to constrained optimization problem: minimize ½ ‖w‖ 2 , subject to where 1, , i n   .The constraints in this formulation ensure that the maximum margin classier classifies each example correctly, which is possible if the data is linearly separable.In practice, data is often not linearly separable; and even if it is, a greater margin can be achieved by allowing the classier to misclassify some points.To allow misclassification, ( 8) is modified with the slack variables ξi, as shown in (9).Slack variables allow examples to be in the margin error (0 ≤ ξi ≤ 1) or to be misclassified (ξi > 1).The bound of misclassified examples is ∑ξ i where 1, , i n   .The constant C > 0 sets the relative importance of maximizing the margin and minimizing the amount of slack.This formulation is called the soft-margin SVM, and was introduced in [7].The optimization problem for soft-margin classifier becomes minimizing expression (10) subject to (9 Using the Lagrange multipliers (dual formulation), the optimization problems now becomes maximizing -½ .
This formulation leads to an expansion of weight factor 0.
The examples xi for which αi > 0 are points that are on or within the margin: these points are called support vectors.The expansion in terms of the support vectors is often sparse, and the level of sparsity (fraction of the data serving as support vectors) is an upper bound on the error rate of the classifier [9].
The dual formulation of the SVM optimization problem depends on the data only through dot products.The dot product can therefore be replaced with a non-linear kernel function, thereby performing large margin separation in the feature-space of the kernel.The SVM optimization problem was traditionally solved in the dual formulation, and only recently it was shown that the primal formulation can lead to efficient kernel-based learning [10].
If compared to polynomial, Radial Basis Function (RBF, also mentioned as a Gaussian in the literature) kernel has fewer numerical difficulties.One key point is that values between and 1, in contrast to polynomial kernels of which kernel values may go to infinity or zero while the degree is large.The performance of intrusion detection that use support vector machines with different kernels is compared in [11].Their experiment proved that that SVM that uses RBF kernel gives the best performance of an SVM based IDS system.

V. CONSTRUCTING THE ADAPTIVE SVM MODEL
Systems based on feature weight calculation have been simulated with the simple feature reduction that cuts off features with low weights [1], [12], experimentally tested and provided high detection rate.Lack of approximation is a small set of features extracted for U2R and R2L categories that contain the most dangerous attacks and have the least instances in the training set.
There are various methods to calculate weight factors.One method is based on rough set theory, as described in [1].The method based on achieving weights directly from support vector decision function is presented in [2].Although obtaining this information is possible only if a trained L2-loss linear model is used [13], authors do not provide sufficient information needed for further discussion.
The proposed algorithm for feature weight calculation is derived from a feature reduction algorithm presented in [2].Feature weights are calculated according to the accuracy change of a classifier trained with a set from which one feature was removed.Let a denotes the accuracy of classifier trained with all features, and let ai denotes the accuracy of a classifier trained with all features except feature i. Accuracy change for that classifier Δai is given with the expression -.
The smallest and the largest accuracy changes (Δamin and Δamax) are defined with ( 14) and ( 15): , ( ) ) where 1, ... 41 i  . Feature weight wi of the feature i is calculated with (16) and scaled to a range [0, 1]

VI. EXPERIMENTS
Experiments were conducted with LibSVM 3.16 using generic RBF kernel.After preprocessing the dataset (linear scaling of numerical attributes and conversion of categorical attributes to binary) and importing it into LibSVM format, the experiment has been conducted as follows: 

VII. COMPARISON TO FEATURE REDUCTION-BASED SVM
To prove the benefits of the model presented in this paper, its performance is compared to the performance of classifiers trained with reduced feature sets.Reduced sets are generated with the empirical method presented in [2], Fscore ranking method presented in [14] and rough set feature reduction algorithm presented in [1].

VIII. CONCLUSIONS
A new SVM based intrusion detection system that classifies unknown data instances according to the feature values and feature weights has been presented in this paper.Model's performance is compared to original, unmodified SVM classifiers and classifiers based on training sets formed by different feature reduction methods.System is capable to detect even the minor attack categories with high detection accuracy, and false negative rate is significantly decreased.
Although detection accuracy is not a major improvement, significantly reduced false negative rate provides an IDS system with high sensitivity, capable of detecting R2L and U2R attacks, which represent the most dangerous attacks in the training set.
System is capable to self-determine optimal hyperparameters of the classifier and operates at high speed (one dot product per classification).
In the further researches we will analyse the multi-class SVMs [15] expanded by feature vectors and form a system capable to readjust feature weights and optimal model hyper-parameters according to changes in the environment where the system is deployed.
S. Air Force Local Area Network (LAN) attacked with various types of intrusions.There are three partitions of the KDD Cup '99 data available: a full training set (4,898,431 instances), a 10 % version of training set, and a test set (311,029 instances), which includes 17 new attacks (attacks that are not included in the training sets).All intrusions are grouped into four categories, according to the taxonomy of Kendall [4]:  Probing -scanning a network of computers to gather information or find known vulnerabilities;  Denial of Service -causing the unavailability of resources;  User to Root (U2R) -exploiting vulnerabilities to gain root access to the system;  Remote to Local (R2L) -obtaining access to remote system without having a user account; Proportions of attack instances in KDD Cup '99 dataset are given in Table

Fig. 1 .
Fig. 1.Maximum margin hyper-plane division of the feature space for two class problem.Suppose the weight vector can be expressed as a linear combination of the training examples, i.e. w = ∑ wai xi.This is known as dual representation of decision boundary.The discriminant function takes the form

TABLE I .
PROPORTIONS OF INSTANCES IN KDD CUP '99 DATASET.

Training set 10 % train set Test set Normal
972,780 (19.86 %) 97,278 (19.69 %) 60,593 (19.48 %) Probing 41,102 (0.84 %) 4,107 (0.83 %) 4166 (1.34 %) DoS 3,883,370 (79.30%) 391,458 (79.24 %) 229,853 (73.90 %) U2R 52 (~0.00 %) 52 (0.01 %) 70 (0.02 %) R2L 1,126 (0.02 %) 1,126 (0.23 %) 16,347 (5.26 %) Scale the training and test set with feature weights;  Find the optimal hyper-parameters of the new model;  Train the SVM classifier with scaled training set;  Test the new model with three randomly generated test sets (50,000 instances each).Feature weights calculation based on classifier accuracy change required 41 additional experiments in which features were removed one at a time.Detection rates and false negative rates of the proposed model are measured and compared to original model.Performance of the original model (SVM) is given in Table II, performance of the proposed model (AC+SVM) is given in Table III, and the comparison of classifiers is given in Table IV.

TABLE II .
PERFORMANCE OF THE ORIGINAL SVM CLASSIFIER.

TABLE IV .
IDS PERFORMANCE COMPARISON.VALUES IN THE TABLE ARE AVERAGES FOR ALL THREE TESTING SETS.

TABLE V .
AVERAGE PERFORMANCE OF THE FEATURE REDUCTION BASED CLASSIFIERS.

TABLE VI .
IDS PERFORMANCE COMPARISON.VALUES IN THETABLE ARE AVERAGES FOR ALL THREE TESTING SETS.As with the proposed model, reduced set classifiers have been tested on three test sets with 50,000 randomly selected instances.Average results of feature reduction-based classifiers are given in Table V.The comparison of the proposed model and feature reduced classifiers is given in Table VI.