Intrusion Detection by Scikit Decision Tree

Using Scikit decision tree classification algorithm to detect network intrusions


I did this project as a part of the Data Mining course. It is an implementation in Python 2 to detect intrusions of the connections, which was completed by Zhaoqing Peng, Junyi Dai, Dastan, and Lingbin Ni. In this blog post I have described the pipeline to achieve this task. The complete code for the project is available here.

Overview of network intrusion detection


The purpose of network intrusion detection is to prevent computer networks from avoiding non -authorized users' infringement. Using machine learning methods, network invasion detectors can build prediction models through learning, which can distinguish harmful connections (invasion or attack) and normal connections.

Generally speaking, invasion is divided into four main types:

1) DOS (Denial-OF-Service): Request rejection, such as flood attacks.

2) R2L: Unauthorized access to remote machines, such as guessing passwords.

3) U2R: Unauthorized access to local super user privileges, such as overflow attacks of multiple buffers.

4) Probing: monitoring or other monitoring, such as port scanning.

Some characteristics can be used to distinguish ordinary connections and invasion, such as:

Same Host: Check the connection within two seconds of the previous connection as the current connection has the same destination host, and calculate and protocol -related behaviors, services, and so on.

Same Service: Just check that the connection within two seconds has the same service as the current connection.

Same Host and Same Service are also known as Time-based Traffic Features of the Connection Records.

However, some probing attacks use more than two seconds time interval scanning hosts or ports, such as once a minute. Therefore, the connection record is also sorted by the destination host, and the feature is constructed by the window of 100 connected to the same host instead of using a real -time window. This produces a data set called Host-based Traffic Features.

Different from the invasion of most DOS and Probing, there is no frequent sequence mode in R2L and U2R invasion. This is because DOS and Probing invasion contains many connections with the host in a short period of time. However, R2L and U2R invasion are embedded in data ports and packets, and they generally include only one connection.

KDD99 dataset


KDD CUP is a game of data mining analysis sponsored by SIGKDD. It is held once a year. KDD99 is a data set of invasion detection in the 1999 competition, but now the academic world has abandoned this data set. Using this datasets, we only first glimpse two classification algorithms.

The purpose of the task is to detect network connections to distinguish normal and non-normal connections. There are about four categories of non-normal connection: DOS, R2L, U2R, and Probing, and there are several sub categories in each category.

It is worth noting that the probability distribution of the data in the training set and the test set is different. The test set has 14 more attack subcategories than the training set. The official explanation is that this can simulate the real situation well, and the new attack category is generally a variation of the known attack category. Therefore, the algorithm model needs to be able to capture the features of each attack category well, and the dataset can be obtained on the official website.

Data Preprocessing


1. Original Data Preprocessing

The classification results of the original training set and test set are both specific classes of attacks, with a total of 24+14=38 types. Here, we do not consider such a detailed classification first, but only classify all connections into four attack categories. Therefore, based on the mapping relationship provided on the official website, we reduce the type attribute to 5 categories and replace it with the numbers 0, 1, 2, 3, and 4, representing the main categories and normal situations of the four attacks.

2. Data Storage

In order to facilitate subsequent operations such as reading and sorting, and reduce runtime, we read and write the file data in the data directory to the MongoDB database. During the reading and writing process, pay attention to converting some variables into ints and floats before storing them, and add an index to the type attribute of the training dataset.

Scikit Decision Tree Classification


1. Loading data

We extract all the training and testing sets from MongoDB. Since the decision tree can accelerate training with ordered inputs, we use the sort command to sort the type attributes in the training set. Then, all the inputs in the training and testing sets are stored in the dataset, and all the output targets are stored in the datatarget. To distinguish between the training and testing sets, we use T_ Len records the size of the training set.

2. Category variable recoding

For categorical variables, such as male and female, high and low, these string variables cannot be directly input into the algorithm model and need to be recoded as numbers 1,2,3,4 or binary bitmaps.

However, for scalars with ordered sizes such as 1,2,3,4, some algorithms understand that 2 is greater than 1, meaning that men are greater than women. In fact, men and women only represent men or women without any relationship between order and size. Therefore, binary encoding is generally used: men are represented by 01, and women are represented by 10. And this increases the number of variables, from one-dimensional variable sex to two-dimensional variables sex=male and sex=female. Therefore, if there are many types of categorical variables, the final encoded variable dimension will greatly increase, which is not conducive to calculation and is called a dimension disaster.

For the decision tree algorithm, it utilizes information purity for classification, so encoding as 1,2,3 will not have any impact. So we will recode the three category variables (protocol_type, service, flag) in the dataset into a set of numbers.

3. Feature selection

The following situations may occur between variables in a general dataset:

1) Variables with strong correlation and severe interdependence can lead to redundant variables

2) The data in variables is extremely sparse, with a large portion being either zero or meaningless

3) The variable dimension is too high, and after recoding, the variable dimension has reached hundreds

The above three situations are all very unfavorable for the training of the model, and the second situation can filter some sparse variables by setting a threshold; Scenario 1 and 3 can extract the main factors for dimensionality reduction through principal component analysis (PCA). In addition, there are also attribute subset selection methods: reducing the amount of data by deleting irrelevant or redundant attributes. There are three options: gradual forward selection, gradual backward selection, and decision tree induction.

After the recoding of the previous category variables, dataset variables have 42 dimensions in total. The training of high-dimensional datasets in the decision tree is prone to overfitting, so we need to filter variables, that is, feature selection. A good set of features can directly affect the accuracy of the algorithm. Scikit has provided a tree based attribute subset selection method, We directly input the data into the ExtraTreesClassifier for training here, then use SelectFromModel to extract the trained model, and finally use the transform method to filter out important features.

Due to the fact that variables in the test set also need to be filtered, but the filtering principle should be consistent with the training set, otherwise the input of the training set and the test set will be different. Therefore, we will store the index of this set of features in fea_index array.

4. Cross validation 1

Before training the model, we use 1/10 of the training set for cross validation and the remaining data for training. Therefore, we divide the training set into two parts:

1) training_set、training_target: used to train the model

2) test_set、test_ target: used for cross validation

5. Training decision tree model

Decision tree is a non parametric supervised learning algorithm, commonly used are ID3, C4.5 and CART. ID3 and C4.5 are based on information entropy, CART is based on gini impurity. Scikit provides an optimized version of CART algorithm, which is encapsulated in DecisionTree Classifier, and has some main parameters:

• criterion: The criterion for node splitting, gini or entropy

• max_depth: the maximum depth of the tree. If the value is too large, the algorithm will overfitting the training set, while if the value is too small, the algorithm will be hindered from learning the data. The initial value is recommended to be 5, and then slowly increase it.

• min_samples_split: The minimum number of samples that need to be split into child nodes. When the number of samples is less than this value, it is directly marked as a leaf node without continuing to generate child nodes. The larger the value, the fewer branches the tree has.

• class_weight: When the number of categories in the input sample set varies greatly, the final shape of the tree tends to be generated in the direction of big data samples, resulting in imbalanced trees. Therefore, the weights of each category can be set. Classes with less data have greater weights, while those with more data have smaller weights, which can effectively balance the tree.

Through repeated attempts, we found a good set of training parameters: using information entropy to split each node; min_ samples_ split=30; Ensure that all training samples are fully balanced, that is, the number of input tuples is equal.

6. Cross validation 2

We predict 1/10 of the training set (test_set), and compare the obtained predict value (trained_target) and actual value (test_target), and output confusion matrix and classification results statistics to verify the effectiveness of the trained model for data sets with the same probability distribution.

7. Testing

Finally, we test whether the trained model can predict test data well. We use T_len to retrieve the test_data_test and test_ data_target from the dataset, then use fea_index to extract sub attributes (features) from the test input, and finally compares the predicted value (test_ trained_ target) and actual value (test_ data_ target).

Repository


https://github.com/LingbinNi/Intrusion-Detection