The primary goal of this project is to analyze network traffic and detect potential cyber threats by identifying malicious connections. Using Python and advanced machine learning models, we preprocess the data, perform feature selection, and train models to distinguish between malicious and non-malicious connections with high accuracy.
The dataset used is derived from the Snort Intrusion Detection Log provided by the National Security Agency (NSA). It contains 34 features, such as:
- Source/Destination IP Address
- Protocol
- TCP Length
- Flags
- Priority Levels (1 to 4)
Some of the features that did not provide useful information for solving the problem of classifying the data have been removed. Further, data preprocessing steps such as missing value imputation and data normalization have been performed to prepare the dataset for model training.
The first step is to convert a log file which is in the form of ‘.txt’ to ‘.csv’ file with required features needed for the analysis. Now the ‘.csv’ file is read, and various pre-processing steps are applied like dropping irrelevant columns having NaN or null values, dropping rows where flag is not set and many more.
As the dataset included 34 columns, I have selected the important features for our analysis as it is difficult to analyze such a high-dimensional dataset.
- I removed all the individual columns of the flags, namely Flag 1, 2, U, A, P, R, S, F, and used a single column “All Flags” instead to understand the type of connection established.
- Further, also dropped the Protocol, ID, and IP Length columns as they were not providing much information for classifying between malicious or non-malicious attacks.
As some of the columns, such as Connection Classification, All Flags, NOP NOP TS, TCP Options, MSS, NOP WS, and SackOS TS, included NA values, I followed appropriate methods to handle them.
- I have imputed the samples having NA in the Connection Classification column with
0
, as0
would simply mean that there is no classification category information for these samples. - Further, I have used only those rows where the “All Flags” column had some value, as only those rows have an established connection. For the remaining dataset, we imputed the columns having NA values in NOP NOP TS, TCP Options, MSS, NOP WS, and SackOS TS with
0
.
Finally, I achieved a dataset with 0
NA values and applied MinMax Scalar Transformation to scale the features to a (0,1) range. This transformation was applied to prevent features with larger ranges from biasing the classification models.
Pre-processing was applied on column name ‘ConnectionClassification’ to convert label data into binary classification problems.
Initially the column contains ‘text’ as the type of attack recorded and ‘empty’ if it is not malicious. All the text is being converted to value 1(Malicious) and empty is replaced with value 0(non-Malicious).
The following supervised learning models were employed:
Using feature importance analysis, the following features were identified as the most important for classifying between malicious and non-malicious connections:
- Priority 3 has maximum feature importance because all the data points having 'Priority' = 3, is classified as O (Non-Malicious), whereas for all other "Priority' it is classified as 1 (Malicious).
- Priority 3 has maximum feature importance because all the data points having 'Priority' = 3, is classified as O (Non-Malicious), whereas for all other "Priority' it is classified as 1 (Malicious).
- Feature "ActionPerformed' with value 'b' (http _inspect) "IIS UNICODE CODEPOINT ENCODING" is always classified as (non-Malicious). It can be considered as safe "ActionPerformed'.
- Priority 3 has maximum feature importance because all the data points having 'Priority' = 3, is classified as O (Non-Malicious), whereas for all other "Priority' it is classified as 1 (Malicious).
- Feature 'DestinationPort' with values "(60384,60024,60061)' is always classified as O (Non-Malicious). It can be considered as a safe 'DestinationPort', but since importance is negligible, it should not be treated as a 'safe' port. Less number of data records could be one of the reasons for such low importance.