Threat Prediction

Real Time Threat Prediction, Identification and Mitigation for Critical Infrastructure Protection Using Semantics, Event Processing and Sequential Analysis

Seamless and faultless operational conditions of multi stakeholder Critical Infrastructures (CIs) are of high importance for today’s societies on a global scale. Due to their population impact, attacks against their interconnected components can create serious damages and performance degradation which eventually can result in a societal crisis. Therefore it is crucial to effectively and timely protect these high performance - critical systems against any type of malicious cyber-physical intrusions. This can be realized by protecting CIs against threat consequences or by blocking threats to take place at an early stage and preventing further escalation or predicting threat occurrences and have the ability to rapidly react by eliminating its roots. In this paper a novel architecture is proposed in which these three ways of confronting with cyber – physical threats are combined using a novel semantics based risk methodology that relies on real time behavioral analysis. The final prototype provides the CI operator with a decision tool (DST) that imprints the proposed approach and which is capable of alerting on new unknown threats, generate suggestions of the required counter-actions and alert of probable threat existence. The implemented architecture has been tested and validated in a proof of concept scenario of an airport CI with simulated monitoring data.

1.1 Crimes are the significant threat to the humankind.There are many crimes that happens regular interval of time.Perhaps it is increasing and spreading at a fast and vast rate.Crimes happen from small village, town to big cities. Crimes are of different type – robbery, murder, rape, assault,battery, false imprisonment, kidnapping, homicide. Since crimes are increasing there is a need to solve the cases in a much faster way. The crime activities have been increased at a faster rate and it is the responsibility of police department to control and reduce the crime activities. Crime prediction and criminal identification are the major problems to the police department as there are tremendous amount of crime data that exist. There is a need of technology through which the case solving could be faster.

1.2 The above problem made me to go for a research about how can solving a crime case made easier. Through many documentation and cases, it came out that machine learning and data science can make the work easier and faster.

1.3 The aim of this project is to make crime prediction using the features present in the dataset. The dataset is extracted from the official sites. With the help of machine learning algorithm, using python as core we can predict the type of crime which will occur in a particular area.

1.4 The objective would be to train a model for prediction. The training would be done using the training data set which will be validated using the test dataset. Building the model will be done using better algorithm depending upon the accuracy. The K-Nearest Neighbor (KNN) classification and other algorithm will be used for crime prediction. Visualization of dataset is done to analyze the crimes which may have occurred in the country. This work helps the law enforcement agencies to predict and detect crimes in Chicago with improved accuracy and thus reduces the crime rate.

2. CONCEPTS OF THE PROPOSED SYSTEM

2.1 Predictive Modeling

Predictive modeling is the way of building a model that is capable of making predictions. The process includes a machine learning algorithm that learns certain properties from a training dataset in order to make those predictions. Predictive modeling can be divided further into two areas: Regression and pattern classification. Regression models are based on the analysis of relationships between variables and trends in order to make predictions about continuous variables.

In contrast to regression models, the task of pattern classification is to assign discrete class labels to particular data value as output of a prediction. Example of a classification model is - A pattern classification task in weather forecasting could be the prediction of a sunny, rainy, or snowy day. Pattern classification tasks can be divided into two parts, Supervised and unsupervised learning. In supervised learning, the class labels in the dataset, which is used to build the classification model, are known. In a supervised learning problem, we would know which training dataset has the particular output which will be used to train so that prediction can be made for unseen data.

Types of Predictive Models Algorithms

Classification and Decision Trees A decision tree is an algorithm that uses a tree shaped graph or model of decisions including chance event outcomes, costs, and utility.

It is one way to display an algorithm. Naive Bayes -In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with independence assumptions between the features.

The technique constructs classifier models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

Linear Regression – The analysis is a statistical process for estimating the relationships among variables. Linear regression is an approach for modelling the relationship between a scalar dependent variable Y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple linear regression. More than one variable is called multivariate.

Logistic Regression - In statistics, logistic regression, is a regression model where the dependent variable is categorical or binary.

Data Preprocessing

This process includes methods to remove any null values or infinite values which may affect the accuracy of the system. The main steps include Formatting, cleaning and sampling. Cleaning process is used for removal or fixing of some missing data there may be data that are incomplete. Sampling is the process where appropriate data are used which may reduce the running time for the algorithm. Using python, the preprocessing is done.

2.2 Functional Diagram of Proposed Work It can be divided into 4 parts:

1. Descriptive analysis on the Data

2. Data treatment (Missing value and outlier fixing)

3. Data Modelling

4. Estimation of performance

Prepare Data

1. In this step we need prepare data into right format for analysis

2. Data cleaning Analyze and Transform Variables We may need to transform the variables using one of the approaches

1. Normalization or standardization

2. Missing Value Treatment Random Sampling (Train and Test)

• Training Sample: Model will be developed on this sample. 70% or 80% of the data goes here.

• Test Sample: Model performances will be validated on this sample. 30% or 20% of the data goes here

Model Selection

Based on the defined goal(s) (supervised or unsupervised) we have to select one of or combinations of modeling techniques. Such as

• KNN Classification

• Logistic Regression

• Decision Trees

• Random Forest

• Support Vector Machine (SVM)

• Bayesian methods

Build/Develop/Train Models

 Validate the assumptions of the chosen algorithm

 Develop/Train Model on Training Sample, which is the available data(Population)

 Check Model performance - Error, Accuracy Validate/Test Model

 Score and Predict using Test Sample

 Check Model Performance: Accuracy etc.

3. IMPLEMENTATION

The dataset used in this project is taken from Kaggle.com. The dataset obtained from kaggle is maintained and updated by the Chicago police department. The implementation of this project is divided into following steps –

3.1. Data collection Crime dataset from kaggle is used in CSV format.

3.2. Data Preprocessing 10k entries are present in the dataset. The null values are removed using df = df.dropna() where df is the data frame. The categorical attributes (Location, Block, Crime Type, Community Area) are converted into numeric using Label Encoder. The date attribute is splitted into new attributes like month and hour which can be used as feature for the model.

3.3 Feature selection Features selection is done which can be used to build the model. The attributes used for feature selection are Block, Location, District, Community area, X coordinate , Y coordinate, Latitude , Longitude, Hour and month,

3.4. Building and Traning Model After feature selection location and month attribute are used for training. The dataset is divided into pair of xtrain ,ytrain and xtest, y test. The algorithms model is imported form skleran. Building model is done using model. Fit (xtrain, ytrain).

3.5. Prediction After the model is build using the above process, prediction is done using model.predict(xtest). The accuracy is calculated using accuracy_score imported from metrics - metrics.accuracy_score (ytest, predicted).

3.6. Visualization Using mathpoltlib library from sklearn. Analysis of the crime dataset is done by plotting various graphs.

.