Campus Safety Data Analysis

In my graduate level Data Prperation and Analysis course, we were given a semester long project to work on. We had to choose our topic and use data science to answer some question and probem within our topic. I was part of a team of 4 that decided to focus on public safety of the campus. Illinois Tech is not in the best area of Chicago and students can often be robbed of their possessions. We wanted to analyse campus incident data in order to try to locate the high risk areas as well as predict where the next crime is most likely to occur on campus. Illinois Tech's public safety department does not have a friendly data source for us to use, but they do have a blog where they post the different incidents and information for each day. We web scraped the blog to obtain the incident data we wanted to use. However, the blog was not super consistent, so it took a lot of work to clean up the data for us to actually use. We also wanted to look at other data sources such as the chicago police portal to try to get crimes that were not reported in the blog or were nearby campus. The last main data set that we used was weather data. It seems reasonable that the weather will influence if crime is committed or not, so we obtained weather data from an API and merged it with the crime data based on the date and time of the crime. Since we were trying to predict whether a crime would be commited or not, we had to classify the different incidents as serious or non-serious incidents based on their label (robbery, car theft, underage drinking, etc...). We quickly realized that we had much more serious crimes than non-serious crimes which can cause over-fitting. We split the dataset and trained many differnt kinds of models on the dataset to determine which are the best models. The types of models that we tried to use include logistic regression, decision trees, ridge regression, lasso regression, and random forrests. We also created more models with a R package called Rose which will synthetically sample from the data to even out the number of serious vs. non-serious crimes that we train on. In the end we found that the logistic model on the Rose data was the best to use. The accuracy was not necessarily as great as some of the other models, but it's precision and recall were much better than the other models. The project code can be found on my GitHub page and the final report can be found in the link below.