Project 02: Car Crash Exploratory Data Analysis

This project uses exploratory data analysis to visualize underlying car crash data for Southeast Michigan. This step uses python, pandas, matplotlib, and seaborn to highlight trends that exist within the data.

import pandas as pd
df = pd.read_csv('combined_dataset.csv')
#df = pd.read_csv('crash_funding_data.csv')
/Users/connorlockman/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (28,48,49) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
df.head()
Unnamed: 0 ACOUNT ALCOHOL BCOUNT BICYCLE CARTODB_ID CCOUNT CNTNAME COMMERCIAL COMMUNITY ... UNITS WEATHER WEEKDAY WORK_ZONE X XCORD Y YCORD YEAR YOUNG
0 0 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 2 5 2 NaN -83.771023 -83.77102 42.287878 42.28787 2017 0
1 1 0 0 0 0 NaN 0 Oakland NaN Brandon Twp ... 1 5 2 NaN -83.421944 -83.42194 42.810271 42.81026 2017 1
2 2 0 0 0 0 NaN 0 Oakland NaN Bloomfield Twp ... 1 5 2 NaN -83.234323 -83.23432 42.615461 42.61546 2017 1
3 3 0 0 0 0 NaN 0 Oakland NaN Orchard Lake Village ... 2 2 3 NaN -83.362812 -83.36281 42.578827 42.57882 2017 0
4 4 0 0 0 0 NaN 0 Livingston NaN Hartland Twp ... 2 2 5 NaN -83.753034 -83.75303 42.656270 42.65626 2017 0

5 rows × 61 columns

What does our combined dataset look like?

len(df)
709900

We have 709,900 crash cases in this dataset.

what about the distribution among years?

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
sns.set(color_codes=True)
x = df['YEAR']
yr_2014 = len(df.loc[df['YEAR'] == 2014])
yr_2015 = len(df.loc[df['YEAR'] == 2015])
yr_2016 = len(df.loc[df['YEAR'] == 2016])
yr_2017 = len(df.loc[df['YEAR'] == 2017])
yr_2018 = len(df.loc[df['YEAR'] == 2018])
yr_2017
145362
x_axis = ['2014','2015','2016','2017','2018']
y_axis = [yr_2014,yr_2015,yr_2016,yr_2017,yr_2018]
sns.barplot(x=x_axis, y=y_axis)
<matplotlib.axes._subplots.AxesSubplot at 0x10a7701d0>

We see that each year has a very similar total number of car crashes.

When are these car crashes occuring?

Let’s look at the aggregate day counts

dates_result = df.groupby('DATE_FULL').size().sort_values(ascending=True)
dates_result
DATE_FULL
12/25/2014     136
01/01/2015     155
01/03/2016     155
12/25/2015     156
03/23/2014     160
07/06/2014     162
03/30/2014     163
12/25/2016     169
04/13/2014     172
07/05/2015     174
04/27/2014     176
04/05/2015     181
04/20/2014     183
04/06/2014     185
01/12/2014     185
09/09/2018     186
03/04/2018     186
02/25/2018     188
03/11/2018     188
12/07/2014     188
03/16/2014     189
03/29/2015     190
12/28/2014     191
02/19/2017     191
12/27/2015     192
11/26/2015     192
12/25/2018     193
07/16/2017     193
12/14/2014     193
04/22/2018     194
              ...
10/31/2014     768
11/20/2018     778
12/24/2017     780
12/11/2016     794
01/10/2017     795
12/18/2015     796
02/09/2018     797
01/16/2014     797
01/31/2017     818
01/02/2014     832
02/09/2016     845
04/17/2018     848
12/13/2017     870
03/01/2016     872
02/05/2014     874
01/09/2015     879
01/03/2014     913
11/21/2015     917
03/13/2014     941
12/09/2017     965
01/08/2014     982
01/09/2018     992
01/29/2018     993
12/08/2016    1007
12/11/2017    1018
03/12/2014    1021
02/24/2016    1126
11/19/2014    1287
01/07/2014    1313
03/13/2017    1369
Length: 1826, dtype: int64

We see that the most car crashes in this dataset occurred on March 13th of 2017. With a little investigation we see that the bottom of the list makes sense with christmas day emerging as one the days with the least number of car crashes. Days at the top of this list mainly fall in the winter months and we can see that with some investigation that these days typically had wintery conditions with snow, ice, or very low temperatures occuring.

fig, ax = plt.subplots(figsize=(15,7))
result_dates = df.groupby('DATE_FULL').size().plot(ax=ax)

Let’s look at the hourly distribution of crashes

fig, ax = plt.subplots(figsize=(15,7))
df.groupby('HOUR').size().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x116c99b38>

We can see in this figure that their is a bimodial distribution of crashes with a spike in the morning during the traditional morning rush hour times of 6 - 9 and a larger spike in the afternoon around 5 - 7 pm during the evening rush hour period. Intuitively, this makes sense.

Which city has the most car crashes and the least?

df.groupby('COMMUNITY').size().sort_values(ascending =True)
COMMUNITY
Novi Twp                              1
Southfield Twp                        1
Fenton (Livingston)                   1
Estral Beach                          5
Grosse Pointe Shores (Macomb)        11
Lake Angelus                         14
Leonard                              17
Memphis (St. Clair)                  17
Barton Hills                         21
Petersburg                           22
Maybee                               29
Emmett                               39
Richmond (St. Clair)                 39
Memphis (Macomb)                     44
Armada                               61
Luna Pier                            76
Carleton                             89
Yale                                 98
Algonac                             124
Capac                               128
Freedom Twp                         140
Milan (Monroe)                      143
East China Twp                      153
Manchester                          154
Ortonville                          170
Grosse Pointe Shores (Wayne)        176
Pinckney                            199
South Rockwood                      199
River Rouge                         218
Grant Twp                           218
                                  ...  
Redford Twp                        6079
Dearborn Heights                   6163
Ypsilanti Twp                      6341
Pittsfield Twp                     6852
St. Clair Shores                   7470
Madison Heights                    7511
West Bloomfield Twp                7684
Bloomfield Twp                     7948
Westland                           8590
Waterford Twp                      9357
Shelby Twp                         9407
Macomb Twp                         9426
Roseville                          9626
Novi                              10128
Pontiac                           10160
Auburn Hills                      10435
Royal Oak                         10582
Canton Twp                        10609
Taylor                            11847
Clinton Twp                       11957
Rochester Hills                   11959
Farmington Hills                  15760
Troy                              17085
Dearborn                          17505
Southfield                        18002
Ann Arbor                         18250
Livonia                           18341
Warren                            23681
Sterling Heights                  23796
Detroit                          115539
Length: 237, dtype: int64

237 cities featured crash data for the region

We see that Detroit emerges as the city with the most crashes. We can believe this because it is clearly the largest city in South East Michigan with a population close to 675,000. It is also the economic center of the region and has many of the areas jobs. Cities that fall to the bottom of the list have, generally smaller populations. For insantance, Barton Hills which is located just to the west of ann arbor had 21 crashes and has a population of 320. These results pose a question of how to normalize this data and whether that should be by total population or number of jobs in the region. We decide to rely on the raw data because we don’t have an indication about which is the appropriate normalization process.

County

df.groupby('CNTNAME').size().sort_values(ascending =False)
CNTNAME
Wayne         260891
Oakland       204977
Macomb        125741
Washtenaw      56091
Livingston     24111
Monroe         19052
St. Clair      19037
dtype: int64
pop = [1.75,1.25,.871,.368,.189,.149,.159]
county = ['Wayne','Oakland','Macomb','Washtenaw','Livingston','Monroe','St. Clair']
sns.barplot(x=pop, y=county)
<matplotlib.axes._subplots.AxesSubplot at 0x1162a79b0>

Who is involved in these car crashes?

who_df = df[['PEDESTRIAN', 'BICYCLE', 'MOTORCYCLE',
       'TRAIN', 'DISTRACTED','ELDERLY','YOUNG']]
who_df.head()
PEDESTRIAN BICYCLE MOTORCYCLE TRAIN DISTRACTED ELDERLY YOUNG
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1
2 0 0 0 0 1 0 1
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
ped_lst = df.loc[who_df['PEDESTRIAN']==1]
ped = len(ped_lst)
ped / 709000
0.008899858956276445
bike_lst = df.loc[who_df['BICYCLE']==1]
bike = len(bike_lst)
moto_lst = df.loc[df['MOTORCYCLE']==1]
moto = len(moto_lst)
train_lst = df.loc[df['TRAIN']==1]
train = len(train_lst)
distracted_lst = df.loc[df['DISTRACTED']==1]
distracted = len(distracted_lst)
young_lst = df.loc[df['YOUNG']==1]
young = len(young_lst)
young / 709000
0.3330253878702398
old_lst = df.loc[df['ELDERLY']==1]
old = len(old_lst)
x_axis = ['ped','bike','moto','train']
y_axis = [ped, bike,moto,train]
sns.barplot(x=x_axis, y=y_axis)
<matplotlib.axes._subplots.AxesSubplot at 0x1181a75f8>

We can see here that pedestrians are involved in the most crash instances while trains are fairly rare in the crash data.

x_axis = ['old','young','distracted']
y_axis = [old, young,distracted]
sns.barplot(x=x_axis, y=y_axis)
<matplotlib.axes._subplots.AxesSubplot at 0x119e3c0b8>

old + young + distracted
372701
709900 - 372701
337199

We see that with using some of the demographic features that we are still 337,199 crash accounts short. This builds insight into that while we do know a little about who is involved in these crashes, we don’t have full understanding about them across the board.

Weather

How does weathe impact this car crash data?

df.groupby('WEATHER').size().sort_values(ascending =False)
WEATHER
1     436696
2     142019
4      63503
5      47649
98     11278
3       3289
7       2354
8       2000
6       1019
9         42
10        27
0         24
dtype: int64

1 = clear

2 = cloudy

3 = FOG

4 = RAIN

5 = SNOW

6 = SEVERE CROSSWINDS

7 = SLEET / HAIL

8 = BLOWING SNOW

9 = BLOWING SAND SOIL DIRT

10 = SMOKE

98 = UNKNOWN

0 = NOT ENTERED

Y = [436696,142019,63503,47649,11278,3289,2354,2000,1019,42,27,24]
#X = [1,2,4,5,98,3,7,8,6,9,10,0]
X = ['CLEAR','CLOUDY','FOG','RAIN','SNOW','SEVERE CROSSWINDS','SLEET / HAIL','BLOWING SNOW','BLOWING SAND SOIL DIRT','SMOKE','UNKNOWN','NOT ENTERED']
sum(Y)
709900

Every case in the data has a weather code assigned to it.

sns.barplot(x=X, y=Y)
plt.xticks(rotation='vertical')
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]),
 <a list of 12 Text xticklabel objects>)

The vast majority of crashes occur under clear or cloudy circumstances.

Road Conditions

Weather doesn’t seem to be a major factor, but could road factors be impacting the data?

df.columns
Index(['Unnamed: 0', 'ACOUNT', 'ALCOHOL', 'BCOUNT', 'BICYCLE', 'CARTODB_ID',
       'CCOUNT', 'CNTNAME', 'COMMERCIAL', 'COMMUNITY', 'CRASHID', 'CRSHTYPEO',
       'DATE_FULL', 'DAY', 'DEER', 'DISTRACTED', 'DIS_CTRL_I', 'DRUG',
       'ELDERLY', 'EMERGENCY', 'FALINKID', 'FID', 'HIGH_SEVER', 'HITNRUN',
       'HOUR', 'HWY_CLSS_C', 'INTERROAD', 'INTR_INVL_', 'JURIS', 'KCOUNT',
       'LANEDEPART', 'LIGHTING', 'MAINROAD', 'MONTH', 'MOTORCYCLE', 'MP',
       'NFC', 'OBJECTID', 'OCCUPANTS', 'PEDESTRIAN', 'PR', 'PROPDAMG',
       'REDLIGHTRU', 'ROADCONDIT', 'ROADLANES', 'SCHOOLBUS', 'SEMMCD',
       'SPEEDLIMIT', 'SURFACE', 'TIME_FULL', 'TRAIN', 'UNITS', 'WEATHER',
       'WEEKDAY', 'WORK_ZONE', 'X', 'XCORD', 'Y', 'YCORD', 'YEAR', 'YOUNG'],
      dtype='object')
df.groupby('WORK_ZONE').size().sort_values(ascending =False)
WORK_ZONE
0.0    141201
1.0      2743
dtype: int64
2743 + 141201
143944

We can’t be certain of anything about workzones because it appears that we dont have data for many of the years in question

df.groupby('ROADCONDIT').size().sort_values(ascending =False)
ROADCONDIT
1     500506
2     114178
4      41835
3      34048
6       6193
98      4955
97      3390
8       2250
5       2185
7        254
10        47
0         38
9         21
dtype: int64

1 = dry

2 = wet

3 = ice

4 = snow

5 = Mud, Dirt, Gravel

6 = slush

7 = debris

8 = Water (Standing/Moving)

9 = Sand

10 = Oily

97 = other

98 = unknown

0 = null / not entered

Y = [500506,114178,41835,34048,6193,4955,3390,2250,2185,254,47,38,21]
X = ['dry','wet','ice','snow','Mud, Dirt, Gravel', 'Slush','debris','Water (Standing/Moving)','Sand','Oily','other','unknown','null / not entered']
sns.barplot(x=X, y=Y)
plt.xticks(rotation='vertical')
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),
 <a list of 13 Text xticklabel objects>)

df.groupby('HWY_CLSS_C').size().sort_values(ascending =False)
HWY_CLSS_C
9    420484
3    129568
1    103871
2     41057
4      9031
5      4076
7      1079
0       734
dtype: int64
df.groupby('ROADLANES').size().sort_values(ascending =False)
ROADLANES
 2    252048
 5    134424
 3    132193
 4    125930
 1     24694
 6     23295
 7     12496
 8      4144
 9       632
 0        40
-2         2
-4         2
dtype: int64
df.groupby('SPEEDLIMIT').size().sort_values(ascending =False)
SPEEDLIMIT
45    160483
25    114022
35     95293
70     86459
40     82383
55     73438
50     49487
30     33187
0       8746
15      2520
60      1406
20       923
65       781
10       355
5        216
75       195
46         2
26         1
23         1
8          1
80         1
dtype: int64
df.groupby('SURFACE').size().sort_values(ascending =False)
SURFACE
Asphalt     91780
Concrete    39784
            10760
Gravel       1609
Brick          11
dtype: int64

Road Funding

We are interested in building road funding into this data and conducting some analysis on this

df_funding = pd.read_csv('community.csv')
df_funding.head()
County Community Annual Amount Road Miles Average Per Mile Unnamed: 5 Unnamed: 6
0 Livingston Brighton $465,346 29.44 $15,807 NaN NaN
1 Livingston Fowlerville $207,751 13.84 $15,011 NaN NaN
2 Livingston Howell $576,420 36.64 $15,732 NaN NaN
3 Livingston Pinckney $150,719 11.37 $13,256 NaN NaN
4 Macomb Armada $106,664 7.21 $14,794 NaN NaN
df_funding = df_funding.rename(columns={'Community': 'COMMUNITY'})
df_funding.loc[df_funding['COMMUNITY']=='Brandon Twp']
County COMMUNITY Annual Amount Road Miles Average Per Mile Unnamed: 5 Unnamed: 6

Not all the cities from our original crash data are represented in our funding data

len(df_funding)
120
result = pd.merge(df,df_funding,how='left', on='COMMUNITY')
result.head()
Unnamed: 0 ACOUNT ALCOHOL BCOUNT BICYCLE CARTODB_ID CCOUNT CNTNAME COMMERCIAL COMMUNITY ... Y YCORD YEAR YOUNG County Annual Amount Road Miles Average Per Mile Unnamed: 5 Unnamed: 6
0 0 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 42.287878 42.28787 2017 0 Washtenaw $7,535,530 296.70 $25,398 NaN NaN
1 1 0 0 0 0 NaN 0 Oakland NaN Brandon Twp ... 42.810271 42.81026 2017 1 NaN NaN NaN NaN NaN NaN
2 2 0 0 0 0 NaN 0 Oakland NaN Bloomfield Twp ... 42.615461 42.61546 2017 1 NaN NaN NaN NaN NaN NaN
3 3 0 0 0 0 NaN 0 Oakland NaN Orchard Lake Village ... 42.578827 42.57882 2017 0 NaN NaN NaN NaN NaN NaN
4 4 0 0 0 0 NaN 0 Livingston NaN Hartland Twp ... 42.656270 42.65626 2017 0 NaN NaN NaN NaN NaN NaN

5 rows × 67 columns

len(result)
709900
result.groupby('Average Per Mile').size().sort_values(ascending =False)
Average Per Mile
$21,086    115539
$21,519     23796
$20,750     23681
$17,127     18341
$25,398     18250
$20,736     18002
$24,417     17505
$16,196     17085
$17,714     15760
$17,795     11959
$20,205     11847
$18,210     10582
$19,761     10435
$18,627     10160
$19,849     10128
$22,508      9626
$22,851      8590
$18,385      7511
$18,710      7470
$18,235      6163
$14,372      5958
$20,253      5766
$16,871      4704
$18,874      4606
$20,474      4391
$32,256      4312
$20,791      4300
$14,715      3923
$16,188      3745
$19,519      3343
            ...  
$17,430       530
$15,920       529
$14,503       494
$14,983       474
$17,953       463
$18,172       459
$13,952       401
$17,990       359
$12,525       343
$16,313       298
$13,131       290
$13,965       266
$11,944       224
$15,011       219
$16,727       218
$13,256       199
$11,354       199
$13,864       170
$10,463       154
$13,948       128
$13,452       124
$12,711        98
$14,614        89
$9,113         76
$14,794        61
$7,884         39
$10,603        29
$11,498        22
$10,925        17
$8,188          5
Length: 103, dtype: int64
fig, ax = plt.subplots(figsize=(15,7))
result.groupby('Average Per Mile').size().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x1198aa6d8>

We see here that we both don’t have information for all of the cities in our crash data and that the most crashes are occuring in municipalities that fund their roads on average of 21,086. The real issue here is that the different values are so highly specific that they are just representing the different cities. We need to bin these values

result[['Currency']] = result[['Average Per Mile']].replace('[\$,]','',regex=True).astype(float)
result.head()
Unnamed: 0 ACOUNT ALCOHOL BCOUNT BICYCLE CARTODB_ID CCOUNT CNTNAME COMMERCIAL COMMUNITY ... YCORD YEAR YOUNG County Annual Amount Road Miles Average Per Mile Unnamed: 5 Unnamed: 6 Currency
0 0 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 42.28787 2017 0 Washtenaw $7,535,530 296.70 $25,398 NaN NaN 25398.0
1 1 0 0 0 0 NaN 0 Oakland NaN Brandon Twp ... 42.81026 2017 1 NaN NaN NaN NaN NaN NaN NaN
2 2 0 0 0 0 NaN 0 Oakland NaN Bloomfield Twp ... 42.61546 2017 1 NaN NaN NaN NaN NaN NaN NaN
3 3 0 0 0 0 NaN 0 Oakland NaN Orchard Lake Village ... 42.57882 2017 0 NaN NaN NaN NaN NaN NaN NaN
4 4 0 0 0 0 NaN 0 Livingston NaN Hartland Twp ... 42.65626 2017 0 NaN NaN NaN NaN NaN NaN NaN

5 rows × 68 columns

result['bin'] = pd.cut(result['Currency'], [0, 5000, 10000,15000,20000,25000,30000])
result.head()
Unnamed: 0 ACOUNT ALCOHOL BCOUNT BICYCLE CARTODB_ID CCOUNT CNTNAME COMMERCIAL COMMUNITY ... YEAR YOUNG County Annual Amount Road Miles Average Per Mile Unnamed: 5 Unnamed: 6 Currency bin
0 0 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 2017 0 Washtenaw $7,535,530 296.70 $25,398 NaN NaN 25398.0 (25000.0, 30000.0]
1 1 0 0 0 0 NaN 0 Oakland NaN Brandon Twp ... 2017 1 NaN NaN NaN NaN NaN NaN NaN NaN
2 2 0 0 0 0 NaN 0 Oakland NaN Bloomfield Twp ... 2017 1 NaN NaN NaN NaN NaN NaN NaN NaN
3 3 0 0 0 0 NaN 0 Oakland NaN Orchard Lake Village ... 2017 0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 0 0 0 0 NaN 0 Livingston NaN Hartland Twp ... 2017 0 NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 69 columns

result.groupby('bin').size().sort_values(ascending =False)
bin
(20000, 25000]    244713
(15000, 20000]    194008
(10000, 15000]     18789
(25000, 30000]     18250
(5000, 10000]       2130
(0, 5000]              0
dtype: int64
fig, ax = plt.subplots(figsize=(15,7))
result.groupby('bin').size().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x1199966a0>

Ann Arbor

Let’s look at a specific city in our data and see if the overarching trends are represented

df_aa = df.loc[df['COMMUNITY']=='Ann Arbor']
df_aa.head()
Unnamed: 0 ACOUNT ALCOHOL BCOUNT BICYCLE CARTODB_ID CCOUNT CNTNAME COMMERCIAL COMMUNITY ... UNITS WEATHER WEEKDAY WORK_ZONE X XCORD Y YCORD YEAR YOUNG
0 0 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 2 5 2 NaN -83.771023 -83.77102 42.287878 42.28787 2017 0
21 21 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 2 1 4 NaN -83.707336 -83.70734 42.302537 42.30253 2017 0
52 52 0 0 1 0 NaN 1 Washtenaw NaN Ann Arbor ... 2 1 3 NaN -83.749675 -83.74967 42.252243 42.25224 2017 0
121 121 0 0 1 0 NaN 0 Washtenaw NaN Ann Arbor ... 2 1 6 NaN -83.739154 -83.73915 42.240915 42.24091 2017 1
131 131 0 0 0 0 NaN 0 Washtenaw NaN Ann Arbor ... 2 1 1 NaN -83.735003 -83.73500 42.285408 42.28540 2017 0

5 rows × 61 columns

len(df_aa)
18250
709000 / 237
2991.5611814345993
2991.5611814345993 / 709000
0.0042194092827004225
.42 * 237
99.53999999999999

If all was equal, each city would have 2,991 crashes occur within their distinct city limits, but as mentioned above not all is equal.

237 * 2991
708867

We can see that 18,250 car cashes occurred within ann arbor city limits over the course of these five years in question.

18250 / 709000
0.025740479548660086

Each municipality should account for .4 percent of the crashes (if all was equal) but ann arbor is accounting for over 2.5 percent. This matches where ann arbor fell on our crash total list.

709000 * 0.025740479548660086
18250.0
aa_dates_result = df_aa.groupby('DATE_FULL').size().sort_values(ascending=False)
aa_dates_result
DATE_FULL
11/19/2014    48
01/01/2014    44
12/11/2017    43
03/07/2018    42
12/18/2014    39
01/27/2014    37
01/12/2018    37
03/13/2017    37
11/21/2015    36
04/17/2018    34
01/05/2015    33
02/24/2016    33
01/09/2014    32
01/28/2014    30
02/05/2014    29
02/09/2018    29
03/12/2014    28
01/29/2018    28
11/02/2017    28
01/09/2018    27
03/01/2016    26
01/16/2014    26
01/23/2014    26
11/21/2017    26
11/04/2016    25
01/29/2017    25
01/09/2015    25
11/18/2015    25
01/22/2014    25
10/31/2014    24
              ..
08/02/2015     2
07/26/2015     2
07/15/2018     2
07/13/2014     2
07/09/2016     2
07/04/2018     2
04/12/2014     2
07/03/2016     2
07/03/2015     2
02/17/2018     2
12/25/2016     1
04/07/2014     1
02/16/2015     1
03/16/2014     1
03/04/2018     1
01/02/2015     1
12/31/2016     1
08/28/2016     1
08/17/2014     1
11/27/2014     1
06/04/2017     1
11/26/2015     1
11/23/2017     1
11/22/2018     1
06/10/2018     1
06/21/2015     1
12/31/2017     1
07/04/2015     1
07/08/2018     1
05/24/2015     1
Length: 1823, dtype: int64

The day with the most crashes in ann arbor places third overall on the crash list for the entire region.

fig, ax = plt.subplots(figsize=(15,7))
aa_result_dates = df_aa.groupby('DATE_FULL').size().plot(ax=ax)

This figure follow the trends that appeared in the overall dates figure. We see spikes in car crashes on select days throughout the winter months.

fig, ax = plt.subplots(figsize=(15,7))
df_aa.groupby('HOUR').size().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x11cfe5898>

The hourly distribution of crashes within the city follows the rush hour trend we identified in the total data

Ultimately, this sub sample of data seems to be representative of the full data.

High severity crashes

df.columns
Index(['Unnamed: 0', 'ACOUNT', 'ALCOHOL', 'BCOUNT', 'BICYCLE', 'CARTODB_ID',
       'CCOUNT', 'CNTNAME', 'COMMERCIAL', 'COMMUNITY', 'CRASHID', 'CRSHTYPEO',
       'DATE_FULL', 'DAY', 'DEER', 'DISTRACTED', 'DIS_CTRL_I', 'DRUG',
       'ELDERLY', 'EMERGENCY', 'FALINKID', 'FID', 'HIGH_SEVER', 'HITNRUN',
       'HOUR', 'HWY_CLSS_C', 'INTERROAD', 'INTR_INVL_', 'JURIS', 'KCOUNT',
       'LANEDEPART', 'LIGHTING', 'MAINROAD', 'MONTH', 'MOTORCYCLE', 'MP',
       'NFC', 'OBJECTID', 'OCCUPANTS', 'PEDESTRIAN', 'PR', 'PROPDAMG',
       'REDLIGHTRU', 'ROADCONDIT', 'ROADLANES', 'SCHOOLBUS', 'SEMMCD',
       'SPEEDLIMIT', 'SURFACE', 'TIME_FULL', 'TRAIN', 'UNITS', 'WEATHER',
       'WEEKDAY', 'WORK_ZONE', 'X', 'XCORD', 'Y', 'YCORD', 'YEAR', 'YOUNG'],
      dtype='object')
sev = df.loc[df['HIGH_SEVER']==1]
len(sev)
1782

Detroit vs Livonia

Conducting some evaluation of the genralizability of clustering groups

det = result.loc[result['COMMUNITY']=='Detroit']
liv = result.loc[result['COMMUNITY']=='Livonia']
det.head()
Unnamed: 0 ACOUNT ALCOHOL BCOUNT BICYCLE CARTODB_ID CCOUNT CNTNAME COMMERCIAL COMMUNITY ... YEAR YOUNG County Annual Amount Road Miles Average Per Mile Unnamed: 5 Unnamed: 6 Currency bin
16 16 0 0 0 0 NaN 0 Wayne NaN Detroit ... 2017 0 Wayne $54,202,186 2,570.50 $21,086 NaN NaN 21086.0 (20000, 25000]
40 40 0 0 0 0 NaN 0 Wayne NaN Detroit ... 2017 1 Wayne $54,202,186 2,570.50 $21,086 NaN NaN 21086.0 (20000, 25000]
41 41 0 0 0 0 NaN 0 Wayne NaN Detroit ... 2017 0 Wayne $54,202,186 2,570.50 $21,086 NaN NaN 21086.0 (20000, 25000]
43 43 0 0 0 0 NaN 0 Wayne NaN Detroit ... 2017 0 Wayne $54,202,186 2,570.50 $21,086 NaN NaN 21086.0 (20000, 25000]
60 60 0 0 0 0 NaN 2 Wayne NaN Detroit ... 2017 0 Wayne $54,202,186 2,570.50 $21,086 NaN NaN 21086.0 (20000, 25000]

5 rows × 69 columns

len(det)
115539
len(liv)
18341
det.columns
Index(['Unnamed: 0', 'ACOUNT', 'ALCOHOL', 'BCOUNT', 'BICYCLE', 'CARTODB_ID',
       'CCOUNT', 'CNTNAME', 'COMMERCIAL', 'COMMUNITY', 'CRASHID', 'CRSHTYPEO',
       'DATE_FULL', 'DAY', 'DEER', 'DISTRACTED', 'DIS_CTRL_I', 'DRUG',
       'ELDERLY', 'EMERGENCY', 'FALINKID', 'FID', 'HIGH_SEVER', 'HITNRUN',
       'HOUR', 'HWY_CLSS_C', 'INTERROAD', 'INTR_INVL_', 'JURIS', 'KCOUNT',
       'LANEDEPART', 'LIGHTING', 'MAINROAD', 'MONTH', 'MOTORCYCLE', 'MP',
       'NFC', 'OBJECTID', 'OCCUPANTS', 'PEDESTRIAN', 'PR', 'PROPDAMG',
       'REDLIGHTRU', 'ROADCONDIT', 'ROADLANES', 'SCHOOLBUS', 'SEMMCD',
       'SPEEDLIMIT', 'SURFACE', 'TIME_FULL', 'TRAIN', 'UNITS', 'WEATHER',
       'WEEKDAY', 'WORK_ZONE', 'X', 'XCORD', 'Y', 'YCORD', 'YEAR', 'YOUNG',
       'County', 'Annual Amount', 'Road Miles', 'Average Per Mile',
       'Unnamed: 5', 'Unnamed: 6', 'Currency', 'bin'],
      dtype='object')

Let’s compare and contrast the rates at which key statistics occur.

det_bike = det.loc[det['BICYCLE']==1]
len(det_bike)
848
848 / 115539
0.007339513064852561
liv_bike = liv.loc[liv['BICYCLE']==1]
len(liv_bike)
117
117 / 18341
0.006379150537048144
det_bus = det.loc[det['SCHOOLBUS']==1]
len(det_bus)
320
320 / 115539
0.0027696275716424757
liv_bus = liv.loc[liv['SCHOOLBUS']==1]
len(liv_bus)
79
79 / 18341
0.004307289678861567
det_moto = det.loc[det['MOTORCYCLE']==1]
len(det_moto)
1068
1068 / 115539
0.009243632020356763
liv_moto = liv.loc[liv['MOTORCYCLE']==1]
len(liv_moto)
126
126 / 18341
0.006869854424513385
det_old = det.loc[det['ELDERLY']==1]
len(det_old)
13205
13205 / 115539
0.11429041276105904
liv_old = liv.loc[liv['ELDERLY']==1]
len(liv_old)
3671
3671 / 18341
0.20015266343165586
det_young = det.loc[det['YOUNG']==1]
len(det_young)
29981
29981 / 115539
0.25948813820441585
liv_young = liv.loc[liv['YOUNG']==1]
len(liv_young)
6301
6301 / 18341
0.34354724387983204
det_alc = det.loc[det['ALCOHOL']==1]
len(det_alc)
2323
2323 / 115539
0.020105765152892096
liv_alc = liv.loc[liv['ALCOHOL']==1]
len(liv_alc)
470
470 / 18341
0.025625647456518182
det_drug = det.loc[det['DRUG']==1]
len(det_drug)
626
626 / 115539
0.005418083937025593
liv_drug = liv.loc[liv['DRUG']==1]
len(liv_drug)
149
149 / 18341
0.008123875470257893
det_sev = det.loc[det['CRSHTYPEO']==1]
len(det_sev)
16170
16170 / 115539
0.13995274322955886
liv_sev = liv.loc[liv['CRSHTYPEO']==1]
len(liv_sev)
1782
1782 / 18341
0.09715936971811788
det_dis = det.loc[det['DISTRACTED']==1]
len(det_dis)
3678
3678 / 115539
0.031833406901565706
liv_dis = liv.loc[liv['DISTRACTED']==1]
len(liv_dis)
577
577 / 18341
0.03145957145193828
det['bin']
16        (20000, 25000]
40        (20000, 25000]
41        (20000, 25000]
43        (20000, 25000]
60        (20000, 25000]
68        (20000, 25000]
76        (20000, 25000]
94        (20000, 25000]
108       (20000, 25000]
109       (20000, 25000]
128       (20000, 25000]
156       (20000, 25000]
175       (20000, 25000]
179       (20000, 25000]
180       (20000, 25000]
182       (20000, 25000]
190       (20000, 25000]
192       (20000, 25000]
198       (20000, 25000]
209       (20000, 25000]
214       (20000, 25000]
218       (20000, 25000]
224       (20000, 25000]
235       (20000, 25000]
245       (20000, 25000]
247       (20000, 25000]
250       (20000, 25000]
251       (20000, 25000]
254       (20000, 25000]
258       (20000, 25000]
               ...      
709804    (20000, 25000]
709817    (20000, 25000]
709818    (20000, 25000]
709824    (20000, 25000]
709825    (20000, 25000]
709827    (20000, 25000]
709828    (20000, 25000]
709830    (20000, 25000]
709831    (20000, 25000]
709833    (20000, 25000]
709834    (20000, 25000]
709840    (20000, 25000]
709842    (20000, 25000]
709851    (20000, 25000]
709852    (20000, 25000]
709855    (20000, 25000]
709857    (20000, 25000]
709858    (20000, 25000]
709863    (20000, 25000]
709865    (20000, 25000]
709866    (20000, 25000]
709870    (20000, 25000]
709875    (20000, 25000]
709878    (20000, 25000]
709879    (20000, 25000]
709884    (20000, 25000]
709889    (20000, 25000]
709895    (20000, 25000]
709897    (20000, 25000]
709898    (20000, 25000]
Name: bin, Length: 115539, dtype: category
Categories (6, interval[int64]): [(0, 5000] < (5000, 10000] < (10000, 15000] < (15000, 20000] < (20000, 25000] < (25000, 30000]]
liv['bin']
15        (15000, 20000]
27        (15000, 20000]
72        (15000, 20000]
98        (15000, 20000]
107       (15000, 20000]
133       (15000, 20000]
162       (15000, 20000]
178       (15000, 20000]
197       (15000, 20000]
213       (15000, 20000]
276       (15000, 20000]
285       (15000, 20000]
309       (15000, 20000]
340       (15000, 20000]
357       (15000, 20000]
371       (15000, 20000]
396       (15000, 20000]
429       (15000, 20000]
450       (15000, 20000]
451       (15000, 20000]
461       (15000, 20000]
464       (15000, 20000]
470       (15000, 20000]
476       (15000, 20000]
485       (15000, 20000]
487       (15000, 20000]
489       (15000, 20000]
495       (15000, 20000]
508       (15000, 20000]
510       (15000, 20000]
               ...      
708660    (15000, 20000]
708677    (15000, 20000]
708718    (15000, 20000]
708720    (15000, 20000]
708787    (15000, 20000]
708864    (15000, 20000]
708902    (15000, 20000]
708924    (15000, 20000]
709028    (15000, 20000]
709042    (15000, 20000]
709079    (15000, 20000]
709081    (15000, 20000]
709092    (15000, 20000]
709093    (15000, 20000]
709149    (15000, 20000]
709162    (15000, 20000]
709220    (15000, 20000]
709269    (15000, 20000]
709323    (15000, 20000]
709337    (15000, 20000]
709427    (15000, 20000]
709441    (15000, 20000]
709443    (15000, 20000]
709590    (15000, 20000]
709618    (15000, 20000]
709794    (15000, 20000]
709822    (15000, 20000]
709850    (15000, 20000]
709853    (15000, 20000]
709881    (15000, 20000]
Name: bin, Length: 18341, dtype: category
Categories (6, interval[int64]): [(0, 5000] < (5000, 10000] < (10000, 15000] < (15000, 20000] < (20000, 25000] < (25000, 30000]]


Tags:

Updated: