Crime type is the primary feature of interest for this analysis and the stacked area plot below shows the absolute counts of each type of crime for each month in the dataset.
Note that each month's data is plotted on the last day of the corresponding month, comprising the sum of the preceding month's data.
The plot shows a notably rapid increase in reported crime for such a short period. Given that the data begins in April of 2020, these trends will have been impacted by COVID-19 and the corresponding national lockdown.
To examine the link between these factors, the plot below shows the data alongside the dates of key lockdown changes.
The timeline on this plot shows that the number of crimes in the dataset was low during the peak of the lockdown and began to rise following the easing of restrictions, before then falling again as measures are reintroduced during the second wave.
For a better sense of the changes in each specific crime, the two plots below show the month-on-month changes in crime counts in both absolute terms and relative to the total count the previous month. These figures are interactive and each crime type can be shown or hidden by clicking its name in the legend.
Crime type is the primary feature of interest for this analysis and contains some unusual features and missing data. The crime type column has 14 unique values which are described on police.uk’s FAQ at: https://www.police.uk/pu/about- police.uk-crime-data. Each of these types are present in the dataset, however, there is an additional type present in the data which is not documented in the FAQ: ”exclusive”. There are 119 entries with this crime type, occurring roughly 20 times each month, at a range of locations, with various LSOA codes and names, and differing outcomes. The meaning of this crime type is unclear and (beyond occurring almost the same number of times each month) they do not exhibit a clear pattern. Given the lack of documentation regarding the meaning of this categorisation and my lack of familiarity with policing, these data were excluded from the analysis.
There are also 2000 entries with null values for the crime type. These null entires all come from the May and September files, with each having exactly 1000. The exact counts of these null entries indicate a possible systematic error which caused them be be added. These entries also have null values in all fields (except crime ID) meaning it is not possible to infer or impute the missing crime types, and as such these entries were also excluded from the analysis.
This section details the chronological process of loading, understanding, plotting and evaluating the data with accompa- nying code and outputs. The full source code is available in the Jupyter notebook here.
First load a single csv to investigate the structure of the data.
# Read data into frame and display the columns
df = pd.read_csv("data/2020-04/2020-04-west-yorkshire-street.csv", index_col=0)
df.columns
Index(['Crime ID', 'Month', 'Falls within', 'Longitude', 'Latitude', 'Location', 'LSOA code', 'LSOA name', 'Crime type', 'Last outcome category', 'Context'], dtype='object')
# Examine the frame head
df.head()
| Crime ID | Month | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 2020-04 | West Yorkshire Police | -1.550626 | 53.597400 | On or near Swithen Hill | E01007359 | Barnsley 005C | Anti-social behaviour | NaN | NaN |
| 1 | 5ea1997471c9de64fcfcf1145cadfff71ba37f21668d25... | 2020-04 | West Yorkshire Police | -1.670108 | 53.553629 | On or near Huddersfield Road | E01007426 | Barnsley 027D | Burglary | Investigation complete; no suspect identified | NaN |
| 2 | NaN | 2020-04 | West Yorkshire Police | -1.862742 | 53.940068 | On or near Smithy Greaves | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| 3 | 0d8ee70dbd3096b4d07059d7f7c310fbf5de9cb7d44c31... | 2020-04 | West Yorkshire Police | -1.879031 | 53.943807 | On or near Cross End Fold | E01010646 | Bradford 001A | Shoplifting | Investigation complete; no suspect identified | NaN |
| 4 | cb4709a03d98dc63ba4c1771171bc7a9353097f5851d80... | 2020-04 | West Yorkshire Police | -1.882481 | 53.924936 | On or near Moorside Lane | E01010646 | Bradford 001A | Violence and sexual offences | Unable to prosecute suspect | NaN |
Next, add a general function to load in all csvs from the data folder.
# Get a list of .csv files from the data directory
csvs = [f"{root}/{files[0]}" for root, dirs, files in os.walk("data/") if "west-yorkshire-street.csv" in files[0]]
# Combine all csv data into a single dataframe.
data: pd.DataFrame = None
for file in csvs:
newData = pd.read_csv(file, index_col=0)
if data is None:
data = newData
else:
data = pd.concat((data, newData))
# Confirm that all the data is present
data["Month"].unique()
array(['2020-06', '2020-08', '2020-09', nan, '2020-07', '2020-05', '2020-04'], dtype=object)
This works when each directory contains only a single csv - as is the case in the provided data. However, to make the process more general it would be useful to support multiple csvs per folder.
csvs = []
for root, dirs, files in os.walk("data/"):
for file in files:
if ".csv" in file:
csvs.append(f"{root}/{file}")
csvs
['data/2020-06/2020-06-west-yorkshire-street.csv', 'data/2020-08/2020-08-west-yorkshire-street.csv', 'data/2020-09/2020-09-west-yorkshire-street.csv', 'data/2020-07/2020-07-west-yorkshire-street.csv', 'data/2020-05/2020-05-west-yorkshire-street.csv', 'data/2020-04/2020-04-west-yorkshire-street.csv']
This works but can be improved using list comprehension for efficiency.
csvs = [f"{root}/{file}"
for root, dirs, files in os.walk("data/")
for file in files
if ".csv" in file]
csvs
['data/2020-06/2020-06-west-yorkshire-street.csv',
'data/2020-08/2020-08-west-yorkshire-street.csv',
'data/2020-09/2020-09-west-yorkshire-street.csv',
'data/2020-07/2020-07-west-yorkshire-street.csv',
'data/2020-05/2020-05-west-yorkshire-street.csv',
'data/2020-04/2020-04-west-yorkshire-street.csv']
Create functions for loading data.
def findCSVsInDir(dir: str) -> list:
'''
Takes in a directory as an input string and returns a list of paths to each csv in the folder or
any subfolders.
'''
return [f"{root}/{file}"
for root, dirs, files in os.walk(dir)
for file in files
if ".csv" in file]
def readAllCSVsInDir(dir: str) -> pd.DataFrame:
'''
Returns a single dataframe containing a concatenation of all csvs within a particular folder and any
of its subfolders.
'''
data = None
for file in findCSVsInDir(dir):
newData = pd.read_csv(file, index_col=0)
if data is None:
data = newData
else:
data = pd.concat((data, newData))
return data
Next to check it works.
crimesDF = readAllCSVsInDir("data")
print(crimesDF["Month"].unique())
crimesDF
['2020-06' '2020-08' '2020-09' nan '2020-07' '2020-05' '2020-04']
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9cfc0ed854bc20e2402d91de03c01bb0eec53ca7d1e52f... | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.764583 | 53.534617 | On or near Park/Open Space | E01007426 | Barnsley 027D | Burglary | Status update unavailable | NaN |
| 1 | e8ef06134d7cbd661b44b14b0090f533d767b1c56702fc... | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.764583 | 53.534617 | On or near Park/Open Space | E01007426 | Barnsley 027D | Burglary | Investigation complete; no suspect identified | NaN |
| 2 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.873004 | 53.941724 | On or near Cornerstones Close | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| 3 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.882481 | 53.924936 | On or near Moorside Lane | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| 4 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.873004 | 53.941724 | On or near Cornerstones Close | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21780 | d7659bf05dccb87e33d38dea8167489103fe6247a522de... | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Other crime | Investigation complete; no suspect identified | NaN |
| 21781 | d0f26c21c2c0aac15d667afa2a0406a2cdf1069be2b075... | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Other crime | Unable to prosecute suspect | NaN |
| 21782 | 413e818c0b01d0614e55398654d6cb7a64d83bd12e5833... | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Other crime | Unable to prosecute suspect | NaN |
| 21783 | e124b1d4f0c201248c69971b862cec6425343b0342f73a... | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Other crime | Unable to prosecute suspect | NaN |
| 21784 | be1051a23575910d1b81883cf4f2cee81473bdf4ae069b... | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Other crime | Unable to prosecute suspect | NaN |
158898 rows × 12 columns
Now that all data is loaded it can be investigated.
The meanings of each column are explained in this table from: https://data.police.uk/about/#columns
| Field | Meaning |
|---|---|
| Reported by | The force that provided the data about the crime. |
| Falls within | At present, also the force that provided the data about the crime. This is currently being looked into and is likely to change in the near future. |
| Longitude and Latitude |
The anonymised coordinates of the crime. See Location Anonymisation for more information. |
| LSOA code and LSOA name |
References to the Lower Layer Super Output Area that the anonymised point falls into, according to the LSOA boundaries provided by the Office for National Statistics. |
| Crime type | One of the crime types listed in the Police.UK FAQ. |
| Last outcome category | A reference to whichever of the outcomes associated with the crime occurred most recently. For example, this crime's 'Last outcome category' would be 'Formal action is not in the public interest'. |
| Context | A field provided for forces to provide additional human-readable data about individual crimes. Currently, for newly added CSVs, this is always empty. |
The details of each crime type are as follows:
| Crime type | Description |
|---|---|
| All crime | Total for all categories. |
| Anti-social behaviour | Includes personal, environmental and nuisance anti-social behaviour. |
| Bicycle theft | Includes the taking without consent or theft of a pedal cycle. |
| Burglary | Includes offences where a person enters a house or other building with the intention of stealing. |
| Criminal damage and arson | Includes damage to buildings and vehicles and deliberate damage by fire. |
| Drugs | Includes offences related to possession, supply and production. |
| Other crime | Includes forgery, perjury and other miscellaneous crime. |
| Other theft | Includes theft by an employee, blackmail and making off without payment. |
| Possession of weapons | Includes possession of a weapon, such as a firearm or knife. |
| Public order | Includes offences which cause fear, alarm or distress. |
| Robbery | Includes offences where a person uses force or threat of force to steal. |
| Shoplifting | Includes theft from shops or stalls. |
| Theft from the person | Includes crimes that involve theft directly from the victim (including handbag, wallet, cash, mobile phones) but without the use or threat of physical force. |
| Vehicle crime | Includes theft from or of a vehicle or interference with a vehicle. |
| Violence and sexual offences | Includes offences against the person such as common assaults, Grievous Bodily Harm and sexual offences. |
# Check the number of records
len(crimesDF)
158898
# Examine columns and their types
print(crimesDF.convert_dtypes().dtypes)
Crime ID string[python]
Month string[python]
Reported by string[python]
Falls within string[python]
Longitude Float64
Latitude Float64
Location string[python]
LSOA code string[python]
LSOA name string[python]
Crime type string[python]
Last outcome category string[python]
Context Int64
dtype: object
# Examine unique values for each column
for col in crimesDF.columns:
numUnique = len(crimesDF[col].unique())
print(f"{col}: has {numUnique} unique entries")
Crime ID: has 129208 unique entries
Month: has 7 unique entries
Reported by: has 2 unique entries
Falls within: has 2 unique entries
Longitude: has 25818 unique entries
Latitude: has 25131 unique entries
Location: has 18712 unique entries
LSOA code: has 1419 unique entries
LSOA name: has 1419 unique entries
Crime type: has 16 unique entries
Last outcome category: has 14 unique entries
Context: has 1 unique entries
# List unique crime types present in the dataset
sorted(crimesDF["Crime type"].dropna().unique()) + ["Null"]
['Anti-social behaviour', 'Bicycle theft', 'Burglary', 'Criminal damage and arson', 'Drugs', 'Exclusive', 'Other crime', 'Other theft', 'Possession of weapons', 'Public order', 'Robbery', 'Shoplifting', 'Theft from the person', 'Vehicle crime', 'Violence and sexual offences', 'Null']
These are similar to the crime categories listed above however, understandably the "all crime" category is not present and there is an extra category called "Exclusive".
The meaning of this "exclusive" crime type is not mentioned in the FAQ, nor in the list of Home Office Offence Codes provided here: https://www.police.uk/SysSiteAssets/police-uk/media/downloads/crime-categories/police-uk-category-mappings.csv
# Show crimes which have an "exclusive" type
crimesDF.loc[crimesDF["Crime type"] == "Exclusive"]
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 223 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.734433 | 53.891805 | On or near Parking Area | E01010773 | Bradford 005C | Exclusive | NaN | NaN |
| 2982 | 27d686161e51ad5806bc54eeb20eccf69141eb2f49cca7... | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.732338 | 53.807628 | On or near Beldon Place | E01010828 | Bradford 035A | Exclusive | Investigation complete; no suspect identified | NaN |
| 3118 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.732609 | 53.800735 | On or near Wingfield Mount | E01010832 | Bradford 035E | Exclusive | NaN | NaN |
| 3532 | abbaa995c0069f4ac0a7a9d411c6004858a60d78fe811c... | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.734881 | 53.799194 | On or near Alcester Garth | E01010607 | Bradford 039B | Exclusive | Investigation complete; no suspect identified | NaN |
| 3849 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.744363 | 53.792968 | On or near A6181 | E01033693 | Bradford 039J | Exclusive | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 16266 | 013a511d6d775aabf07ce9c08ed0b23a4d4c6f5056eba1... | 2020-04 | NaN | West Yorkshire Police | -1.549470 | 53.775533 | On or near Bude Road | E01011372 | Leeds 086C | Exclusive | Investigation complete; no suspect identified | NaN |
| 18231 | ccb10c712da0ebf2162bf1ae6635c2406668bb15aa73c0... | 2020-04 | NaN | West Yorkshire Police | -1.501665 | 53.765960 | On or near Park/Open Space | E01011470 | Leeds 112C | Exclusive | Investigation complete; no suspect identified | NaN |
| 18725 | NaN | 2020-04 | NaN | West Yorkshire Police | -1.340439 | 53.722974 | On or near Morton Crescent | E01011751 | Wakefield 005C | Exclusive | NaN | NaN |
| 19995 | 40c974c6a4274c85a103f7fc63725715b8816c37443a12... | 2020-04 | NaN | West Yorkshire Police | -1.310717 | 53.682332 | On or near Pease Close | E01011846 | Wakefield 023B | Exclusive | Action to be taken by another organisation | NaN |
| 20322 | NaN | 2020-04 | NaN | West Yorkshire Police | -1.362931 | 53.670685 | On or near Verner Street | E01011780 | Wakefield 027B | Exclusive | NaN | NaN |
119 rows × 12 columns
# List the counts of each "last outcome" for these "exclusive" crimes
exclusiveDF = crimesDF.loc[crimesDF["Crime type"] == "Exclusive"]
exclusiveDF["Last outcome category"].value_counts()
Last outcome category Unable to prosecute suspect 45 Investigation complete; no suspect identified 42 Court result unavailable 4 Status update unavailable 3 Offender given a caution 1 Further investigation is not in the public interest 1 Formal action is not in the public interest 1 Local resolution 1 Action to be taken by another organisation 1 Name: count, dtype: int64
# Most of the crimes have the outcomes of "Unable to prosecute suspect"
# and "Investigation complete; no suspect identified" which is
# roughly consistent with the distribution throughout the data.
crimesDF["Last outcome category"].value_counts(ascending=False)
Last outcome category Unable to prosecute suspect 63395 Investigation complete; no suspect identified 43585 Court result unavailable 10575 Status update unavailable 3605 Local resolution 2915 Offender given a caution 1294 Further investigation is not in the public interest 1121 Action to be taken by another organisation 501 Formal action is not in the public interest 279 Further action is not in the public interest 140 Awaiting court outcome 131 Suspect charged as part of another case 22 Offender given penalty notice 1 Name: count, dtype: int64
# List the counts of each "last outcome" for these "exclusive" crimes
exclusiveDF["LSOA code"].value_counts()
LSOA code
E01011811 3
E01010782 2
E01011677 2
E01033693 2
E01010627 2
..
E01011757 1
E01011752 1
E01011493 1
E01011433 1
E01011780 1
Name: count, Length: 108, dtype: int64
# List the counts of each "month" for these "exclusive" crimes
crimesDF.loc[crimesDF["Crime type"] == "Exclusive"]["Month"].value_counts()
Month 2020-06 20 2020-08 20 2020-09 20 2020-07 20 2020-04 20 2020-05 19 Name: count, dtype: int64
There are 119 of entries of this type occurring mostly 20 times a month with a range of outcomes.
Unfortunately, I cannot identify any pattern with the data recorded for this crime type
Given my lack of familiarity with this area and the lack of documentation regarding the meaning of this categorisation I will exclude these data from the analysis.
Additionally, there are null values present for "crime type", and likely other columns, which should be examined.
# Count null values in each column
null_counts = pd.DataFrame({
'column': crimesDF.columns,
'null_count': [crimesDF[col].isnull().sum() for col in crimesDF.columns]
})
null_counts.sort_values(by="null_count", ascending=False)
| column | null_count | |
|---|---|---|
| 11 | Context | 158898 |
| 10 | Last outcome category | 31334 |
| 3 | Falls within | 31278 |
| 0 | Crime ID | 29689 |
| 2 | Reported by | 23785 |
| 7 | LSOA code | 5488 |
| 8 | LSOA name | 5488 |
| 4 | Longitude | 5487 |
| 5 | Latitude | 5487 |
| 1 | Month | 2000 |
| 6 | Location | 2000 |
| 9 | Crime type | 2000 |
The documentation on the crime types notes regarding the "Context" field that: "Currently, for newly added CSVs, this is always empty.". Which explains the large number of null values.
Otherwise, missing data may be due to the data not being available, not being entered or occurring due to an error.
2000 entries have no crime type.
# View crimes with missing types
crimesDF.loc[crimesDF["Crime type"].isna()]
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | 013d7f93d61de0036327474674b7d5767ffc9f0b2787a8... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | e0ec19a6355822b23d0d2a8be119d730449fa107e34173... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | fff65e2c833cff7fed3e9f4e9e15ca33008f1b55584743... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | e1aa4834bce7d8b7a0c8d761eddcc944514809e61f6c27... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 74 | 056dab0d794c00ff3ed10dffb56d7bf4702c2adbf60700... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24494 | 301c7de516f283f7796cb0a41c287849c4a3169b17880e... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24496 | f1c8b597d0830e2bed561230f8be7f3c52c9243b24b15f... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24500 | 3430117d1065ff30530009ceaec4a9c75467244cee35d6... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24572 | 960848317f66498ee2566de2a638046f8af6421c04f1f4... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24577 | 66fa8bf450b0ea292e06165eaae98b8194e47bdff03194... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2000 rows × 12 columns
# Check how many entries are totally blank
crimesDF.loc[crimesDF.drop(columns="Crime ID").isna().all(1)]
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | 013d7f93d61de0036327474674b7d5767ffc9f0b2787a8... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | e0ec19a6355822b23d0d2a8be119d730449fa107e34173... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | fff65e2c833cff7fed3e9f4e9e15ca33008f1b55584743... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | e1aa4834bce7d8b7a0c8d761eddcc944514809e61f6c27... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 74 | 056dab0d794c00ff3ed10dffb56d7bf4702c2adbf60700... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24494 | 301c7de516f283f7796cb0a41c287849c4a3169b17880e... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24496 | f1c8b597d0830e2bed561230f8be7f3c52c9243b24b15f... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24500 | 3430117d1065ff30530009ceaec4a9c75467244cee35d6... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24572 | 960848317f66498ee2566de2a638046f8af6421c04f1f4... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24577 | 66fa8bf450b0ea292e06165eaae98b8194e47bdff03194... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2000 rows × 12 columns
These 2000 entries have no data present for any field except a crime ID. A round number of exactly 2000 missing values may indicate a systematic error, although 2000 would not divide equally across the 6 csvs. Given this, it is worth examining how many erroneous entries are present in each csv.
# Count missing entries in each csv
for month in ["04", "05", "06", "07", "08", "09"]:
csvPath = f"data/2020-{month}/2020-{month}-west-yorkshire-street.csv"
monthDF = pd.read_csv(csvPath)
print(month, len(monthDF.loc[monthDF["Crime type"].isna()]))
04 0
05 1000
06 0
07 0
08 0
09 1000
May and September each have exactly 1000 crimes with null types while the other do not have any.
# Show entries with null crime types from May csv
mayDF = pd.read_csv(f"data/2020-05/2020-05-west-yorkshire-street.csv", index_col=0)
mayDF.loc[mayDF["Crime type"].isna()]
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 472dce151845a2dda9743bb01df3023f43e65137eb8409... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 53 | 612459be5a572b07fd7231477945210d86d0ebaa866199... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 61 | 7e94a8ca2bdd2e9d274e91b19305e98d014aa3902978b1... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 74 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82 | bce93b2672320dc226a1218a9ee5b55b1b482ef124cea8... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24494 | 301c7de516f283f7796cb0a41c287849c4a3169b17880e... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24496 | f1c8b597d0830e2bed561230f8be7f3c52c9243b24b15f... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24500 | 3430117d1065ff30530009ceaec4a9c75467244cee35d6... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24572 | 960848317f66498ee2566de2a638046f8af6421c04f1f4... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24577 | 66fa8bf450b0ea292e06165eaae98b8194e47bdff03194... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1000 rows × 12 columns
# Show entries with null crime types from May csv
sepDF = pd.read_csv(f"data/2020-09/2020-09-west-yorkshire-street.csv", index_col=0)
sepDF.loc[sepDF["Crime type"].isna()]
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | 013d7f93d61de0036327474674b7d5767ffc9f0b2787a8... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | e0ec19a6355822b23d0d2a8be119d730449fa107e34173... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | fff65e2c833cff7fed3e9f4e9e15ca33008f1b55584743... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | e1aa4834bce7d8b7a0c8d761eddcc944514809e61f6c27... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 74 | 056dab0d794c00ff3ed10dffb56d7bf4702c2adbf60700... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26950 | be87c653d51bc7d5712d603efbb0758059e643f720e890... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26962 | dbb921c1ba8c326ae976ef64624fa075c289228c7e391a... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26976 | 31141c7965263b6edb33ca7c98d5dae70b96fb90d0f60f... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26990 | 6ab052be140ec8ec56f40c6e1e2c95bfff832c2bf58da1... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26992 | f763e9f3fbd4120a465a58e8f3191c5d2a66098079d5be... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1000 rows × 12 columns
I am unable to discern a pattern in which entries are missing, or infer/impute any data due to the records being entirely blank. As such these entries will be dropped for the analysis.
The dataset comprises six csv files of street-level crime data between April and September 2020 (inclusive) from the West Yorkshire police. Cumulatively the data consists of 158,898 records, with the following 12 columns: Crime ID, Month, Reported by, Falls within, Longitude, Latitude, Location, LSOA code, LSOA name, Crime type, Last outcome category and Context.
# Drop entries with "exclusive" crime types - as discussed above
crimesDF.drop(crimesDF.loc[crimesDF['Crime type']=="Exclusive"].index, inplace=True)
# Drop entries with null crime type
crimesDF.drop(crimesDF.loc[crimesDF.isna()], inplace=True)
# Convert dtypes
crimesDF = crimesDF.convert_dtypes()
# Convert month to datetime for ease of plotting
crimesDF["Month"] = pd.to_datetime(crimesDF["Month"], format="%Y-%m")
First create a plot of the total number of crimes each month by type.
# Create a frame with the counts for each crime grouped by month and type
crimeTypeCountsDF = crimesDF.groupby(["Month End", "Crime type"]).size().reset_index()
crimeTypeCountsDF = crimeTypeCountsDF.rename(columns={0: "Count"})
# Selection of nice colour based on the those used by OWiD from: https://gist.github.com/Digma/b91db287f8f577fae41c406892d46b15#file-ourworldindata_ghg_pandas-py
COLOR_SCALE = [
"#6D3E91", "#C05917", "#58AC8C", "#286BBB", "#883039", "#BC8E5A", "#00295B", "#C15065",
"#18470F", "#9A5129", "#E56E5A", "#A2559C", "#38AABA", "#578145", "#970046", "#00847E",
"#B13507", "#4C6A9C", "#CF0A66", "#00875E", "#B16214", "#8C4569", "#3B8E1D", "#D73C50"
]
# Function to add labels to the figure
def addLabel(figure: plt.Figure, text: str, xPos:float = dt.datetime(year=2020, month=9, day=30), yPos:float = 0.5, ax: float = 45, ay: float = 0) -> None:
figure.add_annotation(x=xPos, y=yPos, text=text, showarrow=True, arrowhead=0, xanchor="left", ax=ax, ay=ay, arrowcolor="#4d4d4d", arrowwidth=1)
fig = px.area(crimeTypeCountsDF, x="Month End", y="Count",
line_group="Crime type",
color="Crime type",
category_orders={"Crime type": crimesDF["Crime type"].value_counts().index},
markers=True,
color_discrete_sequence=COLOR_SCALE,
template="seaborn",
title="Count of Each Crime Type Per Month")
# NB: These labels have hard-coded positions and should be removed/amended if using different data
yScale = 26000
addLabel(fig, "Violence and<br>sexual offences", yPos=0.200*yScale, ay=32)
addLabel(fig, "Anti-social<br>behaviour", yPos=0.440*yScale, ay=38)
addLabel(fig, "Public order", yPos=0.590*yScale, ay=32)
addLabel(fig, "Criminal damage<br> and arson", yPos=0.690*yScale, ay=32)
addLabel(fig, "Other theft", yPos=0.765*yScale, ay=32)
addLabel(fig, "Burglary", yPos=0.817*yScale, ay=28)
addLabel(fig, "Vehicle crime", yPos=0.864*yScale, ay=22)
addLabel(fig, "Shoplifting", yPos=0.905*yScale, ay=18)
addLabel(fig, "Drugs", yPos=0.932*yScale, ay=11)
addLabel(fig, "Other crime", yPos=0.959*yScale, ay= 7)
addLabel(fig, "Robbery", yPos=0.972*yScale, ay=-1)
addLabel(fig, "Possession of weapons", yPos=0.985*yScale, ay=-10)
addLabel(fig, "Bicycle theft", yPos=0.990*yScale, ay=-22)
addLabel(fig, "Theft from the person", yPos=1.000*yScale, ay=-33)
fig.update_layout(
height=500,
showlegend=False, # Hide the legend when using the hard-coded labels. Remove this for an auto generated legend.
margin=dict(l=0, r=10, t=40, b=0),
)
fig.update_xaxes(title="Date", range=[pd.Timestamp('2020-04-20'), pd.Timestamp('2020-11-10')], tickformat="%d %B<br>%Y", ticks="outside", showgrid=False)
fig.update_yaxes(ticks="outside", col=1)
fig.layout.xaxis.fixedrange = True
fig.layout.yaxis.fixedrange = True
fig_html = to_html(fig, include_plotlyjs=False, full_html=False, div_id="countGraphCrimeTypesPlot")
# print(fig_html)
fig.show()
This plot shows an unusually rapid increase then decrease in the number of crimes over a short period. Given that this data is between April 2020 and August 2020 this striking change can likely be attributed to the national COVID-19 lockdown which began on 23 March 2020, causing a depression in the rate and a gradual increase as restrictions eased - before then decreasing during the second lockdown in September.
To examine the link between these factors, it will be useful to plot the data alongside the dates of key lockdown rule changes.
fig = px.area(crimeTypeCountsDF, x="Month End", y="Count",
line_group="Crime type",
color="Crime type",
category_orders={"Crime type": crimesDF["Crime type"].value_counts().index},
markers=True,
color_discrete_sequence=COLOR_SCALE,
template="seaborn",
title="Count of Each Crime Type Per Month")
# set showlegend property by name of trace
for trace in fig['data']:
if(trace['name'] != 'B'): trace['showlegend'] = False
# NB: These labels have hard-coded positions and should be removed/amended if using different data
yScale = 26000
addLabel(fig, "Violence and<br>sexual offences", yPos=0.200*yScale, ay=32)
addLabel(fig, "Anti-social<br>behaviour", yPos=0.440*yScale, ay=38)
addLabel(fig, "Public order", yPos=0.590*yScale, ay=32)
addLabel(fig, "Criminal damage<br> and arson", yPos=0.690*yScale, ay=32)
addLabel(fig, "Other theft", yPos=0.765*yScale, ay=32)
addLabel(fig, "Burglary", yPos=0.817*yScale, ay=28)
addLabel(fig, "Vehicle crime", yPos=0.864*yScale, ay=22)
addLabel(fig, "Shoplifting", yPos=0.905*yScale, ay=18)
addLabel(fig, "Drugs", yPos=0.932*yScale, ay=11)
addLabel(fig, "Other crime", yPos=0.959*yScale, ay= 7)
addLabel(fig, "Robbery", yPos=0.972*yScale, ay=-1)
addLabel(fig, "Possession of weapons", yPos=0.985*yScale, ay=-10)
addLabel(fig, "Bicycle theft", yPos=0.990*yScale, ay=-22)
addLabel(fig, "Theft from the person", yPos=1.000*yScale, ay=-33)
# Add lines for covid rule dates
fig.add_vline(x=pd.Timestamp('2020-03-23'), line_color="#bd6f51", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="National<br>lockdown<br>begins",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="right"
))
fig.add_vline(x=pd.Timestamp('2020-04-30'), line_color="#B13507", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="PM says<br>“we are past<br>the peak”<br>of the<br>pandemic",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="right"
))
fig.add_vline(x=pd.Timestamp('2020-05-10'), line_color="#984976", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="PM calls<br>for return<br>to work",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="left"
))
fig.add_vline(x=pd.Timestamp('2020-06-01'), line_color="#5e6c8e", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="Phased<br>school<br>return",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="left"
))
fig.add_vline(x=pd.Timestamp('2020-06-15'), line_color="#764c9d", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="Retail<br>reopens",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="left"
))
fig.add_vline(x=pd.Timestamp('2020-09-14'), line_color="#00295B", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="Gatherings<br>above six<br>banned",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="right"
))
fig.add_vline(x=pd.Timestamp('2020-09-22'), line_color="#883039", line_dash="dash", line_width=3, opacity=1.0, showlegend=True, label=dict(
text="PM announces<br>return to working<br>from home",
textposition="end",
yanchor="top",
textangle=0,
padding=10,
xanchor="left"
))
fig.update_layout(
showlegend=False,
height=500,
margin=dict(l=0, r=0, t=40, b=0),
legend_title="Lockdown<br>Regulation Changes"
)
fig.update_xaxes(title="Date", range=[pd.Timestamp('2020-03-01'), pd.Timestamp('2020-11-20')], tickformat="%d %B<br>%Y", ticks="outside", showgrid=False)
fig.update_yaxes(range=[0, 35e3], ticks="outside", col=1)
fig.layout.xaxis.fixedrange = True
fig.layout.yaxis.fixedrange = True
fig_html = to_html(fig, include_plotlyjs=False, full_html=False, div_id="countGraphWithCovidData")
# print(fig_html)
fig.show()
To get a more specific view of how each crime's prevelance is increasing/decreasing it is helpful to plot the month-on-month changes on another graph.
# Group crime types and compute the month-on-month change and percentage change
crimeTypeCountsDF["Change"] = crimeTypeCountsDF.groupby("Crime type")["Count"].diff()
crimeTypeCountsDF["Change Proportion"] = crimeTypeCountsDF.groupby("Crime type")["Count"].pct_change()
crimeTypeCountsDF = crimeTypeCountsDF.replace(np.nan, 0) #Replace initial values with 0
# Plot the month-on-month changes in crime counts
fig = px.line(crimeTypeCountsDF, x="Month End", y="Change Proportion",
line_group="Crime type",
color="Crime type",
category_orders={"Crime type": crimesDF["Crime type"].value_counts().index},
markers=True,
color_discrete_sequence=COLOR_SCALE,
template="seaborn",
title="Absolute Change in Crime Counts Over Time; Grouped by Crime Type")
fig.update_layout(
height=500,
margin=dict(l=0, r=10, t=40, b=0),
yaxis_title="Change in crime count<br>relative to previous month"
)
fig.add_hline(
y=0,
line_color="black",
line_width=1,
line_dash="solid",
opacity=0.5
)
fig.update_xaxes(ticks="outside", showgrid=False)
fig.update_yaxes(tickformat=".0%", dtick=0.05, ticks="outside", col=1)
fig.update_traces(opacity=1.0, selector=dict(type='scatter'))
# Set crime types to be shown initially
visible_traces = {"Violence and sexual offences", "Anti-social behaviour", "Public order", "Criminal damage and arson"}
for trace in fig.data:
trace.visible = True if trace.name in visible_traces else "legendonly"
# When you click on a trace in the legend, show/hide it
fig.update_layout(legend_itemclick="toggle", legend_itemdoubleclick="toggleothers")
# Prevent cropping/moving figure when clicking on the plot
fig.layout.xaxis.fixedrange = True
fig.layout.yaxis.fixedrange = True
fig_html = to_html(fig, include_plotlyjs=False, full_html=False, div_id="absoluteChangeCrimeGraph")
# print(fig_html)
fig.show()
# Plot the proportional month-on-month changes in crime counts
fig = px.line(crimeTypeCountsDF, x="Month End", y="Change Proportion",
line_group="Crime type",
color="Crime type",
category_orders={"Crime type": crimesDF["Crime type"].value_counts().index},
markers=True,
color_discrete_sequence=COLOR_SCALE,
template="seaborn",
title="Proportional Change in Crime Counts Over Time; Grouped by Crime Type")
fig.update_layout(
height=500,
margin=dict(l=0, r=10, t=40, b=0),
yaxis_title="Percentage change in crime count<br>relative to previous month"
)
fig.add_hline(
y=0,
line_color="black",
line_width=1,
line_dash="solid",
opacity=0.5
)
fig.update_xaxes(ticks="outside", showgrid=False)
fig.update_yaxes(tickformat=".0%", dtick=0.05, ticks="outside", col=1)
fig.update_traces(opacity=1.0, selector=dict(type='scatter'))
# Set crime types to be shown initially
visible_traces = {"Shoplifting", "Anti-social behaviour", "Public order", "Theft from the person"}
for trace in fig.data:
trace.visible = True if trace.name in visible_traces else "legendonly"
# When you click on a trace in the legend, show/hide it
fig.update_layout(legend_itemclick="toggle", legend_itemdoubleclick="toggleothers")
# Prevent cropping/moving figure when clicking on the plot
fig.layout.xaxis.fixedrange = True
fig.layout.yaxis.fixedrange = True
fig_html = to_html(fig, include_plotlyjs=False, full_html=False, div_id="proportionalChangeCrimeGraph")
# print(fig_html)
fig.show()
Over the 6 month period, the counts of the each type of crime in the dataset increased during June and July as COVID-19 lockdown restrictions were eased before falling again as measures came back into effect in September. Shoplifting and theft saw their largest increase (of roughly 40\%) in July as retail reopened on July 15th, while ``anti-social behaviour" and ``violence and sexual offences" had their largest increases in June and July respectively. The data also has some unusual features in the crime type column - namely two months featuring an unexpected pattern of exactly 1000 blank entries, as well as an undocumented "exclusive" crime type occurring roughly 20 times each month.
It is also worth noting that this data only represents crimes which the police were aware of and documented. It is therefore useful to consider to what extent these trends are caused by actual changes in the amount of crime committed, and what is caused by fewer crimes getting spotted, reported and documented by the police. The effect of unreported crimes over this period will vary by crime type. For instance, shoplifting can only occur in stores, and as such is expected to fall as shops close, whereas, violent and sexual crimes can occur anywhere and will be harder to detect - and harder for victims to report - when they take place in the home. With more time, quantifying the effects of under-reporting by cross-referencing with other data sources would be a key area of focus, in addition to including, and comparing with, data from other years.
This sections contains additional analysis which, while not pertinant to the task outlined, caught my interest and curiosity.
sns.violinplot(crimesDF["Latitude"], orient="h", width=0.9, gridsize=1000, linewidth=0.5);
The latitudes and longitudes are extremely skewed with many values exceeding the bounds of the UK let alone West Yorkshire.
UK Long range: -8.23 <-> 1.75 (NI to Norwich)
UK Lat range: 49.16 <-> 62.28 (Faroe to Jersey)
West Yorkshire Long range: -2.21 <-> -1.09
West Yorkshire Lat range: 54 <-> 53.5
# Check the ranges of the entries outside the UK
locDF_non_UK.max(), locDF_non_UK.min()
(Latitude 99.528496 Longitude 98.15016 dtype: Float64, Latitude -96.415042 Longitude -99.342253 dtype: Float64)
# Get a frame with all non-null latitude and longitude coordinates
locDF = crimesDF[["Latitude", "Longitude"]]
# Locate all entries which are within the expected bounds of West Yorkshire
locDF_Yorkshire = locDF.loc[(locDF["Latitude"] <= 54) & (locDF["Latitude"] >= 53) & (locDF["Longitude"] >= -2.21) & (locDF["Longitude"] <= -1.1)]
# Locate all entries which are roughly within the bounds of the UK
locDF_UK = locDF.loc[(locDF["Latitude"] <= 62) & (locDF["Latitude"] >= 49) & (locDF["Longitude"] >= -8.2) & (locDF["Longitude"] <= 1.75)]
# Locate all entries which are beyond the UK
locDF_non_UK = locDF.loc[~((locDF["Latitude"] <= 62) & (locDF["Latitude"] >= 49) & (locDF["Longitude"] >= -8.2) & (locDF["Longitude"] <= 1.75))]
# View the points outside the UK on a world map
fig = px.scatter_map(locDF_non_UK, lat="Latitude", lon="Longitude",
center=dict(lat=df["Latitude"].mean(), lon=df["Longitude"].mean()),
zoom=2,
opacity=1,
map_style="open-street-map",
)
# fig.show()
plot(fig, auto_open=True)
# Check for points which are within the UK but outside of West Yorkshire
df_all = locDF_UK.merge(locDF_Yorkshire.drop_duplicates(), on=['Latitude','Longitude'],
how='left', indicator=True)
ukNotYorkshireDF = df_all.loc[df_all['_merge'] == 'left_only']
print(ukNotYorkshireDF["Latitude"].max(), ukNotYorkshireDF["Latitude"].min())
print(ukNotYorkshireDF["Longitude"].max(), ukNotYorkshireDF["Longitude"].min())
ukNotYorkshireDF
54.821672 54.089975
-0.891841 -1.598217
| Latitude | Longitude | _merge | |
|---|---|---|---|
| 9018 | 54.821672 | -1.598217 | left_only |
| 9019 | 54.821672 | -1.598217 | left_only |
| 9020 | 54.821672 | -1.598217 | left_only |
| 9021 | 54.821672 | -1.598217 | left_only |
| 9022 | 54.821672 | -1.598217 | left_only |
| 9023 | 54.821672 | -1.598217 | left_only |
| 9024 | 54.821672 | -1.598217 | left_only |
| 9025 | 54.821672 | -1.598217 | left_only |
| 9026 | 54.821672 | -1.598217 | left_only |
| 35925 | 54.33734 | -1.42805 | left_only |
| 50451 | 54.089975 | -0.891841 | left_only |
| 63431 | 54.821672 | -1.598217 | left_only |
| 89414 | 54.821672 | -1.598217 | left_only |
| 89415 | 54.821672 | -1.598217 | left_only |
| 89416 | 54.821672 | -1.598217 | left_only |
| 89417 | 54.821672 | -1.598217 | left_only |
| 89418 | 54.821672 | -1.598217 | left_only |
| 89419 | 54.821672 | -1.598217 | left_only |
| 89420 | 54.821672 | -1.598217 | left_only |
| 89421 | 54.821672 | -1.598217 | left_only |
| 89422 | 54.821672 | -1.598217 | left_only |
| 89423 | 54.821672 | -1.598217 | left_only |
| 89424 | 54.821672 | -1.598217 | left_only |
| 89425 | 54.821672 | -1.598217 | left_only |
| 89426 | 54.821672 | -1.598217 | left_only |
| 89427 | 54.821672 | -1.598217 | left_only |
| 89428 | 54.821672 | -1.598217 | left_only |
| 89429 | 54.821672 | -1.598217 | left_only |
| 116400 | 54.821672 | -1.598217 | left_only |
| 116401 | 54.33734 | -1.42805 | left_only |
| 116402 | 54.181446 | -1.457476 | left_only |
| 138841 | 54.308725 | -1.566156 | left_only |
fig = px.density_map(latLongDF_UK, lat="Latitude", lon="Longitude", z=None,
radius=7,
center=dict(lat=df["Latitude"].mean(), lon=df["Longitude"].mean()),
zoom=10,
opacity=0.5,
range_color=[0,len(latLongDF)*9e-5],
map_style="open-street-map",
hover_data=None)
# fig.show()
plot(fig, auto_open=True)
While this is a somewhat interesting plot to look at, as crime only occurs where there are people to commit it, the data is mostly just showing the distribution of population.
# Show crimes with missing IDs
crimesDF.loc[crimesDF["Crime ID"].isna()]
| Crime ID | Month | Reported by | Falls within | Longitude | Latitude | Location | LSOA code | LSOA name | Crime type | Last outcome category | Context | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.873004 | 53.941724 | On or near Cornerstones Close | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| 3 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.882481 | 53.924936 | On or near Moorside Lane | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| 4 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.873004 | 53.941724 | On or near Cornerstones Close | E01010646 | Bradford 001A | Anti-social behaviour | NaN | NaN |
| 11 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.890771 | 53.946029 | On or near Green Lane | E01010648 | Bradford 001C | Anti-social behaviour | NaN | NaN |
| 16 | NaN | 2020-06 | West Yorkshire Police | West Yorkshire Police | -1.828609 | 53.920224 | On or near Queen'S Gardens | E01010692 | Bradford 001D | Anti-social behaviour | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21420 | NaN | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Anti-social behaviour | NaN | NaN |
| 21421 | NaN | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Anti-social behaviour | NaN | NaN |
| 21422 | NaN | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Anti-social behaviour | NaN | NaN |
| 21423 | NaN | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Anti-social behaviour | NaN | NaN |
| 21424 | NaN | 2020-04 | NaN | West Yorkshire Police | NaN | NaN | No Location | NaN | NaN | Anti-social behaviour | NaN | NaN |
29689 rows × 12 columns
len(crimesDF.loc[crimesDF["Crime ID"].isna()]) / len(crimesDF)
0.18684313207214692
29689 entries have no crime ID which is ~18.7% of the records. A cursory glance at some of the data indicates that many of these
# Check types of crimes with missing IDs
crimesDF.loc[crimesDF["Crime ID"].isna()]["Crime type"].unique()
<StringArray> ['Anti-social behaviour', <NA>] Length: 2, dtype: string
ax = sns.barplot(crimesDF["Last outcome category"].value_counts(ascending=False), width=1)
plt.xticks(rotation=45)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor");