Applied Data Science Capstone: Week 3¶

Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto¶

In this project, I'll be exploring and clustering neighborhoods in Toronto with the help of Foursquare API. This small project will be tackled in several steps:

Scraping a Wikipedia page with BeautifulSoup.
Cleaning the dataframe, containing the columns: Postal code, borough, and neighborhood name.
Get latitude & longitude coordinates for each neighborhood.
Explore and cluster the neighborhoods in Toronto.

from bs4 import BeautifulSoup
import numpy as np # library to handle data in a vectorized manner
import re # regex library

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported.')

Libraries imported.

1. Scraping a Wikipedia page with BeautifulSoup¶

webpage_response = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") 
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser") # Instantiate BeautifulSoup object
# print(soup.tbody) # Uncomment to analyze which element we're going to scrape

We can see that most of the content in the webpage is assigned in the children tags of tbody. So, let's do some more exploratory analysis before actually scraping the contents.

For table headers, like the labels Postal Code and Borough, are stored in the tag <th>.
Contrarily, the contents of the table are stored in the tag <td>, which are also the children of the tag <tr>.

We can get this information stored in a list, in which we're going to convert into a pandas dataframe later.

# Perform multiple string cleaning
headers = [(str(x)).strip("<th>") for x in soup.find_all("th")]
headers = [x.strip("</") for x in headers]
headers = [x.strip() for x in headers[:3]]
# Display end results
print(headers)

['Postal Code', 'Borough', 'Neighborhood']

# Convert all entries into list
table = soup.find('table')
table_rows = table.find_all('tr')
entries = list()
for tr in table_rows:
    td = tr.find_all('td')
    entries.append([tr.text.strip() for tr in td])

# Convert the list contents into a dataframe, with column name as the headers list
df = pd.DataFrame(entries, columns=headers)
df.head()

2. Cleaning the DataFrame¶

Here are the conditions of the desired DataFrame state:

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
Use the .shape method to print the number of rows of the processed dataframe.

## Drop NaN values
df.dropna(axis=0, inplace=True)

## Process the cells that have an assigned borough
to_drop = df.index[df['Borough'] == 'Not assigned'].tolist() # Get index of all rows that hasn't an assigned borough
df.drop(to_drop, inplace=True) # Drop values based on index

## Display processed dataframe
print(df.shape)
df.reset_index(drop=True, inplace=True)
df.head()

(103, 3)

3. Get longitude and latitude values for each postal code¶

We're going to import the pre-processed long/lat values from this csv file. Then, we're going to append the imported values into our existing dataframe.

longlat = pd.read_csv("https://cocl.us/Geospatial_data")
longlat.head()

merged_df = df.merge(longlat, how = 'inner', on = ['Postal Code']) # Similar to SQL inner join
merged_df.head()

4. Explore and Cluster Neighborhoods in Toronto¶

from sklearn.cluster import KMeans # Import clustering library

Create a map of Toronto with neighborhoods superimposed on top.¶

A quick Google search revealed that Toronto lies on 43.653908 latitude, -79.384293 longitude. We double-assign these values before mapping the map with folium.

latitude, longitude = 43.653908, -79.384293

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['Borough'], merged_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare Credentials and Version¶

Note that when I share this notebook, I would have to redact the credentials. I hope this is understandable :)

CLIENT_ID = 'REDACTED' # your Foursquare ID
CLIENT_SECRET = 'REDACTED' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: REDACTED
CLIENT_SECRET:REDACTED

Define a few functions to use for exploratory analysis.¶

We're going to use these functions to explore neighborhoods in Toronto.

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# function that returns venues in an area, given latitude and longitude
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name) # uncomment to debug. I commented it to preserve memory
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100) # Get the results of the top 100 venues.
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

toronto_venues = getNearbyVenues(names=merged_df['Neighborhood'],
                                   latitudes=merged_df['Latitude'],
                                   longitudes=merged_df['Longitude']
                                  )

print(toronto_venues.shape)

(2131, 7)

Display the end dataframe¶

The dataframe consists of venue properties along with its appropriate neighborhood.

toronto_venues.head()

We can find out how many unique categories can be curated from all the returned venues.

print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 unique categories.

Get One-Hot dummy variables for each venue category¶

This way, we can analyze each neighborhood without having to scroll through all entries in the toronto_venues dataframe.

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

neighborhood_column = toronto_onehot['Neighborhood']
toronto_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
toronto_onehot.insert(0, 'Neighborhood', neighborhood_column)

print(toronto_onehot.shape)
toronto_onehot.head()

(2131, 273)

We can then group rows based on neighborhood names and by taking the mean of the frequency of occurrence of each category.

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

If we want to prettify the most common venues in descending order, we need to write a function. From there, we can create a new dataframe and display the top 10 venues for each neighborhood.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Initialize neighborhood clustering¶

We use k-means clustering with 5 clusters.

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check unique values of cluster labels, to see whether we got it right
print(set(kmeans.labels_))

# check cluster labels generated for the first ten rows in the dataframe
print(kmeans.labels_[0:10] )

{0, 1, 2, 3, 4}
[1 1 1 1 1 1 1 1 1 1]

We can create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = merged_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# Fill, if any, missing values with cluster 0
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].fillna(0) 

toronto_merged.head() # check the "Cluster Labels" column to see changes

Finally, let's visualize the resulting clusters.

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

	Postal Code	Borough	Neighborhood
0	M3A	North York	Parkwoods
1	M4A	North York	Victoria Village
2	M5A	Downtown Toronto	Regent Park, Harbourfront
3	M6A	North York	Lawrence Manor, Lawrence Heights
4	M7A	Downtown Toronto	Queen's Park, Ontario Provincial Government

	Postal Code	Latitude	Longitude
0	M1B	43.806686	-79.194353
1	M1C	43.784535	-79.160497
2	M1E	43.763573	-79.188711
3	M1G	43.770992	-79.216917
4	M1H	43.773136	-79.239476

	Postal Code	Borough	Neighborhood	Latitude	Longitude
0	M3A	North York	Parkwoods	43.753259	-79.329656
1	M4A	North York	Victoria Village	43.725882	-79.315572
2	M5A	Downtown Toronto	Regent Park, Harbourfront	43.654260	-79.360636
3	M6A	North York	Lawrence Manor, Lawrence Heights	43.718518	-79.464763
4	M7A	Downtown Toronto	Queen's Park, Ontario Provincial Government	43.662301	-79.389494

	Neighborhood	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Category
0	Parkwoods	43.753259	-79.329656	Brookbanks Park	43.751976	-79.332140	Park
1	Parkwoods	43.753259	-79.329656	Variety Store	43.751974	-79.333114	Food & Drink Shop
2	Parkwoods	43.753259	-79.329656	Corrosion Service Company Limited	43.752432	-79.334661	Construction & Landscaping
3	Victoria Village	43.725882	-79.315572	Victoria Village Arena	43.723481	-79.315635	Hockey Arena
4	Victoria Village	43.725882	-79.315572	Portugril	43.725819	-79.312785	Portuguese Restaurant

	Neighborhood	Construction & Landscaping	Food & Drink Shop	Hockey Arena	Park	Portuguese Restaurant
0	Parkwoods	0	0	0	1	0
1	Parkwoods	0	1	0	0	0
2	Parkwoods	1	0	0	0	0
3	Victoria Village	0	0	1	0	0
4	Victoria Village	0	0	0	0	1

	Postal Code	Borough	Neighborhood
0	None	None	None
1	M1A	Not assigned	Not assigned
2	M2A	Not assigned	Not assigned
3	M3A	North York	Parkwoods
4	M4A	North York	Victoria Village

	Neighborhood	American Restaurant	Bank	Breakfast Spot	Bridal Shop	Butcher	Café	Chinese Restaurant	Clothing Store	Coffee Shop	Comfort Food Restaurant	Deli / Bodega	Diner	Fried Chicken Joint	Gas Station	Greek Restaurant	Grocery Store	Gym	Ice Cream Shop	Indian Restaurant	Italian Restaurant	Japanese Restaurant	Juice Bar	Latin American Restaurant	Liquor Store	Lounge	Middle Eastern Restaurant	Mobile Phone Shop	Park	Pharmacy	Pizza Place	Pub	Restaurant	Sandwich Place	Shopping Mall	Skating Rink	Supermarket	Sushi Restaurant	Thai Restaurant
0	Agincourt	0.000000	0.000000	0.200000	0.000000	0.000000	0.000000	0.00	0.2	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00	0.000000	0.2	0.000000	0.2	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.2	0.000000	0.000000	0.000000
1	Alderwood, Long Branch	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00	0.0	0.142857	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.142857	0.000000	0.000000	0.000000	0.00	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.000000	0.142857	0.285714	0.142857	0.000000	0.142857	0.000000	0.0	0.000000	0.000000	0.000000
2	Bathurst Manor, Wilson Heights, Downsview North	0.000000	0.095238	0.000000	0.047619	0.000000	0.000000	0.00	0.0	0.095238	0.000000	0.047619	0.047619	0.047619	0.047619	0.000000	0.047619	0.000000	0.047619	0.000000	0.000000	0.00	0.000000	0.0	0.000000	0.0	0.047619	0.047619	0.047619	0.047619	0.047619	0.000000	0.047619	0.047619	0.047619	0.0	0.047619	0.047619	0.000000
3	Bayview Village	0.000000	0.250000	0.000000	0.000000	0.000000	0.250000	0.25	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.25	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	0.000000	0.000000
4	Bedford Park, Lawrence Manor East	0.043478	0.000000	0.043478	0.000000	0.043478	0.043478	0.00	0.0	0.086957	0.043478	0.000000	0.000000	0.000000	0.000000	0.043478	0.043478	0.000000	0.000000	0.043478	0.086957	0.00	0.043478	0.0	0.043478	0.0	0.000000	0.000000	0.000000	0.043478	0.043478	0.043478	0.086957	0.086957	0.000000	0.0	0.000000	0.043478	0.043478