Applied Data Science Capstone: Week 3

Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

In this project, I'll be exploring and clustering neighborhoods in Toronto with the help of Foursquare API. This small project will be tackled in several steps:

  1. Scraping a Wikipedia page with BeautifulSoup.
  2. Cleaning the dataframe, containing the columns: Postal code, borough, and neighborhood name.
  3. Get latitude & longitude coordinates for each neighborhood.
  4. Explore and cluster the neighborhoods in Toronto.
In [1]:
from bs4 import BeautifulSoup
import numpy as np # library to handle data in a vectorized manner
import re # regex library

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported.')
Libraries imported.

1. Scraping a Wikipedia page with BeautifulSoup

In [2]:
webpage_response = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") 
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser") # Instantiate BeautifulSoup object
# print(soup.tbody) # Uncomment to analyze which element we're going to scrape

We can see that most of the content in the webpage is assigned in the children tags of tbody. So, let's do some more exploratory analysis before actually scraping the contents.

  • For table headers, like the labels Postal Code and Borough, are stored in the tag <th>.

  • Contrarily, the contents of the table are stored in the tag <td>, which are also the children of the tag <tr>.

We can get this information stored in a list, in which we're going to convert into a pandas dataframe later.

In [3]:
# Perform multiple string cleaning
headers = [(str(x)).strip("<th>") for x in soup.find_all("th")]
headers = [x.strip("</") for x in headers]
headers = [x.strip() for x in headers[:3]]
# Display end results
print(headers)
['Postal Code', 'Borough', 'Neighborhood']
In [4]:
# Convert all entries into list
table = soup.find('table')
table_rows = table.find_all('tr')
entries = list()
for tr in table_rows:
    td = tr.find_all('td')
    entries.append([tr.text.strip() for tr in td])

# Convert the list contents into a dataframe, with column name as the headers list
df = pd.DataFrame(entries, columns=headers)
df.head()
Out[4]:
Postal Code Borough Neighborhood
0 None None None
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village

2. Cleaning the DataFrame

Here are the conditions of the desired DataFrame state:

  • Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
  • If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
  • Use the .shape method to print the number of rows of the processed dataframe.
In [5]:
## Drop NaN values
df.dropna(axis=0, inplace=True)

## Process the cells that have an assigned borough
to_drop = df.index[df['Borough'] == 'Not assigned'].tolist() # Get index of all rows that hasn't an assigned borough
df.drop(to_drop, inplace=True) # Drop values based on index

## Display processed dataframe
print(df.shape)
df.reset_index(drop=True, inplace=True)
df.head()
(103, 3)
Out[5]:
Postal Code Borough Neighborhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Regent Park, Harbourfront
3 M6A North York Lawrence Manor, Lawrence Heights
4 M7A Downtown Toronto Queen's Park, Ontario Provincial Government

3. Get longitude and latitude values for each postal code

We're going to import the pre-processed long/lat values from this csv file. Then, we're going to append the imported values into our existing dataframe.

In [6]:
longlat = pd.read_csv("https://cocl.us/Geospatial_data")
longlat.head()
Out[6]:
Postal Code Latitude Longitude
0 M1B 43.806686 -79.194353
1 M1C 43.784535 -79.160497
2 M1E 43.763573 -79.188711
3 M1G 43.770992 -79.216917
4 M1H 43.773136 -79.239476
In [7]:
merged_df = df.merge(longlat, how = 'inner', on = ['Postal Code']) # Similar to SQL inner join
merged_df.head()
Out[7]:
Postal Code Borough Neighborhood Latitude Longitude
0 M3A North York Parkwoods 43.753259 -79.329656
1 M4A North York Victoria Village 43.725882 -79.315572
2 M5A Downtown Toronto Regent Park, Harbourfront 43.654260 -79.360636
3 M6A North York Lawrence Manor, Lawrence Heights 43.718518 -79.464763
4 M7A Downtown Toronto Queen's Park, Ontario Provincial Government 43.662301 -79.389494

4. Explore and Cluster Neighborhoods in Toronto

In [8]:
from sklearn.cluster import KMeans # Import clustering library

Create a map of Toronto with neighborhoods superimposed on top.

A quick Google search revealed that Toronto lies on 43.653908 latitude, -79.384293 longitude. We double-assign these values before mapping the map with folium.

In [9]:
latitude, longitude = 43.653908, -79.384293

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['Borough'], merged_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto
Out[9]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Define Foursquare Credentials and Version

Note that when I share this notebook, I would have to redact the credentials. I hope this is understandable :)

In [22]:
CLIENT_ID = 'REDACTED' # your Foursquare ID
CLIENT_SECRET = 'REDACTED' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Your credentails:
CLIENT_ID: REDACTED
CLIENT_SECRET:REDACTED

Define a few functions to use for exploratory analysis.

We're going to use these functions to explore neighborhoods in Toronto.

In [11]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# function that returns venues in an area, given latitude and longitude
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name) # uncomment to debug. I commented it to preserve memory
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100) # Get the results of the top 100 venues.
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
In [12]:
toronto_venues = getNearbyVenues(names=merged_df['Neighborhood'],
                                   latitudes=merged_df['Latitude'],
                                   longitudes=merged_df['Longitude']
                                  )

print(toronto_venues.shape)
(2131, 7)

Display the end dataframe

The dataframe consists of venue properties along with its appropriate neighborhood.

In [13]:
toronto_venues.head()
Out[13]:
Neighborhood Neighborhood Latitude Neighborhood Longitude Venue Venue Latitude Venue Longitude Venue Category
0 Parkwoods 43.753259 -79.329656 Brookbanks Park 43.751976 -79.332140 Park
1 Parkwoods 43.753259 -79.329656 Variety Store 43.751974 -79.333114 Food & Drink Shop
2 Parkwoods 43.753259 -79.329656 Corrosion Service Company Limited 43.752432 -79.334661 Construction & Landscaping
3 Victoria Village 43.725882 -79.315572 Victoria Village Arena 43.723481 -79.315635 Hockey Arena
4 Victoria Village 43.725882 -79.315572 Portugril 43.725819 -79.312785 Portuguese Restaurant

We can find out how many unique categories can be curated from all the returned venues.

In [14]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))
There are 273 unique categories.

Get One-Hot dummy variables for each venue category

This way, we can analyze each neighborhood without having to scroll through all entries in the toronto_venues dataframe.

In [15]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

neighborhood_column = toronto_onehot['Neighborhood']
toronto_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
toronto_onehot.insert(0, 'Neighborhood', neighborhood_column)
In [16]:
print(toronto_onehot.shape)
toronto_onehot.head()
(2131, 273)
Out[16]:
Neighborhood Yoga Studio Accessories Store Afghan Restaurant Airport Airport Food Court Airport Gate Airport Lounge Airport Service Airport Terminal American Restaurant Antique Shop Aquarium Art Gallery Art Museum Arts & Crafts Store Asian Restaurant Athletics & Sports Auto Garage Auto Workshop BBQ Joint Baby Store Bagel Shop Bakery Bank Bar Baseball Field Baseball Stadium Basketball Court Basketball Stadium Beach Bed & Breakfast Beer Bar Beer Store Belgian Restaurant Bike Shop Bistro Boat or Ferry Bookstore Boutique Brazilian Restaurant Breakfast Spot Brewery Bridal Shop Bubble Tea Shop Building Burger Joint Burrito Place Bus Line Bus Station Bus Stop Business Service Butcher Café Cajun / Creole Restaurant Camera Store Candy Store Caribbean Restaurant Cheese Shop Chinese Restaurant Chocolate Shop Church Climbing Gym Clothing Store Cocktail Bar Coffee Shop College Arts Building College Auditorium College Gym College Rec Center College Stadium Colombian Restaurant Comfort Food Restaurant Comic Shop Concert Hall Construction & Landscaping Convenience Store Convention Center Cosmetics Shop Coworking Space Creperie Cuban Restaurant Cupcake Shop Curling Ice Dance Studio Deli / Bodega Department Store Dessert Shop Dim Sum Restaurant Diner Discount Store Distribution Center Dog Run Doner Restaurant Donut Shop Drugstore Eastern European Restaurant Electronics Store Ethiopian Restaurant Event Space Falafel Restaurant Farm Farmers Market Fast Food Restaurant Field Filipino Restaurant Financial or Legal Service Fish & Chips Shop Fish Market Flea Market Flower Shop Food & Drink Shop Food Court Food Service Food Truck Fountain French Restaurant Fried Chicken Joint Frozen Yogurt Shop Fruit & Vegetable Store Furniture / Home Store Gaming Cafe Garden Garden Center Gas Station Gastropub Gay Bar General Entertainment General Travel German Restaurant Gift Shop Gluten-free Restaurant Golf Course Gourmet Shop Greek Restaurant Grocery Store Gym Gym / Fitness Center Hakka Restaurant Harbor / Marina Hardware Store Health & Beauty Service Health Food Store Historic Site History Museum Hobby Shop Hockey Arena Home Service Hookah Bar Hospital Hot Dog Joint Hotel Hotel Bar IT Services Ice Cream Shop Indian Restaurant Indie Movie Theater Indonesian Restaurant Intersection Irish Pub Italian Restaurant Japanese Restaurant Jazz Club Jewelry Store Juice Bar Korean Restaurant Lake Latin American Restaurant Light Rail Station Lingerie Store Liquor Store Lounge Luggage Store Mac & Cheese Joint Malay Restaurant Market Martial Arts Dojo Massage Studio Medical Center Mediterranean Restaurant Men's Store Metro Station Mexican Restaurant Middle Eastern Restaurant Miscellaneous Shop Mobile Phone Shop Modern European Restaurant Molecular Gastronomy Restaurant Monument / Landmark Moroccan Restaurant Motel Movie Theater Moving Target Museum Music Venue New American Restaurant Nightclub Noodle House Office Opera House Optical Shop Organic Grocery Other Great Outdoors Park Performing Arts Venue Pet Store Pharmacy Pizza Place Playground Plaza Poke Place Pool Portuguese Restaurant Poutine Place Pub Ramen Restaurant Record Shop Recording Studio Rental Car Location Rental Service Restaurant River Roof Deck Sake Bar Salad Place Salon / Barbershop Sandwich Place Scenic Lookout Sculpture Garden Seafood Restaurant Shoe Store Shopping Mall Skate Park Skating Rink Smoke Shop Smoothie Shop Snack Place Soccer Field Social Club Soup Place Spa Speakeasy Sporting Goods Shop Sports Bar Stadium Stationery Store Steakhouse Strip Club Supermarket Supplement Shop Sushi Restaurant Swim School Taco Place Tailor Shop Taiwanese Restaurant Tanning Salon Tea Room Tennis Court Thai Restaurant Theater Theme Restaurant Toy / Game Store Trail Train Station Vegetarian / Vegan Restaurant Video Game Store Video Store Vietnamese Restaurant Warehouse Store Wine Bar Wine Shop Wings Joint Women's Store
0 Parkwoods 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 Parkwoods 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 Parkwoods 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 Victoria Village 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 Victoria Village 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We can then group rows based on neighborhood names and by taking the mean of the frequency of occurrence of each category.

In [17]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()
Out[17]:
Neighborhood Yoga Studio Accessories Store Afghan Restaurant Airport Airport Food Court Airport Gate Airport Lounge Airport Service Airport Terminal American Restaurant Antique Shop Aquarium Art Gallery Art Museum Arts & Crafts Store Asian Restaurant Athletics & Sports Auto Garage Auto Workshop BBQ Joint Baby Store Bagel Shop Bakery Bank Bar Baseball Field Baseball Stadium Basketball Court Basketball Stadium Beach Bed & Breakfast Beer Bar Beer Store Belgian Restaurant Bike Shop Bistro Boat or Ferry Bookstore Boutique Brazilian Restaurant Breakfast Spot Brewery Bridal Shop Bubble Tea Shop Building Burger Joint Burrito Place Bus Line Bus Station Bus Stop Business Service Butcher Café Cajun / Creole Restaurant Camera Store Candy Store Caribbean Restaurant Cheese Shop Chinese Restaurant Chocolate Shop Church Climbing Gym Clothing Store Cocktail Bar Coffee Shop College Arts Building College Auditorium College Gym College Rec Center College Stadium Colombian Restaurant Comfort Food Restaurant Comic Shop Concert Hall Construction & Landscaping Convenience Store Convention Center Cosmetics Shop Coworking Space Creperie Cuban Restaurant Cupcake Shop Curling Ice Dance Studio Deli / Bodega Department Store Dessert Shop Dim Sum Restaurant Diner Discount Store Distribution Center Dog Run Doner Restaurant Donut Shop Drugstore Eastern European Restaurant Electronics Store Ethiopian Restaurant Event Space Falafel Restaurant Farm Farmers Market Fast Food Restaurant Field Filipino Restaurant Financial or Legal Service Fish & Chips Shop Fish Market Flea Market Flower Shop Food & Drink Shop Food Court Food Service Food Truck Fountain French Restaurant Fried Chicken Joint Frozen Yogurt Shop Fruit & Vegetable Store Furniture / Home Store Gaming Cafe Garden Garden Center Gas Station Gastropub Gay Bar General Entertainment General Travel German Restaurant Gift Shop Gluten-free Restaurant Golf Course Gourmet Shop Greek Restaurant Grocery Store Gym Gym / Fitness Center Hakka Restaurant Harbor / Marina Hardware Store Health & Beauty Service Health Food Store Historic Site History Museum Hobby Shop Hockey Arena Home Service Hookah Bar Hospital Hot Dog Joint Hotel Hotel Bar IT Services Ice Cream Shop Indian Restaurant Indie Movie Theater Indonesian Restaurant Intersection Irish Pub Italian Restaurant Japanese Restaurant Jazz Club Jewelry Store Juice Bar Korean Restaurant Lake Latin American Restaurant Light Rail Station Lingerie Store Liquor Store Lounge Luggage Store Mac & Cheese Joint Malay Restaurant Market Martial Arts Dojo Massage Studio Medical Center Mediterranean Restaurant Men's Store Metro Station Mexican Restaurant Middle Eastern Restaurant Miscellaneous Shop Mobile Phone Shop Modern European Restaurant Molecular Gastronomy Restaurant Monument / Landmark Moroccan Restaurant Motel Movie Theater Moving Target Museum Music Venue New American Restaurant Nightclub Noodle House Office Opera House Optical Shop Organic Grocery Other Great Outdoors Park Performing Arts Venue Pet Store Pharmacy Pizza Place Playground Plaza Poke Place Pool Portuguese Restaurant Poutine Place Pub Ramen Restaurant Record Shop Recording Studio Rental Car Location Rental Service Restaurant River Roof Deck Sake Bar Salad Place Salon / Barbershop Sandwich Place Scenic Lookout Sculpture Garden Seafood Restaurant Shoe Store Shopping Mall Skate Park Skating Rink Smoke Shop Smoothie Shop Snack Place Soccer Field Social Club Soup Place Spa Speakeasy Sporting Goods Shop Sports Bar Stadium Stationery Store Steakhouse Strip Club Supermarket Supplement Shop Sushi Restaurant Swim School Taco Place Tailor Shop Taiwanese Restaurant Tanning Salon Tea Room Tennis Court Thai Restaurant Theater Theme Restaurant Toy / Game Store Trail Train Station Vegetarian / Vegan Restaurant Video Game Store Video Store Vietnamese Restaurant Warehouse Store Wine Bar Wine Shop Wings Joint Women's Store
0 Agincourt 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.200000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.2 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.00 0.0 0.0 0.000000 0.0 0.0 0.2 0.0 0.0 0.000000 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Alderwood, Long Branch 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.142857 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.142857 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.00 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.142857 0.285714 0.0 0.0 0.0 0.0 0.0 0.0 0.142857 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.142857 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Bathurst Manor, Wilson Heights, Downsview North 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.095238 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.095238 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.047619 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.000000 0.0 0.0 0.0 0.0 0.000000 0.00 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.047619 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.047619 0.0 0.047619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Bayview Village 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.250000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.250000 0.0 0.0 0.0 0.0 0.0 0.25 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.000000 0.25 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Bedford Park, Lawrence Manor East 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.043478 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.086957 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.043478 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.043478 0.0 0.0 0.0 0.0 0.086957 0.00 0.0 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.043478 0.043478 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.086957 0.0 0.0 0.0 0.0 0.0 0.086957 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.043478 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

If we want to prettify the most common venues in descending order, we need to write a function. From there, we can create a new dataframe and display the top 10 venues for each neighborhood.

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
Out[18]:
Neighborhood 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 Agincourt Lounge Latin American Restaurant Skating Rink Clothing Store Breakfast Spot Doner Restaurant Dim Sum Restaurant Diner Discount Store Distribution Center
1 Alderwood, Long Branch Pizza Place Gym Sandwich Place Pharmacy Pub Coffee Shop Airport Gate Deli / Bodega Event Space Ethiopian Restaurant
2 Bathurst Manor, Wilson Heights, Downsview North Bank Coffee Shop Pharmacy Supermarket Deli / Bodega Sushi Restaurant Restaurant Middle Eastern Restaurant Mobile Phone Shop Diner
3 Bayview Village Japanese Restaurant Café Bank Chinese Restaurant Dessert Shop Diner Discount Store Distribution Center Dog Run Women's Store
4 Bedford Park, Lawrence Manor East Sandwich Place Coffee Shop Restaurant Italian Restaurant Liquor Store Indian Restaurant Café Pub Sushi Restaurant Breakfast Spot

Initialize neighborhood clustering

We use k-means clustering with 5 clusters.

In [19]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check unique values of cluster labels, to see whether we got it right
print(set(kmeans.labels_))

# check cluster labels generated for the first ten rows in the dataframe
print(kmeans.labels_[0:10] )
{0, 1, 2, 3, 4}
[1 1 1 1 1 1 1 1 1 1]

We can create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [20]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = merged_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# Fill, if any, missing values with cluster 0
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].fillna(0) 

toronto_merged.head() # check the "Cluster Labels" column to see changes
Out[20]:
Postal Code Borough Neighborhood Latitude Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 M3A North York Parkwoods 43.753259 -79.329656 0.0 Park Food & Drink Shop Construction & Landscaping Electronics Store Eastern European Restaurant Drugstore Donut Shop Doner Restaurant Deli / Bodega Dog Run
1 M4A North York Victoria Village 43.725882 -79.315572 1.0 Portuguese Restaurant Hockey Arena Coffee Shop Intersection Financial or Legal Service Dim Sum Restaurant Diner Discount Store Distribution Center Dog Run
2 M5A Downtown Toronto Regent Park, Harbourfront 43.654260 -79.360636 1.0 Coffee Shop Pub Bakery Park Theater Breakfast Spot Restaurant Café Bank Hotel
3 M6A North York Lawrence Manor, Lawrence Heights 43.718518 -79.464763 1.0 Clothing Store Miscellaneous Shop Accessories Store Boutique Vietnamese Restaurant Coffee Shop Shoe Store Event Space Furniture / Home Store Distribution Center
4 M7A Downtown Toronto Queen's Park, Ontario Provincial Government 43.662301 -79.389494 1.0 Coffee Shop Sushi Restaurant Diner Gym Discount Store Sandwich Place Park Mexican Restaurant Italian Restaurant Hobby Shop

Finally, let's visualize the resulting clusters.

In [21]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
Out[21]:
Make this Notebook Trusted to load map: File -> Trust Notebook