In this project, Foursquare geospatial and venue data will be used to determine which area(s) is best suited to expand a Coffee Shop chain, Kopi Kompleks, around Surabaya, East Java, Indonesia. I would like to say my most profound gratitude to:
.shp
format. Check out her flickr!.shp
into .geojson
with EPSG:4326 format. Check out his art page!For without their help, this project wouldn't be completed as it is.
The franchise in question, Kopi Kompleks, has already opened 4 branches across Surabaya, and is looking to open their fifth one. Since there are already a lot of coffee shop chain in the city, it could potentially be unbeneficial for the franchise to expand into the wrong neighborhood.
This report will help the franchise to determine which area is most prospective, and which area is the least. This report will also be presented to the stakeholders and higher-ups of the franchise, with hopes to aid in their decision-making process.
I will try to:
The franchise has already opened four branches, in the following areas:
Based on definition of our problem, factors that will influence our decision are:
Following data sources will be needed to extract/generate the required information:
I exported the .shp file format through mapshaper into two formats:
Foursquare data will first be segmented into several clusters, before eventually eliminating clusters that are nearest and/or belongs to the existing four branches of the franchise. This way, we can get unbiased representation of each neighborhoods' preferences, and ultimately determine which areas are the most prospective to open the next branch.
from bs4 import BeautifulSoup
import numpy as np # library to handle data in a vectorized manner
import re # regex library
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
print("Libraries imported!")
This section covers obtaining and cleaning the latitude and longitude values of each neighborhoods in Surabaya.
df = pd.read_csv("Administrasi_Kelurahan_Surabaya.csv")
df.head()
We can see that the original .csv file has some unnecessary columns. These columns are:
SURABAYA
, which is obvious since we're working with Surabaya dataset.NaN
values.We can drop these said columns to achieve a clearer, more concise dataframe object.
df.drop(['OBJECTID', 'Id', 'KABKOT', 'PROPINSI', 'Z', 'FillTransp', 'OutlineTra', 'SHAPE_Leng', 'SHAPE_Area'], axis=1, inplace=True)
df = df.rename(columns={"DESA": "KELURAHAN", "X": "Longitude", "Y": "Latitude"}) # Rename the rest of the columns for clarity
df.head()
A quick Google search revealed that Surabaya lies on the respective latitude and longitude values: -7.250445, 112.768845. We double-assign these values before mapping the map with folium.
latitude, longitude = -7.265, 112.7107 # Altered slightly to better centerize the map display
# create map of Toronto using latitude and longitude values
map_sby = folium.Map(location=[latitude, longitude], zoom_start=12)
# add markers to map
for lat, lng, kec, kel in zip(df['Latitude'], df['Longitude'], df['KECAMATAN'], df['KELURAHAN']):
label = '{}, {}'.format(kel, kec)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_sby)
map_sby
We now have a set of datapoints representing each neighborhood (kelurahan) in Surabaya.
CLIENT_ID = 'â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– ' # Foursquare ID
CLIENT_SECRET = 'â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– ' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Using the following credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
We're going to use these functions to explore neighborhoods in Surabaya. In the function getNearbyVenues
, we define neighborhoods as circular areas with a radius of 500 meters, so our neighborhood centers will each be 1 km apart.
# function that extracts the category of the venue
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
# function that returns venues in an area, given latitude and longitude
def getNearbyVenues(names, latitudes, longitudes, radius=500):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
# print(name) # uncomment to debug. I commented it to preserve memory
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
100) # Get the results of the top 100 venues.
# make the GET request
# print(requests.get(url).json()["response"]) # uncomment to debug in case we've overused our GET request limit.
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
#sby_venues = getNearbyVenues(names=df['KELURAHAN'], latitudes=df['Latitude'], longitudes=df['Longitude'])
#Uncomment to make function call the Foursquare API.
The dataframe consists of venue properties along with its appropriate neighborhood.
#sby_venues.to_csv("processed.csv") # Write csv file as a checkpoint in case we overuse the API call quota
sby_venues = pd.read_csv("processed.csv")
sby_venues.drop('Unnamed: 0', inplace=True,axis=1)
sby_venues.head()
We can find out how many unique categories can be curated from all the returned venues.
print('There are {} unique categories.'.format(len(sby_venues['Venue Category'].unique())))
This way, we can analyze each neighborhood without having to scroll through all entries in the sby_venues
dataframe.
# one hot encoding
sby_onehot = pd.get_dummies(sby_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
sby_onehot['Neighborhood'] = sby_venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [sby_onehot.columns[-1]] + list(sby_onehot.columns[:-1])
sby_onehot = sby_onehot[fixed_columns]
neighborhood_column = sby_onehot['Neighborhood']
sby_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
sby_onehot.insert(0, 'Neighborhood', neighborhood_column)
sby_onehot.head()
We can then group rows based on neighborhood names and by taking the sum of all occurrences in each category.
sby_grouped = sby_onehot.groupby('Neighborhood').sum().reset_index()
sby_grouped.head()
Finally, we can trim the above dataframe into a more concise one for clarity. In particular, we're interested in the categories Café
and Coffee Shop
, so let's do just that:
# Make another dataframe
sby_cafes = sby_grouped.copy(deep=True)[['Neighborhood', 'Café', 'Coffee Shop']]
# Add another column containing the total cafe and coffee shop venues
sby_cafes['Total'] = sby_cafes.sum(axis=1)
# Sort rows by the new column, 'Total'
sby_cafes.sort_values(by ='Total' , ascending=False, inplace=True)
sby_cafes.reset_index(drop=True, inplace=True)
print("The dimension of this dataframe is: ")
print(sby_cafes.shape)
sby_cafes.head()
We may want to visualize which neighborhood has the most coffee shops and/or cafes. But first, we need to trim neighborhoods which has zero totals so as to not clog up the chart(s).
sby_cafes_not_null = sby_cafes[sby_cafes.Total != 0]
print("The dimension of this dataframe is: ")
print(sby_cafes_not_null.shape)
print("We've trimmed",
sby_cafes.shape[0] - sby_cafes_not_null.shape[0],
"rows!"
)
sby_cafes_not_null.head()
We know that our existing four branches are:
Which, respectively, are located in the following neighborhoods:
We remove these said four neighborhoods from our new dataframe, since they're already irrelevant for further decision-making.
to_remove = ['KETABANG', 'NGAGEL', 'BABATAN', 'KETINTANG']
sby_cafes_to_remove = sby_cafes_not_null[sby_cafes_not_null.Neighborhood.isin(to_remove)]
sby_cafes_to_remove
We can see that BABATAN and KETINTANG don't have any coffee spots in them, hence they didn't appear in the above dataframe. We can also see that KETABANG and NGAGEL lies in the 3rd and 5th index, respectively. So we simply remove them.
#sby_cafes_not_null.drop([3,5], inplace=True)
sby_cafes_not_null.head(10)
This section covers the cleaning and choropleth mapping of Surabaya's population density, per neighborhood.
pop_dens = pd.read_csv("Jumlah_Penduduk_Kelurahan_Surabaya.csv")
pop_dens.head()
We can remove the last two columns, since it's irrelevant to use for choropleth mapping. We also rename the Penduduk
column into COUNT
, for easier processing and clarity.
pop_dens.drop(['KK', 'Rata-rata Anggota Keluarga'], axis=1, inplace=True)
pop_dens.rename(columns={"Penduduk": "COUNT", "Neighborhood": "KELURAHAN"}, inplace=True)
pop_dens.head()
We then merge this population density to the original df
dataframe from section 1.2.1:
merged_df = df.merge(pop_dens, how = 'left', on = ['KELURAHAN'])
for column in merged_df.isnull().columns.values.tolist():
print(column)
print (merged_df.isnull()[column].value_counts())
print("")
We substitute these NaN
values with the first quartile of the population count of all neighborhoods. We use first quartile because it's more representative than using the mean.
quantile = merged_df.COUNT.quantile(0.25)
merged_df['COUNT'] = merged_df['COUNT'].fillna(quantile) # Substitute missing values with first quartile
print(merged_df.shape)
merged_df.head()
sby_cafes_to_remove = merged_df[merged_df.KELURAHAN.isin(to_remove)]
sby_cafes_to_remove
We need to remove the rows with indices 11, 56, 63, and 106. Because in them, we've already established a branch.
merged_df.drop([11, 56, 63, 106], inplace=True)
from folium.plugins import HeatMap
from sklearn import preprocessing
m = folium.Map([latitude, longitude], zoom_start=12)
data = merged_df[['Latitude', 'Longitude', 'COUNT']].values
heat_data = [[row['Latitude'], row['Longitude'], index] for index, row in merged_df.iterrows()]
HeatMap(data,
max_val=max(merged_df['COUNT'].values)).add_to(m)
m
sby_geojson = r'Administrasi_Kelurahan_Surabaya.geojson'
density_map = folium.Map(location=[latitude, longitude], zoom_start=12)
density_map.choropleth(
geo_data=sby_geojson,
data=merged_df,
columns=['KELURAHAN', 'COUNT'],
key_on='feature.properties.DESA',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Population Density in Surabaya'
)
density_map
df
;sby_cafes_not_null
;merged_df
.The neighborhoods belonging to the existing branches are already cleaned from numbers 2 and 3.
This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new branch of coffee shop.
From here, we will cluster the above dataframes using the method K-means clustering
. Then, we can see the characteristics from various clusters and determine whether we clustered them the right way.
united_df = sby_cafes_not_null[['Neighborhood', 'Total']].copy(deep=True)
united_df.rename(columns={"Neighborhood": "KELURAHAN"}, inplace=True)
united_df = united_df.merge(merged_df[['KELURAHAN', 'COUNT']], how = 'inner', on = ['KELURAHAN'])
united_df.rename(columns={"COUNT": "Density"}, inplace=True)
united_df.head(10)
The magnitude of values for both numerical columns differ heavily from each other. We can apply normalization.
Normalization is a statistical method that helps mathematical-based algorithms to interpret features with different magnitudes and distributions equally
Normalize both the columns for further processing by the StandardScaler() function.
from sklearn.preprocessing import MinMaxScaler, StandardScaler # Normalization library
from matplotlib import pyplot as plt # Data visualization standard library
from sklearn.cluster import KMeans # Clustering library
%matplotlib inline
col_names = ['Total', 'Density']
features = united_df[col_names] # Get a subset of the united_df dataframe
scaler = MinMaxScaler().fit(features.values) # Instantiate and fit data, as a numpy array, to the scaler object
features = scaler.transform(features.values) # Perform standardization by centering and scaling
print("A sample of the first ten entries of the normalized features:")
print(features[:10])
Before we instantiate the clustering on our feature set, we need to determine which value of k
is optimal using the elbow method.
distortions = []
k_values = range(1,10)
for k in k_values:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(features)
distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(k_values, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
k
= 4. Thus, we can use this value for further processing.¶plt.clf()
clusterNum = 4
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(features)
united_df['Cluster'] = k_means.labels_
united_df.head(10)
Cluster
column based on the mean of total coffee shops and population density:¶united_df.groupby('Cluster').mean()
Cluster
:¶members = united_df['Cluster'].value_counts().rename_axis('Cluster No.').to_frame('Members of Cluster')
members.plot(kind='barh', figsize=(9,9))
Cluster
:¶for i in set(united_df['Cluster']):
neighborhoods_list = united_df[united_df['Cluster'] == i]
print("For cluster", i)
print(neighborhoods_list)
print("---------------------------------------------------")
area = (np.pi*(features[:, 1])**2)
x_quartiles = [round(np.percentile(united_df['Density'].values, x)) for x in [(20 * x) for x in range(6)]] # Get an evenly-divided quartiles of six
# Plot the figure
plt.figure(figsize=(12,10))
plt.scatter(features[:, 1], features[:, 0], s = 200, c=k_means.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Population Count', fontsize=16)
plt.xticks(ticks = np.arange(0.0, 1.1, 0.2), labels = x_quartiles)
plt.yticks(ticks = np.arange(0, 1.1, 0.2), labels = np.arange(0, max(united_df.Total.values) + 1, 2))
plt.ylabel('Total Coffee Shops', fontsize=16)
plt.title("Clusters of Coffee Shops vs Total Population in Surabaya", fontsize=16)
plt.show()
Looking through our scatter plot visualization and the density and total coffee shops for each cluster, we can draw a conclusion and label each of our clusters:
We can then change the cluster numerical values into our verdict, in our united_df
dataframe.
united_df.Cluster.replace({3:'High', 0:'Mid-high', 1:'Moderate', 2:'Least'}, inplace=True)
united_df.head()
Displaying the top 5 areas for this cluster.
united_df[united_df['Cluster'] == 'High'].head()
Displaying the top 5 areas for this cluster.
united_df[united_df['Cluster'] == 'Mid-high'].head()
Displaying the top 5 areas for this cluster.
united_df[united_df['Cluster'] == 'Moderate'].head()
Displaying the top 5 areas for this cluster.
united_df[united_df['Cluster'] == 'Least'].head()
Written, tested, and published by Charis Chrisna (portfolio).