Data Science Capstone Project: Kopi Kompleks¶

Segmenting Prospective Neighborhoods to Expand a Coffee Shop Chain¶

1. Introduction¶

In this project, Foursquare geospatial and venue data will be used to determine which area(s) is best suited to expand a Coffee Shop chain, Kopi Kompleks, around Surabaya, East Java, Indonesia. I would like to say my most profound gratitude to:

Muhammad Fathin Moerdiwanto, for providing me enough information on the franchise.
Waode Nurul Jasmin A. Budiman, for helping me obtain Surabaya's map of administration borders in .shp format. Check out her flickr!
Taqy Satriaji Santoso, my best friend since high school, for helping me extract .shp into .geojson with EPSG:4326 format. Check out his art page!

For without their help, this project wouldn't be completed as it is.

Results presentation of the project can be seen here;
Technical report of the project can be seen here instead.

1.1. Business Problem Formulation¶

The franchise in question, Kopi Kompleks, has already opened 4 branches across Surabaya, and is looking to open their fifth one. Since there are already a lot of coffee shop chain in the city, it could potentially be unbeneficial for the franchise to expand into the wrong neighborhood.

This report will help the franchise to determine which area is most prospective, and which area is the least. This report will also be presented to the stakeholders and higher-ups of the franchise, with hopes to aid in their decision-making process.

I will try to:

Detect locations that are not already crowded with cafes.
Detect areas with no cafe/coffee shops in vicinity, with hopes of us being the pioneering venue in said area.
Detect areas with high population density.

1.2. Data¶

The franchise has already opened four branches, in the following areas:

Kopi Kompleks, Jl. Slamet No.16A, Ketabang, Kec. Genteng, Kota SBY, Jawa Timur 60272
Satu Rasa Kopi by Kopi Kompleks, Jl. Karimata No.1, Ngagel, Kec. Wonokromo, Kota SBY, Jawa Timur 60246
Kopilaborasi by Kopi Kompleks, Jl. Raya Pd. Rosan No.19, Babatan, Kec. Wiyung, Kota SBY, Jawa Timur 60227
Kopi Kompleks at PanaHouse, Jl. Gayung Kebonsari No.61 - 65, Ketintang, Kec. Gayungan, Kota SBY, Jawa Timur 60231

Based on definition of our problem, factors that will influence our decision are:

The number of existing coffee shops in the neighborhoods.
Population density in each of the neighborhoods.

Following data sources will be needed to extract/generate the required information:

Number of cafe/coffee shops and their location in every neighborhood will be obtained using Foursquare API.
Population density of each and every neighborhood in Surabaya, obtained through Surabaya's Central of Statistics
Longitude and latitude values of each neighborhood in Surabaya, in .shp file format, obtained through this ArcGIS online item.

I exported the .shp file format through mapshaper into two formats:

A cleaner, more understandable .csv file format
A .geojson format with an output coordinate system of EPSG:4326

Foursquare data will first be segmented into several clusters, before eventually eliminating clusters that are nearest and/or belongs to the existing four branches of the franchise. This way, we can get unbiased representation of each neighborhoods' preferences, and ultimately determine which areas are the most prospective to open the next branch.

from bs4 import BeautifulSoup
import numpy as np # library to handle data in a vectorized manner
import re # regex library

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

print("Libraries imported!")

Libraries imported!

1.2.1. Surabaya Geospatial Data¶

This section covers obtaining and cleaning the latitude and longitude values of each neighborhoods in Surabaya.

df = pd.read_csv("Administrasi_Kelurahan_Surabaya.csv")
df.head()

We can see that the original .csv file has some unnecessary columns. These columns are:

OBJECTID -- Dropped because it's a duplicate of index.
Id -- Dropped because it's filled with unrepresentative zeroes.
KABKOT -- Dropped because all of them are SURABAYA, which is obvious since we're working with Surabaya dataset.
PROPINSI -- Dropped because irrelevant and filled with NaN values.
Z -- Dropped because it's filled with unrepresentative ones.
FillTransp -- Dropped because this column is only representative when opened with ArcGIS Desktop.
OutlineTra -- Dropped because this column is only representative when opened with ArcGIS Desktop.
SHAPE_Leng -- Dropped because this column is only representative when opened with ArcGIS Desktop.
SHAPE_Area -- Dropped because this column is only representative when opened with ArcGIS Desktop.

We can drop these said columns to achieve a clearer, more concise dataframe object.

df.drop(['OBJECTID', 'Id', 'KABKOT', 'PROPINSI', 'Z', 'FillTransp', 'OutlineTra', 'SHAPE_Leng', 'SHAPE_Area'], axis=1, inplace=True) 

df = df.rename(columns={"DESA": "KELURAHAN", "X": "Longitude", "Y": "Latitude"}) # Rename the rest of the columns for clarity

df.head()

Create a map of Surabaya with neighborhoods superimposed on top.¶

A quick Google search revealed that Surabaya lies on the respective latitude and longitude values: -7.250445, 112.768845. We double-assign these values before mapping the map with folium.

latitude, longitude = -7.265, 112.7107 # Altered slightly to better centerize the map display

# create map of Toronto using latitude and longitude values
map_sby = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, kec, kel in zip(df['Latitude'], df['Longitude'], df['KECAMATAN'], df['KELURAHAN']):
    label = '{}, {}'.format(kel, kec)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sby)  
    
map_sby

We now have a set of datapoints representing each neighborhood (kelurahan) in Surabaya.

1.2.2. Venue Data from Foursquare API¶

This section covers the cleaning and transforming data obtained from the GET request to Foursquare API.

Define Foursquare Credentials and Version¶

Note that when I share this notebook, I would have to redact the credentials. I hope this is understandable :)

CLIENT_ID = '■■■■■■■■■■■■■■■■■■■■■■■■■' # Foursquare ID
CLIENT_SECRET = '■■■■■■■■■■■■■■■■■■■■■■■■■' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Using the following credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Using the following credentials:
CLIENT_ID: ■■■■■■■■■■■■■■■■■■■■■■■■■
CLIENT_SECRET:■■■■■■■■■■■■■■■■■■■■■■■■■

Define a few functions to use for exploratory analysis¶

We're going to use these functions to explore neighborhoods in Surabaya. In the function getNearbyVenues, we define neighborhoods as circular areas with a radius of 500 meters, so our neighborhood centers will each be 1 km apart.

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# function that returns venues in an area, given latitude and longitude
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name) # uncomment to debug. I commented it to preserve memory
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100) # Get the results of the top 100 venues.
            
        # make the GET request
        # print(requests.get(url).json()["response"]) # uncomment to debug in case we've overused our GET request limit.
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#sby_venues = getNearbyVenues(names=df['KELURAHAN'], latitudes=df['Latitude'], longitudes=df['Longitude'])
#Uncomment to make function call the Foursquare API.

Display the end dataframe¶

The dataframe consists of venue properties along with its appropriate neighborhood.

#sby_venues.to_csv("processed.csv") # Write csv file as a checkpoint in case we overuse the API call quota
sby_venues = pd.read_csv("processed.csv")
sby_venues.drop('Unnamed: 0', inplace=True,axis=1)
sby_venues.head()

We can find out how many unique categories can be curated from all the returned venues.

print('There are {} unique categories.'.format(len(sby_venues['Venue Category'].unique())))

There are 202 unique categories.

Get One-Hot dummy variables for each venue category¶

This way, we can analyze each neighborhood without having to scroll through all entries in the sby_venues dataframe.

# one hot encoding
sby_onehot = pd.get_dummies(sby_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sby_onehot['Neighborhood'] = sby_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sby_onehot.columns[-1]] + list(sby_onehot.columns[:-1])
sby_onehot = sby_onehot[fixed_columns]

neighborhood_column = sby_onehot['Neighborhood']
sby_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
sby_onehot.insert(0, 'Neighborhood', neighborhood_column)

sby_onehot.head()

We can then group rows based on neighborhood names and by taking the sum of all occurrences in each category.

sby_grouped = sby_onehot.groupby('Neighborhood').sum().reset_index()
sby_grouped.head()

Finally, we can trim the above dataframe into a more concise one for clarity. In particular, we're interested in the categories Café and Coffee Shop, so let's do just that:

# Make another dataframe
sby_cafes = sby_grouped.copy(deep=True)[['Neighborhood', 'Café', 'Coffee Shop']]

# Add another column containing the total cafe and coffee shop venues
sby_cafes['Total'] = sby_cafes.sum(axis=1) 

# Sort rows by the new column, 'Total'
sby_cafes.sort_values(by ='Total' , ascending=False, inplace=True)
sby_cafes.reset_index(drop=True, inplace=True)
print("The dimension of this dataframe is: ")
print(sby_cafes.shape)
sby_cafes.head()

The dimension of this dataframe is: 
(145, 4)

We may want to visualize which neighborhood has the most coffee shops and/or cafes. But first, we need to trim neighborhoods which has zero totals so as to not clog up the chart(s).

sby_cafes_not_null = sby_cafes[sby_cafes.Total != 0] 
print("The dimension of this dataframe is: ")
print(sby_cafes_not_null.shape)
print("We've trimmed",
     sby_cafes.shape[0] - sby_cafes_not_null.shape[0],
      "rows!"
     )
sby_cafes_not_null.head()

The dimension of this dataframe is: 
(63, 4)
We've trimmed 82 rows!

We know that our existing four branches are:

Kopi Kompleks, Jl. Slamet No.16A, Ketabang, Kec. Genteng, Kota SBY, Jawa Timur 60272
Satu Rasa Kopi by Kopi Kompleks, Jl. Karimata No.1, Ngagel, Kec. Wonokromo, Kota SBY, Jawa Timur 60246
Kopilaborasi by Kopi Kompleks, Jl. Raya Pd. Rosan No.19, Babatan, Kec. Wiyung, Kota SBY, Jawa Timur 60227
Kopi Kompleks at PanaHouse, Jl. Gayung Kebonsari No.61 - 65, Ketintang, Kec. Gayungan, Kota SBY, Jawa Timur 60231

Which, respectively, are located in the following neighborhoods:

KETABANG
NGAGEL
BABATAN
KETINTANG

We remove these said four neighborhoods from our new dataframe, since they're already irrelevant for further decision-making.

to_remove = ['KETABANG', 'NGAGEL', 'BABATAN', 'KETINTANG']
sby_cafes_to_remove = sby_cafes_not_null[sby_cafes_not_null.Neighborhood.isin(to_remove)]
sby_cafes_to_remove

We can see that BABATAN and KETINTANG don't have any coffee spots in them, hence they didn't appear in the above dataframe. We can also see that KETABANG and NGAGEL lies in the 3rd and 5th index, respectively. So we simply remove them.

#sby_cafes_not_null.drop([3,5], inplace=True)
sby_cafes_not_null.head(10)

1.2.3. Surabaya Population Density¶

This section covers the cleaning and choropleth mapping of Surabaya's population density, per neighborhood.

pop_dens = pd.read_csv("Jumlah_Penduduk_Kelurahan_Surabaya.csv")
pop_dens.head()

We can remove the last two columns, since it's irrelevant to use for choropleth mapping. We also rename the Penduduk column into COUNT, for easier processing and clarity.

pop_dens.drop(['KK', 'Rata-rata Anggota Keluarga'], axis=1, inplace=True)
pop_dens.rename(columns={"Penduduk": "COUNT", "Neighborhood": "KELURAHAN"}, inplace=True)
pop_dens.head()

We then merge this population density to the original df dataframe from section 1.2.1:

merged_df = df.merge(pop_dens, how = 'left', on = ['KELURAHAN'])
for column in merged_df.isnull().columns.values.tolist():
    print(column)
    print (merged_df.isnull()[column].value_counts())
    print("")

KELURAHAN
False    165
Name: KELURAHAN, dtype: int64

KECAMATAN
False    165
Name: KECAMATAN, dtype: int64

Longitude
False    165
Name: Longitude, dtype: int64

Latitude
False    165
Name: Latitude, dtype: int64

COUNT
False    149
True      16
Name: COUNT, dtype: int64

We can see that in the COUNT column, some values are missing.¶

We substitute these NaN values with the first quartile of the population count of all neighborhoods. We use first quartile because it's more representative than using the mean.

quantile = merged_df.COUNT.quantile(0.25)
merged_df['COUNT'] = merged_df['COUNT'].fillna(quantile) # Substitute missing values with first quartile
print(merged_df.shape)
merged_df.head()

(165, 5)

sby_cafes_to_remove = merged_df[merged_df.KELURAHAN.isin(to_remove)]
sby_cafes_to_remove

We need to remove the rows with indices 11, 56, 63, and 106. Because in them, we've already established a branch.

merged_df.drop([11, 56, 63, 106], inplace=True)
from folium.plugins import HeatMap
from sklearn import preprocessing

m = folium.Map([latitude, longitude], zoom_start=12)
data = merged_df[['Latitude', 'Longitude', 'COUNT']].values
heat_data = [[row['Latitude'], row['Longitude'], index] for index, row in merged_df.iterrows()]
HeatMap(data, 
        max_val=max(merged_df['COUNT'].values)).add_to(m)
m

sby_geojson = r'Administrasi_Kelurahan_Surabaya.geojson'
density_map = folium.Map(location=[latitude, longitude], zoom_start=12)

density_map.choropleth(
    geo_data=sby_geojson,
    data=merged_df,
    columns=['KELURAHAN', 'COUNT'],
    key_on='feature.properties.DESA',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Population Density in Surabaya'
)

density_map

C:\Users\MSI-PC\Anaconda3\envs\PyCharm\lib\site-packages\folium\folium.py:415: FutureWarning: The choropleth  method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this.
  FutureWarning

In the choropleth above, the four greyed segments are the neighborhoods with existing branches.¶

So, at this point we have clean dataframes of:¶

Latitude & longitude values for each neighborhood in Surabaya, df;
A list amount of coffee shops and/or cafes in each neighborhood sby_cafes_not_null;
Population density of each neighborhood in Surabaya--along with its visualization merged_df.

The neighborhoods belonging to the existing branches are already cleaned from numbers 2 and 3.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new branch of coffee shop.

2. Methodology¶

From here, we will cluster the above dataframes using the method K-means clustering. Then, we can see the characteristics from various clusters and determine whether we clustered them the right way.

united_df = sby_cafes_not_null[['Neighborhood', 'Total']].copy(deep=True)
united_df.rename(columns={"Neighborhood": "KELURAHAN"}, inplace=True)
united_df = united_df.merge(merged_df[['KELURAHAN', 'COUNT']], how = 'inner', on = ['KELURAHAN'])
united_df.rename(columns={"COUNT": "Density"}, inplace=True)
united_df.head(10)

The magnitude of values for both numerical columns differ heavily from each other. We can apply normalization.

Normalization is a statistical method that helps mathematical-based algorithms to interpret features with different magnitudes and distributions equally

Normalize both the columns for further processing by the StandardScaler() function.

from sklearn.preprocessing import MinMaxScaler, StandardScaler #  Normalization library
from matplotlib import pyplot as plt # Data visualization standard library
from sklearn.cluster import KMeans # Clustering library
%matplotlib inline

col_names = ['Total', 'Density']
features = united_df[col_names] # Get a subset of the united_df dataframe
scaler = MinMaxScaler().fit(features.values) # Instantiate and fit data, as a numpy array, to the scaler object
features = scaler.transform(features.values) # Perform standardization by centering and scaling
print("A sample of the first ten entries of the normalized features:")
print(features[:10])

A sample of the first ten entries of the normalized features:
[[1.         0.17792649]
 [0.88888889 0.05735693]
 [0.77777778 0.35814436]
 [0.66666667 0.22395388]
 [0.55555556 0.22859117]
 [0.44444444 0.33764935]
 [0.33333333 0.28398407]
 [0.33333333 0.20064013]
 [0.33333333 0.23873866]
 [0.33333333 0.06508574]]

Before we instantiate the clustering on our feature set, we need to determine which value of k is optimal using the elbow method.

distortions = []
k_values = range(1,10)
for k in k_values:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(features)
    distortions.append(kmeanModel.inertia_)

plt.figure(figsize=(16,8))
plt.plot(k_values, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

We can see that the elbow point is at `k` = 4. Thus, we can use this value for further processing.¶

plt.clf()
clusterNum = 4
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(features)
united_df['Cluster'] = k_means.labels_
united_df.head(10)

<Figure size 432x288 with 0 Axes>

We can then group our new `Cluster` column based on the mean of total coffee shops and population density:¶

united_df.groupby('Cluster').mean()

We can also count the members belonging to each `Cluster` :¶

members = united_df['Cluster'].value_counts().rename_axis('Cluster No.').to_frame('Members of Cluster')
members.plot(kind='barh', figsize=(9,9))

<matplotlib.axes._subplots.AxesSubplot at 0x242a445ae88>

If we want to see which neighborhoods belong to which `Cluster`:¶

for i in set(united_df['Cluster']):
    neighborhoods_list = united_df[united_df['Cluster'] == i] 
    print("For cluster", i)
    print(neighborhoods_list)
    print("---------------------------------------------------")

For cluster 0
            KELURAHAN  Total  Density  Cluster
5        PACAR KELING      5  24144.0        0
6           AIRLANGGA      4  21193.0        0
7            KEPUTRAN      4  16610.0        0
8           MULYOREJO      4  18705.0        0
9     DUKUH MENANGGAL      4   9156.0        0
10       DUKUH KUPANG      4  16254.0        0
11         DR. SUTOMO      4  17783.0        0
12        PUCANG SEWU      3  15116.0        0
13  NGIDEN JANGKUNGAN      3  16182.0        0
14      RUNGKUT KIDUL      3  14074.0        0
15         BARATAJAYA      3  16996.0        0
16          JAMBANGAN      3  11560.0        0
17          KERTAJAYA      3  26692.0        0
18           KALISARI      3  16066.0        0
---------------------------------------------------
For cluster 1
           KELURAHAN  Total  Density  Cluster
20            GREGES      2  12455.0        1
21             JERUK      2   8956.0        1
22         KALIJUDAN      2  14282.0        1
23         MENANGGAL      2   9945.0        1
24   EMBONG KALIASIN      2  13397.0        1
26             DARMO      2  16182.0        1
27      SIWALANKERTO      2  18006.0        1
28      PANJANG JIWO      1  13607.0        1
29  KREMBANGAN UTARA      1  19156.0        1
30            WIYUNG      1  18963.0        1
31  TENGGILIS MEJOYO      1  11578.0        1
32      TEMBOK DUKUH      1  27724.0        1
33         TEGALSARI      1  19634.0        1
34       TANJUNGSARI      1  12600.0        1
36        TAMBAKREJO      1  20993.0        1
37          KUTISARI      1  19407.0        1
38       NYAMPLUNGAN      1   9473.0        1
39        SEMOLOWARU      1  19992.0        1
41           SAWAHAN      1  20676.0        1
43       PERAK TIMUR      1  16221.0        1
44       PERAK BARAT      1  16685.0        1
47      KEDUNG BARUK      1  16899.0        1
48     KUPANG KRAJAN      1  25570.0        1
49          WONOREJO      1  18257.0        1
50          WONOREJO      1  18257.0        1
51          WONOREJO      1  18257.0        1
52          WONOREJO      1  18257.0        1
53        BALONGSARI      1  10344.0        1
54            BENOWO      1  11491.0        1
56             JAGIR      1  22668.0        1
57           BRINGIN      1   5577.0        1
58      GEBANG PUTIH      1   7775.0        1
59      KALI RUNGKUT      1  24187.0        1
60          GAYUNGAN      1  12455.0        1
61    JEMUR WONOSARI      1  23082.0        1
62    BALAS KLUMPRIK      1  13973.0        1
---------------------------------------------------
For cluster 2
          KELURAHAN  Total  Density  Cluster
0            GUBENG     10  15361.0        2
1           GENTENG      9   8731.0        2
2        KEDUNGDORO      8  25271.0        2
3  MANYAR SABRANGAN      7  17892.0        2
4    KLAMPIS NGASEM      6  18147.0        2
---------------------------------------------------
For cluster 3
               KELURAHAN  Total  Density  Cluster
19                  MOJO      2  48516.0        3
25             WONOKROMO      2  43480.0        3
35  TANAH KALI KEDINDING      1  60566.0        3
40          SAWUNGGALING      1  29006.0        3
42            PUTAT JAYA      1  49249.0        3
45              PEGIRIAN      1  35272.0        3
46               KEBRAON      1  30160.0        3
55            BANYU URIP      1  42459.0        3
63                GADING      1  31597.0        3
---------------------------------------------------

Finally, we can visualize the clusters:¶

area = (np.pi*(features[:, 1])**2) 
x_quartiles = [round(np.percentile(united_df['Density'].values, x)) for x in [(20 * x) for x in range(6)]] # Get an evenly-divided quartiles of six

# Plot the figure
plt.figure(figsize=(12,10))
plt.scatter(features[:, 1], features[:, 0], s = 200, c=k_means.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Population Count', fontsize=16)
plt.xticks(ticks = np.arange(0.0, 1.1, 0.2), labels = x_quartiles)
plt.yticks(ticks = np.arange(0, 1.1, 0.2), labels = np.arange(0, max(united_df.Total.values) + 1, 2))
plt.ylabel('Total Coffee Shops', fontsize=16)
plt.title("Clusters of Coffee Shops vs Total Population in Surabaya", fontsize=16)
plt.show()

Conclusion¶

Looking through our scatter plot visualization and the density and total coffee shops for each cluster, we can draw a conclusion and label each of our clusters:

Cluster 3 has the second-to-least members, a few coffee shops, with the most densely-populated neighborhoods. We can say that this is indeed the most prospective cluster.
Cluster 0 has a moderate amount of members, with each having a relatively high number of total coffee shops. We can say that this area is the mid-highly prospective cluster.
Cluster 1 has the most members, alongside a few coffee shops in each. With the lower half of population density belonging to this cluster, this area is the moderately prospective cluster.
Cluster 2 has the fewest members, but also the most amount of total coffee shops. This area is the least prospective cluster.

We can then change the cluster numerical values into our verdict, in our united_df dataframe.

united_df.Cluster.replace({3:'High', 0:'Mid-high', 1:'Moderate', 2:'Least'}, inplace=True) 
united_df.head()

Most prospective areas:¶

Displaying the top 5 areas for this cluster.

united_df[united_df['Cluster'] == 'High'].head()

Moderate-to-high prospective areas:¶

Displaying the top 5 areas for this cluster.

united_df[united_df['Cluster'] == 'Mid-high'].head()

Moderately prospective areas:¶

Displaying the top 5 areas for this cluster.

united_df[united_df['Cluster'] == 'Moderate'].head()

Least prospective areas:¶

Displaying the top 5 areas for this cluster.

united_df[united_df['Cluster'] == 'Least'].head()

Written, tested, and published by Charis Chrisna (portfolio).

	OBJECTID	DESA	KECAMATAN	KABKOT	PROPINSI	X	Y	Z	FillTransp	SHAPE_Leng	SHAPE_Area
0	1	WARUGUNUNG	KARANG PILANG	SURABAYA	NaN	112.673585	-7.342167	1	49.803922	9913.141714	3902169.5
1	2	KARANG PILANG	KARANG PILANG	SURABAYA	NaN	112.689647	-7.339349	1	49.803922	6840.293677	2164425.5
2	3	KEBRAON	KARANG PILANG	SURABAYA	NaN	112.700501	-7.330315	1	49.803922	7984.106871	2073235.0
3	4	KEDURUS	KARANG PILANG	SURABAYA	NaN	112.705912	-7.320099	1	49.803922	5973.233996	1752535.5
4	5	PAGESANGAN	JAMBANGAN	SURABAYA	NaN	112.712619	-7.334844	1	49.803922	5102.418403	1121491.5

	KELURAHAN	KECAMATAN	Longitude	Latitude
0	WARUGUNUNG	KARANG PILANG	112.673585	-7.342167
1	KARANG PILANG	KARANG PILANG	112.689647	-7.339349
2	KEBRAON	KARANG PILANG	112.700501	-7.330315
3	KEDURUS	KARANG PILANG	112.705912	-7.320099
4	PAGESANGAN	JAMBANGAN	112.712619	-7.334844

	Neighborhood	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Category
0	WARUGUNUNG	-7.342167	112.673585	Hokben Supermall	-7.340084	112.676092	Japanese Restaurant
1	KARANG PILANG	-7.339349	112.689647	Gerbang Tol Warugunung	-7.340210	112.690311	Toll Booth
2	KARANG PILANG	-7.339349	112.689647	Sungai karangpilang	-7.337978	112.690233	River
3	KARANG PILANG	-7.339349	112.689647	Indomaret	-7.341636	112.690925	Grocery Store
4	KEBRAON	-7.330315	112.700501	Toko LUCKY	-7.326822	112.700493	Convenience Store

	Neighborhood	Convenience Store	Grocery Store	Japanese Restaurant	River	Toll Booth
0	WARUGUNUNG	0	0	1	0	0
1	KARANG PILANG	0	0	0	0	1
2	KARANG PILANG	0	0	0	1	0
3	KARANG PILANG	0	1	0	0	0
4	KEBRAON	1	0	0	0	0

	Neighborhood	Asian Restaurant	BBQ Joint	Bakery	Balinese Restaurant	Bookstore	Café	Camera Store	Cemetery	Chinese Restaurant	Convenience Store	Department Store	Diner	Food Court	Food Truck	Fried Chicken Joint	Government Building	Grocery Store	Historic Site	History Museum	Hobby Shop	Hotel	Indonesian Meatball Place	Indonesian Restaurant	Karaoke Bar	Metro Station	Middle Eastern Restaurant	Mosque	Pharmacy	Rest Area	Restaurant	Snack Place	Spa	Turkish Restaurant	Vegetarian / Vegan Restaurant	Video Game Store	Vineyard
0	AIRLANGGA	1	0	0	0	0	4	0	0	2	4	0	1	0	1	0	0	0	0	0	0	0	1	4	0	0	0	0	1	0	1	1	0	0	0	0	0
1	ALON-ALON CONTONG	0	1	1	1	1	0	2	0	3	3	1	0	0	2	1	1	0	1	1	1	1	0	2	1	0	0	0	0	0	0	0	1	0	1	0	0
2	AMPEL	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	4	0	2	0	0	4	1	0	0	0	1	0	1	0	0	0
3	ASEMROWO	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	1
4	BABAT JERAWAT	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0

	Neighborhood	Café	Coffee Shop	Total
0	GUBENG	3	7	10
1	GENTENG	6	3	9
2	KEDUNGDORO	2	6	8
3	KETABANG	1	6	7
4	MANYAR SABRANGAN	3	4	7

	Neighborhood	Penduduk	KK	Rata-rata Anggota Keluarga
0	WARUGUNUNG	9697	2982	3.25
1	KARANG PILANG	9725	2963	3.28
2	KEBRAON	30160	9558	3.16
3	KEDURUS	27975	8761	3.19
4	TANAH KALI KEDINDING	60566	17629	3.44

	KELURAHAN	KECAMATAN	Longitude	Latitude	COUNT
11	KETINTANG	GAYUNGAN	112.727603	-7.318579	16271.0
56	NGAGEL	WONOKROMO	112.745784	-7.287062	13039.0
63	BABATAN	WIYUNG	112.680195	-7.306552	29404.0
106	KETABANG	GENTENG	112.747903	-7.258152	7697.0

	KELURAHAN	Total	Density
0	GUBENG	10	15361.0
1	GENTENG	9	8731.0
2	KEDUNGDORO	8	25271.0
3	MANYAR SABRANGAN	7	17892.0
4	KLAMPIS NGASEM	6	18147.0
5	PACAR KELING	5	24144.0
6	AIRLANGGA	4	21193.0
7	KEPUTRAN	4	16610.0
8	MULYOREJO	4	18705.0
9	DUKUH MENANGGAL	4	9156.0

	Total	Density
Cluster
0	3.571429	17180.785714
1	1.194444	16305.027778
2	8.000000	17080.400000
3	1.222222	41145.000000

	KELURAHAN	Total	Density	Cluster
0	GUBENG	10	15361.0	Least
1	GENTENG	9	8731.0	Least
2	KEDUNGDORO	8	25271.0	Least
3	MANYAR SABRANGAN	7	17892.0	Least
4	KLAMPIS NGASEM	6	18147.0	Least

	KELURAHAN	Total	Density	Cluster
19	MOJO	2	48516.0	High
25	WONOKROMO	2	43480.0	High
35	TANAH KALI KEDINDING	1	60566.0	High
40	SAWUNGGALING	1	29006.0	High
42	PUTAT JAYA	1	49249.0	High

	KELURAHAN	Total	Density	Cluster
5	PACAR KELING	5	24144.0	Mid-high
6	AIRLANGGA	4	21193.0	Mid-high
7	KEPUTRAN	4	16610.0	Mid-high
8	MULYOREJO	4	18705.0	Mid-high
9	DUKUH MENANGGAL	4	9156.0	Mid-high

	KELURAHAN	Total	Density	Cluster
20	GREGES	2	12455.0	Moderate
21	JERUK	2	8956.0	Moderate
22	KALIJUDAN	2	14282.0	Moderate
23	MENANGGAL	2	9945.0	Moderate
24	EMBONG KALIASIN	2	13397.0	Moderate

Data Science Capstone Project: Kopi Kompleks¶

Segmenting Prospective Neighborhoods to Expand a Coffee Shop Chain¶

1. Introduction¶

1.1. Business Problem Formulation¶

1.2. Data¶

1.2.1. Surabaya Geospatial Data¶

Create a map of Surabaya with neighborhoods superimposed on top.¶

1.2.2. Venue Data from Foursquare API¶

Define Foursquare Credentials and Version¶

Define a few functions to use for exploratory analysis¶

Display the end dataframe¶

Get One-Hot dummy variables for each venue category¶

1.2.3. Surabaya Population Density¶

We can see that in the COUNT column, some values are missing.¶

In the choropleth above, the four greyed segments are the neighborhoods with existing branches.¶

So, at this point we have clean dataframes of:¶

2. Methodology¶

We can see that the elbow point is at k = 4. Thus, we can use this value for further processing.¶

We can then group our new Cluster column based on the mean of total coffee shops and population density:¶

We can also count the members belonging to each Cluster :¶

If we want to see which neighborhoods belong to which Cluster:¶

Finally, we can visualize the clusters:¶

Conclusion¶

Most prospective areas:¶

Moderate-to-high prospective areas:¶

Moderately prospective areas:¶

Least prospective areas:¶

We can see that the elbow point is at `k` = 4. Thus, we can use this value for further processing.¶

We can then group our new `Cluster` column based on the mean of total coffee shops and population density:¶

We can also count the members belonging to each `Cluster` :¶

If we want to see which neighborhoods belong to which `Cluster`:¶