Data Science Capstone Project: Kopi Kompleks

Segmenting Prospective Neighborhoods to Expand a Coffee Shop Chain


1. Introduction

In this project, Foursquare geospatial and venue data will be used to determine which area(s) is best suited to expand a Coffee Shop chain, Kopi Kompleks, around Surabaya, East Java, Indonesia. I would like to say my most profound gratitude to:

  1. Muhammad Fathin Moerdiwanto, for providing me enough information on the franchise.
  2. Waode Nurul Jasmin A. Budiman, for helping me obtain Surabaya's map of administration borders in .shp format. Check out her flickr!
  3. Taqy Satriaji Santoso, my best friend since high school, for helping me extract .shp into .geojson with EPSG:4326 format. Check out his art page!

For without their help, this project wouldn't be completed as it is.

  • Results presentation of the project can be seen here;
  • Technical report of the project can be seen here instead.

1.1. Business Problem Formulation

The franchise in question, Kopi Kompleks, has already opened 4 branches across Surabaya, and is looking to open their fifth one. Since there are already a lot of coffee shop chain in the city, it could potentially be unbeneficial for the franchise to expand into the wrong neighborhood.

This report will help the franchise to determine which area is most prospective, and which area is the least. This report will also be presented to the stakeholders and higher-ups of the franchise, with hopes to aid in their decision-making process.

I will try to:

  1. Detect locations that are not already crowded with cafes.
  2. Detect areas with no cafe/coffee shops in vicinity, with hopes of us being the pioneering venue in said area.
  3. Detect areas with high population density.

1.2. Data

The franchise has already opened four branches, in the following areas:

  1. Kopi Kompleks, Jl. Slamet No.16A, Ketabang, Kec. Genteng, Kota SBY, Jawa Timur 60272
  2. Satu Rasa Kopi by Kopi Kompleks, Jl. Karimata No.1, Ngagel, Kec. Wonokromo, Kota SBY, Jawa Timur 60246
  3. Kopilaborasi by Kopi Kompleks, Jl. Raya Pd. Rosan No.19, Babatan, Kec. Wiyung, Kota SBY, Jawa Timur 60227
  4. Kopi Kompleks at PanaHouse, Jl. Gayung Kebonsari No.61 - 65, Ketintang, Kec. Gayungan, Kota SBY, Jawa Timur 60231

Based on definition of our problem, factors that will influence our decision are:

  • The number of existing coffee shops in the neighborhoods.
  • Population density in each of the neighborhoods.

Following data sources will be needed to extract/generate the required information:

  • Number of cafe/coffee shops and their location in every neighborhood will be obtained using Foursquare API.
  • Population density of each and every neighborhood in Surabaya, obtained through Surabaya's Central of Statistics
  • Longitude and latitude values of each neighborhood in Surabaya, in .shp file format, obtained through this ArcGIS online item.

I exported the .shp file format through mapshaper into two formats:

  1. A cleaner, more understandable .csv file format
  2. A .geojson format with an output coordinate system of EPSG:4326

Foursquare data will first be segmented into several clusters, before eventually eliminating clusters that are nearest and/or belongs to the existing four branches of the franchise. This way, we can get unbiased representation of each neighborhoods' preferences, and ultimately determine which areas are the most prospective to open the next branch.

In [1]:
from bs4 import BeautifulSoup
import numpy as np # library to handle data in a vectorized manner
import re # regex library

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

print("Libraries imported!")
Libraries imported!

1.2.1. Surabaya Geospatial Data

This section covers obtaining and cleaning the latitude and longitude values of each neighborhoods in Surabaya.

In [2]:
df = pd.read_csv("Administrasi_Kelurahan_Surabaya.csv")
df.head()
Out[2]:
OBJECTID Id DESA KECAMATAN KABKOT PROPINSI X Y Z FillTransp OutlineTra SHAPE_Leng SHAPE_Area
0 1 0 WARUGUNUNG KARANG PILANG SURABAYA NaN 112.673585 -7.342167 1 49.803922 0 9913.141714 3902169.5
1 2 0 KARANG PILANG KARANG PILANG SURABAYA NaN 112.689647 -7.339349 1 49.803922 0 6840.293677 2164425.5
2 3 0 KEBRAON KARANG PILANG SURABAYA NaN 112.700501 -7.330315 1 49.803922 0 7984.106871 2073235.0
3 4 0 KEDURUS KARANG PILANG SURABAYA NaN 112.705912 -7.320099 1 49.803922 0 5973.233996 1752535.5
4 5 0 PAGESANGAN JAMBANGAN SURABAYA NaN 112.712619 -7.334844 1 49.803922 0 5102.418403 1121491.5

We can see that the original .csv file has some unnecessary columns. These columns are:

  • OBJECTID -- Dropped because it's a duplicate of index.
  • Id -- Dropped because it's filled with unrepresentative zeroes.
  • KABKOT -- Dropped because all of them are SURABAYA, which is obvious since we're working with Surabaya dataset.
  • PROPINSI -- Dropped because irrelevant and filled with NaN values.
  • Z -- Dropped because it's filled with unrepresentative ones.
  • FillTransp -- Dropped because this column is only representative when opened with ArcGIS Desktop.
  • OutlineTra -- Dropped because this column is only representative when opened with ArcGIS Desktop.
  • SHAPE_Leng -- Dropped because this column is only representative when opened with ArcGIS Desktop.
  • SHAPE_Area -- Dropped because this column is only representative when opened with ArcGIS Desktop.

We can drop these said columns to achieve a clearer, more concise dataframe object.

In [3]:
df.drop(['OBJECTID', 'Id', 'KABKOT', 'PROPINSI', 'Z', 'FillTransp', 'OutlineTra', 'SHAPE_Leng', 'SHAPE_Area'], axis=1, inplace=True) 

df = df.rename(columns={"DESA": "KELURAHAN", "X": "Longitude", "Y": "Latitude"}) # Rename the rest of the columns for clarity
In [4]:
df.head()
Out[4]:
KELURAHAN KECAMATAN Longitude Latitude
0 WARUGUNUNG KARANG PILANG 112.673585 -7.342167
1 KARANG PILANG KARANG PILANG 112.689647 -7.339349
2 KEBRAON KARANG PILANG 112.700501 -7.330315
3 KEDURUS KARANG PILANG 112.705912 -7.320099
4 PAGESANGAN JAMBANGAN 112.712619 -7.334844

Create a map of Surabaya with neighborhoods superimposed on top.

A quick Google search revealed that Surabaya lies on the respective latitude and longitude values: -7.250445, 112.768845. We double-assign these values before mapping the map with folium.

In [5]:
latitude, longitude = -7.265, 112.7107 # Altered slightly to better centerize the map display

# create map of Toronto using latitude and longitude values
map_sby = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, kec, kel in zip(df['Latitude'], df['Longitude'], df['KECAMATAN'], df['KELURAHAN']):
    label = '{}, {}'.format(kel, kec)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sby)  
    
map_sby
Out[5]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We now have a set of datapoints representing each neighborhood (kelurahan) in Surabaya.

1.2.2. Venue Data from Foursquare API

This section covers the cleaning and transforming data obtained from the GET request to Foursquare API.

Define Foursquare Credentials and Version

Note that when I share this notebook, I would have to redact the credentials. I hope this is understandable :)

In [53]:
CLIENT_ID = 'â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– ' # Foursquare ID
CLIENT_SECRET = 'â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– ' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Using the following credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Using the following credentials:
CLIENT_ID: â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– 
CLIENT_SECRET:â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– 

Define a few functions to use for exploratory analysis

We're going to use these functions to explore neighborhoods in Surabaya. In the function getNearbyVenues, we define neighborhoods as circular areas with a radius of 500 meters, so our neighborhood centers will each be 1 km apart.

In [7]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# function that returns venues in an area, given latitude and longitude
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name) # uncomment to debug. I commented it to preserve memory
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100) # Get the results of the top 100 venues.
            
        # make the GET request
        # print(requests.get(url).json()["response"]) # uncomment to debug in case we've overused our GET request limit.
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
In [8]:
#sby_venues = getNearbyVenues(names=df['KELURAHAN'], latitudes=df['Latitude'], longitudes=df['Longitude'])
#Uncomment to make function call the Foursquare API.

Display the end dataframe

The dataframe consists of venue properties along with its appropriate neighborhood.

In [9]:
#sby_venues.to_csv("processed.csv") # Write csv file as a checkpoint in case we overuse the API call quota
sby_venues = pd.read_csv("processed.csv")
sby_venues.drop('Unnamed: 0', inplace=True,axis=1)
sby_venues.head()
Out[9]:
Neighborhood Neighborhood Latitude Neighborhood Longitude Venue Venue Latitude Venue Longitude Venue Category
0 WARUGUNUNG -7.342167 112.673585 Hokben Supermall -7.340084 112.676092 Japanese Restaurant
1 KARANG PILANG -7.339349 112.689647 Gerbang Tol Warugunung -7.340210 112.690311 Toll Booth
2 KARANG PILANG -7.339349 112.689647 Sungai karangpilang -7.337978 112.690233 River
3 KARANG PILANG -7.339349 112.689647 Indomaret -7.341636 112.690925 Grocery Store
4 KEBRAON -7.330315 112.700501 Toko LUCKY -7.326822 112.700493 Convenience Store

We can find out how many unique categories can be curated from all the returned venues.

In [10]:
print('There are {} unique categories.'.format(len(sby_venues['Venue Category'].unique())))
There are 202 unique categories.

Get One-Hot dummy variables for each venue category

This way, we can analyze each neighborhood without having to scroll through all entries in the sby_venues dataframe.

In [11]:
# one hot encoding
sby_onehot = pd.get_dummies(sby_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sby_onehot['Neighborhood'] = sby_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sby_onehot.columns[-1]] + list(sby_onehot.columns[:-1])
sby_onehot = sby_onehot[fixed_columns]

neighborhood_column = sby_onehot['Neighborhood']
sby_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
sby_onehot.insert(0, 'Neighborhood', neighborhood_column)
In [12]:
sby_onehot.head()
Out[12]:
Neighborhood Accessories Store Afghan Restaurant Airport Terminal American Restaurant Arcade Arepa Restaurant Art Gallery Arts & Crafts Store Asian Restaurant Australian Restaurant Auto Workshop BBQ Joint Bakery Balinese Restaurant Bank Bar Basketball Court Beach Bed & Breakfast Bike Shop Bistro Board Shop Bookstore Boutique Breakfast Spot Bridal Shop Bubble Tea Shop Buffet Building Burger Joint Burrito Place Bus Station Business Service Cafeteria Café Camera Store Campground Car Wash Cemetery Chinese Restaurant Church Clothing Store Cocktail Bar Coffee Shop College Cafeteria Comfort Food Restaurant Convenience Store Cosmetics Shop Cupcake Shop Dance Studio Department Store Dessert Shop Dim Sum Restaurant Diner Discount Store Doctor's Office Dog Run Donut Shop Dumpling Restaurant Electronics Store Event Space Fair Farm Farmers Market Fast Food Restaurant Fish & Chips Shop Flea Market Flower Shop Food Food & Drink Shop Food Court Food Stand Food Truck French Restaurant Fried Chicken Joint Frozen Yogurt Shop Fruit & Vegetable Store Furniture / Home Store Garden Gas Station Gastropub Gift Shop Gluten-free Restaurant Golf Course Government Building Grocery Store Gym Gym / Fitness Center Harbor / Marina Hardware Store Health & Beauty Service Herbs & Spices Store Historic Site History Museum Hobby Shop Hostel Hotel Hotel Bar Hotel Pool Ice Cream Shop Indian Restaurant Indonesian Meatball Place Indonesian Restaurant Intersection Italian Restaurant Japanese Restaurant Javanese Restaurant Jewelry Store Juice Bar Karaoke Bar Kids Store Korean Restaurant Lake Light Rail Station Lounge Malay Restaurant Market Massage Studio Metro Station Middle Eastern Restaurant Middle School Mobile Phone Shop Molecular Gastronomy Restaurant Monument / Landmark Mosque Motel Motorcycle Shop Movie Theater Multiplex Museum Music School Music Store Music Venue Nail Salon Night Market Nightclub Nightlife Spot Noodle House Office Optical Shop Outdoor Sculpture Padangnese Restaurant Paper / Office Supplies Store Park Performing Arts Venue Pet Store Pharmacy Pie Shop Pier Pizza Place Playground Plaza Pool Pool Hall Pub Racetrack Resort Rest Area Restaurant River Road Sake Bar Salad Place Salon / Barbershop Satay Restaurant Seafood Restaurant Shabu-Shabu Restaurant Shoe Store Shop & Service Shopping Mall Snack Place Soccer Field Soccer Stadium Soup Place Spa Speakeasy Sporting Goods Shop Sports Bar Stadium Steakhouse Street Food Gathering Supermarket Sushi Restaurant Tennis Court Tennis Stadium Thai Restaurant Theater Theme Park Theme Restaurant Toll Booth Tourist Information Center Track Train Station Turkish Restaurant Udon Restaurant Vegetarian / Vegan Restaurant Video Game Store Video Store Vineyard Watch Shop Women's Store Yoga Studio
0 WARUGUNUNG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 KARANG PILANG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 KARANG PILANG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 KARANG PILANG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 KEBRAON 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We can then group rows based on neighborhood names and by taking the sum of all occurrences in each category.

In [13]:
sby_grouped = sby_onehot.groupby('Neighborhood').sum().reset_index()
sby_grouped.head()
Out[13]:
Neighborhood Accessories Store Afghan Restaurant Airport Terminal American Restaurant Arcade Arepa Restaurant Art Gallery Arts & Crafts Store Asian Restaurant Australian Restaurant Auto Workshop BBQ Joint Bakery Balinese Restaurant Bank Bar Basketball Court Beach Bed & Breakfast Bike Shop Bistro Board Shop Bookstore Boutique Breakfast Spot Bridal Shop Bubble Tea Shop Buffet Building Burger Joint Burrito Place Bus Station Business Service Cafeteria Café Camera Store Campground Car Wash Cemetery Chinese Restaurant Church Clothing Store Cocktail Bar Coffee Shop College Cafeteria Comfort Food Restaurant Convenience Store Cosmetics Shop Cupcake Shop Dance Studio Department Store Dessert Shop Dim Sum Restaurant Diner Discount Store Doctor's Office Dog Run Donut Shop Dumpling Restaurant Electronics Store Event Space Fair Farm Farmers Market Fast Food Restaurant Fish & Chips Shop Flea Market Flower Shop Food Food & Drink Shop Food Court Food Stand Food Truck French Restaurant Fried Chicken Joint Frozen Yogurt Shop Fruit & Vegetable Store Furniture / Home Store Garden Gas Station Gastropub Gift Shop Gluten-free Restaurant Golf Course Government Building Grocery Store Gym Gym / Fitness Center Harbor / Marina Hardware Store Health & Beauty Service Herbs & Spices Store Historic Site History Museum Hobby Shop Hostel Hotel Hotel Bar Hotel Pool Ice Cream Shop Indian Restaurant Indonesian Meatball Place Indonesian Restaurant Intersection Italian Restaurant Japanese Restaurant Javanese Restaurant Jewelry Store Juice Bar Karaoke Bar Kids Store Korean Restaurant Lake Light Rail Station Lounge Malay Restaurant Market Massage Studio Metro Station Middle Eastern Restaurant Middle School Mobile Phone Shop Molecular Gastronomy Restaurant Monument / Landmark Mosque Motel Motorcycle Shop Movie Theater Multiplex Museum Music School Music Store Music Venue Nail Salon Night Market Nightclub Nightlife Spot Noodle House Office Optical Shop Outdoor Sculpture Padangnese Restaurant Paper / Office Supplies Store Park Performing Arts Venue Pet Store Pharmacy Pie Shop Pier Pizza Place Playground Plaza Pool Pool Hall Pub Racetrack Resort Rest Area Restaurant River Road Sake Bar Salad Place Salon / Barbershop Satay Restaurant Seafood Restaurant Shabu-Shabu Restaurant Shoe Store Shop & Service Shopping Mall Snack Place Soccer Field Soccer Stadium Soup Place Spa Speakeasy Sporting Goods Shop Sports Bar Stadium Steakhouse Street Food Gathering Supermarket Sushi Restaurant Tennis Court Tennis Stadium Thai Restaurant Theater Theme Park Theme Restaurant Toll Booth Tourist Information Center Track Train Station Turkish Restaurant Udon Restaurant Vegetarian / Vegan Restaurant Video Game Store Video Store Vineyard Watch Shop Women's Store Yoga Studio
0 AIRLANGGA 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 2 0 0 0 0 0 0 4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 ALON-ALON CONTONG 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 3 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2 AMPEL 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3 ASEMROWO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
4 BABAT JERAWAT 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Finally, we can trim the above dataframe into a more concise one for clarity. In particular, we're interested in the categories Café and Coffee Shop, so let's do just that:

In [14]:
# Make another dataframe
sby_cafes = sby_grouped.copy(deep=True)[['Neighborhood', 'Café', 'Coffee Shop']]

# Add another column containing the total cafe and coffee shop venues
sby_cafes['Total'] = sby_cafes.sum(axis=1) 

# Sort rows by the new column, 'Total'
sby_cafes.sort_values(by ='Total' , ascending=False, inplace=True)
sby_cafes.reset_index(drop=True, inplace=True)
print("The dimension of this dataframe is: ")
print(sby_cafes.shape)
sby_cafes.head()
The dimension of this dataframe is: 
(145, 4)
Out[14]:
Neighborhood Café Coffee Shop Total
0 GUBENG 3 7 10
1 GENTENG 6 3 9
2 KEDUNGDORO 2 6 8
3 KETABANG 1 6 7
4 MANYAR SABRANGAN 3 4 7

We may want to visualize which neighborhood has the most coffee shops and/or cafes. But first, we need to trim neighborhoods which has zero totals so as to not clog up the chart(s).

In [15]:
sby_cafes_not_null = sby_cafes[sby_cafes.Total != 0] 
print("The dimension of this dataframe is: ")
print(sby_cafes_not_null.shape)
print("We've trimmed",
     sby_cafes.shape[0] - sby_cafes_not_null.shape[0],
      "rows!"
     )
sby_cafes_not_null.head()
The dimension of this dataframe is: 
(63, 4)
We've trimmed 82 rows!
Out[15]:
Neighborhood Café Coffee Shop Total
0 GUBENG 3 7 10
1 GENTENG 6 3 9
2 KEDUNGDORO 2 6 8
3 KETABANG 1 6 7
4 MANYAR SABRANGAN 3 4 7

We know that our existing four branches are:

  1. Kopi Kompleks, Jl. Slamet No.16A, Ketabang, Kec. Genteng, Kota SBY, Jawa Timur 60272
  2. Satu Rasa Kopi by Kopi Kompleks, Jl. Karimata No.1, Ngagel, Kec. Wonokromo, Kota SBY, Jawa Timur 60246
  3. Kopilaborasi by Kopi Kompleks, Jl. Raya Pd. Rosan No.19, Babatan, Kec. Wiyung, Kota SBY, Jawa Timur 60227
  4. Kopi Kompleks at PanaHouse, Jl. Gayung Kebonsari No.61 - 65, Ketintang, Kec. Gayungan, Kota SBY, Jawa Timur 60231

Which, respectively, are located in the following neighborhoods:

  1. KETABANG
  2. NGAGEL
  3. BABATAN
  4. KETINTANG

We remove these said four neighborhoods from our new dataframe, since they're already irrelevant for further decision-making.

In [16]:
to_remove = ['KETABANG', 'NGAGEL', 'BABATAN', 'KETINTANG']
sby_cafes_to_remove = sby_cafes_not_null[sby_cafes_not_null.Neighborhood.isin(to_remove)]
sby_cafes_to_remove
Out[16]:
Neighborhood Café Coffee Shop Total
3 KETABANG 1 6 7
5 NGAGEL 3 3 6

We can see that BABATAN and KETINTANG don't have any coffee spots in them, hence they didn't appear in the above dataframe. We can also see that KETABANG and NGAGEL lies in the 3rd and 5th index, respectively. So we simply remove them.

In [17]:
#sby_cafes_not_null.drop([3,5], inplace=True)
sby_cafes_not_null.head(10)
Out[17]:
Neighborhood Café Coffee Shop Total
0 GUBENG 3 7 10
1 GENTENG 6 3 9
2 KEDUNGDORO 2 6 8
3 KETABANG 1 6 7
4 MANYAR SABRANGAN 3 4 7
5 NGAGEL 3 3 6
6 KLAMPIS NGASEM 2 4 6
7 PACAR KELING 3 2 5
8 AIRLANGGA 4 0 4
9 KEPUTRAN 2 2 4

1.2.3. Surabaya Population Density

This section covers the cleaning and choropleth mapping of Surabaya's population density, per neighborhood.

In [18]:
pop_dens = pd.read_csv("Jumlah_Penduduk_Kelurahan_Surabaya.csv")
pop_dens.head()
Out[18]:
Neighborhood Penduduk KK Rata-rata Anggota Keluarga
0 WARUGUNUNG 9697 2982 3.25
1 KARANG PILANG 9725 2963 3.28
2 KEBRAON 30160 9558 3.16
3 KEDURUS 27975 8761 3.19
4 TANAH KALI KEDINDING 60566 17629 3.44

We can remove the last two columns, since it's irrelevant to use for choropleth mapping. We also rename the Penduduk column into COUNT, for easier processing and clarity.

In [19]:
pop_dens.drop(['KK', 'Rata-rata Anggota Keluarga'], axis=1, inplace=True)
pop_dens.rename(columns={"Penduduk": "COUNT", "Neighborhood": "KELURAHAN"}, inplace=True)
pop_dens.head()
Out[19]:
KELURAHAN COUNT
0 WARUGUNUNG 9697
1 KARANG PILANG 9725
2 KEBRAON 30160
3 KEDURUS 27975
4 TANAH KALI KEDINDING 60566

We then merge this population density to the original df dataframe from section 1.2.1:

In [20]:
merged_df = df.merge(pop_dens, how = 'left', on = ['KELURAHAN'])
for column in merged_df.isnull().columns.values.tolist():
    print(column)
    print (merged_df.isnull()[column].value_counts())
    print("") 
KELURAHAN
False    165
Name: KELURAHAN, dtype: int64

KECAMATAN
False    165
Name: KECAMATAN, dtype: int64

Longitude
False    165
Name: Longitude, dtype: int64

Latitude
False    165
Name: Latitude, dtype: int64

COUNT
False    149
True      16
Name: COUNT, dtype: int64

We can see that in the COUNT column, some values are missing.

We substitute these NaN values with the first quartile of the population count of all neighborhoods. We use first quartile because it's more representative than using the mean.

In [21]:
quantile = merged_df.COUNT.quantile(0.25)
merged_df['COUNT'] = merged_df['COUNT'].fillna(quantile) # Substitute missing values with first quartile
print(merged_df.shape)
merged_df.head()
(165, 5)
Out[21]:
KELURAHAN KECAMATAN Longitude Latitude COUNT
0 WARUGUNUNG KARANG PILANG 112.673585 -7.342167 9697.0
1 KARANG PILANG KARANG PILANG 112.689647 -7.339349 9725.0
2 KEBRAON KARANG PILANG 112.700501 -7.330315 30160.0
3 KEDURUS KARANG PILANG 112.705912 -7.320099 27975.0
4 PAGESANGAN JAMBANGAN 112.712619 -7.334844 14467.0
In [22]:
sby_cafes_to_remove = merged_df[merged_df.KELURAHAN.isin(to_remove)]
sby_cafes_to_remove
Out[22]:
KELURAHAN KECAMATAN Longitude Latitude COUNT
11 KETINTANG GAYUNGAN 112.727603 -7.318579 16271.0
56 NGAGEL WONOKROMO 112.745784 -7.287062 13039.0
63 BABATAN WIYUNG 112.680195 -7.306552 29404.0
106 KETABANG GENTENG 112.747903 -7.258152 7697.0

We need to remove the rows with indices 11, 56, 63, and 106. Because in them, we've already established a branch.

In [23]:
merged_df.drop([11, 56, 63, 106], inplace=True)
from folium.plugins import HeatMap
from sklearn import preprocessing

m = folium.Map([latitude, longitude], zoom_start=12)
data = merged_df[['Latitude', 'Longitude', 'COUNT']].values
heat_data = [[row['Latitude'], row['Longitude'], index] for index, row in merged_df.iterrows()]
HeatMap(data, 
        max_val=max(merged_df['COUNT'].values)).add_to(m)
m
Out[23]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [24]:
sby_geojson = r'Administrasi_Kelurahan_Surabaya.geojson'
density_map = folium.Map(location=[latitude, longitude], zoom_start=12)

density_map.choropleth(
    geo_data=sby_geojson,
    data=merged_df,
    columns=['KELURAHAN', 'COUNT'],
    key_on='feature.properties.DESA',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Population Density in Surabaya'
)

density_map
C:\Users\MSI-PC\Anaconda3\envs\PyCharm\lib\site-packages\folium\folium.py:415: FutureWarning: The choropleth  method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this.
  FutureWarning
Out[24]:
Make this Notebook Trusted to load map: File -> Trust Notebook

In the choropleth above, the four greyed segments are the neighborhoods with existing branches.

So, at this point we have clean dataframes of:

  1. Latitude & longitude values for each neighborhood in Surabaya, df;
  2. A list amount of coffee shops and/or cafes in each neighborhood sby_cafes_not_null;
  3. Population density of each neighborhood in Surabaya--along with its visualization merged_df.

The neighborhoods belonging to the existing branches are already cleaned from numbers 2 and 3.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new branch of coffee shop.

2. Methodology

From here, we will cluster the above dataframes using the method K-means clustering. Then, we can see the characteristics from various clusters and determine whether we clustered them the right way.

In [25]:
united_df = sby_cafes_not_null[['Neighborhood', 'Total']].copy(deep=True)
united_df.rename(columns={"Neighborhood": "KELURAHAN"}, inplace=True)
united_df = united_df.merge(merged_df[['KELURAHAN', 'COUNT']], how = 'inner', on = ['KELURAHAN'])
united_df.rename(columns={"COUNT": "Density"}, inplace=True)
united_df.head(10)
Out[25]:
KELURAHAN Total Density
0 GUBENG 10 15361.0
1 GENTENG 9 8731.0
2 KEDUNGDORO 8 25271.0
3 MANYAR SABRANGAN 7 17892.0
4 KLAMPIS NGASEM 6 18147.0
5 PACAR KELING 5 24144.0
6 AIRLANGGA 4 21193.0
7 KEPUTRAN 4 16610.0
8 MULYOREJO 4 18705.0
9 DUKUH MENANGGAL 4 9156.0

The magnitude of values for both numerical columns differ heavily from each other. We can apply normalization.

Normalization is a statistical method that helps mathematical-based algorithms to interpret features with different magnitudes and distributions equally

Normalize both the columns for further processing by the StandardScaler() function.

In [26]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler #  Normalization library
from matplotlib import pyplot as plt # Data visualization standard library
from sklearn.cluster import KMeans # Clustering library
%matplotlib inline

col_names = ['Total', 'Density']
features = united_df[col_names] # Get a subset of the united_df dataframe
scaler = MinMaxScaler().fit(features.values) # Instantiate and fit data, as a numpy array, to the scaler object
features = scaler.transform(features.values) # Perform standardization by centering and scaling
print("A sample of the first ten entries of the normalized features:")
print(features[:10])
A sample of the first ten entries of the normalized features:
[[1.         0.17792649]
 [0.88888889 0.05735693]
 [0.77777778 0.35814436]
 [0.66666667 0.22395388]
 [0.55555556 0.22859117]
 [0.44444444 0.33764935]
 [0.33333333 0.28398407]
 [0.33333333 0.20064013]
 [0.33333333 0.23873866]
 [0.33333333 0.06508574]]

Before we instantiate the clustering on our feature set, we need to determine which value of k is optimal using the elbow method.

In [27]:
distortions = []
k_values = range(1,10)
for k in k_values:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(features)
    distortions.append(kmeanModel.inertia_)

plt.figure(figsize=(16,8))
plt.plot(k_values, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
We can see that the elbow point is at k = 4. Thus, we can use this value for further processing.
In [42]:
plt.clf()
clusterNum = 4
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(features)
united_df['Cluster'] = k_means.labels_
united_df.head(10)
Out[42]:
KELURAHAN Total Density Cluster
0 GUBENG 10 15361.0 2
1 GENTENG 9 8731.0 2
2 KEDUNGDORO 8 25271.0 2
3 MANYAR SABRANGAN 7 17892.0 2
4 KLAMPIS NGASEM 6 18147.0 2
5 PACAR KELING 5 24144.0 0
6 AIRLANGGA 4 21193.0 0
7 KEPUTRAN 4 16610.0 0
8 MULYOREJO 4 18705.0 0
9 DUKUH MENANGGAL 4 9156.0 0
<Figure size 432x288 with 0 Axes>
We can then group our new Cluster column based on the mean of total coffee shops and population density:
In [43]:
united_df.groupby('Cluster').mean()
Out[43]:
Total Density
Cluster
0 3.571429 17180.785714
1 1.194444 16305.027778
2 8.000000 17080.400000
3 1.222222 41145.000000
We can also count the members belonging to each Cluster :
In [44]:
members = united_df['Cluster'].value_counts().rename_axis('Cluster No.').to_frame('Members of Cluster')
members.plot(kind='barh', figsize=(9,9))
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x242a445ae88>
If we want to see which neighborhoods belong to which Cluster:
In [45]:
for i in set(united_df['Cluster']):
    neighborhoods_list = united_df[united_df['Cluster'] == i] 
    print("For cluster", i)
    print(neighborhoods_list)
    print("---------------------------------------------------")
For cluster 0
            KELURAHAN  Total  Density  Cluster
5        PACAR KELING      5  24144.0        0
6           AIRLANGGA      4  21193.0        0
7            KEPUTRAN      4  16610.0        0
8           MULYOREJO      4  18705.0        0
9     DUKUH MENANGGAL      4   9156.0        0
10       DUKUH KUPANG      4  16254.0        0
11         DR. SUTOMO      4  17783.0        0
12        PUCANG SEWU      3  15116.0        0
13  NGIDEN JANGKUNGAN      3  16182.0        0
14      RUNGKUT KIDUL      3  14074.0        0
15         BARATAJAYA      3  16996.0        0
16          JAMBANGAN      3  11560.0        0
17          KERTAJAYA      3  26692.0        0
18           KALISARI      3  16066.0        0
---------------------------------------------------
For cluster 1
           KELURAHAN  Total  Density  Cluster
20            GREGES      2  12455.0        1
21             JERUK      2   8956.0        1
22         KALIJUDAN      2  14282.0        1
23         MENANGGAL      2   9945.0        1
24   EMBONG KALIASIN      2  13397.0        1
26             DARMO      2  16182.0        1
27      SIWALANKERTO      2  18006.0        1
28      PANJANG JIWO      1  13607.0        1
29  KREMBANGAN UTARA      1  19156.0        1
30            WIYUNG      1  18963.0        1
31  TENGGILIS MEJOYO      1  11578.0        1
32      TEMBOK DUKUH      1  27724.0        1
33         TEGALSARI      1  19634.0        1
34       TANJUNGSARI      1  12600.0        1
36        TAMBAKREJO      1  20993.0        1
37          KUTISARI      1  19407.0        1
38       NYAMPLUNGAN      1   9473.0        1
39        SEMOLOWARU      1  19992.0        1
41           SAWAHAN      1  20676.0        1
43       PERAK TIMUR      1  16221.0        1
44       PERAK BARAT      1  16685.0        1
47      KEDUNG BARUK      1  16899.0        1
48     KUPANG KRAJAN      1  25570.0        1
49          WONOREJO      1  18257.0        1
50          WONOREJO      1  18257.0        1
51          WONOREJO      1  18257.0        1
52          WONOREJO      1  18257.0        1
53        BALONGSARI      1  10344.0        1
54            BENOWO      1  11491.0        1
56             JAGIR      1  22668.0        1
57           BRINGIN      1   5577.0        1
58      GEBANG PUTIH      1   7775.0        1
59      KALI RUNGKUT      1  24187.0        1
60          GAYUNGAN      1  12455.0        1
61    JEMUR WONOSARI      1  23082.0        1
62    BALAS KLUMPRIK      1  13973.0        1
---------------------------------------------------
For cluster 2
          KELURAHAN  Total  Density  Cluster
0            GUBENG     10  15361.0        2
1           GENTENG      9   8731.0        2
2        KEDUNGDORO      8  25271.0        2
3  MANYAR SABRANGAN      7  17892.0        2
4    KLAMPIS NGASEM      6  18147.0        2
---------------------------------------------------
For cluster 3
               KELURAHAN  Total  Density  Cluster
19                  MOJO      2  48516.0        3
25             WONOKROMO      2  43480.0        3
35  TANAH KALI KEDINDING      1  60566.0        3
40          SAWUNGGALING      1  29006.0        3
42            PUTAT JAYA      1  49249.0        3
45              PEGIRIAN      1  35272.0        3
46               KEBRAON      1  30160.0        3
55            BANYU URIP      1  42459.0        3
63                GADING      1  31597.0        3
---------------------------------------------------

Finally, we can visualize the clusters:

In [46]:
area = (np.pi*(features[:, 1])**2) 
x_quartiles = [round(np.percentile(united_df['Density'].values, x)) for x in [(20 * x) for x in range(6)]] # Get an evenly-divided quartiles of six

# Plot the figure
plt.figure(figsize=(12,10))
plt.scatter(features[:, 1], features[:, 0], s = 200, c=k_means.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Population Count', fontsize=16)
plt.xticks(ticks = np.arange(0.0, 1.1, 0.2), labels = x_quartiles)
plt.yticks(ticks = np.arange(0, 1.1, 0.2), labels = np.arange(0, max(united_df.Total.values) + 1, 2))
plt.ylabel('Total Coffee Shops', fontsize=16)
plt.title("Clusters of Coffee Shops vs Total Population in Surabaya", fontsize=16)
plt.show()

Conclusion

Looking through our scatter plot visualization and the density and total coffee shops for each cluster, we can draw a conclusion and label each of our clusters:

  1. Cluster 3 has the second-to-least members, a few coffee shops, with the most densely-populated neighborhoods. We can say that this is indeed the most prospective cluster.
  2. Cluster 0 has a moderate amount of members, with each having a relatively high number of total coffee shops. We can say that this area is the mid-highly prospective cluster.
  3. Cluster 1 has the most members, alongside a few coffee shops in each. With the lower half of population density belonging to this cluster, this area is the moderately prospective cluster.
  4. Cluster 2 has the fewest members, but also the most amount of total coffee shops. This area is the least prospective cluster.

We can then change the cluster numerical values into our verdict, in our united_df dataframe.

In [47]:
united_df.Cluster.replace({3:'High', 0:'Mid-high', 1:'Moderate', 2:'Least'}, inplace=True) 
united_df.head()
Out[47]:
KELURAHAN Total Density Cluster
0 GUBENG 10 15361.0 Least
1 GENTENG 9 8731.0 Least
2 KEDUNGDORO 8 25271.0 Least
3 MANYAR SABRANGAN 7 17892.0 Least
4 KLAMPIS NGASEM 6 18147.0 Least

Most prospective areas:

Displaying the top 5 areas for this cluster.

In [48]:
united_df[united_df['Cluster'] == 'High'].head()
Out[48]:
KELURAHAN Total Density Cluster
19 MOJO 2 48516.0 High
25 WONOKROMO 2 43480.0 High
35 TANAH KALI KEDINDING 1 60566.0 High
40 SAWUNGGALING 1 29006.0 High
42 PUTAT JAYA 1 49249.0 High

Moderate-to-high prospective areas:

Displaying the top 5 areas for this cluster.

In [49]:
united_df[united_df['Cluster'] == 'Mid-high'].head()
Out[49]:
KELURAHAN Total Density Cluster
5 PACAR KELING 5 24144.0 Mid-high
6 AIRLANGGA 4 21193.0 Mid-high
7 KEPUTRAN 4 16610.0 Mid-high
8 MULYOREJO 4 18705.0 Mid-high
9 DUKUH MENANGGAL 4 9156.0 Mid-high

Moderately prospective areas:

Displaying the top 5 areas for this cluster.

In [50]:
united_df[united_df['Cluster'] == 'Moderate'].head()
Out[50]:
KELURAHAN Total Density Cluster
20 GREGES 2 12455.0 Moderate
21 JERUK 2 8956.0 Moderate
22 KALIJUDAN 2 14282.0 Moderate
23 MENANGGAL 2 9945.0 Moderate
24 EMBONG KALIASIN 2 13397.0 Moderate

Least prospective areas:

Displaying the top 5 areas for this cluster.

In [51]:
united_df[united_df['Cluster'] == 'Least'].head()
Out[51]:
KELURAHAN Total Density Cluster
0 GUBENG 10 15361.0 Least
1 GENTENG 9 8731.0 Least
2 KEDUNGDORO 8 25271.0 Least
3 MANYAR SABRANGAN 7 17892.0 Least
4 KLAMPIS NGASEM 6 18147.0 Least

Written, tested, and published by Charis Chrisna (portfolio).