Quantum Capital, a Singaporean firm specialising in direct investment is planning to open a new franchise of fast food fried chicken outlet in Indonesia. To begin with, they decide to open the franchise outlet in Jakarta — the capital of Indonesia. For the year 2021 alone, they decide to invest & open 5 chicken outlets.
Jakarta is a metropolitan city with supporting areas. The main area is divided into 5 regents, consist of around 40 districts. The analysis will be done among the districts using any available data in the internet.
With high number of populations, Quantum Capital expecting a tight competition from existing multinational outlet such as KFC, McDonald and local brands alike.
It’s important for Quantum Capital to choose the best location for its pilot project. A successful project will act as a good portfolio & attract more investment fund from clients.
This research’s objective is:
1. provide exploratory data analysis using data available from the internet
2. give recommendation on the best possible location for the chicken outlet
3. provide a base recommendation for future research
Jakarta is at very early stage of implementing smart city. Researcher should expect that the amount of open-source data available on the internet should be very limited.
This exploratory research is mainly limited by this.
2. Data Acquisition & Cleaning
District name & sub district name, area, population for all district in Jakarta Province is available from various source on the internet.
We use the district name to pull the latitude / longitude data from geocoder.
For venue data, we input the latitude / longitude data to Foursquare API and get return of number of venues around that location. We pull only certain category which we identify as direct or potential competitor. Below are the table of categories regarded as potential competitor.
Combining all data, we get the GDP density, population density & competitor density for each sub-district.
2.1 Data Sources
Below are the links to our data sources:
The Regional Domestic Gross Income only available at district level & then extrapolated to sub-district level based on the number of people on each sub district.
2.2 Data Cleaning
So far, only standard data cleaning is needed, such as cleaning the HTML tags / characters for data gathered via html page and removing empty whitespaces.
Data cleaning also done based on multiple entry in venue categories, example: KFC is registered in both Fried Chicken Joint & Fast-Food Restaurant
There’s no missing or incomplete data.
2.3 Final Table
The final table is in the form of:
3. Exploratory Data Analysis
3.1 Clustering Process
The clustering process is done to 3 data:
1. GDP/ km2
2. People / km2
3. Venue/ km2
Using K-means++ clustering & DBScan as comparison.
K-means clustering ++ is used because this method is considered enough for the purpose of this research, to find groups in the data, robust & easy to implement.
DBScan is presented as alternative method for comparison purpose as it’s approached the clustering process from different angle.
3.2 K-Means Clustering
Clustering performance is evaluate using elbow method & silhouette analysis.
Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow.
Graphic shown that the sum of square distance becoming flattened around k values of 5–6–7.
Second method, which is silhouette analysis can be used to determine the degree of separation between clusters or each sample.
Below are steps to calculate silhouette analysis:
1. Compute the average distance from all data points in the same cluster (ai).
2. Compute the average distance from all data points in the closest cluster (bi).
3. Compute the coefficient:
The coefficient can take values in the interval [-1, 1].
1. If it is 0 –> the sample is very close to the neighboring clusters.
2. It it is 1 –> the sample is far away from the neighboring clusters.
3. It it is -1 –> the sample is assigned to the wrong clusters.
Therefore, we want the coefficients to be as big as possible and close to 1 to have a good cluster.
For the result, we can see that the K is dropped from above 0.4 to below 0.4 when we increase the K from 6 to 7.
Based on the model evaluation above, we decide to use k=6.
3.3 DBScan Clustering
DBscan are affected by 2 parameters :
1. Minimum samples (“MinPts”): the fewest number of points required to form a cluster
2. ε (epsilon or “eps”): the maximum distance two points can be from one another while still belonging to the same cluster
minimum point, if it is possible, should follow below rule :
If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).
Based on above, the ε recommended is 6.
While the k value is determined using calculation of the average distance between each point and its k nearest neighbours. We’ll find the optimal value for ε at the point of maximum curvature (i.e. where the graph has the greatest slope).
Maximum curvature happened at value around 1.5
However, inputting ε=1.5 and minimum samples=6 do not return satisfying cluster. The data is clustered into 1 cluster & 5 outliers.
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Further investigation, iterating variation of ε from 0.5–1.6 & minimum samples from 2–6 do not provide satisfying result with too many numbers in outliers or too small number cluster. The calculation is presented in our Jupyter Notebook notes.
We decide that the DBScan is not viable approach.
4. Findings & Discussion
The classification result of district clustering using k=6 is presented on the table on below.
The table is visualised as below:
The mean value for each cluster is presented below:
Furthermore, we do a simple heuristic process to understand the cluster & put a scoring.
1) choose very sub-district with medium/high / very high GDP density as it represents high purchasing power. Thus cluster 3,4,5 will put into best candidate.
2) Avoid very high person density in cluster 3 should be avoided since it might contain big social problem. We have 4 & 5 as our candidate.
3) Very high venue density should be avoided since it represents high competition.
Green color signifies favorable condition while the red color signify the unfavorable condition.
result: we choose cluster: 5. SENEN & KEMAYORAN
High population density with medium or low venue density, as medium / low value density presents less competition, presented as cluster 0 and cluster 3.
Quick search on the internet to Tambora & Johar Baru, shown that these two districts is considered two most dense area in Asia and often ravaged by population problem such as slum, low-income area, waste problem, and often small local riot between its populous residence. We exclude this from the consideration.
We come to conclusion to recommend below district as a possible location for chicken outlet.
The recommendation on this research only considers 4 aspect: GDP,population, number of venue & area. Therefore, do not capture other aspect of consideration
- Purchase parity
- Elite / non elite area which will affect the image of the outlet
- Rent Price
- Legal / Permit
- Property Availability
We recommend qualitative research is done to enhance the result of this paper by doing further exploration on mentioned above aspect.