Clustering the Tokyo, Japan Airbnb superhost listings by room features, booking, and review features

August 11, 2021 2 minute read

This mini project was created alongside Daryll Tumambing for our Data Mining and Wrangling class held from May to August 2020. We were tasked to choose any dataset from the Inside Airbnb dataset and perform descriptive analytics by solving a clustering problem.

Research question

We particularly got interested in superhost listings in Airbnb Tokyo. Given the highly varied market of guests, what are the different types of superhost listings? How are these types similar and different from one other?

About the data

We used the December 30, 2019 crawl date of the Inside Airbnb Tokyo listings dataset, which corresponds to all 14,550 listings that have existed since Airbnb began in Tokyo.

Methodology

Data collection: Inside Airbnb data was collected from Jojie’ AIM’s supercomputer.
Data cleaning and preprocessing: To prepare for clustering, we cleaned the data. Some columns were dropped in the process as they were unnecessary for clustering. Categorical columns were also remapped to numerical values. Columns regarding price were originally in string format and then converted into numerical format. We had a fairly lengthy data pre-processing step, so I’d suggest to check the HTML file instead.
Data standardization: There were 3 scaling methods that were tried in the data, namely StandardScaler, MinMaxScaler and RobustScaler. MinMaxScaler provided the best end output.
Principal Component Analysis: We used Scikit-Learn’s PCA (Principal Component Analysis) to reduce the dimensions for better interpretability, less complexity, and easier visualization.
Clustering: Scikit-Learn’s KMeans and PyClustering’s KMedians were performed and validated using different number of clusters inertias (within-cluster sum of squared errors) and Calinski-Harabasz scores (CH) (ratio between the within-cluster dispersion and the between-cluster dispersion)
Exploratory data analysis on the various clusters: Using the columns we had for analysis, we then tried to find patterns in the clusters.

Insights

From our EDA, we have found the following clusters of listings in Airbnb Tokyo.

Feature	Cluster 0	Cluster 1	Cluster 2
Property Type	Apartments, hostels, houses, aparthotels	Apartments, houses	Apartments, houses, condominiums
Price	Least cost efficient	Cost-efficient for big groups	Cost efficient for small- to average-sized groups
Rating	Lowest rated overall	Highest rated overall	Good ratings
Amenities	Lacking in amenities	Family-friendly amenities	Similar to Cluster 1, but also offers amenities for business travelers

Majority of listings available in Tokyo are apartments, but each cluster contained a unique mix of property types. Cluster 1 contained more larger spaces as it had more houses, while Cluster 2 had the highest proportion of apartment spaces.
Price tiering was found in the clusters. Cluster 0 had the highest mean price and the lowest mean number of allowable guests, thus suggesting that these are not cost efficient. Cluster 1 has the highest mean number of allowable guests, thus making it more cost efficient than Cluster 0 listings. Finally, Cluster 2 had the lowest mean price.
Guests agree that Cluster 0 listings are to be avoided, as Cluster 0 had the lowest mean overall rating. Cluster 1 ranked the highest mean overall, while Cluster 2 performed fairly well across specific rating categories.

Nika Espiritu

Clustering the Tokyo, Japan Airbnb superhost listings by room features, booking, and review features

Research question

About the data

Methodology

Insights

Share on

You may also enjoy

To the Moon: Creating a Data Architecture Solution for Cryptocurrency Exchange Data Visualization and Analysis

Use your WHITS: Applying the Weighted HITS Algorithm in analyzing the world trade network

What’s the Index? Using the Global Database of Events, Language, and Tone (GDELT) in predicting the Philippine Stock Exchange Index (PSEi)

Tip Mo Ba Ako? A Tip Recommender System to Influence Passenger Tipping Behavior of Taxi and Ridesharing Platforms