2 minute read

This mini project was created alongside Daryll Tumambing for our Data Mining and Wrangling class held from May to August 2020. We were tasked to choose any dataset from the Inside Airbnb dataset and perform descriptive analytics by solving a clustering problem.

Research question

We particularly got interested in superhost listings in Airbnb Tokyo. Given the highly varied market of guests, what are the different types of superhost listings? How are these types similar and different from one other?

About the data

We used the December 30, 2019 crawl date of the Inside Airbnb Tokyo listings dataset, which corresponds to all 14,550 listings that have existed since Airbnb began in Tokyo.

Methodology

  1. Data collection: Inside Airbnb data was collected from Jojie’ AIM’s supercomputer.
  2. Data cleaning and preprocessing: To prepare for clustering, we cleaned the data. Some columns were dropped in the process as they were unnecessary for clustering. Categorical columns were also remapped to numerical values. Columns regarding price were originally in string format and then converted into numerical format. We had a fairly lengthy data pre-processing step, so I’d suggest to check the HTML file instead.
  3. Data standardization: There were 3 scaling methods that were tried in the data, namely StandardScaler, MinMaxScaler and RobustScaler. MinMaxScaler provided the best end output.
  4. Principal Component Analysis: We used Scikit-Learn’s PCA (Principal Component Analysis) to reduce the dimensions for better interpretability, less complexity, and easier visualization.
  5. Clustering: Scikit-Learn’s KMeans and PyClustering’s KMedians were performed and validated using different number of clusters inertias (within-cluster sum of squared errors) and Calinski-Harabasz scores (CH) (ratio between the within-cluster dispersion and the between-cluster dispersion)
  6. Exploratory data analysis on the various clusters: Using the columns we had for analysis, we then tried to find patterns in the clusters.

Insights

From our EDA, we have found the following clusters of listings in Airbnb Tokyo.

Feature Cluster 0 Cluster 1 Cluster 2
Property Type Apartments, hostels, houses, aparthotels Apartments, houses Apartments, houses, condominiums
Price Least cost efficient Cost-efficient for big groups Cost efficient for small- to average-sized groups
Rating Lowest rated overall Highest rated overall Good ratings
Amenities Lacking in amenities Family-friendly amenities Similar to Cluster 1, but also offers amenities for business travelers
  1. Majority of listings available in Tokyo are apartments, but each cluster contained a unique mix of property types. Cluster 1 contained more larger spaces as it had more houses, while Cluster 2 had the highest proportion of apartment spaces.
  2. Price tiering was found in the clusters. Cluster 0 had the highest mean price and the lowest mean number of allowable guests, thus suggesting that these are not cost efficient. Cluster 1 has the highest mean number of allowable guests, thus making it more cost efficient than Cluster 0 listings. Finally, Cluster 2 had the lowest mean price.
  3. Guests agree that Cluster 0 listings are to be avoided, as Cluster 0 had the lowest mean overall rating. Cluster 1 ranked the highest mean overall, while Cluster 2 performed fairly well across specific rating categories.