The COVID-19 virus, first reported in the Chinese city of Wuhan in December, has spread to 180 countries, according to data compiled by Johns Hopkins University. As of 4/30, more than 3,308,233 infections have been reported, with over 234,105 deaths.
As the number of COVID-19 confirmed cases grow, our interest is how the number of COVID-19 cases differ across countries. Some countries have higher risk and, thus, are more likely to experience negative impacts of COVID-19. Which countries are more vulnerable to COVID-19? The primary goal of this analysis is to identify risk factors of COVID-19 by country.
First, we looked at the top 10 countries in numbers of confirmed cases of COVID-19.
Then, several questions arose in our minds: Why are these countries vulnerable to COVID-19? What’s the similarities among these countries? Can any available public data with Machine Learning explain this?
To answer these questions, we went to a journey to collect the data of risk factors (30+ including demographic, socioeconomic, environmental, and underlying health issues), and examined the relationship between risk factors and the COVID-19 confirmed cases by country.
After data collection, we applied Machine Learning models to see whether it can explain the difference of numbers of COVID-19 confirmed cases with the hope any findings might shed lights on the most important risk factors this pandemic and potential key factors for COVID-19 stabilization.
We started by making a cluster map for the data set. As we look at this figure, we clearly see that there are some factors clustered by similarities. This allowed us to quickly identify which risk factors are correlated with the number of COVID-19 confirmed cases. The tree diagrams on the left and top form grouping of factors for additional use.
With the initial run, we found the number of COVID-19 confirmed cases are highly correlated with the number of Air Transportation Passengers, Tourism Expenditure(in millions, US Dollars), Number of Arriving Tourists/Visitors (Thousands). The Scatter plot above shows this correlation.
Machine Learning Model Building
We chose Random Forest Regression model for these reasons.
1. Random Forest algorithm can be used for both classifications and regression tasks.
2. It provides higher accuracy.
3. Random Forest classifier will handle the missing values and maintain the accuracy of a large proportion of data.
4. If there are more trees, it won’t allow overfitting trees in the model.
5. It has the power to handle a large data set with higher dimensionality.
A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging. What is bagging you may ask? Bagging, in the Random Forest method, involves training each decision tree on a different data sample where sampling is done with replacement.
The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
We conducted a Random Forest Regression Model using Python. (You can find the code here.)
We will follow the traditional machine learning pipeline. The pipeline follows these steps: Importing dataset, Data Pre-processing(missing value, label encoding, log transformation etc.), Variable Selection, Training Machine Learning Model, and Evaluation.
To visualize the random forest analysis, we will select one tree in the forest, and save the whole tree as an image.
Single Full Decision Tree in Forest
This tree is a bit too large to easily digest so we will limit the depth(3) of trees in the forest to produce an understandable image.
R2 of Random Forest Regression: 0.9588846597997682
As the last step, we checked to see how well our model fit to the test data. Among the various metrics in evaluating Regression analysis, — R-Square, MAE(Mean Absolute Error), MSE(Mean Squared Error), RMSE(Root Mean Squared Error), and RMSLE(Root Mean Squared Log Error)- we chose RMSLE as we used the log transformation of the actual values. Here is a nice explanation the differences between RMSE and RMSLE.
During performance evaluation, our model computed the value of RMSLE 0.22 and R-squared 0.95.
The top 3 most important features selected by Random Forest Regression are the number of Tourist/visitor arrivals(thousands), the number of Tourism expenditure(millions of US dollars) and the number of Air transport, passengers carried.
Based on these results, we checked the top ten countries with regards to the number of Air transport, passengers carried. While most of the top ten countries (US, UK, Turkey, Germany, Brazil) have a higher number of confirmed COVID-19 cases, while China, Ireland, India, Japan, Indonesia have a lower number of cases. These five lower countries made us wonder there might be other factors not covered in our dataset.
This analysis suggested key factors causing the difference in the number of COVID-19 cases by countries were related to air transportation and tourism-related factors. We need to implement mandatory testing and quarantines to nearly all arrivals from overseas, including citizens. We can utilize machine learning models in many ways to combat COVID-19 across the nations.