Goal
Predicting individuals’ annual income (“<=50K” and “>50K”) and exploring the relationship between multiple covariates and income.
Contributions
- Used the Census Income Data Set from the UCI Machine Learning Repository to build the models
- Performed the exploratory data analysis, including observing the characteristics of each variable, drawing statistical charts of “variable vs. Income”, and observing the correlation between variables
- Extracted a sample dataset, performed tunning on the Logistic Regression, CART, Random Forest, and AdaBoost models, and trained the four models using the complete dataset to predict the results
- Ran the optimal models on the test data and compared multiple metrics, indicating the best performance of CART and Random Forest (with a balanced accuracy of 0.8236 )