So last week our research mentor organized a mini hackathon for the interns. We were given a dataset describing the various qualities (pH, sweetness, sulfur content, etc.) of roughly 800 different wines. The data set also categorized each of the wines as other Good or Bad. The challenge for us was to use machine learning on the given dataset to predict whether each of the wines provided in a test set of wine data (which lacks the Good or Bad feature) is Good or Bad.
Participating in this small data science challenge exposed me to my first actual use of cross-validation, which is a technique in which you attempt to test the accuracy of your prediction model on the original data you were given. This was also my first use of an Support Vector Machine (SVM), which is tool for supervised machine learning that attempts to create boundaries amongst the data to help categorize predictions.
In the end I landed in 5th place out of the 12 interns with an accuracy of 71%, which isn’t too shabby. Feel free to take a look at my python script, which I whipped up with the help of some of my peers.
Here’s also a quick snippet of my code which encompasses the simple approach I took: