.Data Mining
This assignment is about two things:
(1) doing a simple KNN analysis using a built-in package from Scikit-learn/sklearn
(2) making an ‘elbow’ plot.
(1)
(a)
Now before we get into the data mining, let’s first ‘pre-process’ the dataset. So, we pretend we only have access to 70 percent of this iris data set. If you think of the entire data set as a dataframe, then you are going to only use the top 70% of therows for your ‘data mining’ purpose. You will reserve the bottom 30%of the rows as the data set to see how well your model performs. To make it more relatable, imagine the data vendor tells you that the bottom 30% is premium data which you would have to pay a lot of money for the access. For the sake of clarity, the top 70% data is called ‘seen_data_set’ and the bottom30% is called ‘unseen_data_set’.
Please write a few lines of code to do this. It should be easy.
(b)
Now we are going to further split this seen_data_set into 80/20. Make sure to use random state = 1234 in the split for the result to be reproducible. Set the K value to be 5. Find thecorresponding test_score which measures the precision of your prediction on the test data set.
(2)
Please write a short report to answer this question.
The test_score you obtain in the first part is probably very good.But how does K value affect the precision of your prediction? Suppose K value varies from 3 to 15. Also, how good is it when we apply our mining model to the ‘unseen_data_set’? Imagine the bottom 30% of the original data is so prohibitively expensive that you are going to have to predict it by yourself. Make a plot where the X measures how K value varies, and Y measures the precision score of the test_data_set of ‘seen_data_set’.
Make a plot where the X measures how K value varies and Ymeasures the precision score when you apply your model to the unseen data set. Write the code required to make these two plots include the two plots in the report make a few comments on both of the plots.