Classifying Forest Cover Type

In this assignment, you will have to analyze the Forest Cover data set, build a few different classifiers to classify the forest cover type of a parcel of land, and evaluate the performance of the classifiers. Even though we have used the forest cover data from Kaggle, we will use a larger, more complete forest cover data set from UCI data repository.

While you will be iteratively wrangling the data, selecting relevant features, tweaking classifiers, and evaluating the results, your final report (in the form of a Jupyter Notebook) should only describe the most essential parts of your efforts (i.e., do not describe in full detail all your missteps and detours). Your report should discuss:

goals or objectives of the work and why they are important
how is the data preprocessed if any
how are features selected
what classification techniques are used and how their parameters are selected or tuned.
how will performance (i.e., accuracy) be evaluated
the performance results
discussion of any insights into the data set or the problem

Other considerations:

At least two different classification techniques must be used
You may choose any classification technique you like. If you do not know what technique to use, it is recommended that you stick with Decision Trees and Nearest Neighbor.
When splitting the data into training set and testing set, take care to preserving balance between the classes
You can convert between categorical and binary attributes using sklean.preprocessing.LabelBinarizer
I am looking for evidence of “thoughtful analysis”: I need to see you trying to make sense of the data and that does mean trying to understand the domain as well.
Submissions that consist only of Python code without any narrative to explain what you have done/observed and make some sense out of it are not acceptable. Please put in some narrative. I do not want long essays, but a few sentences for each result/investigation.
Submissions that consists only of comparison of accuracy between different models without any kind of preprocessing or data exploration or deeper insight are not acceptable. Please incorporate some work to analyze and understand the data – this is NOT a kaggle competition where only the accuracy matters.

Deliverables

Jupyter Notebook (ipynb file) describing your work
Any additional scripts you have used outside of the Python/Jupyter environment

For those who have problems running the processing within Jupyter Notebook, you may run the heavy lifting python code outside of Jupyter Notebook, but you must write up your work in Jupyter Notebook for submission.

Do not submit the data set with your submission, but you need to submit any code/scripts needed to run your analysis starting from the data set from the website.

Submission Procedure

Submit your files via Laulima->Assignment.

If you have many files, you might want to zip them up into one archive (zip and tgz accepted. rar is NOT accepted).