Classifying Forest Cover Type
In this assignment, you will have to analyze the Forest Cover data set, build
a few different classifiers to classify the forest cover type of a parcel of
land, and evaluate the performance of the classifiers. Even though we have used
the forest cover data from
Kaggle, we will
use a larger, more complete forest cover data set from
UCI data repository.
While you will be iteratively wrangling the data, selecting relevant features,
tweaking classifiers, and evaluating the results, your final report (in the form
of a Jupyter Notebook) should only describe the most essential parts of your
efforts (i.e., do not describe in full detail all your missteps and detours).
Your report should discuss:
- goals or objectives of the work and why they are important
- how is the data preprocessed if any
- how are features selected
- what classification techniques are used and how their parameters are selected
or tuned.
- how will performance (i.e., accuracy) be evaluated
- the performance results
- discussion of any insights into the data set or the problem
Other considerations:
- At least two different classification techniques must be used
- You may choose any classification technique you like. If you do not
know what technique to use, it is recommended that you stick with Decision
Trees and Nearest Neighbor.
- When splitting the data into training set and testing set, take care to
preserving balance between the classes
- You can convert between categorical and binary attributes using
sklean.preprocessing.LabelBinarizer
- I am looking for evidence of “thoughtful analysis”: I
need to see you trying to make sense of the data and that
does mean trying to understand the domain as well.
- Submissions that consist only of Python code without
any narrative to explain what you have done/observed and
make some sense out of it are not acceptable. Please put in
some narrative. I do not want long essays, but a few
sentences for each result/investigation.
- Submissions that consists only of comparison of
accuracy between different models without any kind of
preprocessing or data exploration or deeper insight are not
acceptable. Please incorporate some work to analyze and
understand the data – this is NOT a kaggle competition
where only the accuracy matters.
Deliverables
- Jupyter Notebook (ipynb file) describing your work
- Any additional scripts you have used outside of the
Python/Jupyter environment
For those who have problems running the processing within
Jupyter Notebook, you may run the heavy lifting python code outside of
Jupyter Notebook, but you must write up your work in Jupyter
Notebook for submission.
Do not submit the data set with your submission, but you
need to submit any code/scripts needed to run your analysis
starting from the data set from the website.
Submission Procedure
Submit your files via Laulima->Assignment.
If you have many files, you might want to zip them up into
one archive (zip and tgz accepted. rar is NOT accepted).