All source code should be stored in a version control system (Git or Mercurial).
You will need to provide the following:
- a term paper in paper form;
- a term paper in electronic form;
- a zip archive containing your source code repository with full commit history.
Please not that the zip file GitHub offers to download is not the whole repo but only the latest revision. In order to download the repository with full commit history you will need to clone it.
Term paper structure
Title page
1 page
Contents
1 page
Introduction
1 page
A short introduction to the domain. Describe the tools you used for development.
Main content
30 pages
Conclusion
1 page
References
1 page
Each figure should be accompanied with a text description.
Each figure should be labeled and numbered using automatic numbering.
Each figure should be referenced from the text.
A figure should not immediately follow a caption.
Text on figures should be readable. Split big figures into smaller ones.
All captions of the same level should be formatted identically (font, intervals, alignment). Do not use empty lines to adjust intervals before captions or after them.
Main text should have the same formatting over the paper (font, intervals, alignment).
All captions, except "Contents", "Introduction", "Conclusion", "References", should be numbered using automatic numbering.
Table of contents should be created automatically and update before printing.
Each page except the title page should be numbered.
Do not use multiple spaces or line feeds in a row for formatting.
Grading criteria
Code Functionality
-
Does the code work?
All code is functional and produces no errors when run.
The code given is sufficient to reproduce the results described.
-
Does the project use NumPy and Pandas appropriately?
The project uses NumPy arrays and Pandas Series and DataFrames where appropriate rather than Python lists and dictionaries.
Where possible, vectorized operations and built-in functions are used instead of loops.
-
Does the project use good coding practices?
The code makes use of classes and functions to avoid repetitive code.
The code contains good comments and variable names, making it easy to read.
The code conforms to PEP8.
Quality of Analysis
-
Are questions clearly posed?
The project clearly states one or more questions, then addresses those questions in the rest of the analysis.
Data Wrangling Phase
-
Is the data cleaning well documented?
The project documents any changes that were made to clean the data, such as merging multiple files, handling missing values, etc.
Exploration Phase
-
Is the data explored in many ways?
The project investigates the stated questions from multiple angles.
At least one dependent variable and three independent variables are investigated using both single-variable (1d) and multiple-variable (2d) explorations.
-
Are there a variety of relevant visualizations and statistical summaries?
The project's visualizations are varied and show multiple comparisons and trends.
Relevant statistics are computed throughout the analysis when an inference is made about the data.
At least two kinds of plots should be created as part of the explorations.
Conclusions Phase
-
Has the student correctly communicated tentativeness of findings?
The results of the analysis are presented such that any limitations are clear.
The analysis does not state or imply that one change causes another based solely on a correlation.
The project uses statistical tests to draw rigorous conclusions where appropriate.
Communication Phase
-
Is the flow of the analysis easy to follow?
Reasoning is provided for each analysis decision, plot, and statistical summary.
-
Is the data visualized using appropriate plots and parameter choices?
Visualizations made in the project depict the data in an appropriate manner that allows plots to be readily interpreted.
Example subjects
-
Explore and visualize the Open Food Facts dataset
Download dataset
-
Principal component analysis of the Human Resources Analytics dataset
Download dataset
-
Film recommendation engine
Download dataset