Project description

Timeline

10 Weeks

Topic Deadline GitHub folder & hints
topic-ideas Nov 18 references/
project-proposal Dec 02 references/
draft-analysis Dec 16 Notebooks/
peer-review-draft Dec 23 use the issue-template
report Jan 14 reports/
github-repo Jan 15 “Repo in final state”
presentation Jan 16 “Final presentation”

Introduction

TL;DR: Pick a data set and use the concepts and methods covered in our course (data from Kaggle is not allowed).

The goal of the project is for you to use analytical methods to analyze a data set of your own choosing. You need to perform both regression and classification analysis.

Choose the data based on your group’s interests or work you all have done in other courses or at your company. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this course (and beyond, if you like) and apply them to a data set to analyze it in a meaningful way.

All analyses must be done in Python, and all components of the project must be reproducible (with the exception of the final presentation) placed inside the provided GitHub project repo.

Logistics

  • You will need to work collaboratively using a GitHub Organization and GitHub Projects.

  • One team-member should create a GitHub Organization, clone the project repo to the organizations account and invite the other team members to this repo.

  • Use GitHub Projects to organize your teamwork.

  • Every team member needs to clone the repo to a local machine and contribute to the project. Make use of regular commits (this will be a relevant factor for your grade).

The primary deliverables for the project are (see files in project repo):

  • A draft analysis (“draft-analysis.ipynb”)
  • A written, reproducible report detailing your analysis (“report.ipynb”)
  • A GitHub organization with a Github Project and your project repo
  • Slides
  • Formal peer review on another team’s project
  • Final project presentation

Topic ideas

Identify 2 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.

The purpose of submitting project ideas is to give you time to find data for the project and to make sure you have a data set that can help you be successful in the project.

Therefore, you must use one of the data sets submitted as a topic idea, unless otherwise notified.

The data sets should meet the following criteria:

  • At least 100 observations

  • At least 10 columns

  • At least 6 of the columns must be useful and unique predictor variables.

    • Identifier variables such as “name”, “social security number”, etc. are not useful predictor variables.
    • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.
  • At least one variable that can be identified as a reasonable response variable.

    • You need two response variables: one quantitative for regression and one categorical for classification.
  • Observations should reasonably meet the independence condition. Therefore, avoid data with repeated measures, data collected over time, etc.

  • You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.

Please ask me if you’re unsure whether your data set meets the criteria.

For each data set, include the following:

Introduction and data

  • State the source of the data set.
  • Describe when and how it was originally collected (by the original data curator, not necessarily how you found the data)
  • Describe the observations and the general characteristics being measured in the data

Research question

  • Describe a research question you’re interested in answering using this data.

Overview of data

  • Use the Pandas functions to provide an overview of each data set

Submit the HTML or PDF of the topic ideas to Moodle.

Project proposal

The purpose of the project proposal is to help you think about your analysis strategy early.

Include the following in the proposal:

Introduction

The introduction section includes

  • an introduction to the subject matter you’re investigating
  • the motivation for your research question (citing any relevant literature)
  • the general research question you wish to explore
  • your hypotheses regarding the research question of interest.

Data description

In this section, you will describe the data set you wish to explore. This includes

  • description of the observations in the data set,
  • description of how the data was originally collected (not how you found the data but how the original curator of the data collected it).

Analysis approach

In this section, you will provide a brief overview of your analysis approach. This includes:

  • Description of the response variables.
  • Visualization and summary statistics for the response variables.
  • List of variables that will be considered as predictors
  • Your model types (what kind of model(s) will you use)

Data dictionary

Create a data dictionary for all the variables in your data set. Include the following information for every variable: Name, description, role, type and format.

  • Role: response, predictor, ID (ID columns are not used in a model but can help to better understand the data)
  • Type: nominal, ordinal or numeric
  • Format: int, float, string, category, date or object

Submission

Push all of your final changes to the GitHub repo, and submit the HTML or PDF file of your proposal to Moodle.

Proposal grading

Total 10 pts
Introduction 3 pts
Data description 2 pts
Analysis plan 4 pts
Data dictionary 1 pts

Each component will be graded as follows:

  • Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.

  • Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.

  • Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.

Draft analysis

The purpose of the draft analysis is to give you an opportunity to get early feedback on your analysis. Each team should push their final version of the draft analysis to their GitHub repo by the due date.

The structure of the draft analysis is as follows:

- Introduction
- Setup
- Data
    - Import data
    - Data structure
    - Data corrections
    - Variable lists
    - Data splitting
- Analysis
- Model
    - Select model
    - Training and validation
    - Fit model
    - Evaluation on test set
    - Save model
- Conclusion

Below is a brief description of the sections to focus on in the draft.

  • Introduction: This section includes an introduction to the project motivation, a data dictionary and research question.

  • Setup: Import all necessary Python libraries.

  • Data: Includes all data prepartion steps.

  • Analysis: Focus on the descriptive statistics and EDA for the response variable and a few other interesting variables and relationships. Interpret the results.

  • Model: Explain the reasoning for the type of model you’re fitting and predictor variables considered for the model. Save the model in your models/ folder.

  • Conclusion: This section includes initial interpretations and conclusions drawn from the model.

Peer review draft analysis

Critically reviewing others’ work is a crucial part of the data driven process. Each team will be assigned two other teams’s projects to review.

During the peer review process, you will be provided read-only access to your partner teams’ GitHub repos.

Provide your review in the form of GitHub issues to your partner team’s GitHub repo using the issue template provided (see below).

The peer review will be graded on the extent to which it comprehensively and constructively addresses the components of the partner team’s report: the research context and motivation, exploratory data analysis, modeling, interpretations, and conclusions.

Issue template

Spend ~30 mins to review each team’s project.

  1. Find your team to review in Moodle.
  2. Open the repo of the team you’re reviewing, read their project draft, and browser around the rest of their repo.
  3. Then, go to the Issues tab in that repo, click on New issue and fill out the issue by answering the following questions (copy the following content and replace it with your feedback):

Issue template:


### Peer review team

- Peer review by: \[NAME OF TEAM DOING THE REVIEW\]
- Names of team members that participated in this review: \[FULL NAMES OF TEAM MEMBERS DOING THE REVIEW\]

### Goal of the project

- Describe the goal of the project.

### Data

- Describe the data used or collected, if any. Is the data adequate for the project?

### Approach, tools and methods

- Describe the approaches, tools, and methods that will be used.

### Lack of clarity

- Is there anything that is unclear from the proposal? 

### Possible improvements

- Provide constructive feedback on how the team might be able to improve their project.Make sure your feedback includes at least one comment on the modeling aspect of the project, but do feel free to comment on aspects beyond the modeling.

### Presentation

- What aspect of this project are you most interested in and would like to see highlighted in the final presentation?

### Organization

- Provide constructive feedback on any issues with file and/or code organization.

### Further comments

- (Optional) Any further comments or feedback?

If you are done, click on Submit new issue.

Report

Your final report has to be created in the report.ipynb file (see folder reports/) and must be reproducible. Assume that it will be used to communicate your results to other data analysts who are interested in your findings.

All team members should contribute to the GitHub repository, with regular meaningful commits.

You also need to submit the HTML of your final report in Moodle (the HTML you submit must match the files in your GitHub repository exactly).

The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis and report.

Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.

You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.

The written report is worth 40 points, broken down as follows

Total 40 pts
Introduction/data 6 pts
Methodology 10 pts
Results 14 pts
Discussion + conclusion 6 pts
Organization + formatting 4 pts

Click here for an overview of the written report rubric.

Introduction and data

This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.

Grading criteria

The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The explanatory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.

Methodology

This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model. Additionally, show how you arrived at the final model by describing the model selection process, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.

Grading criteria

The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to select the final model; the approach is clearly described in the report. The model selection process took into account any violations in model conditions. The model conditions and diagnostics are thoroughly and accurately assessed for their model. If violations of model conditions are still present, there was a reasonable attempt to address the violations based on the course content.

Results

This is where you will output the final model with any relevant model fit statistics.

Describe the key results from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Grading criteria

The model fit is clearly assessed, and interesting findings from the model are clearly described. Interpretations of model coefficients are used to support the key findings and conclusions, rather than merely listing the interpretation of every model coefficient. If the primary modeling objective is prediction, the model’s predictive power is thoroughly assessed.

Discussion + Conclusion

In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.

Grading criteria

Overall conclusions from analysis are clearly described, and the model results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.

Organization + formatting

This is an assessment of the overall presentation and formatting of the written report.

Grading criteria

The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.

Presentation

Time constraints

Every group needs to present their project to the whole course.

During the presentation, each student may present for a maximum of 10 minutes to show the main highlights of the project (this means a team of two students can present for a maximum of 20 minutes). You may use as many slides as you prefer and you can also show code and live demos.

Slides

In addition to the report, your team will also create presentation slides (please use Google Slides with an adequate template).

These slides should serve as a brief visual addition to your report and will be graded for content and quality.

Here is a suggested outline as you think through the slides; you do not have to use this exact format

  • Title Slide
  • Introduce the topic and motivation
  • Introduce the data
  • Highlights from EDA
  • Final model
  • Interesting findings from the model
  • Conclusions + future work

Submission

Please share the files online and put a link in the README.md file in the reports/ folder.

GitHub Organization

All written work (with exception of presentation slides) should be reproducible, and the GitHub Organization should be neatly organized.

Do not change the content or structure of the repo after the stated deadline.

You will find an overview of the GitHub structure in the README.md file of the GitHub repo projects.

Points for reproducibility + organization will be based on the reproducibility of the report and the organization of the project GitHub repo. The repo should be neatly organized, there should be no extraneous files, all text in the README should be easily readable.

Overall grading

The grade breakdown is as follows:

Total 100 pts
Topic ideas 5 pts
Project proposal 10 pts
Peer review 10 pts
Report 40 pts
Reproducibility + orga (GitHub) 10 pts
Final presentation 25 pts

Grading summary

Grading of the project will take into account the following:

  • Content - What is the quality of the research question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a convincing argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a convincing argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a convincing argument. Communication of results is unclear.
  • Below 60%: Student is not making a sufficient effort.

Late work policy

Projects may be submitted up to 2 days late. There will be a 25% deduction for each 24-hour period the project is late.

Be sure to turn in your work early to avoid any technological mishaps.