Final project
Timeline
due Monday, October 17 (Thursday labs)
due Tuesday, October 18 (Friday labs)
Project proposal due Friday, November 4
Round 1 submission (optional) due Tuesday, November 22
Written report due Friday, December 9 (accepted until December 11)
Video presentation + slides and Reproducibility + organization due Wednesday, December 14
Presentation comments due Friday, December 16
Introduction
TL;DR: Pick a data set and do a regression analysis. That is your final project.
The goal of the final project is for you to use regression analysis to analyze a data set of your own choosing. The data set may already exist or you may collect your own data by scraping the web.
Choose the data based on your group’s interests or work you all have done in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set to analyze it in a meaningful way.
All analyses must be done in RStudio, and all components of the project must be reproducible (with the exception of the presentation).
Logistics
You will work on the project with your lab groups.
The four primary deliverables for the final project are
- A written, reproducible report detailing your analysis
- A GitHub repository corresponding to your report
- Slides + a video presentation
- Formal peer review on another team’s project
Topic ideas
Identify 2 - 3 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources on the Tips + resources page as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.
The purpose of submitting project ideas is to give you time to find data for the project and to make sure you have a data set that can help set you up for a successful project. Therefore, you must use one of the data sets submitted as a topic idea, unless otherwise notified by the teaching team.
The data sets should meet the following criteria:
- At least 500 observations
- At least 10 columns
- At least 6 of the columns must be useful and unique predictor variables.
- Identifier variables such as “name”, “social security number”, etc. are not useful predictor variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.
- At least one variable that can be identified as a reasonable response variable.
- The response variable can be quantitative or categorical.
- A mix of quantitative and categorical variables that can be used as predictors.
- Observations should reasonably meet the independence condition. Therefore, avoid data with repeated measures, data collected over time, etc.
- You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
For each data set, include the following:
Introduction and data
- State the source of the data set.
- Describe when and how it was originally collected (by the original data curator, not necessarily how you found the data)
- Describe the observations and the general characteristics being measured in the data
Research question
- Describe a research question you’re interested in answering using this data.
Glimpse of data
- Use the
glimpse
function to provide an overview of the data set
Grading
The Topic Ideas portion of the project is worth 10 points. It will be graded based on meeting the criteria stated above for 2 proposed data sets.
Project proposal
The purpose of the project proposal is to help you think about your analysis strategy early and thoroughly explore the data.
Include the following in the proposal:
Section 1 - Introduction
The introduction section includes
- an introduction to the subject matter you’re investigating
- the motivation for your research question (citing any relevant literature)
- the primary research question you are interested in exploring
- your team’s hypotheses regarding the research question
Section 2 - Data description
In this section, you will describe the data set. This includes
- description of the observations in the data set
- description of how the data were originally collected (not how you found the data but how the original curator of the data collected it).
Section 3 - Exploratory data analysis
In this section, you will explore the data. This includes using , visualizations and summary statistics to describe the following:
- univariate distribution of the response variable
- univariate distributions of the potential predictor variables
- relationships between the response and predictors
- relationships between predictors
- potential interaction effects you’re interested in exploring
In this section, you will also describe any data cleaning you need to do to prepare for modeling, such as imputing missing values, collapsing levels for categorical predictors, creating new variables, summarizing data, etc.
Section 4 - Analysis approach
In this section, you will provide a brief overview of your analysis approach. This includes:
- description of the response variable and list of potential predictors
- regression model technique (multiple linear regression or logistic regression)
Data dictionary (aka code book)
Submit a data dictionary for all the variables in your data set in the README
of the data
folder. You do not need to include the data dictionary in the PDF.
Submission
Grading
The anticipated length, including all graphs, tables, narrative, etc., is 3 - 6 pages; it may not exceed 7 pages.
The proposal is with 15 points and will be graded based on accurately and comprehensively addressing the criteria stated above. Points will be assigned based on a holistic review of the project proposal.
Excellent (14 - 15 points) : All required elements are completed and are accurate. There is a thorough exploration of the data and the team has demonstrated a careful and thoughtful approach exploring the data and preparing it for analysis. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Strong: (11 - 13 points) Requirements are mostly met, but there are some elements that are incomplete or inaccurate. Some minor revision of the work required before team is ready for modeling.
Satisfactory (8 - 10 points): Requirements partially met, but there are some elements that are incomplete and/or inaccurate. Major revision of the work required before team is ready for modeling.
Needs Improvement (7 or fewer points points): Requirements are largely unmet, and there are large elements that are incomplete and/or inaccurate. Substantial revisions of the work required before team is ready for modeling.
Round 1 submission (optional)
The Round 1 submission is an opportunity to receive detailed feedback on your analysis and written report. The feedback will only be on the content that is submitted, so more “complete” drafts will receive more detailed feedback. At this stage, you will also be notified of the grade you would receive at that point. You will have the option to keep the grade (and thus you don’t need to turn in an updated report) or resubmit the written report by the final submission deadline for grading.
To submit the draft:
1. Push the updated written-report.qmd
and written-report.pdf
to your GitHub repo.
2. Open an issue with the title “Round 1 Submission”. You can use the template issue in the GitHub repo. Make sure I am tagged in the issue (@matackett), so I receive notification of your Round 1 submission. See Creating an issue from a repository for instructions on opening an issue. Please ask a member of the teaching team for assistance if you need help opening the issue.
Note that this is optional, so there is no grading penalty for turning in nothing for the Round 1 submission. Due to time constraints at the end of the semester, only high-level feedback will be given for the reports submitted at the final written report deadline in December.
Written report
Your written report must be completed in the written-report.qmd
file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
You will submit the PDF of your final report on GitHub.
The PDF you submit must match the .qmd in your GitHub repository exactly. The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis and report.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.
Grading criteria
The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The explanatory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.
Methodology
This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.
Grading criteria
The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to select the final model; the approach is clearly described in the report. The model selection process took into account potential interaction effects and addressed any violations in model conditions. The model conditions and diagnostics are thoroughly and accurately assessed for their model. If violations of model conditions are still present, there was a reasonable attempt to address the violations based on the course content.
Results
This is where you will output the final model with any relevant model fit statistics.
Describe the key results from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
Grading criteria
The model fit is clearly assessed, and interesting findings from the model are clearly described. Interpretations of model coefficients are used to support the key findings and conclusions, rather than merely listing the interpretation of every model coefficient. If the primary modeling objective is prediction, the model’s predictive power is thoroughly assessed.
Discussion + Conclusion
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Grading criteria
Overall conclusions from analysis are clearly described, and the model results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.
Organization + formatting
This is an assessment of the overall presentation and formatting of the written report.
Grading criteria
The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.
Submission
Video presentation + slides
Slides
In addition to the written report, your team will also create presentation slides and record a video presentation that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.
The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
- Title Slide
- Slide 1: Introduce the topic and motivation
- Slide 2: Introduce the data
- Slide 3: Highlights from EDA
- Slide 4: Final model
- Slide 5: Interesting findings from the model
- Slide 6: Conclusions + future work
Video presentation
For the video presentation, you can speak over your slide deck, similar to the lecture content videos. The video presentation must be no longer than 7 minutes. It is fine if the video is shorter than 7 minutes, but it cannot exceed 7 minutes. You may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:
- Recording presentations in Zoom
- Apple Quicktime for screen recording
- Windows 10 built-in screen recording functionality
- Kap for screen recording
Once your video is ready, upload the video to Warpwire, then embed the video in an new discussion post on Sakai.
To upload your video to Warpwire:
- Click the Warpwire tab in the Sakai site for your section.
- Click the “+” and select “Upload files”.
- Locate the video on your computer and click to upload.
- Once you’ve uploaded the video to Warpwire, click to share the video and copy the video’s URL. You will need this when you post the video in the discussion forum.
To post the video to the discussion forum
- Click the Presentations tab in the course Sakai site.
- Click the Presentations topic.
- Click “Start a new conversation”.
- Make the title “Your Team Name: Project Title”.
- Click the Warpwire icon (between the table and shopping cart icons).
- Select your video, then click “Insert 1 item.” This will embed your video in the conversation.
- Under the video, paste the URL to your video. This is to ensure peers can see the presentation if they’re unable to view the embedded video.
- You’re done!
You can see the Teaching Team example in Sakai.
Presentation comments
Each student will be assigned 2 presentations to watch. Click here to see your viewing assignments.
Watch the group’s video, then click “Reply” to post a question for the group. You may not post a question that’s already been asked on the discussion thread. Additionally, the question should be (i) substantive (i.e. it shouldn’t be “Why did you use a bar plot instead of a pie chart”?), (ii) demonstrate your understanding of the content from the course, and (iii) relevant to that group’s specific presentation, i.e demonstrating that you’ve watched the presentation.
This portion of the project will be assessed individually.
Reproducibility + organization
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
The GitHub repo should have the following structure:
README
: Short project description and data dictionarywritten-report.qmd
&written-report.pdf
: Final written report/data
: Folder that contains the data set for the final project./previous-work
: Folder that contains thetopic-ideas
,project-proposal
, andwritten-report-comments.pdf
(if applicable) files./presentation
: Folder with the presentation slides.- If your presentation slides are online, you can put a link to the slides in a
README.md
file in thepresentation
folder.
- If your presentation slides are online, you can put a link to the slides in a
*.Rproj
: File specifying the RStudio project.gitignore
: File listing all files that are in the local RStudio project but not the GitHub repo
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Peer teamwork evaluation
You will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly.
Overall grading
The grade breakdown is as follows:
Total | 100 pts |
---|---|
Topic ideas | 10 pts |
Project proposal | 15 pts |
Written report | 45 pts |
Slides + video presentation | 15 pts |
Reproducibility + organization | 5 pts |
Video comments | 5 pts |
Peer teamwork evaluation | 5 pts |
Note: No late project reports or videos are accepted.
Grading summary
Grading of the project will take into account the following:
- Content - What is the quality of research and/or policy question and relevancy of data to those questions?
- Correctness - Are statistical procedures carried out and explained correctly?
- Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
- Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
A general breakdown of scoring is as follows:
- 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
- 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
- 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
- 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
- Below 60%: Student is not making a sufficient effort.
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.