Archived Final Project, Fall 2023

Final Project: Linking Writing Processes to Writing Quality

Timeline
- Release Date: October 30th, 2023
- Due Date: 11:59 PM, Thursday, December 8th, 2023
Data Information
- We will utilize data from the Kaggle competition on Linking Writing Processes to Writing Quality.
- At least one team member must register for a Kaggle account and download the data. Prior to downloading, you will be prompted to read and accept the competition rules.
- The dataset comprises four files: train_logs.csv, test_logs.csv, train_scores.csv, and sample_submission.csv. Our primary focus will be on the train_logs (covariates) and train_scores (outcome) files. You also have the option to submit your results to Kaggle for extra credit, as detailed in the report requirements section.
Submission Guidelines
- Designate one team member as the team lead to submit the complete report and supplementary files via Gradescope. Use the group submission feature and list all team members.
- The full report must be submitted in .pdf format. It should consist of a cover page and a main body not exceeding 12 pages. A supplementary section can extend the report to a maximum of 20 pages to display additional figures, tables, or results supporting your findings.
- Clearly indicate the names and NetIDs of all team members on the cover page of the report.

Project Report Requirement

[5 Points, 1 page] Project Description and Summary: This section should provide a concise summary of your project’s goal, approach, and conclusion. Consider this as the abstract of your report. Additionally, highlight any unique or specialized methods you have employed, particularly those not covered in this course, and explain how they differ from methods you’ve reviewed in the Literature Review section.
[10 Points, 1-2 pages] Literature Review: The dataset has been available for some time, and many researchers have attempted to analyze it. There are likely to be interesting ideas that you could incorporate into your project. For instance, in this paget, participants share their own solutions and thoughts. You may also use a Google search with relevant keywords to find other ideas. Perform the following tasks for this section:
- Best Known Accuracy: What is the best accuracy on this dataset so far with a known analysis approach? Note that some teams on the Kaggle leaderboard may have very high accuracy but may not reveal their methods. Focus on those who have published their approach and summarize it.
- Other Approaches: Read and understand at least two other posts or solutions. Summarize their approaches in a short paragraph and comment on their advantages and disadvantages.
- Citations and Links: For all the sources you consult, provide appropriate citations and/or URLs.
- Application to Your Project: Discuss possibilities of incorporating or adapting the approaches you’ve learned from the literature into your own project.
[30 Points, 2-3 page] Data processing
- This dataset has a distinct structure, consisting of 5,000 logs of user inputs. Each log is scored on a scale of 0 to 6 and captures anonymous keyboard actions, effectively making each log a sequential set of input data. It’s crucial to read the data description thoroughly before starting any analysis.
- For this project, you are required to condense each user’s log, which usually spans several thousand rows, into a single row of data. Do this by using a randomly selected 80% of observations (as training data), and apply your processing algorithm to predict the rest 1000 observations (as testing data). This processing algorithm may be carried out in multiple steps and involves various decision points. Exercise your best judgment in making these decisions. Iterations between this step and your supervised learning step could be beneficial to arrive at the optimal data configuration, but is not required.
- Certain variables in the dataset have specialized structures. For instance, the activity variable is categorized into multiple distinct types. Another example is when both the down_event and up_event are space, there is no change in text. When processing such variables, take caution to preserve their original meanings. Clearly delineate your approach in a written paragraph.
- After data processing, provide a frequency table, histogram plot, or summary of all variables in your data. If the processed data is high-dimensional, summarize the variables efficiently without generating excessive output. Discuss any outliers at the user level that require attention and any transformations you aim to apply. If you choose to implement any modifications, clearly outline your approach.
- Be aware that the choices made during this step will significantly influence subsequent supervised and unsupervised learning phases. As such, careful consideration and planning is needed. It is encouraged to utilize supervised or unsupervised approaches at this step.
[10 Points, 2-3 pages] Unsupervised Learning
- For this step, use only the 4000 training data you randomly selected
- Perform at least two clustering algorithms to the processed data. How many clusters you decide to use?
- Are these clustering results associated with your outcome variable?
- What can you learn from the clustering results? Are they useful for your supervised model?
[30 Points, 3-4 pages] Regression/Classification Models
- Perform at least three different regression models to the training dataset and also perform one classification (by treating 0 - 6 as unique class labels, ordered or unordered). Note that all linear and penalized linear (logistic) models are considered the same model. KNN and NW kernel regression/classification are considered as the same.
- Your approaches are not limited to the ones we learned in this course. However, at least two models should be the ones we covered. If you decide to use/propose a new approach, make sure to properly describe your model.
- You need to tune all methods properly and clearly describe your tuning process for each model. What is your evaluation criteria? If you use multiple ones, discuss their pros and cons.
- For this step, use only the 80% training data you randomly selected. After tuning and selecting the best model, apply it to your 1000 testing data.
- You need to provide sufficient information (table/figure and descriptions) to demonstrate the model fitting results. Which model seem to perform the best? What variable(s) you constructed seem to be most predictive and can your provide interpretations of them?
[15 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning for your decisions.
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more than 2 pages of the space should be used as displaying the code.

Archived Final Project, Fall 2023

Ruoqing Zhu

Final Project: Linking Writing Processes to Writing Quality

Project Report Requirement