Final Project: Linking Writing Processes to Writing Quality
- Timeline
- Release Date: October 30th, 2023
- Due Date: 11:59 PM, Thursday, December 8th,
2023
- Data Information
- We will utilize data from the Kaggle competition on Linking
Writing Processes to Writing Quality.
- At least one team member must register for a Kaggle account and
download the data. Prior to downloading, you will be prompted to read
and accept the competition rules.
- The dataset comprises four files:
train_logs.csv
,
test_logs.csv
, train_scores.csv
, and
sample_submission.csv
. Our primary focus will be on the
train_logs
(covariates) and train_scores
(outcome) files. You also have the option to submit your results to
Kaggle for extra credit, as detailed in the report requirements
section.
- Submission Guidelines
- Designate one team member as the team lead to
submit the complete report and supplementary files via Gradescope. Use the
group submission feature and list all team
members.
- The full report must be submitted in
.pdf
format. It
should consist of a cover page and a main body not exceeding 12
pages. A supplementary section can extend the report to a
maximum of 20 pages to display additional figures, tables, or results
supporting your findings.
- Clearly indicate the names and NetIDs of all team members on the
cover page of the report.
Project Report Requirement
- [5 Points, 1 page] Project Description and Summary:
This section should provide a concise summary of your project’s goal,
approach, and conclusion. Consider this as the abstract of your report.
Additionally, highlight any unique or specialized methods you have
employed, particularly those not covered in this course, and explain how
they differ from methods you’ve reviewed in the Literature Review
section.
- [10 Points, 1-2 pages] Literature Review: The
dataset has been available for some time, and many researchers have
attempted to analyze it. There are likely to be interesting ideas that
you could incorporate into your project. For instance, in this paget,
participants share their own solutions and thoughts. You may also use a
Google search with relevant keywords to find other ideas. Perform the
following tasks for this section:
- Best Known Accuracy: What is the best accuracy on
this dataset so far with a known analysis approach? Note that some teams
on the Kaggle
leaderboard may have very high accuracy but may not reveal their
methods. Focus on those who have published their approach and summarize
it.
- Other Approaches: Read and understand at least two
other posts or solutions. Summarize their approaches in a short
paragraph and comment on their advantages and disadvantages.
- Citations and Links: For all the sources you
consult, provide appropriate citations and/or URLs.
- Application to Your Project: Discuss possibilities
of incorporating or adapting the approaches you’ve learned from the
literature into your own project.
- [30 Points, 2-3 page] Data processing
- This dataset has a distinct structure, consisting of 5,000 logs of
user inputs. Each log is scored on a scale of 0 to 6 and captures
anonymous keyboard actions, effectively making each log a sequential set
of input data. It’s crucial to read the data
description thoroughly before starting any analysis.
- For this project, you are required to condense each
user’s log, which usually spans several thousand rows, into a single row
of data. Do this by using a randomly selected 80% of observations (as
training data), and apply your processing algorithm to predict the rest
1000 observations (as testing data). This processing algorithm may be
carried out in multiple steps and involves various decision points.
Exercise your best judgment in making these decisions. Iterations
between this step and your supervised learning step could be beneficial
to arrive at the optimal data configuration, but is not required.
- Certain variables in the dataset have specialized structures. For
instance, the
activity
variable is categorized into
multiple distinct types. Another example is when both the
down_event
and up_event
are
space
, there is no change in text. When processing such
variables, take caution to preserve their original meanings. Clearly
delineate your approach in a written paragraph.
- After data processing, provide a frequency table, histogram plot, or
summary of all variables in your data. If the processed data is
high-dimensional, summarize the variables efficiently without generating
excessive output. Discuss any outliers at the user level that require
attention and any transformations you aim to apply. If you choose to
implement any modifications, clearly outline your approach.
- Be aware that the choices made during this step will significantly
influence subsequent supervised and unsupervised learning phases. As
such, careful consideration and planning is needed. It is encouraged to
utilize supervised or unsupervised approaches at this step.
- [10 Points, 2-3 pages] Unsupervised Learning
- For this step, use only the 4000 training data you randomly
selected
- Perform at least two clustering algorithms to the processed data.
How many clusters you decide to use?
- Are these clustering results associated with your outcome
variable?
- What can you learn from the clustering results? Are they useful for
your supervised model?
- [30 Points, 3-4 pages] Regression/Classification
Models
- Perform at least three different regression models to the training
dataset and also perform one classification (by treating 0 - 6 as unique
class labels, ordered or unordered). Note that all linear and penalized
linear (logistic) models are considered the same model. KNN and NW
kernel regression/classification are considered as the same.
- Your approaches are not limited to the ones we learned in this
course. However, at least two models should be the ones we covered. If
you decide to use/propose a new approach, make sure to properly describe
your model.
- You need to tune all methods properly and clearly describe your
tuning process for each model. What is your evaluation criteria? If you
use multiple ones, discuss their pros and cons.
- For this step, use only the 80% training data you randomly selected.
After tuning and selecting the best model, apply it to your 1000 testing
data.
- You need to provide sufficient information (table/figure and
descriptions) to demonstrate the model fitting results. Which model seem
to perform the best? What variable(s) you constructed seem to be most
predictive and can your provide interpretations of them?
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning
for your decisions.
- Is it written in a manner such that a reader does not need to be
very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and
compact enough? For example, you should not include a super long table
or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more
than 2 pages of the space should be used as displaying the
code.