Use case for this program involved the tokenization and analysis of approximately ~500 first-year student writing samples at around 150-200 words per submission. It can also be coupled with findings drawn from associated NLP methods geared toward sensemaking and pattern recognition among rhetorically contained linguistic datasets.
Such was the case with the metadiscursive register of students who write about their writing in response to the pedagogical language of an assignment description: one whose parameters narrow the pool of lexical and phraseological expressions among online learners. The instructional design & rhetoric of these writing prompts thus underwent strong qualitative growth year-over-year due the patterned insights of these learner analytics.
- Python script designed to preprocess, sort, and ennumerate the frequency values for bigram and trigram tokens within considerably large text corpora.
- A sizeable linguistic dataset drawn from a distributed set of text artificats
- Plain or rich text file for each linguistic dataset it to be concetenated, added as a commit, with its filename wrapped in quotes and added as input in script.
This repository comprises part of my final research project in fulfillment of Kyle Gorman's Fall 2019 section of Methods in Computational Linguistics at The Graduate Center, CUNY.