Creative Projects
08. Creative Projects — Building a Computational Linguistics Portfolio
Fatima, your academic direction in Linguistics / Computational Linguistics gives you a rare opportunity to merge deep language insight with technical implementation. The committee emphasized that your portfolio should not only show coding ability but also tangible contributions to language data and analysis. In this section, you’ll build a creative, research-oriented portfolio that demonstrates your ability to handle real linguistic data and communicate results effectively.
Project 1: Mini NLP Dataset or Language Documentation Corpus
The committee specifically encouraged you to develop and release a small NLP dataset or language documentation corpus. This project can serve as your “anchor” — a public artifact showing technical rigor and linguistic awareness.
- Goal: Create a small but well-structured dataset (around 1,000–5,000 entries) documenting a linguistic phenomenon, dialect, or semantic pattern.
- Possible Scope: You could focus on morphologically rich languages, regional dialects, or semantic ambiguity patterns. Since your home state is Minnesota, you might explore regional lexical variation or pronunciation shifts — but only if you have authentic data access.
- Technical Stack:
- Data Collection: Python scripts using
BeautifulSouporrequestsfor web scraping (if the data is publicly available). - Data Cleaning:
pandasandregexfor text normalization. - Annotation: Use
spaCyorNLTKfor tokenization and POS tagging; store metadata inJSONorCSV. - Hosting: Publish the dataset on GitHub or Hugging Face Datasets with a clear README explaining collection ethics and structure.
- Data Collection: Python scripts using
- Deliverable: A public GitHub repository titled something like “FatimaHassan_LinguisticCorpus2025” containing dataset files, a data dictionary, and a short “Methodology.md” file.
- Impact: Demonstrates initiative, technical fluency, and contribution to open linguistic resources — highly valued by MIT and UMN’s computational linguistics faculty.
Project 2: GitHub Portfolio for Linguistic Data Analysis
Your GitHub will act as a living record of your technical growth. The committee highlighted that you should create a GitHub portfolio showcasing code from linguistic data analysis or modeling projects. This project builds directly on the corpus work above.
- Goal: Organize and present your code in a way that reflects both clarity and reproducibility.
- Structure:
/datasets/— raw and processed linguistic data./analysis/— Jupyter notebooks demonstrating exploratory data analysis (EDA), token frequency plots, or semantic clustering./models/— simple NLP models (e.g., word embeddings, sentiment classifiers)./docs/— README, methodology, and visualization exports.
- Technical Stack:
Python(for data processing and modeling).Jupyter Notebooks(for interactive visualization).matplotlibandseaborn(for linguistic data visualization).scikit-learnorspaCy(for simple NLP models).
- Best Practices:
- Use clear commit messages (e.g., “Added token frequency analysis for corpus_v2”).
- Include data ethics notes if scraping or annotating text.
- Document dependencies in a
requirements.txt.
- Deliverable: A GitHub portfolio ready to link in applications and research correspondence. The repository should feel professional — not a collection of school assignments, but a cohesive research workspace.
Project 3: Poster or Presentation on Computational Linguistics Research
The committee also advised that you design a poster or presentation summarizing computational linguistics research for local or online conferences. This project converts your technical work into academic communication — a skill that MIT and UMN value highly.
- Goal: Create a visual and verbal summary of your corpus or analysis project suitable for a student research fair or online symposium.
- Content:
- Title and abstract describing your linguistic focus.
- Visuals: word frequency heatmaps, dependency parse trees, or semantic network graphs.
- Methodology: explain data collection, preprocessing, and modeling steps.
- Findings: highlight patterns or anomalies discovered in your dataset.
- Tools:
LaTeXwithbeamerfor professional poster formatting, orCanvafor simpler design.matplotlibandGraphvizfor linguistic visualizations.
- Deliverable: A polished poster PDF and slide deck uploaded to your GitHub or linked portfolio site. If your school hosts a research fair, consider submitting — otherwise, explore online venues such as undergraduate linguistics symposia.
Integration Strategy — Connecting Projects into a Cohesive Portfolio
These three projects should not stand alone. Together, they form a narrative arc: data creation → analysis → communication. This progression mirrors how computational linguistics research develops in professional settings and gives your application tangible evidence of independent inquiry.
- Step 1: Begin with the dataset — this is your foundation.
- Step 2: Use the dataset to generate analytical notebooks and model prototypes.
- Step 3: Translate those findings into a visual presentation or poster.
- Step 4: Host everything on GitHub with clear documentation and links to your poster PDFs.
This structure allows admissions readers to see both your programming and linguistic reasoning skills in one coherent portfolio.
Portfolio Presentation Tips
- Keep your GitHub repositories public and organized; use descriptive titles and concise READMEs.
- Include a “Research Overview.md” file summarizing your computational linguistics interests and linking all projects.
- When ready, consider building a simple personal website (using
GitHub PagesorNotion) to host your portfolio and poster. - Include ethical and methodological reflections — MIT and UMN appreciate awareness of data bias and linguistic diversity.
Monthly Action Plan (March–September)
| Month | Actions | Target Outcome |
|---|---|---|
| March |
|
Project framework defined; GitHub initialized. |
| April |
|
Corpus draft completed; initial analysis notebook published. |
| May |
|
Functional analysis pipeline established. |
| June |
|
Poster draft completed; feedback incorporated. |
| July |
|
Public dataset and poster ready for distribution. |
| August |
|
Portfolio presentation completed; application-ready. |
| September |
|
Finalized portfolio integrated into admissions narrative. |
Closing Perspective
Fatima, by completing these projects, you will create a portfolio that embodies both linguistic sensitivity and computational precision. MIT will value the technical depth of your dataset and modeling work; West Chester University will appreciate your clarity of linguistic analysis; and the University of Minnesota–Twin Cities will recognize your regional and academic relevance. Each project builds toward a clear, evidence-based story of your intellectual independence — a hallmark of successful applicants in computational linguistics.