COVID-19 Text Analysis and Language Modeling Assignments

This repository contains solutions for multiple assignments focused on natural language processing (NLP) tasks using a COVID-19-related dataset. The tasks include pre-processing, clustering, word embeddings, and abstract generation.

Assignment 1: Text Preprocessing and Zipf's Law

Tasks:

Extract Text Content:
- Extract text from the JSON dataset using a Python library (e.g., json or pandas).
Preprocessing:
- Perform case-folding, removal of numbers, stopword elimination, tokenization, and stemming/lemmatization.
Weighted Term Frequency and Zipf’s Law:
- Compute term frequencies.
- Plot frequency vs. rank to validate Zipf's Law.
- Calculate the exponent (\alpha).
Tokens and Vocabulary:
- Count the total tokens and unique vocabulary.
Tokens vs. Vocabulary:
- Use Heap’s Law to plot vocabulary size as a function of tokens.

Assignment 2 & 3: Modified COALS Algorithm

Tasks:

Co-occurrence Matrix:
- Build a co-occurrence matrix using the ratio of probabilities instead of correlation.
- Limit the vocabulary size to ~7K words.
Vocabulary Size:
- Display the vocabulary size and matrix dimensions.
Identify Words:
- Select five COVID-19-related nouns and verbs.
Word Similarities:
- Compute and display five similar words for each selected word using cosine distance.
Visualization:
- Use Multi-Dimensional Scaling (MDS) to visualize concepts. Plot three concepts with up to 10 words per concept.

Assignment 4 & 5: Skipgram Model

Tasks:

Corpus and Vocabulary:
- Extract abstracts and create a vocabulary of ~10,000 words.
One-Hot Vectors:
- Implement dynamic or pre-created OHVs for training.
Model Architecture:
- Describe the neural network architecture for the Skipgram model.
Training:
- Use Stochastic Gradient Descent (SGD) for optimization.
Epoch vs. Training Error:
- Plot the relationship between epochs and training error.
Negative Sampling:
- Replace naive softmax with negative sampling (5 negative samples per instance).
Word Analogies:
- Test analogies with two COVID-19-related words (avoiding commonly restricted terms).
Matrix Comparisons:
- Analyze similarities using Win, Wout, and their combinations.

Assignment 6, 7 & 8: Abstract Generation

Tasks:

Architecture:
- Implement a two-layer vanilla RNN, LSTM, or GRU model.
- Support both forward and backward passes.
Training:
- Train the model using 100 abstracts initially.
- Plot the error graph during training.
Abstract Generation:
- Use the trained model to generate at least three abstracts.
Discussion:
- Identify factors contributing to sub-optimal results.
- Suggest improvements for better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DivyanshiMDS202322_Assignment01.ipynb		DivyanshiMDS202322_Assignment01.ipynb
DivyanshiMDS202322_Assignment2&3NLP.ipynb		DivyanshiMDS202322_Assignment2&3NLP.ipynb
DivyanshiMDS202322_Assignment4&5 (1).ipynb		DivyanshiMDS202322_Assignment4&5 (1).ipynb
Divyanshi_MDS202322_Assignment6,7,8 (1).ipynb		Divyanshi_MDS202322_Assignment6,7,8 (1).ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

COVID-19 Text Analysis and Language Modeling Assignments

Table of Contents

Assignment 1: Text Preprocessing and Zipf's Law

Tasks:

Assignment 2 & 3: Modified COALS Algorithm

Tasks:

Assignment 4 & 5: Skipgram Model

Tasks:

Assignment 6, 7 & 8: Abstract Generation

Tasks:

About

Uh oh!

Releases

Packages

Languages

divK12/NLP

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Text Analysis and Language Modeling Assignments

Table of Contents

Assignment 1: Text Preprocessing and Zipf's Law

Tasks:

Assignment 2 & 3: Modified COALS Algorithm

Tasks:

Assignment 4 & 5: Skipgram Model

Tasks:

Assignment 6, 7 & 8: Abstract Generation

Tasks:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages