This repository contains solutions for multiple assignments focused on natural language processing (NLP) tasks using a COVID-19-related dataset. The tasks include pre-processing, clustering, word embeddings, and abstract generation.
- Assignment 1: Text Preprocessing and Zipf's Law
- Assignment 2 & 3: Modified COALS Algorithm
- Assignment 4 & 5: Skipgram Model
- Assignment 6, 7 & 8: Abstract Generation
-
Extract Text Content:
- Extract text from the JSON dataset using a Python library (e.g.,
json
orpandas
).
- Extract text from the JSON dataset using a Python library (e.g.,
-
Preprocessing:
- Perform case-folding, removal of numbers, stopword elimination, tokenization, and stemming/lemmatization.
-
Weighted Term Frequency and Zipf’s Law:
- Compute term frequencies.
- Plot frequency vs. rank to validate Zipf's Law.
- Calculate the exponent (\alpha).
-
Tokens and Vocabulary:
- Count the total tokens and unique vocabulary.
-
Tokens vs. Vocabulary:
- Use Heap’s Law to plot vocabulary size as a function of tokens.
-
Co-occurrence Matrix:
- Build a co-occurrence matrix using the ratio of probabilities instead of correlation.
- Limit the vocabulary size to ~7K words.
-
Vocabulary Size:
- Display the vocabulary size and matrix dimensions.
-
Identify Words:
- Select five COVID-19-related nouns and verbs.
-
Word Similarities:
- Compute and display five similar words for each selected word using cosine distance.
-
Visualization:
- Use Multi-Dimensional Scaling (MDS) to visualize concepts. Plot three concepts with up to 10 words per concept.
-
Corpus and Vocabulary:
- Extract abstracts and create a vocabulary of ~10,000 words.
-
One-Hot Vectors:
- Implement dynamic or pre-created OHVs for training.
-
Model Architecture:
- Describe the neural network architecture for the Skipgram model.
-
Training:
- Use Stochastic Gradient Descent (SGD) for optimization.
-
Epoch vs. Training Error:
- Plot the relationship between epochs and training error.
-
Negative Sampling:
- Replace naive softmax with negative sampling (5 negative samples per instance).
-
Word Analogies:
- Test analogies with two COVID-19-related words (avoiding commonly restricted terms).
-
Matrix Comparisons:
- Analyze similarities using Win, Wout, and their combinations.
-
Architecture:
- Implement a two-layer vanilla RNN, LSTM, or GRU model.
- Support both forward and backward passes.
-
Training:
- Train the model using 100 abstracts initially.
- Plot the error graph during training.
-
Abstract Generation:
- Use the trained model to generate at least three abstracts.
-
Discussion:
- Identify factors contributing to sub-optimal results.
- Suggest improvements for better performance.