NLP Task: Legal Text Classification

There are two parts of this project:

Text classification: The dataset contains ~1000 legal text documents and 900 of them are labeled with the corresponding area of law ( i.e., LNIND_1993_DEL_112 has been labeled as Criminal Laws ), the goal is to predict the rest of the 100 files with the correct area of laws.
Topic Modeling: with some specific area of law selected, the goal is to extract the topics from the documents, therefore analysis the correlation within the same area of law.

Still in progress to achieve higher accuracy.

Introduction

In this project I accomplished the following things:

Text preprocessing
Feature extraction and evaluation
Model selection, training, and result comparison
Setup pipeline and hyperparameter tuning
Topic modeling
Data Visualization

Process:

Load Data

Text Cleaning and Preprocessing

Tools/Libraries used: NLTK ,Lexnlp

Used WordNetLemmatizer to do lemmatization

Remove punctuations and stopwords

Feature Extraction

first attempt: TF-IDF

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

second attempt: combination of bag of words first then TF-IDF

Fit Models

Naive Bayes worst performance: 32.9%

Logistic Regression: 63.1%

Use binary classifier to solve multivariate problems

SVM: 57.3%

Decomposition first: SVD
Standardize the data
Before Standardize, the performance for SVM was 18.2%

XGBoost: 61.3%

All performance scores above were in terms of test set accurary.

Prediction

Build Pipeline and tune Hyperparameters

GridSearchCV

I bulit the pipeline specifically for Logistic Regression and XGBoost models since they had higher accuracy in the first place.

Topic Modeling

Any document seems to be a mixture of topics, especially in legal documents. Essencially, topic modeling is a text clustering problem.

Here I used LDA

I realize that 'to guess' how many topics in a file/ area of law is difficult.

Data Visualization

I used`mglearn* library to display the top 10 words within each specific topic model.

And PyldaVis library was used to visualize the topic models.

The last but not least I uesd wordcloud to generate the entire legal document for the selected area of law to note the most recurrent terms.

Problems:

After some models built in the project, I realized that legal documents in the Natural Language Processing area is a very special topic that requires different techniques and tools than regular text data. I plan to do Information Extraction on these text project first and then see will that improve the accuracy.
I realize that 'to guess' how many topics in a file/ area of law is difficult.

I notice there are some powerful packages, such as LexNLP, which deals with the NLP problems with legal documents.

Conclusion(s)/Discussion.

In Progress:

Information Extraction

Appendix:

Some Useful Links:

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

Approaching (Almost) Any Machine Learning Problem | Abhishek Thakur

LexNLP: Natural language processing and information extraction for legal and regulatory texts

approaching almost any machine learning problem

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
figure		figure
.gitattributes		.gitattributes
.gitignore		.gitignore
Classifier.ipynb		Classifier.ipynb
README.md		README.md
predictions.csv		predictions.csv
topic-modeling.ipynb		topic-modeling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Task: Legal Text Classification

Introduction

In this project I accomplished the following things:

Process:

Load Data

Text Cleaning and Preprocessing

Feature Extraction

Fit Models

Prediction

Build Pipeline and tune Hyperparameters

Topic Modeling

Data Visualization

Problems:

Conclusion(s)/Discussion.

In Progress:

Appendix:

Some Useful Links:

About

Uh oh!

Releases

Packages

Languages

ffflora/nlp-area-of-law-classification

Folders and files

Latest commit

History

Repository files navigation

NLP Task: Legal Text Classification

Introduction

In this project I accomplished the following things:

Process:

Load Data

Text Cleaning and Preprocessing

Feature Extraction

Fit Models

Prediction

Build Pipeline and tune Hyperparameters

Topic Modeling

Data Visualization

Problems:

Conclusion(s)/Discussion.

In Progress:

Appendix:

Some Useful Links:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages