Skip to content

ffflora/nlp-area-of-law-classification

Repository files navigation

NLP Task: Legal Text Classification

There are two parts of this project:

  1. Text classification: The dataset contains ~1000 legal text documents and 900 of them are labeled with the corresponding area of law ( i.e., LNIND_1993_DEL_112 has been labeled as Criminal Laws ), the goal is to predict the rest of the 100 files with the correct area of laws.
  2. Topic Modeling: with some specific area of law selected, the goal is to extract the topics from the documents, therefore analysis the correlation within the same area of law.

Still in progress to achieve higher accuracy.

Introduction

In this project I accomplished the following things:

  • Text preprocessing
  • Feature extraction and evaluation
  • Model selection, training, and result comparison
  • Setup pipeline and hyperparameter tuning
  • Topic modeling
  • Data Visualization

Process:

Load Data
Text Cleaning and Preprocessing

Tools/Libraries used: NLTK ,Lexnlp

Used WordNetLemmatizer to do lemmatization

Remove punctuations and stopwords

Feature Extraction
  • first attempt: TF-IDF

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

  • second attempt: combination of bag of words first then TF-IDF
Fit Models

Naive Bayes worst performance: 32.9%

Logistic Regression: 63.1%

  • Use binary classifier to solve multivariate problems

SVM: 57.3%

  • Decomposition first: SVD
  • Standardize the data
  • Before Standardize, the performance for SVM was 18.2%

XGBoost: 61.3%

All performance scores above were in terms of test set accurary.

Prediction
Build Pipeline and tune Hyperparameters

GridSearchCV

I bulit the pipeline specifically for Logistic Regression and XGBoost models since they had higher accuracy in the first place.

Topic Modeling

Any document seems to be a mixture of topics, especially in legal documents. Essencially, topic modeling is a text clustering problem.

Here I used LDA

I realize that 'to guess' how many topics in a file/ area of law is difficult.

Data Visualization

I used`mglearn* library to display the top 10 words within each specific topic model.

And PyldaVis library was used to visualize the topic models.

The last but not least I uesd wordcloud to generate the entire legal document for the selected area of law to note the most recurrent terms.


Problems:

  1. After some models built in the project, I realized that legal documents in the Natural Language Processing area is a very special topic that requires different techniques and tools than regular text data. I plan to do Information Extraction on these text project first and then see will that improve the accuracy.
  2. I realize that 'to guess' how many topics in a file/ area of law is difficult.

I notice there are some powerful packages, such as LexNLP, which deals with the NLP problems with legal documents.

Conclusion(s)/Discussion.

In Progress:
  • Information Extraction

Appendix:

Some Useful Links:

Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

Approaching (Almost) Any Machine Learning Problem | Abhishek Thakur

LexNLP: Natural language processing and information extraction for legal and regulatory texts

approaching almost any machine learning problem

Releases

No releases published

Packages

No packages published