Skip to content

heeh/legal_advice

Repository files navigation

Test Image 1

Introduction

  • This program classifies legal issues into a binary value for each National Subject Matter Index (NSMI). (https://nsmi.lsntap.org/browse-v2) "Category" means 20 indexes. "Class" means sub categories under the category.

Data

Preprocessing

  • Each article has a binary value(0 or 1) that indicates if this article is related to a specific legal class.
  • We ignore unlabeled entries when constructing a model.
  • Among 100+ classes from NSMI-v2, we extracted 36 classes which has more than 10 positive submissions.
  • After getting the classifier, we chose 16 classes that has reasonable recall.

Classifier

  • First, the program converts each text into an tf-idf vector.
  • After getting the vectors, we apply sklearn LogisticRegression model with L1 loss.
  • We predict the model with 10-fold cross-validation.

Test

  • We apply the classifier to the 900,000 submissions on reddit data.
  • The classifier produces hard and soft labels.
  • We get the final prevalence using freq-e (https://github.com/slanglab/freq-e)