Skip to content

kunal-kumar-chaudhary/SkimLit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

SkimLit (Sequence Classification in Medical Abstracts)

This repository contains code for training and evaluating deep learning models for sequence classification in medical abstracts. The models aim to classify medical abstracts into predefined categories based on their content.

Overview

The provided code implements several deep learning models using TensorFlow and Keras to classify medical abstracts. The models utilize token embeddings, character embeddings, and hybrid embeddings to encode the textual data and make predictions.

The following models are included:

  1. Conv1D with Token Embeddings: This model uses token embeddings generated by a pre-trained TensorFlow Hub Universal Sentence Encoder. It employs a 1D convolutional layer for sequence processing.

  2. Feature Extraction with Pretrained Token Embeddings: This model utilizes the Universal Sentence Encoder (USE) from TensorFlow Hub as a feature extractor. It extracts high-level features from the text to make predictions.

  3. Conv1D with Character Embeddings: This model employs character-level embeddings and a 1D convolutional layer. It processes the sequences at a character level to capture finer-grained information.

  4. Combining Pretrained Token Embeddings and Character Embeddings (Hybrid Embedding Layer): This model combines token embeddings from the Universal Sentence Encoder with character-level embeddings using a hybrid token embedding layer. It aims to leverage both token and character-level information for improved performance.

Usage

  1. Clone the repository:

    git clone https://github.com/your_username/sequence-classification-medical-abstracts.git
  2. install the necessary dependencies

  • you won't need to install if you're running the notebook on Google Colaboratory

Pubmed RCT Dataset

The Pubmed RCT (Randomized Controlled Trial) dataset is a collection of medical abstracts obtained from PubMed. It consists of abstracts from randomized controlled trials and has been pre-processed for various natural language processing tasks, including sentence classification.

Dataset Overview

  • Objective: The dataset aims to facilitate research in the domain of biomedical natural language processing by providing a standardized collection of abstracts from randomized controlled trials.

  • Content: The dataset includes abstracts from various medical research papers, with each abstract containing information about the study's objective, methods, results, and conclusion.

  • Labels: The sentences in the abstracts are classified into five categories:

    1. Objective
    2. Methods
    3. Results
    4. Conclusion
    5. Other (Sentences that do not fall into any of the above categories)

Accessing the Data

The dataset is available on GitHub at the following link: Pubmed RCT Dataset.

Researchers and practitioners interested in biomedical natural language processing can utilize this dataset for various tasks such as text classification, summarization, information extraction, and more.

Releases

No releases published

Packages

No packages published