Taking reference from the paper PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts which presented dataset PubMed 200k RCT consisting of ~200,000 labelled data for Randomized Controlled Trial (RCT) abstracts from different medical papers. Using similar model architecture from the paper Neural Networks for Joint Sentence Classification in Medical Paper Abstracts. Making use of PubMed 20k RCT dataset to train the model and able to achieve ~88% accuracy on unseen test data. The dataset can be found at pubmed-rct
The flask application takes as the input RCT abstract text and uses the model to predict one of the 5 categories {"Background", "Methods", "Objective", "Results", "Conclusions"} for each sentence in the abstract, it is then presented in neat, easy to read format.
Create new environment for the application using Anaconda at the root project folder and install the required packages for python
$ conda create -p venv python==3.7 -y
$ conda activate ./venv
$ pip install -r requirements.txt
Use the following command to run the Flask server on localhost
python ./src/app.py
Open the following URL in the browser of your choice
http://localhost:{port}/
Using character embeddings, word embeddings and line number the text appears on relative to the entire abstract as 4 inputs to the model.
Flask application opened in browser where we can insert RCT abstract.
Insert abstract on the left hand side text area and click on "Beautify" button to get categorized easy to read text on the right hand side.
The experimental colab notebook is present at
./src/experiment/SkimRCT.ipynb
.
The model is trained on PubMed 20k, more data can be used to better train the model with PubMed 200k. The data collection from different papers other than just RCTs can also help expand the scope and generalize the model.
The flask app can be hosted on any cloud platform or can be containerized. Can deploy the app as backend and use a separate Angular or React frontend for better user experience.