Skip to content

Create high qaulity instruct finetuning datasets for LLMs by just giving a topic / website

Notifications You must be signed in to change notification settings

adithya-s-k/Topic2Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Topic2Dataset

👋🏼Introduction

data collection and preprocessing can be time-consuming and resource-intensive tasks. Topic2Dataset ia an open source project that helps by automating dataset generation from specified topics and websites, eliminating manual data gathering and accelerating dataset creation timelines. This project mainly aims to curate datasets for Retrieval augmented Generation (RAG) and LLM finetuning.

🔑Key Features

  • Effortless Dataset Generation: Generate datasets for pre-training, fine-tuning, and preference modeling with ease.
  • Tailored Instruction and Data: Upload a document and receive customized instructions and data for fine-tuning models.
  • AI-Powered Scraping: Live AI agent-based scrapers ensure relevance, accuracy, and up-to-date data extraction.
  • Time-Dependent Scraping: Keep datasets fresh and relevant with time-dependent scraping capabilities.
  • Seamless Data Export: Export datasets to S3 and other blob storage services effortlessly for further analysis and usage.
  • Insightful Summarization: Receive answers to questions and insights relevant to the document's content for enhanced understanding.

Getting Started

Demo Notebook: Colab

Python Setup🐍

Clone the repository

git clone https://github.com/adithya-s-k/topic2dataset
cd topic2dataset

Install Required Dependencies

pip install -r requirements.txt

Quick Start dataset generation

python generate.py --topic Clinical Trials --website ["https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/" , "https://clinregs.niaid.nih.gov/#"] --time_in_hrs 10 --output local

Contributing

Topic2Dataset is open source. We welcome contributions and collaboration from the community! See the project page for ways to contribute.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citing

If you use Topic2Dataset in your research, please cite the following paper:

@article{Topic2Dataset,
  title={Topic2Dataset: AI-Drive Dataset curation for RAG and Finetuning},
  author={Adithya S K},
  year={2023}
}

Disclaimer

The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced by any version of WizardLM is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.

About

Create high qaulity instruct finetuning datasets for LLMs by just giving a topic / website

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published