LLM Fine-Tuning

This repository was created to explore fine-tuning models using OpenAI and LLama V2. While that functionality still exists in the code, it is no longer the primary focus. Instead, the repo has shifted towards automating the generation of topic modeling and vector embeddings for data collected by the scraper and stored in an Elasticsearch index.

The original fine-tuning scripts remain in place for potential future use, but the current effort centers on leveraging document summaries for efficient topic extraction and embedding generation.

Libraries Used

The current workflows involve:

Cron Jobs for automation.
Elasticsearch for document storage and querying.
GPT (via OpenAI API) for generating topic models.
SentenceTransformer for generating vector embeddings.

The legacy fine-tuning scripts still support the following libraries:

OpenAI
LLama V2

Current Functionality

The repository now focuses on automating topic modeling and vector embedding based on document summaries. Below are the details of the current functionality:

Overview

Utilizing data collected by the scraper and stored in an Elasticsearch index, this repository runs several nightly cron jobs to automate topic modeling and vector embeddings.

Current Cron Jobs

Daily Topic Modeling Generation (source)
- Queries Elasticsearch for documents without topic modeling across specified sources. It generates primary and secondary topics for each document using GPT and a predefined list of Bitcoin-related topics. The results are stored in CSV files.
Daily Push Topic Modeling to Elasticsearch (source)
- Reads the generated topic modeling CSV files and updates the corresponding documents in Elasticsearch with their primary and secondary topics.
Daily Update Vector Embeddings (source)
- Queries Elasticsearch for documents without vector embeddings across specified sources. It generates vector embeddings using the document's title and summary (or body if summary is unavailable) with SentenceTransformer (intfloat/e5-large-v2). These embeddings are then updated in Elasticsearch.
- Note: This uses an open-source model, so there's no cost for generating embeddings. The embedding size limit is 1024.

Legacy Fine-Tuning Code

For those interested, the original fine-tuning scripts can still be found under their respective directories (openai-finetune and llamav2-finetune). These are not actively maintained but remain available for future experimentation.

Getting Started

To use this repository, you can clone it to your local machine or download it as a zip file. Once you've done that, refer to the scripts or cron jobs relevant to your use case, and follow the instructions in the respective files.

Name		Name	Last commit message	Last commit date
Latest commit History 448 Commits
.github/workflows		.github/workflows
gpt_output		gpt_output
llamav2-finetune		llamav2-finetune
openai-finetune		openai-finetune
src		src
test		test
.gitignore		.gitignore
README.md		README.md
btc_topics.csv		btc_topics.csv
drop_duplicates_from_es.py		drop_duplicates_from_es.py
generate_topic_modeling_csv.py		generate_topic_modeling_csv.py
push_topic_modeling_to_es.py		push_topic_modeling_to_es.py
query_es_embedding_vectors.py		query_es_embedding_vectors.py
requirements.txt		requirements.txt
topic_modeling_using_es.py		topic_modeling_using_es.py
update_vector_embedding_to_es.py		update_vector_embedding_to_es.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Fine-Tuning

Libraries Used

Current Functionality

Overview

Current Cron Jobs

Legacy Fine-Tuning Code

Getting Started

About

Releases

Packages

Contributors 2

Languages

bitcoinsearch/llm-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

LLM Fine-Tuning

Libraries Used

Current Functionality

Overview

Current Cron Jobs

Legacy Fine-Tuning Code

Getting Started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages