Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NLLB 200 for translations #80

Merged
merged 1 commit into from
Nov 23, 2024
Merged

Conversation

svenseeberg
Copy link
Member

@svenseeberg svenseeberg commented Nov 22, 2024

Replace the LLM translations with NLLB-200 3.3B model.

Fix #50

@svenseeberg svenseeberg force-pushed the feature/nllb-3b-translations branch 2 times, most recently from 69e4ab8 to 8a63d8b Compare November 22, 2024 20:27
@svenseeberg
Copy link
Member Author

svenseeberg commented Nov 22, 2024

We can use chunking to work around the token limit:

def split_text(text, max_length=500):
    sentences = text.split('.') "

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if not sentence.strip():
            continue
        sentence = sentence.strip() + "."
        if len(current_chunk) + len(sentence) <= max_length:
            current_chunk += sentence + " " 
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluate Translation model performance
1 participant