Skip to content

A Prototype for a Wikidata Question-Answering System

Notifications You must be signed in to change notification settings

rti/askwikidata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AskWikidata

A Prototype for a Wikidata Question-Answering System

The askwikidata title image

This system allows users to query Wikidata using natural language questions. The responses contain links to sources. If Wikidata does not provide the information requested, the system refuses to answer.

The system is in an early proof of concept state.

Demo

A short demo showing the askwikidata repl responding to a question.

Quickstart

To give it a try, use ➡️ this Google Colab Notebook or load AskWikidata_Quickstart.ipynb in your infrastructure.

Implementation

In order to answer questions based on Wikidata, the system uses retrieval augmented generation. First it transforms Wikidata items to text and generates embeddings for them. The user query is then embedded as well. Using nearest neighbor search, most relevant Wikidata items are identified. A reranker model selects only the best matches from the neighbors. Finally, these matches are incorporated into the LLM prompt in order to allow the LLM to generate using Wikidata knowledge.

All models, including the LLM, can run on the local machine using pytorch and bitsandbytes quantization. For nearest neighbor search, an annoy index is used.

Usage

Install dependencies

Nix

On Nix the dev shell will install all required dependencies.

nix develop .

Pip

Alternatively, install python requirements using pip.

pip install -r requirements.txt

Unpack provided caches

For faster execution, the results of some pre-computation steps are cached. In order to use those caches, unpack them:

bunzip2 --keep --force *.json.bz2

Generate dataset

Generate text representations for Wikidata items. The list of items to use is currently hardcoded in text_representation.py.

python text_representation.py

Answer a question

This python code will use AskWikidata to answer one question.

from askwikidata import AskWikidata

config = {
    "chunk_size": 1280,
    "chunk_overlap": 0,
    "index_trees": 1024,
    "retrieval_chunks": 16,
    "context_chunks": 5,
    "embedding_model_name": "BAAI/bge-small-en-v1.5",
    "reranker_model_name": "BAAI/bge-reranker-base",
    "qa_model_url": "Qwen/Qwen2.5-3B-Instruct",
}

askwikidata = AskWikidata(**config)
askwikidata.setup()
print(askwikidata.ask("Who is the current mayor of Berlin? And since when is them serving?"))

Interactive REPL

A simple interactive read eval print loop can be used to ask questions.

python repl.py

Run evaluation

A script to evaluate the performance of different configurations is provided.

python eval.py

Configure API Keys

If you do not want to use a local LLM, AskWikidata can access the Huggingface LLM API. Configure your Hugginface API key in the HUGGINGFACE_API_KEY environment variable.

Run tests

To execute the unit test suite, run:

$ python -m unittest

To get a coverage report, run

$ coverage run -m unittest
$ coverage report --omit="test_*,/nix/*" --show-missing

About

A Prototype for a Wikidata Question-Answering System

Resources

Stars

Watchers

Forks