A proof of concept for creating natural language descriptions of data returned by the ILOSTAT SDMX API using a chat completion model.
Before following either of the steps below, you'll need to get a token from Hugging Face.
The easiest way to run the app locally is to pull the Docker image from the container registry.
docker pull ghcr.io/justintemps/ilostat-simple-summarizer/ilostat-simple-summarizer:latest
You can then run the container like this:
docker run -p 7860:7860 --env HUGGING_FACE_TOKEN=yourhuggingfacetoken ghcr.io/justintemps/ilostat-simple-summarizer/ilostat-simple-summarizer:latest
The application should be available at http://127.0.0.1:7860
pip install -r requirements.txt
Create an .env
file in the project root with your personal Hugging Face token.
HUGGING_FACE_TOKEN=yourhuggingfacetoken
python3 main.py
The first time you run this command, the app will cache certain metadata from the ILOSTAT SDMX API. This may take a while.
After that, the application should start on local url http://127.0.0.1:7860
- Select a geographic region and an indicator from ILOSTAT.
- Filter the data by selecting dimensions.
- Click the "Update data" button to generate a prompt.
- See the chart for a visual representation of the data.
- Go to the "Prompt" tab to see the generated prompt.
- Go to the "AI Summary" tab and click the "Generate summary" button to generate a summary using a chat completion model.
Current large language models (LLMs) struggle with summarising tabular data. They're trained entirely on unstructured text and have no framework for understanding two-dimensional data structures like tables. They also lack the numerical reasoning skills that are needed to understand the relationships between numbers in a table, from the values themselves to the time periods they represent. This app proposes a solution to this problem by distilling key insights from tabular data into a prompt that can be used to generate a summary using a chat completion model.
-
facebook/bart-large-cnn is used to summarise metadata about detaflows
-
Llama-3.3-70B-Instruct is used by default to generate the data descriptions. In future versions, this could be paramtized to experiment with different chat completion models.
Text descriptions like the ones generated here can be used to generate accessible summaries of data for people with seeing disabilities or limited access to visual content. Together with contextual information from news stories or reports, they can also be used to generate data-driven narratives that help people understand complex world of work issues.
A production-ready application. This is a proof of concept that demonstrates how chat completion models can be used to summarize data from ILOSTAT. It is not intended for use in production environments.
- A similar approach could be combined with a Retrieval Augmented Generation (RAG) system to generate static pages for the ILO's website ilo.org
- The same thing could be used to build a chatbot providing an open interface for the ILO's knowledge base, with full access to its statistical resources.
Many thanks to the ILOSTAT team, especially Weichen Lee, for his support using the ILO SDMX API.