02.02.2026

Custom RAG Solution

An Interim Report from Practice

Many organizations manage a large number of PDF documents: fact sheets, guidelines, regulations, directives, laws, and standards. These documents are binding—for both internal staff and external partners—but are often difficult to access in everyday work. This is exactly where a current client project by mp technology comes in.

Initial Situation and Objectives

Our client wanted a chatbot that:

  • provides well-founded, reliable answers
  • explicitly references relevant documents and chapters
  • can be used by internal employees as well as external users

The challenge: the underlying PDF documents are not uniform in structure, but consist largely of highly structured texts with clear hierarchies, tables, and references.

Semantic Chunking Instead of Simple Text Splitting

To make content usable for Retrieval-Augmented Generation (RAG), documents must be broken down into smaller units—so-called chunks. Instead of purely length-based chunking with overlaps, we rely on custom semantic chunking that preserves the original document structure as much as possible.

Specifically, content is modeled as a tree structure with up to four levels:

  1. Document
  2. Chapter
  3. Subchapter
  4. Sub-subchapter

Each chunk is aware of its structural context, which can be explicitly used when answering questions. Because each chunk covers a self-contained topic, vector similarity search on embeddings works far more accurately than with overlapping text fragments.

In addition, documents are assigned to thematic categories. A single document can belong to multiple topics at the same time.

Demo screenshot. For this, public documents from the bfu (Beratungsstelle für Unfallverhütung) were used instead of our customer’s documents.

Transparency During Ingestion Is Critical

During development, one thing became clear very quickly:
Without transparency in the PDF conversion and chunking process, quality cannot be controlled.

Only when it is visible

  • where structure is lost
  • whether tables are processed correctly
  • whether document hierarchies are detected accurately

can systematic weaknesses be identified.

Errors introduced during ingestion inevitably lead to incorrect or incomplete answers later on.

PDF Conversion: No Tool Is Perfect

No conversion library is reliable in all cases—especially when it comes to tabular content. During the pilot phase, we evaluated Docling, an open-source component from IBM. While the initial results were promising, closer inspection revealed significant shortcomings in the conversion of table-based structures.

We are currently using Marker via the Datalab API, which delivers more robust results for our use case. That said, it is equally clear that this solution is not perfect either.

Our approach therefore consists of two steps:

  • automated correction of systematic issues
  • manual cleanup of isolated errors

Only then is a document admitted into the curated corpus, in the best possible quality.

Modularity as a Strategic Decision

Whether it is the document converter, the embedding model, or the LLM:
The pace of innovation is extremely high. It is obvious that better components will emerge within months.

That is why our architecture is strictly modular. Individual components can be replaced with minimal effort, without reworking the entire system.

Technology Stack

The current solution is based on a proven open-source stack:

  • FastAPI for the API layer
  • LangGraph for orchestration and control flows
  • PostgreSQL with pgvector for vector search
  • Redis for caching and state management

Conclusion

A powerful RAG solution does not start with the chatbot—it starts with clean, well-structured knowledge ingestion. Structure, transparency, and curation are essential to delivering reliable answers.

This interim report shows: the effort pays off—and lays the foundation for a system that can grow with future requirements.

More news

21.09.2025

Innovation in LED Lightning

mp technology is pleased to accompany the market launch of luxKit, an innovative and globally unique LED modular system from smarterion, with a mobile app for digital planning, configuration, and ordering.

read more
31.08.2023

Customer Portal for TerraRail ModalSplit

Efficient transport by rail: customer portal for TerraRail ModalSplit

read more
30.08.2023

20 years of mp technology

mp technology AG celebrates its 20th anniversary. 🥂

read more
Show all news