Custom RAG Solution

An Interim Report from Practice

Many organizations manage a large number of PDF documents: fact sheets, guidelines, regulations, directives, laws, and standards. These documents are binding—for both internal staff and external partners—but are often difficult to access in everyday work. This is exactly where a current client project by mp technology comes in.

Initial Situation and Objectives

Our client wanted a chatbot that:

provides well-founded, reliable answers
explicitly references relevant documents and chapters
can be used by internal employees as well as external users

The challenge: the underlying PDF documents are not uniform in structure, but consist largely of highly structured texts with clear hierarchies, tables, and references.

Semantic Chunking Instead of Simple Text Splitting

To make content usable for Retrieval-Augmented Generation (RAG), documents must be broken down into smaller units—so-called chunks. Instead of purely length-based chunking with overlaps, we rely on custom semantic chunking that preserves the original document structure as much as possible.

Specifically, content is modeled as a tree structure with up to four levels:

Document
Chapter
Subchapter
Sub-subchapter

Each chunk is aware of its structural context, which can be explicitly used when answering questions. Because each chunk covers a self-contained topic, vector similarity search on embeddings works far more accurately than with overlapping text fragments.

In addition, documents are assigned to thematic categories. A single document can belong to multiple topics at the same time.

Demo screenshot. For this, public documents from the bfu (Beratungsstelle für Unfallverhütung) were used instead of our customer’s documents.

Transparency During Ingestion Is Critical

During development, one thing became clear very quickly:
Without transparency in the PDF conversion and chunking process, quality cannot be controlled.

Only when it is visible

where structure is lost
whether tables are processed correctly
whether document hierarchies are detected accurately

can systematic weaknesses be identified.

Errors introduced during ingestion inevitably lead to incorrect or incomplete answers later on.

PDF Conversion: No Tool Is Perfect

No conversion library is reliable in all cases—especially when it comes to tabular content. During the pilot phase, we evaluated Docling, an open-source component from IBM. While the initial results were promising, closer inspection revealed significant shortcomings in the conversion of table-based structures.

We are currently using Marker via the Datalab API, which delivers more robust results for our use case. That said, it is equally clear that this solution is not perfect either.

Our approach therefore consists of two steps:

automated correction of systematic issues
manual cleanup of isolated errors

Only then is a document admitted into the curated corpus, in the best possible quality.

Modularity as a Strategic Decision

Whether it is the document converter, the embedding model, or the LLM:
The pace of innovation is extremely high. It is obvious that better components will emerge within months.

That is why our architecture is strictly modular. Individual components can be replaced with minimal effort, without reworking the entire system.

Technology Stack

The current solution is based on a proven open-source stack:

FastAPI for the API layer
LangGraph for orchestration and control flows
PostgreSQL with pgvector for vector search
Redis for caching and state management

Conclusion

A powerful RAG solution does not start with the chatbot—it starts with clean, well-structured knowledge ingestion. Structure, transparency, and curation are essential to delivering reliable answers.

This interim report shows: the effort pays off—and lays the foundation for a system that can grow with future requirements.

Custom RAG Solution

An Interim Report from Practice

Initial Situation and Objectives

Semantic Chunking Instead of Simple Text Splitting

Transparency During Ingestion Is Critical

PDF Conversion: No Tool Is Perfect

Modularity as a Strategic Decision

Technology Stack

Conclusion

More news

Innovation in LED Lightning

Customer Portal for TerraRail ModalSplit

20 years of mp technology

Contact

Services

Customers

About us