Mındly

From document to answer.

A multi-tenant RAG API that turns business document corpora into reliable, isolated and auditable conversational assistants. Built for professional use, secure by default, and designed to stay agnostic to the language model.

FastAPILangChainQdrant RedisFirestoreOpenAI Docker~192 tests

Terminal — request trace

mindly-api — POST /chat/stream

Overview

What the platform does

Mindly vectorizes documents, stores them in a per-owner isolated semantic base, and answers user questions using only the relevant passages it retrieves. The reference use case: labor law in New Caledonia, by sector (banking, port handling, retail).

Indexing

PDF upload, cleaning, chunking and vectorization into a collection dedicated to the user.

Semantic search

For each question, retrieval of the closest passages, ordered by priority then by score.

Streaming answer

Token-by-token generation, strictly grounded in the retrieved context, with conversation history.

Multi-user

rh and agent roles, isolated spaces, each base owned by someone.

Secure

Rotating JWTs, hashed passwords, rate limiting, audit of configuration changes.

Agnostic

An abstraction layer designed to plug in other LLM providers without touching the rest.

Architecture

Three layers, one direction of dependency

The code follows a strict separation: routers validate and delegate, services hold the business logic, modules wrap the infrastructure (Qdrant, Redis, Firestore, security, RAG). A layer never depends on the one above it.

Request flow

The FastAPI routers are deliberately thin: they declare the route, apply Pydantic validation and role checks via dependencies, then call a service. All the real logic lives in services/, which orchestrates the infrastructure modules.

routers/chat.py extrait réel

@router.post("/chat/stream")
@limiter.limit(config.RATE_LIMIT_CHAT)
async def chat_stream(request, body, background_tasks,
                      current_user = Depends(get_current_user)):
    ctx = await build_chat_context(current_user)   # delegated validation
    # ... streaming + history persistence in a background task
    return StreamingResponse(token_stream(), media_type="text/plain")

The user store shows the same discipline: an abstract UserStore class defines the contract, FirestoreUserStore implements it, and a module-level facade lets you inject a fake store in tests. Refresh token handling is isolated in a dedicated mixin.

Request lifecycle

What happens on `POST /chat/stream`

Before a single token is generated, the request goes through a chain of validations that fail fast and clearly. Each step has its own error code.

Sequential steps

services/chat_service.py — build_chat_context extrait réel

user_config = await get_user_config_from_store(username)
if user_config is None:
    raise HTTPException(403, "Utilisateur non enregistre")

index_name = user_config.get("index_name")
if not index_name:
    raise HTTPException(400, "Aucune base vectorielle configuree")

if current_user.get("role") == "rh" and not await user_owns_vector_base(
        int(current_user["id"]), str(index_name)):
    raise HTTPException(404, "La collection n'existe pas dans Qdrant.")

if not await collection_exists(index_name):
    raise HTTPException(404, "La collection n'existe pas dans Qdrant.")

The answer is streamed as text/plain as it comes. The exchange (question + answer) is saved in a background task, after the stream ends, so it never delays what the client sees.

RAG pipeline

From upload to the relevant passage

Indexing does more than split text. It normalizes, strips recurring noise (headers, footers, page numbers) and deduplicates before vectorizing.

Ingestion & parsing. Loading the PDF with PyMuPDF, page by page.
Cleaning. Unicode normalization, removal of page numbers and boilerplate repeated across most pages.
Chunking. Recursive splitting with separators suited to legal text (Article, Chapitre), controlled overlap.
Deduplication. Hash of the normalized content to drop identical chunks, and a hash of the whole file to reject a document already present (409).
Vectorization. Embeddings stored in the owner's Qdrant collection.
Retrieval. Similarity search, grouped by document priority then sorted by score.

modules/text_preprocessing.py — boilerplate detection extrait réel

# a line seen on >= 60% of pages is treated as repeated noise
min_occurrences = max(3, ceil(len(documents) * BOILERPLATE_RATIO_THRESHOLD))
repeated = {
    line_key
    for line_key, count in page_level_counter.items()
    if count >= min_occurrences
}

modules/rag.py — priority-ordered retrieval extrait réel

results = store.similarity_search_with_score(query, k=top_k)
grouped = {}
for doc, score in results:
    prio = doc.metadata.get("priority", float("inf"))
    grouped.setdefault(prio, []).append((doc, score))

ordered = []
for prio in sorted(grouped):                       # priority 1 first
    ordered.extend(sorted(grouped[prio],
                          key=lambda t: t[1], reverse=True))   # descending score

The system prompt forces the model to answer only from the provided context, to cite the article and source when it relies on a specific text, and to admit it has no answer rather than make one up.

Isolation

Every base belongs to someone

Multi-tenancy does not rely on a naming convention but on a Firestore registry: each vector base is tied to an owner_user_id. Every operation on a base checks this ownership.

Roles

An rh creates, lists and manages their own bases and documents. An agent consumes a base assigned to them and cannot change its configuration.

Ownership enforced

Chat, upload, deletion, priorities, base deletion: every path goes through user_owns_vector_base before acting.

Transactional creation

Qdrant creation then Firestore registration, with a rollback of the collection if registration fails.

Inconsistent states handled

If a collection has vanished from Qdrant but still exists in Firestore, deletion cleans up the orphan without error.

modules/vector_base_registry.py extrait réel

async def user_owns_vector_base(user_id: int, base_name: str) -> bool:
    if user_id <= 0:
        return False
    record = await get_vector_base_record(base_name)
    return bool(record and record["owner_user_id"] == user_id)

Security

Hardened where it matters

Authentication, secret storage and error exposure were handled seriously rather than with a minimal JWT bolted on top of the business logic.

Typed JWT

Separate access and refresh, jti/iat/exp claims, type checked on decode. The token's user_id is cross-checked against the stored id.

Refresh rotation

On every login or refresh, old tokens are revoked and expired ones purged. The token is hashed before storage.

Passwords

bcrypt_sha256 to get around bcrypt's 72-byte limit. Legacy plaintext passwords cannot authenticate.

Fail-fast

Refuses to start if the JWT key is under 32 characters or the OpenAI key is missing.

Rate limiting

Per-IP limits on login, refresh, upload and chat, to guard against brute force and overuse.

Audit

Every creation, change or deletion of user config is recorded in a Firestore audit collection.

modules/security.py extrait réel

pwd_context = CryptContext(schemes=["bcrypt_sha256", "bcrypt"], deprecated="auto")

def hash_password(password: str) -> str:
    # bcrypt_sha256: no silent truncation beyond 72 bytes
    return pwd_context.hash(password, scheme="bcrypt_sha256")

In production, debug mode is off: the /docs and /redoc routes disappear and stack traces are never returned to the client, only logged server-side.

Data stores

Three stores, three roles

Each backing store has a clear responsibility, and blocking calls are systematically offloaded to a thread so the asyncio loop never stalls.

Qdrant

Vector base, one collection per base. Similarity search, document priority handling, embedding dimension matched to the model.

Redis

Per-user conversation history, length-bounded with a sliding TTL refreshed on every read or write. Cache of retrieved documents.

Firestore

Users and configuration, hashed refresh tokens, audit log, and the ownership registry for vector bases.

Multi-provider

Model-agnostic by design

The architecture is designed to plug in several LLM providers. The seams exist: a model identifier flows from configuration down to the selection point, registries are ready to be extended, and the RAG layer is isolated behind a single function.

The distinction that matters. Two different axes, often conflated. The embedding fixes the vector dimension and ties the collection: it is fixed per base, otherwise you compare incomparable vectors. Generation, on the other hand, is swappable per request. A genuinely multi-provider design separates these two axes instead of driving them with a single field.

Natural targets on the generation side: Vertex AI / Gemini, open-source models via vLLM, Ollama on-prem, Azure OpenAI for enterprise, or any provider compatible with the OpenAI API. The wiring comes down to a factory and widening the accepted model type.

provider factory proposed target

def get_llm(provider: str, model: str):
    return {
        "openai": lambda: ChatOpenAI(model=model, streaming=True),
        "vertex": lambda: ChatVertexAI(model=model),
        "ollama": lambda: ChatOllama(model=model),
        "vllm":   lambda: ChatOpenAI(model=model, base_url=VLLM_URL),  # openai-compatible
        "azure":  lambda: AzureChatOpenAI(deployment_name=model),
    }[provider]()

Today the pipeline runs on OpenAI; the extension points are in place to host the others without rewriting the business logic.

Deployment

Stack & foundations

Containerized

Docker Compose orchestrates the API, Qdrant, Redis and a Firestore emulator for local work, with an optional UI enabled by profile.

GCP target

Firestore in production, GitLab CI integration, Python. Environment variables centralized in a single configuration module.

Configurable

Token lifetimes, history TTL, rate limits, upload size, maximum lengths: everything is overridable via environment variable.

Seedable

An idempotent seed script creates or updates users and their configuration without touching the Qdrant bases.

Core dependencies

The backend's core building blocks and their role in the pipeline.

Package	Role
fastapi	Async API framework: routing, validation, streaming
langchain	RAG orchestration, LCEL chain, prompt templates
langchain-openai	ChatOpenAI, OpenAIEmbeddings
langchain-qdrant	QdrantVectorStore, similarity_search_with_score
qdrant-client	Qdrant client: scroll, set_payload, delete, count
redis	Async Redis client: list ops, scan, TTL
google-cloud-firestore	User store, refresh tokens, audit, ownership
PyMuPDF	PDF text extraction (PyMuPDFLoader)
passlib[bcrypt]	bcrypt_sha256, password hashing
bcrypt	Pinned bcrypt backend (passlib compat)
PyJWT	JWT HS256 creation and decoding
slowapi	Per-IP rate limiting
python-multipart	Multipart file upload
uvicorn	ASGI server, production mode without reload
aiofiles	Async read, SHA-256 hash of the PDF

Tests

Around 192 unit tests

Coverage goes well beyond a smoke test: security, preprocessing, business services, the indexing pipeline, and API integration tests.

Security. Hashing, generation and validation of JWT tokens.
Preprocessing. PDF cleaning, page-number detection, chunk deduplication.
Services. Chat context, base ownership, users, documents, vectorstore.
Indexing. Loader, splitter, deduplication, Qdrant insertion, full upload with temp-file cleanup.
API integration. Login, authentication, password and filename validation.

API reference

API surface

Every protected route expects an Authorization: Bearer <access_token> header. Management routes are restricted to the rh role.

Authentication

POST/logincredentials → token pair

POST/refreshrenews the pair

POST/logoutrevokes the current refresh

Users · rh

GET/list_users

GET/get_user_config

POST/create_new_user

PUT/modify_user_config

DEL/users/{username}

PUT/users/{username}/config

Chat

POST/chat/streamstreaming answer

GET/chat/retrieved_documentspassages from the last answer

History

GET/show_history

DEL/clean_history

GET/show_history_length

PUT/modify_history_length

Documents · rh

GET/document_list

POST/upload_document

DEL/delete_document

PUT/set_new_documents_priority

PUT/reset_document_priority

Vectorstore · rh

GET/get_vectorbase_list

POST/create_new_vector_base

DEL/delete_vector_base