Mındly
A multi-tenant RAG API that turns business document corpora into reliable, isolated and auditable conversational assistants. Built for professional use, secure by default, and designed to stay agnostic to the language model.
What the platform does
Mindly vectorizes documents, stores them in a per-owner isolated semantic base, and answers user questions using only the relevant passages it retrieves. The reference use case: labor law in New Caledonia, by sector (banking, port handling, retail).
Indexing
PDF upload, cleaning, chunking and vectorization into a collection dedicated to the user.
Semantic search
For each question, retrieval of the closest passages, ordered by priority then by score.
Streaming answer
Token-by-token generation, strictly grounded in the retrieved context, with conversation history.
Multi-user
rh and agent roles, isolated spaces, each base owned by someone.
Secure
Rotating JWTs, hashed passwords, rate limiting, audit of configuration changes.
Agnostic
An abstraction layer designed to plug in other LLM providers without touching the rest.
Three layers, one direction of dependency
The code follows a strict separation: routers validate and delegate, services hold the business logic, modules wrap the infrastructure (Qdrant, Redis, Firestore, security, RAG). A layer never depends on the one above it.
The FastAPI routers are deliberately thin: they declare the route, apply Pydantic validation and role checks via dependencies, then call a service. All the real logic lives in services/, which orchestrates the infrastructure modules.
@router.post("/chat/stream")
@limiter.limit(config.RATE_LIMIT_CHAT)
async def chat_stream(request, body, background_tasks,
current_user = Depends(get_current_user)):
ctx = await build_chat_context(current_user) # delegated validation
# ... streaming + history persistence in a background task
return StreamingResponse(token_stream(), media_type="text/plain")
The user store shows the same discipline: an abstract UserStore class defines the contract, FirestoreUserStore implements it, and a module-level facade lets you inject a fake store in tests. Refresh token handling is isolated in a dedicated mixin.
What happens on POST /chat/stream
Before a single token is generated, the request goes through a chain of validations that fail fast and clearly. Each step has its own error code.
user_config = await get_user_config_from_store(username)
if user_config is None:
raise HTTPException(403, "Utilisateur non enregistre")
index_name = user_config.get("index_name")
if not index_name:
raise HTTPException(400, "Aucune base vectorielle configuree")
if current_user.get("role") == "rh" and not await user_owns_vector_base(
int(current_user["id"]), str(index_name)):
raise HTTPException(404, "La collection n'existe pas dans Qdrant.")
if not await collection_exists(index_name):
raise HTTPException(404, "La collection n'existe pas dans Qdrant.")
The answer is streamed as text/plain as it comes. The exchange (question + answer) is saved in a background task, after the stream ends, so it never delays what the client sees.
From upload to the relevant passage
Indexing does more than split text. It normalizes, strips recurring noise (headers, footers, page numbers) and deduplicates before vectorizing.
- Ingestion & parsing. Loading the PDF with PyMuPDF, page by page.
- Cleaning. Unicode normalization, removal of page numbers and boilerplate repeated across most pages.
- Chunking. Recursive splitting with separators suited to legal text (
Article,Chapitre), controlled overlap. - Deduplication. Hash of the normalized content to drop identical chunks, and a hash of the whole file to reject a document already present (409).
- Vectorization. Embeddings stored in the owner's Qdrant collection.
- Retrieval. Similarity search, grouped by document priority then sorted by score.
# a line seen on >= 60% of pages is treated as repeated noise
min_occurrences = max(3, ceil(len(documents) * BOILERPLATE_RATIO_THRESHOLD))
repeated = {
line_key
for line_key, count in page_level_counter.items()
if count >= min_occurrences
}
results = store.similarity_search_with_score(query, k=top_k)
grouped = {}
for doc, score in results:
prio = doc.metadata.get("priority", float("inf"))
grouped.setdefault(prio, []).append((doc, score))
ordered = []
for prio in sorted(grouped): # priority 1 first
ordered.extend(sorted(grouped[prio],
key=lambda t: t[1], reverse=True)) # descending score
The system prompt forces the model to answer only from the provided context, to cite the article and source when it relies on a specific text, and to admit it has no answer rather than make one up.
Every base belongs to someone
Multi-tenancy does not rely on a naming convention but on a Firestore registry: each vector base is tied to an owner_user_id. Every operation on a base checks this ownership.
Roles
An rh creates, lists and manages their own bases and documents. An agent consumes a base assigned to them and cannot change its configuration.
Ownership enforced
Chat, upload, deletion, priorities, base deletion: every path goes through user_owns_vector_base before acting.
Transactional creation
Qdrant creation then Firestore registration, with a rollback of the collection if registration fails.
Inconsistent states handled
If a collection has vanished from Qdrant but still exists in Firestore, deletion cleans up the orphan without error.
async def user_owns_vector_base(user_id: int, base_name: str) -> bool:
if user_id <= 0:
return False
record = await get_vector_base_record(base_name)
return bool(record and record["owner_user_id"] == user_id)
Hardened where it matters
Authentication, secret storage and error exposure were handled seriously rather than with a minimal JWT bolted on top of the business logic.
Typed JWT
Separate access and refresh, jti/iat/exp claims, type checked on decode. The token's user_id is cross-checked against the stored id.
Refresh rotation
On every login or refresh, old tokens are revoked and expired ones purged. The token is hashed before storage.
Passwords
bcrypt_sha256 to get around bcrypt's 72-byte limit. Legacy plaintext passwords cannot authenticate.
Fail-fast
Refuses to start if the JWT key is under 32 characters or the OpenAI key is missing.
Rate limiting
Per-IP limits on login, refresh, upload and chat, to guard against brute force and overuse.
Audit
Every creation, change or deletion of user config is recorded in a Firestore audit collection.
pwd_context = CryptContext(schemes=["bcrypt_sha256", "bcrypt"], deprecated="auto")
def hash_password(password: str) -> str:
# bcrypt_sha256: no silent truncation beyond 72 bytes
return pwd_context.hash(password, scheme="bcrypt_sha256")
/docs and /redoc routes disappear and stack traces are never returned to the client, only logged server-side.Three stores, three roles
Each backing store has a clear responsibility, and blocking calls are systematically offloaded to a thread so the asyncio loop never stalls.
Qdrant
Vector base, one collection per base. Similarity search, document priority handling, embedding dimension matched to the model.
Redis
Per-user conversation history, length-bounded with a sliding TTL refreshed on every read or write. Cache of retrieved documents.
Firestore
Users and configuration, hashed refresh tokens, audit log, and the ownership registry for vector bases.
Model-agnostic by design
The architecture is designed to plug in several LLM providers. The seams exist: a model identifier flows from configuration down to the selection point, registries are ready to be extended, and the RAG layer is isolated behind a single function.
Natural targets on the generation side: Vertex AI / Gemini, open-source models via vLLM, Ollama on-prem, Azure OpenAI for enterprise, or any provider compatible with the OpenAI API. The wiring comes down to a factory and widening the accepted model type.
def get_llm(provider: str, model: str):
return {
"openai": lambda: ChatOpenAI(model=model, streaming=True),
"vertex": lambda: ChatVertexAI(model=model),
"ollama": lambda: ChatOllama(model=model),
"vllm": lambda: ChatOpenAI(model=model, base_url=VLLM_URL), # openai-compatible
"azure": lambda: AzureChatOpenAI(deployment_name=model),
}[provider]()
Today the pipeline runs on OpenAI; the extension points are in place to host the others without rewriting the business logic.
Stack & foundations
Containerized
Docker Compose orchestrates the API, Qdrant, Redis and a Firestore emulator for local work, with an optional UI enabled by profile.
GCP target
Firestore in production, GitLab CI integration, Python. Environment variables centralized in a single configuration module.
Configurable
Token lifetimes, history TTL, rate limits, upload size, maximum lengths: everything is overridable via environment variable.
Seedable
An idempotent seed script creates or updates users and their configuration without touching the Qdrant bases.
Core dependencies
The backend's core building blocks and their role in the pipeline.
| Package | Role |
|---|---|
| fastapi | Async API framework: routing, validation, streaming |
| langchain | RAG orchestration, LCEL chain, prompt templates |
| langchain-openai | ChatOpenAI, OpenAIEmbeddings |
| langchain-qdrant | QdrantVectorStore, similarity_search_with_score |
| qdrant-client | Qdrant client: scroll, set_payload, delete, count |
| redis | Async Redis client: list ops, scan, TTL |
| google-cloud-firestore | User store, refresh tokens, audit, ownership |
| PyMuPDF | PDF text extraction (PyMuPDFLoader) |
| passlib[bcrypt] | bcrypt_sha256, password hashing |
| bcrypt | Pinned bcrypt backend (passlib compat) |
| PyJWT | JWT HS256 creation and decoding |
| slowapi | Per-IP rate limiting |
| python-multipart | Multipart file upload |
| uvicorn | ASGI server, production mode without reload |
| aiofiles | Async read, SHA-256 hash of the PDF |
Around 192 unit tests
Coverage goes well beyond a smoke test: security, preprocessing, business services, the indexing pipeline, and API integration tests.
- Security. Hashing, generation and validation of JWT tokens.
- Preprocessing. PDF cleaning, page-number detection, chunk deduplication.
- Services. Chat context, base ownership, users, documents, vectorstore.
- Indexing. Loader, splitter, deduplication, Qdrant insertion, full upload with temp-file cleanup.
- API integration. Login, authentication, password and filename validation.
API surface
Every protected route expects an Authorization: Bearer <access_token> header. Management routes are restricted to the rh role.
/logincredentials → token pair/refreshrenews the pair/logoutrevokes the current refresh/list_users/get_user_config/create_new_user/modify_user_config/users/{username}/users/{username}/config/chat/streamstreaming answer/chat/retrieved_documentspassages from the last answer/show_history/clean_history/show_history_length/modify_history_length/document_list/upload_document/delete_document/set_new_documents_priority/reset_document_priority/get_vectorbase_list/create_new_vector_base/delete_vector_base