Knowledge Base
Upload documents and enable grounded, accurate agent responses.
Knowledge Base
The Knowledge Base is OpenAgent's RAG (Retrieval-Augmented Generation) system. It lets your agents answer questions based on your own documents rather than relying solely on the LLM's training data.
How it works
Document Upload
│
▼
┌─────────────────┐
│ Document │ ← parse text from PDF, DOCX, XLSX, etc.
│ Parsing │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Text │ ← split into overlapping chunks
│ Splitting │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Embedding │ ← convert chunks to vectors
└────────┬────────┘
│
▼
┌─────────────────┐
│ Vector Store │ ← store vectors + metadata
└────────┬────────┘
│
At query time:
│
▼
┌─────────────────┐
│ Semantic │ ← embed query, find similar chunks
│ Search │
└────────┬────────┘
│
▼
Top-K chunks injected into agent contextSupported Document Formats
| Format | Extension | Notes |
|---|---|---|
.pdf | Text extraction + OCR for scanned pages | |
| Word | .docx, .doc | Preserves headings and structure |
| Excel | .xlsx, .xls | Tabular data per sheet |
| CSV / TSV | .csv, .tsv | Row-by-row ingestion |
| PowerPoint | .pptx | Slide text and notes |
| Plain Text | .txt | Direct ingestion |
| Markdown | .md, .mdx | Code-aware splitting |
Scanned PDFs are processed with OCR automatically. Accuracy depends on scan quality.
Creating a Knowledge Base
Name your knowledge base
Go to Knowledge Bases → New Knowledge Base. Choose a descriptive name (e.g., "Product Documentation", "HR Handbook").
Choose an embedding model
Select the embedding model used to convert text chunks to vectors. All documents in a knowledge base must use the same embedding model — choose carefully, as changing it later requires re-indexing.
Recommended: text-embedding-3-small (OpenAI) — fast, affordable, and high quality.
Upload documents
Click Upload and select one or more files. You can upload multiple files at once. The indexing pipeline starts automatically.
Monitor indexing progress
Each file shows a status indicator:
- Queued — waiting to be processed
- Indexing — currently being chunked and embedded
- Ready — available for agent queries
- Failed — an error occurred (check the error details)
Attach to an agent
Open your agent settings and select this knowledge base under Knowledge Base. The agent will now search it automatically for every user message.
Chunking Strategy
OpenAgent splits documents into overlapping chunks before embedding. The default settings work well for most documents, but you can tune them:
| Parameter | Default | Description |
|---|---|---|
chunk_size | 512 tokens | Maximum tokens per chunk |
chunk_overlap | 50 tokens | Overlap between adjacent chunks to preserve context |
split_strategy | recursive | How to split: recursive, sentence, paragraph |
When to adjust:
- Technical documentation — larger chunks (1024) to keep code examples intact
- FAQ-style content — smaller chunks (256) so each Q&A is a discrete unit
- Dense tables — use
paragraphstrategy to keep rows together
Retrieval Configuration
When an agent queries the knowledge base, you can configure:
| Parameter | Default | Description |
|---|---|---|
top_k | 5 | Number of chunks to retrieve |
similarity_threshold | 0.7 | Minimum similarity score (0–1) |
reranking | disabled | Rerank results with a cross-encoder for better precision |
Raising top_k gives the agent more context but uses more tokens per request. Start with the default and increase if agents miss relevant information.
Multi-Knowledge Base Agents
A single agent can query multiple knowledge bases. For example, a customer support agent might search both:
Product Documentation— for technical questionsCompany Policy— for billing and return questions
When multiple knowledge bases are attached, OpenAgent performs parallel searches and merges the results, ranking by relevance score.
Keeping Knowledge Fresh
- Manual re-upload — delete the old file and upload the updated version
- API-based updates — use the REST API to programmatically sync documents from your CMS, database, or file system
- Scheduled sync — configure a sync job to pull from a URL or S3 bucket on a schedule
Best Practices
Organize by topic — separate knowledge bases for different domains keep retrieval precise and let you attach only the relevant base to each agent.
Prefer clean text — if possible, export documents as plain text or markdown rather than converting complex PDFs. Better input → better retrieval.
Include metadata — add document title and creation date to help the agent cite sources correctly.
Test retrieval — use the Search tab in the knowledge base view to test what chunks are returned for representative queries before deploying to production.