Knowledge Base

The Knowledge Base is OpenAgent's RAG (Retrieval-Augmented Generation) system. It lets your agents answer questions based on your own documents rather than relying solely on the LLM's training data.

How it works

Document Upload
      │
      ▼
┌─────────────────┐
│   Document      │  ← parse text from PDF, DOCX, XLSX, etc.
│   Parsing       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Text          │  ← split into overlapping chunks
│   Splitting     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Embedding     │  ← convert chunks to vectors
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Vector Store  │  ← store vectors + metadata
└────────┬────────┘
         │
      At query time:
         │
         ▼
┌─────────────────┐
│   Semantic      │  ← embed query, find similar chunks
│   Search        │
└────────┬────────┘
         │
         ▼
    Top-K chunks injected into agent context

Supported Document Formats

Format	Extension	Notes
PDF	`.pdf`	Text extraction + OCR for scanned pages
Word	`.docx`, `.doc`	Preserves headings and structure
Excel	`.xlsx`, `.xls`	Tabular data per sheet
CSV / TSV	`.csv`, `.tsv`	Row-by-row ingestion
PowerPoint	`.pptx`	Slide text and notes
Plain Text	`.txt`	Direct ingestion
Markdown	`.md`, `.mdx`	Code-aware splitting

Scanned PDFs are processed with OCR automatically. Accuracy depends on scan quality.

Creating a Knowledge Base

Name your knowledge base

Go to Knowledge Bases → New Knowledge Base. Choose a descriptive name (e.g., "Product Documentation", "HR Handbook").

Choose an embedding model

Select the embedding model used to convert text chunks to vectors. All documents in a knowledge base must use the same embedding model — choose carefully, as changing it later requires re-indexing.

Recommended: text-embedding-3-small (OpenAI) — fast, affordable, and high quality.

Upload documents

Click Upload and select one or more files. You can upload multiple files at once. The indexing pipeline starts automatically.

Monitor indexing progress

Each file shows a status indicator:

Queued — waiting to be processed
Indexing — currently being chunked and embedded
Ready — available for agent queries
Failed — an error occurred (check the error details)

Attach to an agent

Open your agent settings and select this knowledge base under Knowledge Base. The agent will now search it automatically for every user message.

Chunking Strategy

OpenAgent splits documents into overlapping chunks before embedding. The default settings work well for most documents, but you can tune them:

Parameter	Default	Description
`chunk_size`	512 tokens	Maximum tokens per chunk
`chunk_overlap`	50 tokens	Overlap between adjacent chunks to preserve context
`split_strategy`	`recursive`	How to split: `recursive`, `sentence`, `paragraph`

When to adjust:

Technical documentation — larger chunks (1024) to keep code examples intact
FAQ-style content — smaller chunks (256) so each Q&A is a discrete unit
Dense tables — use paragraph strategy to keep rows together

Retrieval Configuration

When an agent queries the knowledge base, you can configure:

Parameter	Default	Description
`top_k`	5	Number of chunks to retrieve
`similarity_threshold`	0.7	Minimum similarity score (0–1)
`reranking`	disabled	Rerank results with a cross-encoder for better precision

Raising top_k gives the agent more context but uses more tokens per request. Start with the default and increase if agents miss relevant information.

Multi-Knowledge Base Agents

A single agent can query multiple knowledge bases. For example, a customer support agent might search both:

Product Documentation — for technical questions
Company Policy — for billing and return questions

When multiple knowledge bases are attached, OpenAgent performs parallel searches and merges the results, ranking by relevance score.

Keeping Knowledge Fresh

Manual re-upload — delete the old file and upload the updated version
API-based updates — use the REST API to programmatically sync documents from your CMS, database, or file system
Scheduled sync — configure a sync job to pull from a URL or S3 bucket on a schedule

Best Practices

Organize by topic — separate knowledge bases for different domains keep retrieval precise and let you attach only the relevant base to each agent.

Prefer clean text — if possible, export documents as plain text or markdown rather than converting complex PDFs. Better input → better retrieval.

Include metadata — add document title and creation date to help the agent cite sources correctly.

Test retrieval — use the Search tab in the knowledge base view to test what chunks are returned for representative queries before deploying to production.

Knowledge Base

On this page