Chunking strategies greatly affect RAG quality:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
// Basic chunking
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['
', '
', ' ', '']
});
const chunks = await splitter.splitText(document);
Semantic Chunking
async function semanticChunk(text, maxTokens = 500) {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks = [];
let current = [];
let tokenCount = 0;
for (const sentence of sentences) {
const tokens = sentence.split(/s+/).length; // Approximate
if (tokenCount + tokens > maxTokens && current.length) {
chunks.push(current.join(' '));
current = [];
tokenCount = 0;
}
current.push(sentence);
tokenCount += tokens;
}
if (current.length) chunks.push(current.join(' '));
return chunks;
}
Best Practices
- Chunk size: 500-1000 tokens
- Overlap: 10-20% for context
- Preserve semantic boundaries
