Product Introduction
- Document Parser by Contextual AI is a specialized tool designed to convert unstructured documents into structured Markdown or JSON formats for Retrieval-Augmented Generation (RAG) applications. It supports PDF, DOC/DOCX, and PPT/PPTX file formats, leveraging LibreOffice for file conversion to ensure compatibility and accurate page count calculations. The parser operates with a 100MB file size limit and handles documents up to 400 pages, prioritizing reliability for enterprise-scale workflows.
- The core value lies in its ability to extract semantically meaningful structures from complex documents while minimizing hallucinations and preserving hierarchical relationships. It enables AI systems to process technical reports, financial statements, and other modality-rich documents with improved accuracy for downstream tasks like question answering and data retrieval.
Main Features
- The parser implements document-level understanding through hierarchical heading detection, automatically generating a table of contents when
enable_document_hierarchyis activated. This feature identifies H1-H6 heading levels and maintains parent-child relationships between sections, even in scanned or image-heavy documents processed instandardparse mode. - Advanced table handling splits large tables across multiple rows using
enable_split_tablesandmax_split_table_cellsparameters, preserving header context in each split segment. This ensures LLMs can process tabular data without losing structural context, particularly effective for financial reports with 50+ row tables. - Multi-modal processing combines OCR, layout analysis, and computer vision to extract figures, equations, and text blocks in their native spatial arrangement. The
figure_caption_modeparameter offers concise or detailed descriptions, with beta features for chemical diagrams and mathematical notation recognition.
Problems Solved
- Addresses the challenge of information loss in traditional PDF parsers when handling nested tables, multi-column layouts, and embedded images. The solution prevents merged text blocks and misordered content common in open-source parsing libraries.
- Targets AI engineers building RAG pipelines who require structured document representations for vector databases. Typical users include teams developing enterprise search systems, technical documentation analyzers, and regulatory compliance checkers.
- Optimized for processing scanned contracts with handwritten annotations, research papers containing mathematical notation, and annual reports with embedded charts. Supports use cases requiring precise page range extraction through
page_rangeparameters like0-5,10,15-20.
Unique Advantages
- Unlike AWS Textract or Google Document AI, this parser maintains original document hierarchy through semantic analysis rather than relying solely on coordinate-based layout detection. This enables reconstruction of section nesting even in documents without native heading styles.
- Implements active hallucination suppression through constrained decoding algorithms that limit speculative text generation. The system achieves 98.7% accuracy on the PubLayNet benchmark for layout recognition, outperforming commercial alternatives by 12%.
- Combines LibreOffice-based preprocessing with proprietary layout analysis engines, enabling consistent handling of both native digital files and scanned documents. The dual-mode captioning system (
concisevsdetailed) adapts to different RAG requirements without retraining models.
Frequently Asked Questions (FAQ)
- What file formats and size limits does the parser support? The API accepts PDF, DOC/DOCX, and PPT/PPTX files under 100MB with a 400-page maximum. DOC/PPT files are converted to PDF via LibreOffice, which may alter page counts compared to native viewers.
- How does page range specification work for partial parsing? Use 0-based indexing with comma-separated values or hyphenated ranges like
0-5,10,15-20to select specific pages. Continuous ranges starting from page 0 are required when using document hierarchy features. - What distinguishes
basicandstandardparse modes? Basic mode processes text-only documents with simple layouts, while standard mode activates advanced features like table splitting, hierarchical headings, and multi-modal analysis for complex documents. Standard mode requires 2x more processing time but enables RAG-optimized outputs.