PDF/docx Extract test questions and images to create a master document ?

A user seeks to extract test questions and images from 10 exam documents and organize them by historical topics into a master DOCX file with answer keys. AI assistants like Claude and ChatGPT have produced errors during extraction, including missing images, overlooked passages, and incorrect sizing. The primary challenge involves accurately allocating each question to the correct topic while preserving associated images and text.

Detailed Analysis

A user working with historical exam materials has identified a practical ceiling in current large language model capabilities when tasked with structured document extraction and reorganization. The workflow in question involves ten separate exam documents covering topics spanning AP World History-style content — including the World in 1750, Revolutions, Nationalism, Imperialism, and World War I — which the user seeks to consolidate into a single master DOCX file, sectioned by topic with associated images, text passages, and auto-generated answer keys. Despite converting all source PDFs to DOCX format using Adobe Acrobat Pro, both Claude and ChatGPT have struggled to reliably extract multimodal content — particularly images embedded within questions — and correctly map each item to its corresponding thematic section.

The core technical challenge the user describes reflects a well-documented limitation in how current AI systems handle multimodal document processing at scale. While LLMs have made significant progress in reading and interpreting text, reliably extracting and repositioning embedded images within document structures — while preserving their semantic relationship to surrounding question text — remains error-prone. Errors such as missing images, oversized renderings, or incomplete passages indicate that the models are processing document content inconsistently, likely due to the complex and non-standardized way images are embedded in DOCX files, even after clean PDF-to-DOCX conversion. The classification task — assigning questions to the correct historical topic — adds a second layer of complexity that compounds these extraction failures.

This use case highlights a growing gap between AI capabilities as demonstrated in controlled demos and AI performance in real-world, multi-step document workflows. Users increasingly expect AI tools to function as end-to-end document automation pipelines, not merely as text-generation assistants. The repeated failure mode here — where Claude and ChatGPT both produce plausible but incomplete outputs — is particularly telling, as it suggests the issue is systemic to the current generation of LLMs rather than specific to any one model. Gemini's complete inability to engage with the task at all further underscores the unevenness across the AI landscape for this class of problem.

The user's practical suggestion of pre-converting PDFs to DOCX via Adobe Acrobat Pro is a sound preprocessing step, as it normalizes document structure before AI ingestion. However, a more robust solution would likely involve a pipeline approach: using a dedicated document parsing tool such as Adobe PDF Extract API, Amazon Textract, or Google Document AI to handle the structured extraction of text and images independently, then feeding the structured output into an LLM solely for the classification and organization task. This separation of concerns — extraction handled by purpose-built OCR/parsing tools, semantic sorting handled by LLMs — would play to the respective strengths of each technology rather than asking a single AI system to perform both reliably in sequence.

More broadly, this user's experience reflects a broader trend in AI adoption where professionals are discovering that LLMs excel at reasoning and language generation but still require significant scaffolding to operate as reliable document processing systems. The demand for agentic, multi-step workflows that handle diverse content types — text, images, structured data — is accelerating faster than the underlying model capabilities can uniformly support. Tools like Claude are increasingly being used in contexts that require precision and completeness rather than approximation, which is driving both user frustration and product development pressure on AI labs to improve document-native reliability, structured output fidelity, and multimodal coherence in complex real-world tasks.

Read original article →

Detailed Analysis

Don't Miss a Deploy