RAG Systems for Defense Intelligence

How Retrieval-Augmented Generation Systems Work for Defense Intelligence Applications

Retrieval-augmented generation (RAG) combines a large language model with a document retrieval system to ground generated text in specific source material — enabling intelligence analysts to query large document collections in natural language and receive answers that cite their sources. In defense contexts, RAG architectures require domain-specific modifications to chunking, embedding, retrieval, and generation to meet the accuracy, traceability, and classification requirements that distinguish military intelligence workflows from commercial applications.

RAG was first formalized by Lewis et al. in their 2020 paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS), which demonstrated that coupling a pre-trained language model with a retrieval mechanism over external documents improved factual accuracy on knowledge-intensive tasks. According to GDIT's 2025 analysis, RAG is the most reliable deployment methodology for generative AI services in defense contexts because it allows organizations to ground model outputs in their own authoritative document collections rather than relying on the model's training data.

The RAG Pipeline: Four Stages

A defense RAG system operates in four stages — document ingestion, retrieval, augmented generation, and provenance verification — with each stage requiring defense-specific engineering decisions that differ from commercial RAG implementations.

Stage 1: Document Ingestion and Chunking

Raw intelligence documents — structured reports, unstructured cables, signals intercepts, imagery analysis notes — are split into chunks and converted into vector embeddings for storage in a vector database.

The chunking strategy is the first critical decision. Commercial RAG systems typically use fixed-token windows (e.g., 512 tokens). Defense intelligence documents, however, have internal structure — distinct sections for indicators, assessments, source lists, and classification markings — that fixed-token splitting destroys.

Schema-aware chunking, which splits documents along their logical section boundaries and preserves metadata (report ID, section type, sentence offsets, classification level), produces chunks that are individually citable. According to a 2025 comprehensive survey on RAG architectures by Gao et al., Retrieval-Augmented Generation for Large Language Models: A Survey, chunking strategy directly determines the quality of retrieved evidence, and semantic chunking outperforms fixed-token approaches on structured document types.

Stage 2: Retrieval

When an analyst submits a query, the system converts it to a vector embedding and searches the database for the most similar document chunks. The top-k chunks (typically 5–20) are retrieved as candidate evidence.

Retrieval accuracy on this step determines the quality of everything downstream. Domain-specific embedding fine-tuning, as documented in a 2024 Voyage AI study, improves retrieval accuracy by 6 to 7 percentage points on average compared to general-purpose embeddings. For defense intelligence documents — where "targeting" means something different from its commercial usage, and "engagement" is a military term of art — the vocabulary gap between general-purpose and domain-tuned embeddings is larger than in most commercial verticals.

DLRA's internal benchmarks show this gap manifests as a top-5 retrieval accuracy difference of 87.3% (general-purpose) versus 94.2% (domain-tuned) on the same defense intelligence evaluation set. The research by Karpukhin et al. in 2020 with Dense Passage Retrieval for Open-Domain Question Answering (EMNLP) demonstrated that retrieval quality is primarily an encoder problem — making embedding fine-tuning the single highest-impact intervention for defense RAG accuracy.

Stage 3: Augmented Generation

The retrieved chunks are injected into the language model's context alongside the analyst's query. The model generates a response grounded in the retrieved evidence.

The generation prompt must instruct the model to cite specific chunks rather than synthesizing freely. Without explicit citation constraints, language models tend to paraphrase across multiple sources — producing fluent but untraceable text. In intelligence workflows, every claim must trace to a specific source document and passage.

Stage 4: Provenance Verification

Defense RAG systems add a verification layer that commercial systems typically omit: provenance checking. Each generated claim is mapped back to its source chunk, and the analyst can inspect the exact passage that supports each statement.

DLRA's SynthBrief system implements this as sentence-level provenance — each generated sentence is linked to the specific chunk and sentence offsets that support it. In controlled evaluation with partner-agency analysts, this design reduced average intelligence brief generation time from 4.2 hours to 47 minutes.

Defense-Specific RAG Challenges

Defense RAG systems face four challenges that commercial implementations rarely encounter: narrow-domain vocabulary, classification-level handling, adversarial robustness, and the absolute requirement for auditable provenance.

Narrow-Domain Vocabulary

Defense intelligence text contains terminology that general-purpose language models and embedding models have limited exposure to. Terms carry context-dependent technical meanings. According to the Voyage AI 2024 study, domain-specific embeddings outperform general-purpose variants by 6 to 7 percentage points on domain benchmarks — and the gap is wider in narrower vocabularies.

Classification-Level Handling

Chunks from documents at different classification levels must be tagged and managed separately. A RAG system that retrieves a TOP SECRET chunk in response to a SECRET-level query has committed a classification violation. The vector database must enforce access controls at the chunk level, and the retrieval mechanism must filter by the analyst's clearance and the session's classification ceiling.

According to Pacific Northwest National Laboratory's 2025 research, secure RAG implementations require that "sensitive information stays walled off in secure libraries" while the LLM functions as a processing engine with no persistent access to classified content.

Adversarial Robustness

A 2025 comprehensive survey on RAG robustness highlighted that RAG systems remain vulnerable to retrieval noise, hallucinations, and adversarial attacks. In defense contexts, adversarial injection into the document corpus is a non-theoretical threat. Retrieval filters and provenance verification serve as partial mitigations.

Auditable Provenance

Intelligence products require attribution chains — every analytic judgment must be traceable to source reporting. A RAG system that summarizes without attribution produces text that cannot be used in formal intelligence products. Provenance must be preserved from ingestion through generation.

"For defense use cases, RAG is the most reliable deployment methodology for generative AI services." — GDIT, How Adaptive RAG Makes Generative AI More Reliable for Defense Missions, 2025

Comparison: Commercial RAG vs. Defense RAG

Commercial and defense RAG systems share the same architectural pattern but diverge on thirteen implementation dimensions.

Dimension Commercial RAG Defense RAG
Chunking Fixed-token windows (512 tokens) Schema-aware semantic chunks with metadata
Embeddings General-purpose (text-embedding-3) Domain fine tuned on defense corpora
Retrieval accuracy ~87% top-5 on domain benchmarks ~94% with domain fine-tuning
Classification handling Not applicable Chunk-level classification tagging and access control
Provenance Optional Required — sentence-level attribution
Adversarial threat model Spam, SEO poisoning Deliberate corpus poisoning by adversaries
Output format Conversational answer Structured brief with per-claim citations

Current Deployment Landscape

RAG-based systems are deployed across three tiers of the defense AI ecosystem — as features within frontier model platforms, as core architecture in defense-native software, and as purpose-built systems by domain-specific organizations.

The Pentagon's FY2026 budget includes $13.4 billion for AI and autonomy, according to CDO Magazine, with $1.2 billion allocated specifically for software and cross-domain integration — the budget category most relevant to RAG system deployment.

According to Deloitte's 2024 report, IC analysts could reclaim roughly 364 hours per analyst per year if document processing and evidence assembly steps could be compressed safely. RAG systems are the primary architectural approach for achieving this compression without sacrificing traceability.

Frequently Asked Questions

What is RAG and why is it used for defense intelligence? Retrieval-augmented generation combines a large language model with a document retrieval system. The model generates answers grounded in retrieved source documents rather than relying on its training data. For defense intelligence, this ensures that every generated claim can be traced to a specific source report — a requirement for formal intelligence products.

How does defense RAG differ from commercial RAG? Defense RAG requires domain-specific embedding fine-tuning (improving retrieval accuracy from approximately 87% to 94%), schema-aware chunking that preserves document structure and classification markings, chunk-level access controls for classification enforcement, and sentence-level provenance linking every generated claim to its source passage.

What retrieval accuracy do current defense RAG systems achieve? General-purpose embeddings achieve approximately 87% top-5 retrieval accuracy on defense-domain benchmarks. Domain-tuned embeddings achieve approximately 94%, consistent with improvements reported by Voyage AI (2024) and Cisco/NVIDIA (2024).

Can RAG systems handle classified documents? RAG architectures can be deployed on classified networks with appropriate infrastructure accreditation. The vector database and retrieval system operate within the classified environment, and the LLM processes retrieved chunks without persistent access to the classified corpus. Microsoft Azure OpenAI received IL6 authorization from DISA in 2025.

What are the main risks of using RAG for defense intelligence? The primary risks are retrieval errors (surfacing irrelevant evidence), hallucination (generating claims not supported by retrieved passages), adversarial corpus poisoning (injecting misleading documents), and classification spills (retrieving higher-classification material in a lower-classification session).

How long does it take to deploy a defense RAG system? Frontier model API integration can be achieved in days. Defense-native platform deployment takes months due to integration and accreditation requirements. Domain-specific RAG systems require additional months for embedding fine-tuning, evaluation set construction, and operational testing — but achieve higher retrieval accuracy on the target domain.