What is OpenSearch?
OpenSearch is a highly scalable, open-source search and analytics suite used for full-text search, log analytics, observability, security analytics, vector search, and other data-intensive applications. It is built on Apache Lucene, the high-performance Java search library that powers many modern search platforms. OpenSearch extends Lucene with a distributed cluster architecture, REST APIs, near real-time indexing, high availability, security features, analytics capabilities, dashboards, and a growing ecosystem of plugins and integrations.
OpenSearch originated as a fork of Elasticsearch 7.10.2 and Kibana 7.10.2, but it has since evolved into its own Apache 2.0 licensed, community-driven project. Today, OpenSearch is not just a classic search engine. It is also a platform for observability, vector search, semantic search, AI-assisted search applications, and operational analytics.
As of April 7, 2026, the latest version is OpenSearch 3.6. This release adds agent-powered search and observability features, OpenSearch Launchpad, automated relevance tuning, improvements to vector search performance and storage efficiency, APM capabilities, and enhanced PPL query support.
This article explains the architecture of OpenSearch, including its Lucene foundation, cluster design, node types, data organization, indexing flow, search flow, storage structures, and aggregation framework.
This is part of a series of articles about OpenSearch
OpenSearch and Apache Lucene
At the core of OpenSearch is Apache Lucene, an open-source search library written in Java. Lucene is responsible for low-level indexing and search operations, including inverted indexes, tokenization, scoring, and segment-based storage.
Lucene is extremely powerful, but it is a library rather than a complete distributed search platform. OpenSearch builds on top of Lucene by adding:
- Distributed clustering
- Horizontal scaling
- Replication and high availability
- REST APIs
- Index and shard management
- Security and access control
- Query DSL, SQL, and PPL interfaces
- OpenSearch Dashboards
- Observability, alerting, anomaly detection, machine learning, and vector search features
The most important Lucene concept behind OpenSearch is the inverted index. Instead of scanning every document for a search term, Lucene stores a mapping from terms to the documents that contain them. This makes text search extremely fast because OpenSearch can look up a term and immediately identify matching documents. Lucene also supports multiple search patterns, including full-text queries, phrase queries, wildcard queries, fuzzy matching, proximity searches, and relevance scoring. OpenSearch exposes these capabilities through its Query DSL and higher-level APIs.
What OpenSearch Adds on Top of Lucene
Lucene handles search at the library level. OpenSearch turns Lucene into a distributed search and analytics engine.
Distributed Architecture
OpenSearch stores and processes data across a cluster of nodes. A cluster can contain one node or many nodes, depending on the size and availability requirements of the workload. Data is divided into shards and distributed across the cluster so that indexing and search operations can run in parallel.
Scalability
OpenSearch scales horizontally by adding more nodes to a cluster. This is important because search and analytics workloads often grow beyond the limits of a single server. Instead of relying only on vertical scaling, OpenSearch distributes data and query processing across multiple machines.
High Availability
OpenSearch supports replica shards, zone-aware allocation, snapshots, and recovery mechanisms to reduce the risk of downtime or data loss. If a node fails, replica shards can continue serving queries while the cluster reallocates data. OpenSearch also supports segment replication, which copies Lucene segment files across shard copies instead of indexing the same documents independently on every shard copy. This can improve indexing throughput and reduce resource usage, though it may increase network utilization.
Database-Like Document Operations
OpenSearch is not a relational database, but it provides document-store capabilities such as:
- Unique document IDs
- Document indexing
- Document updates
- Document deletes
- Versioning
- Optimistic concurrency control
- Reindexing
Documents are stored as JSON and are grouped into indexes.
Search and Analytics
OpenSearch supports both search and analytics use cases. It can run full-text searches, filters, aggregations, metric calculations, faceted navigation, time-series analysis, and pipeline aggregations.
APIs and Query Languages
OpenSearch provides REST APIs for indexing, search, cluster management, security, and plugin functionality. It also supports several query interfaces, including:
- Query DSL for structured search
- SQL for SQL-like access
- PPL, or Piped Processing Language, for log and observability-style queries
OpenSearch 3.6 continues to expand PPL with new and enhanced query capabilities.
Security and Access Control
OpenSearch includes security features such as TLS encryption, authentication, authorization, role-based access control, index-level permissions, document-level security, field-level security, audit logging, and integration with identity providers.
OpenSearch Cluster Architecture
An OpenSearch cluster is a collection of one or more nodes that work together to store, index, search, and analyze data. Each node runs OpenSearch and can perform one or more roles.
A cluster is horizontally scalable. As data volume or query traffic increases, more nodes can be added to increase capacity. In production, it is common to separate node responsibilities so that critical cluster management, indexing, search, ingest, and machine learning workloads do not compete for the same resources.
Cluster Manager Nodes
Older Elasticsearch-based terminology often used “master node.” In OpenSearch, the preferred term is cluster manager node.
A cluster manager node is responsible for managing cluster-level operations, including:
- Tracking cluster state
- Creating and deleting indexes
- Allocating shards
- Managing node membership
- Coordinating cluster metadata updates
- Monitoring cluster health
Only one cluster manager node is active at a time, but production clusters should have multiple cluster-manager-eligible nodes so that a new cluster manager can be elected if the current one fails. A common best practice is to use an odd number of cluster-manager-eligible nodes, often three, to support quorum-based elections and reduce split-brain risk.
Data Nodes
Data nodes store index data and handle most indexing, search, and aggregation work. These nodes are usually the most resource-intensive part of an OpenSearch cluster because they perform disk I/O, CPU-heavy search operations, memory-intensive aggregations, and segment merges.
Data nodes should be sized based on:
- Data volume
- Query rate
- Indexing rate
- Shard count
- Aggregation complexity
- Storage type
- Availability requirements
OpenSearch workloads usually benefit from horizontal scaling rather than simply increasing the size of a single node.
Coordinating Nodes
Any OpenSearch node can act as a coordinating node. A coordinating node receives client requests, routes them to the relevant shards, gathers responses, reduces the results, and returns the final response to the client.
In larger clusters, dedicated coordinating nodes can help isolate client traffic from data nodes. This is especially useful for search-heavy workloads where the reduce phase can become expensive.
During a distributed search, the coordinating node:
- Receives the query.
- Sends the query to relevant primary or replica shards.
- Collects the top results from each shard.
- Sorts and merges those results.
- Fetches the final documents.
- Returns the response to the client.
Ingest Nodes
Ingest nodes preprocess documents before they are indexed. They run ingest pipelines that can transform, enrich, parse, or normalize incoming data.
Common ingest processors include:
- Grok parsing
- Date parsing
- Field renaming
- GeoIP enrichment
- Script processors
- Attachment processing
- Set, remove, convert, and lowercase processors
Because some ingest processors are CPU-intensive, production clusters often use dedicated ingest nodes to avoid slowing down search and aggregation workloads.
Specialized Nodes
Depending on installed plugins and workload requirements, OpenSearch clusters can also include specialized node roles, such as:
- Machine learning nodes
- Remote cluster client nodes
- Search-only or warm-tier nodes
- Dedicated coordinating nodes
- Dedicated ingest nodes
This separation helps improve cluster stability and makes it easier to scale individual parts of the architecture.
Data Organization in OpenSearch
OpenSearch organizes data into indexes, documents, fields, shards, and segments. Index An index is a logical collection of documents. It is similar to a table in a relational database, but it stores JSON documents rather than rows.
For example, an ecommerce site might use separate indexes for:
- products
- orders
- customers
- inventory
- logs
- clickstream-events
Each index has settings and mappings that control how documents are stored, analyzed, and searched.
An OpenSearch index is not a single Lucene index. Instead, it is divided into one or more shards, and each shard is a Lucene index.
Document
A document is a JSON object stored in an index. It contains fields and values.
Example:
|
1 2 3 4 5 6 7 8 |
{ "customer_id": "C-10045", "name": "Dana Smith", "city": "New York", "signup_date": "2026-02-14", "loyalty_status": "gold" } |
OpenSearch adds metadata fields to documents, such as:
_id_index_source_version_seq_no_primary_term
The _source field stores the original JSON document. This is important for retrieving documents, updating documents, highlighting, and reindexing.
Fields
Fields are the key-value pairs inside a document. Each field has a data type, and the type determines how OpenSearch indexes, stores, and searches the field.
Common field types include:
- text
- keyword
- Numeric types
- Date types
- Boolean
- Object
- Nested
- Geo point
- Geo shape
- Vector fields
Mappings define how fields are interpreted. OpenSearch can infer mappings dynamically, but production systems usually benefit from explicit mappings to avoid unexpected field types.
Understanding Shards and Replicas
Primary Shards
A shard is a smaller unit of an index. Each primary shard is a Lucene index that stores part of the index’s data.
Sharding allows OpenSearch to distribute data across multiple nodes. This improves scalability because indexing and search operations can run in parallel.
When creating an index, you configure the number of primary shards. Choosing the right number of shards is important. Too few shards may limit scalability, while too many shards can increase cluster overhead.
Replica Shards
Replica shards are copies of primary shards. They provide high availability and can also improve search throughput because queries can be served by either primary or replica shards.
Replica shards should be allocated to different nodes from their primary shards. This way, if a node fails, OpenSearch can still serve the data from a replica.
Replicas increase storage requirements and can add overhead to indexing, but they are essential for production availability.
Lucene Segments
Each OpenSearch shard is made up of Lucene segments. A segment is an immutable file-system-level structure that stores indexed data.
When documents are indexed, Lucene writes them into new segments. Once a segment is created, it is not modified. Updates and deletes are handled by marking old documents as deleted and writing new versions of documents into new segments.
Over time, many small segments are merged into larger segments. Segment merging helps:
- Reclaim space from deleted documents
- Improve search performance
- Reduce the number of segments that must be searched
- Optimize disk layout
Segment merging is important but can be resource-intensive, especially during heavy indexing.
Translog, Refresh, and Flush
OpenSearch is near real time, not strictly real time. This means newly indexed documents become searchable shortly after indexing, usually after a refresh.
Translog
Before changes are fully committed to Lucene segments, OpenSearch records operations in a transaction log called the translog. The translog helps protect against data loss if a node fails before changes are committed.
Refresh
A refresh makes newly indexed data searchable by opening a new searcher over recent changes. By default, OpenSearch refreshes indexes periodically, commonly every second for actively searched indexes.
A refresh makes data searchable, but it does not necessarily mean the data has been fully committed for durable persistence.
Flush
A flush commits Lucene segments and clears older translog entries. OpenSearch documentation explains that during a flush, the shard fsyncs Lucene segments so they become durably persisted, and the translog entries are no longer needed for durability.
Document Indexing Flow
When a document is indexed into OpenSearch, several steps happen:
- The client sends a JSON document to OpenSearch.
- The coordinating node routes the request to the correct primary shard.
- The document is processed by an ingest pipeline, if configured.
- Field mappings determine how each field is handled.
- Text fields are analyzed into tokens.
- Terms are written into Lucene data structures.
- The operation is recorded in the translog.
- The operation is replicated to replica shards.
- After refresh, the document becomes searchable.
OpenSearch stores both the searchable indexed representation and, by default, the original document in _source.
Text Analysis Process
Analysis is the process of converting raw text into searchable terms. It is used primarily for text fields.
An analyzer is made of three parts:
- Character filters
- Tokenizer
- Token filters
Character Filters
Character filters modify text before tokenization. For example, an HTML stripping filter can remove HTML tags and decode HTML entities.
Tokenizers
A tokenizer breaks text into tokens. For example:
Input:
|
1 |
OpenSearch architecture guide |
Tokens:
|
1 2 3 |
opensearch architecture guide |
Token Filters
Token filters modify, remove, or add tokens. They can lowercase terms, remove stop words, apply stemming, fold accented characters, or add synonyms.
Normalizers
Normalizers are similar to analyzers, but they produce a single token and are used for keyword fields. They are useful for case-insensitive sorting, filtering, and aggregations.
Common OpenSearch Field Types
Text
The text type is used for full-text search. It is appropriate for fields such as product descriptions, article bodies, log messages, comments, and support tickets.
Text fields are analyzed, which makes them suitable for relevance-based search, phrase queries, fuzzy search, and match queries.
Keyword
The keyword type is used for structured values such as IDs, tags, categories, email addresses, status codes, and exact-match fields.
Keyword fields are commonly used for:
- Filtering
- Sorting
- Aggregations
- Exact matching
Numeric Types
OpenSearch supports numeric field types such as integer, long, float, double, scaled float, and others. Numeric fields are used for prices, counts, measurements, durations, and metrics.
Choose the smallest numeric type that safely fits the expected value range to reduce storage overhead.
Date
Date fields are used for timestamps and time-based filtering. They are essential for log analytics, observability, monitoring, and time-series use cases.
Geo Point
The geo_point type stores latitude and longitude coordinates. It supports distance queries, bounding box queries, and location-based sorting.
Vector Fields
Vector fields support similarity search for machine learning and semantic search use cases. OpenSearch has continued to improve vector search in recent releases. OpenSearch 3.6 adds 1-bit scalar quantization across Faiss and Lucene engines, storage optimizations, prefetching, and latency improvements for vector search workloads.
What Is the Inverted Index?
The inverted index is the core data structure that makes OpenSearch fast for text search.
Instead of storing a document-to-terms mapping, an inverted index stores a term-to-documents mapping.
For example:
| Term | Matching documents |
|---|---|
| opensearch | 1, 3, 7 |
| architecture | 1, 2 |
| analytics | 2, 4, 9 |
When a user searches for “OpenSearch architecture,” OpenSearch can quickly look up both terms and find documents that contain them.
What Is the Term Dictionary and Posting Lists?
The term dictionary contains sorted terms found in an index. Each term points to a posting list.
A posting list may contain:
- Document IDs
- Term frequency
- Positions
- Offsets
- Scoring metadata
This information supports relevance scoring, phrase matching, proximity queries, and highlighting.
For high-cardinality fields, term lookup must be efficient. Lucene uses optimized data structures to avoid loading every term into memory.
Search Flow in OpenSearch
OpenSearch uses a distributed search process commonly described as the query phase and fetch phase.
Query Phase
During the query phase:
- The coordinating node receives the search request.
- It identifies the relevant shards.
- It sends the query to each shard copy.
- Each shard runs the query locally.
- Each shard returns top matching document IDs, scores, and metadata.
The actual full documents are usually not returned during this phase.
Fetch Phase
During the fetch phase:
- The coordinating node merges and sorts shard-level results.
- It determines the final top results.
- It requests the full documents from the relevant shards.
- It returns the final search response to the client.
This two-phase process allows OpenSearch to search many shards in parallel while minimizing unnecessary document retrieval.
Document Scoring
OpenSearch uses similarity algorithms to score matching documents. The default similarity is BM25, a ranking function based on TF-IDF concepts with improvements for term frequency saturation and field-length normalization.
BM25 considers:
Term Frequency
How often the search term appears in a field. A term that appears more often may indicate higher relevance.
Inverse Document Frequency
How rare the term is across the index. Rare terms usually carry more relevance than common terms.
Field Length Normalization
Shorter fields can be considered more relevant when they contain the same term frequency as longer fields.
For example, if the term “OpenSearch” appears twice in a short title, that may be more meaningful than appearing twice in a long article body.
Aggregations
OpenSearch is also an analytics engine. Aggregations let you summarize, group, and analyze documents.
Common aggregation use cases include:
- Faceted search
- Sales summaries
- Log analytics
- Time-series dashboards
- Error rate calculations
- Security event analysis
- Metrics reporting
Metric Aggregations
Metric aggregations calculate values from documents.
Examples include:
sumavgminmaxstatsextended_statscardinalitypercentiles
Example: Total order value
|
1 2 3 4 5 6 7 8 9 10 11 |
GET book_store_orders/_search { "size": 0, "aggs": { "total_orders": { "sum": { "field": "order_price" } } } } |
Bucket Aggregations
Bucket aggregations group documents into buckets.
Examples include:
termsdate_histogramrangehistogramfilterscomposite
Example: Sales stats by category
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
GET book_store_orders/_search { "size": 0, "aggs": { "sales_by_category": { "terms": { "field": "category.keyword", "size": 10 }, "aggs": { "sales_stats": { "stats": { "field": "order_price" } } } } } } |
Pipeline Aggregations
Pipeline aggregations operate on the output of other aggregations. They are useful for calculating trends, derivatives, moving averages, and summary metrics across buckets.
Example: Maximum order price across categories
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
GET book_store_orders/_search { "size": 0, "aggs": { "sales_by_category": { "terms": { "field": "category.keyword" }, "aggs": { "max_order_price": { "max": { "field": "order_price" } } } }, "max_order_price_across_categories": { "max_bucket": { "buckets_path": "sales_by_category>max_order_price" } } } } |
OpenSearch Dashboards
OpenSearch Dashboards is the visualization and user interface layer for OpenSearch. It allows users to explore data, build dashboards, create visualizations, manage indexes, run observability workflows, configure alerts, and interact with plugins.
OpenSearch 3.6 expands the Dashboards experience with agent-powered relevance tooling, APM features, agent traces for generative AI applications, and observability enhancements.
Modern OpenSearch: Vector Search, AI, and Observability
OpenSearch architecture has expanded beyond traditional keyword search. Modern OpenSearch deployments often include semantic search, hybrid search, vector databases, AI-powered relevance tuning, and observability pipelines.
OpenSearch 3.6 highlights this direction with:
- OpenSearch Launchpad for guided search application development
- Agent-powered relevance tuning
- Agentic search improvements
- 1-bit scalar quantization for vector search
- Faiss quantization optimizations
- Vector metadata compression
- Prefetching for ANN and exact vector search
- Application Performance Monitoring
- Agent Traces for generative AI workloads
- Enhanced PPL commands
These features show how OpenSearch is evolving from a search and log analytics engine into a broader platform for search modernization, observability, and AI-driven applications.
Conclusion
OpenSearch is a distributed search and analytics platform built on Apache Lucene. Its architecture combines Lucene’s efficient indexing and search capabilities with distributed clustering, shard-based scaling, replication, REST APIs, security, dashboards, aggregations, observability features, and vector search.
The key architectural building blocks are:
- Clusters and nodes
- Cluster manager nodes
- Data nodes
- Coordinating nodes
- Ingest nodes
- Indexes
- Documents
- Fields
- Primary and replica shards
- Lucene segments
- Translog, refresh, and flush operations
- Inverted indexes
- Query and fetch phases
- Aggregations
With OpenSearch 3.6, the platform continues to move beyond traditional search into AI-assisted search development, semantic relevance tuning, high-performance vector search, and full-stack observability. For teams building scalable search, analytics, monitoring, or AI-powered retrieval systems, understanding OpenSearch architecture is the foundation for designing reliable and performant deployments.