https://24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com/wp-content/themes/instaclustr-2020/assets/font/ionicons.ttf?v=2.0.0

Technical Technical — Elasticsearch Monday 21st December 2020

What Is Elasticsearch?

By Thomas Griffiths

Elasticsearch is an open source scalable search and analytics engine. It takes the open source search engine Apache Lucene and enables it to scale across many machines in a cluster to handle large volumes of data at high speed. Much like the index at the back of a book, the core data structure of Lucene is an inverted index. An inverted index keeps track of files or documents by the terms contained within them. Using an inverted index is as simple as looking up a given term to acquire a list of the relevant documents or items that contain that term.

Indexing provides a different data access pattern to other database and storage technologies that might also be used to store text or analytics information. Elasticsearch, while not a database, stores data as well as the indices needed to quickly look it up. Beyond classical search, this access pattern has proven useful for quickly searching through other types of data such as web server, security, and application logs.

Apache Lucene

Apache Lucene is a popular open source search engine written in Java and first released in 1999 under the open source Apache License. It has over two decades of development work from a wide community. It implements modern and performant inverted index ranking functions and other features needed for powerful search.

Lucene doesn’t have an inherent scalability story to take the technology beyond a single machine. Scaling Lucene, as well as providing various interfaces and improved ergonomics, is what Elasticsearch was built to do. An example of this is Wikipedia, who were using their own internal Lucene-based search before running into problems and deciding to move to Elasticsearch.

Elasticsearch

Elasticsearch, released in 2010, enables Apache Lucene to scale by dividing the search problem up into pieces, called shards, and providing a means to distribute these across multiple nodes participating in a scalable cluster.

Any index on the cluster is split into shards; each an index responsible for a limited portion of the data. There are also different shard types. Master shards contain canonical data and are responsible for indexing new documents; Replica shards shadow the data in master shards and are able to service read requests and take over as the master in the case of a failure.

A central allocator ensures a balance of these shards across actual nodes within the cluster. When new nodes are added, shards can be migrated to these automatically. Replica shards can be placed on a different node than the master they are following to ensure robustness and that the load is balanced.

Replicas are not merely redundant copies as they can also service read requests, and adding replicas is a key way of expanding the performance of a given cluster. Elasticsearch provides the cluster coordination mechanisms, including leader elections, to manage these indices, and REST APIs for querying and ingesting data as JSON documents.

This picture shows a single node inside a cluster with multiple Apache Lucene indices inside the node.

The diagram above shows a single node in an Elasticsearch cluster. You can see it is responsible for multiple shards of data, each of which has a Lucene index behind it.

Top Elasticsearch Use Cases

Elasticsearch can be used for full text searches and enterprise search applications where high availability and scale are required. Lucene was built to search text so Elasticsearch can provide results to queries at low latency for real-time search.

Elasticsearch for Analytics

The capability to scale makes Elasticsearch useful for data contained in analytics and logs. This type of data can often be much larger than what is found in classical text search problems. Rather than indexing entries as text like a product and its description, in this use case Elasticsearch can index textual logs on the keywords they contain, amounts represented in certain fields, time, and date. This enables observability data to be sliced, retrieved, and then paired with Kibana and visualized.

Elasticsearch for Application Performance Monitoring (APM)

Application Performance Monitoring is a popular use of Elasticsearch. APM consists of user actions as well as backend status in web and mobile applications being ingested, indexed, and analyzed. This enables companies to have a view into the performance of their applications, particularly where they are deployed on machines that they do not directly control and can’t implement any observability infrastructure (i.e. on an end-user’s mobile device). Here the application itself has to send the data back to the company to help them explore application performance as well as usage patterns.

Elasticsearch for Security Information and Event Management (SIEM)

Elasticsearch has been making considerable inroads into the security space. Just as it is useful for log data generally, it is equally useful when exploring security logs. Security data that is kept locally can be subject to deletion by attackers. Data protected in a separate environment enables security monitoring personnel to see real-time security information and respond to events after the fact.

Active monitoring and investigations require digging into and slicing data, and Elastic and its ecosystem are particularly well suited to these kinds of powerful transformations. Simple dashboard tools are not able to empower security personnel in the same way that Elastic’s flexibility, paired with Kibana, can. This offers users an open source alternative with similar capabilities to SIEM systems found in other products like Splunk.

The ELK Stack

The ELK stack, sometimes called the Elastic stack, stands for a set of products commonly used together: Elasticsearch, Logstash, and Kibana. This stack of tools extends Elasticsearch by adding capabilities for shipping data into the system in clever ways, via Logstash, as well as visualizing and analyzing it, with Kibana.

Kibana

Kibana is an open source web application technology that enables a browser-based interface to query and visualize Elasticsearch data, and even administer some Elasticsearch features.

This diagram shows the multiple types of data that can flow into Elasticsearch before it is accessed and visualised by Kibana.

Logstash

Logstash is a log shipping utility that enables you to separate your log shipping to Elasticsearch from your application layer. It also enables you to richly annotate the logs you are shipping by making sure the correct fields (like time) are represented rather than the record being simply a raw unannotated log file. With an ecosystem of plugins and transforms it is a powerful tool for adding value to the data you place into Elasticsearch.

The Open Distro for Elasticsearch

The Open Distro for Elasticsearch is a distribution of Elasticsearch built by a coalition of companies and is entirely licensed under the Apache 2.0 licence. The distribution derives from a problem with the “open core” business model of offering an open source project with proprietary extras. Elasticsearch development was following this model and security was one of the things held back and made proprietary.

After many high-profile data breaches, the community responded to the need for genuine security for all Elasticsearch deployments and created completely open source add-on packs to fill these various community needs. This effort seeks to minimize any deviation and retains the same core codebase while providing extra functionality, from security to machine learning, that’s completely open source.

Open Distro for Elasticsearch also puts the power back in the customers hands by providing them with a choice of customer-focused managed service providers. Our own managed version of Elasticsearch uses the Open Distro for Elasticsearch. Being able to get an Elasticsearch service managed, and to not be held hostage by a single provider, allows enterprises to pursue an open source strategy across their application and data layer.

Frequently Asked Questions About Elasticsearch

What’s an Inverted Index?

An inverted index is at the core of classical search. It operates like the index at the back of a book, listing terms first and then the associated documents (pages) that contain them. 

An index of Wikipedia would have an entry for “beagle” under which every article in which the word appeared would be listed. You can see from a search of Wikipedia for “beagle”, that the documents returned all contain the word “beagle”.

A large set of logs could be indexed by both date and other terms enabling entries to be quickly sliced by time and topic. Due to the utility of inverted indexes, many classical databases have added them. However, these additions frequently lack the extensive features, scalability, and years of development present in Apache Lucene.

Weighting the Index: TF-IDF, BM25, and BM25F

Weighting an index allows documents or records to not only belong to each inverted index entry, but also to be ordered by a ranking function. This primarily derives from work done in the 1970s by Karen Spärk Jones and others on Term Frequency – Inverse Document Frequency (TF-IDF). 

The intuition behind these techniques is that documents that use a word a lot should be ranked higher than words that use a term infrequently. Some frequently used words aren’t as important. For example most documents have a high term frequency for the word “the” but it rarely assists in information retrieval.

BM25, sometimes Okapi BM25, is a higher-performing variant of TF-IDF schema with extra parameters to take into account factors like document length. A variant known as BM25F is able to be field aware so that fields like titles, headings, or link text can be given more importance.

If you are interested in finding out a little more about the theory the book Introduction to Information Retrieval is available online for free.

The Enduring Power of Elasticsearch

Elasticsearch’s capability to scale, store, and index data means that although it’s not strictly a database, DBEngines.com ranks it as one of the top 10 data layer technologies in the world. The usefulness of indexing data and visualizing it this way is indisputable across many domains. You can try out Elasticsearch for free on our platform, or get in touch with us if you have further questions.