“Search out thy wit for secret policies, and we will make thee famous through the world.” [Henry VI Part I, III, 3]
Recently I decided to learn about a new addition to Instaclustr’s managed open source platform, the scalable full-text search technology Elasticsearch. Being in COVID-19 lockdown at the time, I found an online course that seemed to have a good mix of theory and practice: Elasticsearch 7 and the Elastic Stack – In Depth & Hands On! This blog reveals some of the most interesting things that I learnt on my first (and definitely not last) “test drive” of Elasticsearch!
1. Don’t Panic!
Mr Prosser: But, Mr Dent, the plans have been available in the local planning office for the last nine months.
Arthur: Oh yes, well as soon as I heard I went straight round to see them, yesterday afternoon. You hadn’t exactly gone out of your way to call attention to them had you? I mean like actually telling anybody or anything.
Mr Prosser: But the plans were on display…
Arthur: On display? I eventually had to go down to the cellar to find them.
Mr Prosser: That’s the display department.
Arthur: With a torch.
Mr Prosser: The lights had probably gone out.
Arthur: So had the stairs.
Mr Prosser: But look, you found the notice, didn’t you?
Arthur: Yes yes I did. It was on display at the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying beware of the leopard.
[Arthur’s conversation with Mr Prosser, the local council worker trying to demolish Arthur’s house, in “The Hitchhiker’s Guide to the Galaxy”, by Douglas Adams]
His house being demolished turned out to be the least of Arthur’s problems when a Vogon Constructor Fleet arrived to demolish the Earth to make way for a hyperspace express route, much to the surprise of everyone on Earth. They too had been informed by the Galactic Hyperspace Planning Council that all the “planning charts and demolition orders have been on display at your local planning department in Alpha Centauri for 50 of your Earth years”. Being able to find a document can sometimes be extremely important (e.g. if you don’t want your house or planet demolished without warning).
2. What Do You Already Know About Text Search?
About to launch myself into a crash course in Elasticsearch during the COVID-19 lockdown, I wondered what (if anything!) I knew about text search in general, so I did a “self” search…
I don’t recall covering it in my computer science degree (a long time ago). I suspect this is because text search has its origins in Information Retrieval which was taught by the Information Systems (run by the accounting and management department). From the 1940’s there was considerable progress on the computerized automation of information retrieval, building on manual card catalogues in libraries! (also called index cards, typically you could only search by author, title or subject!)
But then I wondered which things related to text I’ve come across, and the results surprised me.
My first encounter with “text” was in the context of Natural Language Understanding and parsing, when I wrote a Prolog DCGs (define clause grammars) parser for a large subset of the English language. This was actually my second attempt to implement the grammar, my first attempt used LISP and was a lot more complicated. One of the influential thinkers in Natural Language Processing at the time was Terry Winograd. I’ve still got the first (and only) volume of a planned series—Language As A Cognitive Process, Syntax, Volume 1 (here’s a PDF version). Interestingly, he was Larry Page’s (co-founder of Google) PhD supervisor!
So what made Larry’s search engine better than others? Well, the idea was surprisingly simple (but harder to implement at scale),that the World Wide Web was just a citation system for pages, the more links there are to a page, the more important that page it is likely to be. The algorithm was called PageRank (this is the original paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”). Significantly, the size of web scale search required a new approach to high performance computing and resulted in MapReduce (easy to program scalable parallel computing on commodity clusters), which in turn evolved into Apache Hadoop and Apache Spark, and other web scale technologies such as Apache Cassandra and Apache Kafka.
Another parallel development was in structured document processing using XML (Extensible Markup Language). XML and Web Services were an important part of the research and industry landscapes in the 2000’s, and I made extensive use of the rich set of standards and tools for XML transformation and search such as XSLT, XPATH, and XQuery (these are still W3C standards). Typical of most technologies, Web Services were built on previous ideas, and in turn evolved further into more “modern” REST and JSON web APIs.
Finally, while at NICTA (National ICT Australia) I did some performance modeling for IP Australia —they are the Australian government department who look after patents and trademarks. They were re-architecting their patent and trademark search systems and wanted to know how well they would perform (response times) and what capacity (throughput) they would have for a realistic mix of search types. The majority of searches were “simple”, so they were fast and didn’t consume many resources, however, there were some more complex searches that took a lot longer to return results and were resource hungry.
So, given this limited and somewhat mixed background to text search, some of my questions at the start of learning Elasticsearch included:
- What is the relevant linguistic background/theory for understanding text search?
- Does Elasticsearch use any grammatical knowledge for searching?
- How is relevance computed? (Assuming there are no links to the document as in the original Google PageRank algorithm.)
- Can Elasticsearch search structured or unstructured documents?
- Are more “complex” searches slower?
After 9 hours and 96 Chapters of online videos, plus several days of hands-on exercises, I had 30+ pages of handwritten notes, lots of things that I learnt about Elasticsearch that I didn’t know to start with, and still more questions. In the rest of this blog I’ll focus on some of the things that I learnt. In the next blog I’ll ask one of our resident Elasticsearch gurus about some of the open questions I had.
3. What Did I Learn About Elasticsearch?
3.1 What Sort of API Does Elasticsearch Have?
A REST API (HTTP + JSON).
3.2 Are There Other Ways of Using Elasticsearch?
Yes! There’s a simple URI request query without a JSON body. However, you do have to worry about encoding special characters correctly.
About half of the course used Kibana; this is a cool GUI which enables you to create searches visually, and JSON is generated as well so you can inspect the JSON and edit it etc.) It’s also used to graph search results. You can use Kibana with the Instaclustr managed Elasticsearch.
Another way of interacting with Elasticsearch is the Elasticsearch SQL, although this has some limitations (e.g. for queries and aggregations only).
3.3 What Is a Document?
“A document in madness! Thoughts and remembrance fitted.” [Hamlet, IV, 5]
Elasticsearch searches documents. But what is a document? Basically an Elasticsearch document is a JSON object which is stored and indexed in Elasticsearch. Each document is stored in at least one index and can be searched in many different ways.
How big can/should a document be? The default maximum size is 100MB, and the absolute maximum is 2GB (the Lucene limit). The real problem with large documents is that when you do a search, the result is just a list of documents. It’s probably not very useful knowing that the word “duck” appears somewhere (but where?) in a single large document (e.g. The Complete Works of William Shakespeare, for example “Duck” or a variant appears 16 times). So how small should a document be? A single word? Possible in theory. For Shakespeare a sensible granularity may be one document per line (or speech perhaps?) Then you want to ensure that each line of the play has meta-data including which play it’s in, who’s speaking, the line number, etc., so that when the results are returned you have sufficient context for the document.
One of only 235 surviving copies of Shakespeare’s original First Folio, worth $6M!
Another challenge to searching the original Shakespear text is the Elizabethan spelling (not to mention spelling variations, archaic words, or words that have changed their meaning etc.) This is the actual wording on the “To the Reader” page above:
This Figure, that thou here feest put,
It was for gentle Shakespeare cut:
Wherein the Grauer had a strife
with Naure, to out-doo the life:
O, could he but haue dravvne his vvit
As vvell in frasse, as he hath hit
Hisface; the Print vvould then surpasse
All, that vvas euer in frasse.
But, since he cannot, Reader, looke
Not on his picture, but his Booke.
Some of the language analysis and search techniques supported by Elasticsearch help solve this type of problem (but unfortunately probably won’t help to decipher my hand written notes from the course), keep on reading to find out more.
3.4 How Are Documents Indexed?
“The presentation of but what I was; The flattering index of a direful pageant” [Richard III, IV, 4]
Before you can search documents in Elasticsearch, they have to be indexed (in theory you could search documents by scanning all of the documents, but this would be way too slow).
You can index documents either manually or dynamically (which I discovered after the course) . To manually index, you have to define mappings, then create an index, and finally add documents to the index. Using the dynamic way, you can just index a document and everything else is done for you by Elasticsearch. Elasticsearch just tries to guess the data types found in a document. Here’s some good documentation I found on dynamic mapping.
3.5 What Are Mappings?
“I am near to the place where they should meet, if Pisanio have mapped it truly.” [Cymbeline, IV, 1]
Mappings are basically the index schema, and map the fields in documents to Elasticsearch data types. Depending on the data type, the search semantics may differ. There are simple data types, complex data types (that support hierarchical JSON objects), geo data types, and more specialized data types.
String data can be mapped to either text or keyword types. What’s the difference? Text is full-text searchable (and analyzed to create searchable index terms). Text isn’t used for sorting or aggregation. Keywords are for strings that can only be searched by their exact value, but can be sorted, filtered, and aggregated (e.g. a postcode). But what if you need both? You can have multi-fields using the fields parameter. For example, you may want the Shakespearean characters to be mapped as both text and keyword, to allow you to filter by a specific name (e.g. Lord Caputius), or search for all characters who have a particular title or rank (e.g. Lord or Captain etc.) I found this the most surprising thing about Elasticsearch and still haven’t fully understood the subtleties, but this is a better explanation of the difference between Term-level vs full-text queries.
3.6 What Are the Steps for Indexing and Searching Documents?
“Tell her, we measure them by weary steps.” [Love’s Labour’s Lost, V, 2]
The basic lifecycle of an Elasticsearch workflow consists of the following steps:
- Define mappings
- Create index
- Add documents to iIndex
- Search the index to find documents
- (Try to) change mappings/number of primary shards, etc.
- Delete the index to start over!
Elasticsearch has a number of things that are immutable, and which can’t be easily changed after creating them.
If you want to change the mappings for an index you can’t, you have to create a new index with the new mappings. In the course this meant starting from scratch each time, which assumed you had all the documents still to add to the new index each time. After the course I found out you can reindex, i.e. You can move documents from one index to another index! You can also add mappings after index creation.
You also can’t change the number of primary shards (used for writes, but you can add replica shards, used for reads).
Documents are also immutable. What happens if you want to update them? Simple! You get a new version.
3.7 How Does Elasticsearch Handle Concurrent Updates (Conflicts)?
“Who’s this? O God! it is my father’s face, Whom in this conflict I unwares have kill’d.” [Henry VI, Part III, II, 5]
I thought that this next thing I learnt about Elasticsearch was interesting as it shows that Elasticsearch is more than “just” Lucene, and actually distributed—and it’s also something I knew about already.
What happens if you try to change (create, update, or delete) a document at the same time from multiple clients? This can cause a potential problem as Elasticsearch is intrinsically distributed, asynchronous, and concurrent! It also doesn’t support ACID transactions, and documents are immutable so any changes result in a new version. How can it prevent an older version of a document from overwriting a newer version? It uses a well known mechanism from distributed databases called “optimistic concurrency control” (which I first came across when trying to answer the question “How scalable is J2EE technology?”, it also enables efficient cache utilization). Optimistic concurrency allows for higher concurrency (compared with “pessimistic” concurrency which prevents conflicts by using locks), and “optimistically” assumes it has the most up to date version, but which may result in an exception if an inconsistency is eventually detected. Elasticsearch uses sequence_number (which is different to version) and primary_term to identify changes. Here’s a Java example. You can set the retry_on_conflict to a non-zero value (the default is no retries) to get automatic retries up to a maximum value.
So now you’ve got an index, and some documents in the index. What can you do? Search! Search options are complex and include filters, queries, sorting, etc. However, many of these assume an exact match between the searched and indexed terms, what happens if they don’t match exactly?
“And stemming it with hearts of controversy;
But ere we could arrive the point proposed,
Caesar cried ‘Help me, Cassius, or I sink!’” [Julius Caesar, I, 2]
This is where we could easily fall down the computational linguistics rabbit hole. Stemming is the process of reducing a word to its root or stem (which may not even be a word), e.g. “vac” (empty) is the root of both “evacuation” and “vacation”. Here’s an online dictionary of roots to give you an idea of how complex this is (given that English is derived from multiple other languages including Latin and Greek). Stemming sounds potentially painful! For example, here’s a flower stemming machine (it removes leaves, prickles, etc., leaving just the bare stem):
Why is stemming useful in text searching? Well, typically you want to find documents that contain different parts of speech of a root, and you may not even be using the root, and some roots aren’t words—e.g. “argu” is the root of “argue”, “argument”, etc. Searching for “duck” in Shakespeare, you may want to return speeches which include all of these duck-like words, not only exact matches, i.e. all of the following words with duck appear: “a-ducking”, “wild-duck”, “silly-ducking”, “duck”, “ducks”, “Duck”, “Ducks”. Stemming uses simple purely syntactic methods to find a root (typically removing prefixes and suffixes) so would work correctly for this example.
Elasticsearch supports stemming, and other similar approaches to tokenizing words (e.g. lowercasing) using analyzers. You specify analyzers when you define mappings for an index, and you can have different analyzers for each field, or a default for the whole index. This online book has more on stemming.
“Here, master doctor, in perplexity and doubtful dilemma.” [Merry Wives of Windsor, IV, 5]
Given that finding the “Stem” of a word is called “Stemming”, then is finding the “Lemma” of a word called “Lemming”? No! It’s called Lemmatization (although a simple stemming approach would assume that both Stemming and Lemmatization have “Lemm” as the root!)
The Lemmings Amiga game!
Lemmatization has the same goal as Stemming, but uses more sophisticated linguistic analysis, and finds a root that is a real word, so is better than Stemming. But this requires both a dictionary and potentially grammatical analysis to be 100% accurate. For example, stemming will not find the root of “better” and “best”, which is “good” (not “be” as a Stemmer may guess).
Any of the Elasticsearch stemming algorithms that use dictionaries are actually Lemmatization based approaches. There are some open source lemmatization plugins for Elasticsearch, e.g. LemmaGen, and a list of other open source lemmatizers.
3.10 Levenshtein Fuzzy Queries!
From information theory, the Levenshtein (edit) distance is a metric for measuring the difference between two sequences—the minimum number of edits to change one sequence into the other. It’s similar to the better known Hamming distance, but is more general as it allows for deletions and insertions (as well as substitutions) so the sequences can be different lengths. In Elasticsearch you use a fuzzy query, and you may need to set the “fuzziness” value (options are either auto which automatically determines the difference based on the word length, or manually set). For example, the Muppet character, Fozzie Bear (who was definitely Fuzzy), has a fuzziness of 3 between the strings “Fuzzy” and “Fozzie”, i.e. (1) replace “u” with “o” giving “Fozzy”, (2) replace “y” with “i” giving “Fozzi”, (3) insert “e” giving “Fozzie”!
Fuzzy Fozzie Bear (with Kermit)
In computer science an n-gram is just a contiguous subset of a longer sequence. They are therefore applied to domains such as genetics as well as languages. In Natural Language Processing n-grams are typically applied at the word level, and are useful to predict what the next word is likely to be given an existing sequence and a Markov model or Bayesian probability. A well known example of n-grams at the word level is the Google Books Ngram viewer. Here’s an example graphing the occurrence of n-grams “text search”, “natural language processing”, “index card”, “catalogue card” in books from 1800 to the present day.
In Elasticsearch n-grams are used for efficient approximate matching (as they are computed at index time so are fast at search time, and match on partial words—use the ngram tokenizer), and also for autocompletion suggestions (use the edge_ngram tokenizer). The defaults are to compute a unigram and a bigram, but the min_gram and max_gram lengths can be customized.
How does this help match incorrectly spelt words? Because a longer list of n-gram terms is generated at index time, and the same ngram analyzer is applied to the search term, then a match occurs for any of the ngrams. For example, if we are looking for documents containing “duck” and have indexed them with bigrams, giving terms [du, uc, ck], then searching for “duk” will result in bigrams [du, uk], of which “du” matches an indexed bigrams. The best explanation I’ve found is in “Elasticsearch in Action”. Here’s another good example of Elasticsearch n-grams.
Apparently “duk” is a correct spelling of “duck” if you are the “local” Canberra band “Peking Duk”.
“Signior Romeo, bon jour! There’s a French salutation to your French slop.” [Romeo and Juliet, II, 4]
What if you want to search for documents with multiple terms occurring in order and “close” to each other? By default the match phrase query only returns exact matches. You can specify a “slop” parameter to change the maximum distance the words can be apart (0 is no gap, 1 is 1 word in between etc). The phrase “And hold-fast is the only dog my duck” in Shakespeare would be found with a phrase query of “dog duck” with a slop of >= 1 but not 0.
There are some good query examples in this article, including using slop.
3.13 Partial Matching With Prefix, Wildcards and Regular Expression Queries
“O, they have lived long on the alms-basket of words.
I marvel thy master hath not eaten thee for a word;
for thou art not so long by the head as
honorificabilitudinitatibus: thou art easier
swallowed than a flap-dragon.” [Love’s Labour’s Lost, V, 1]
If analysis of the text doesn’t result in indexed tokens that are sensible to search for exactly (or if you don’t expect users to be able to remember long words!), another option is to use partial matching, either by prefix, wildcards, or regular expressions. For example, if you want to find the word “Honorificabilitudinitatibus” (the longest word in Shakespeare), then a prefix query of “Honour” will find it (and fast). Wildcard queries allow pattern matching (on simple patterns) using “?” for any single character and “*” for 0 or more characters. So a wildcard query of “Honour*” will also find our longest word. In theory you could also find “Honorificabilitudinitatibus” using a Wildcard query of “*bus”, but this may be slow! More general pattern matching is supported with the regular expression query (regexp), but also may be slow depending on the exact pattern. Here’s the regular expression syntax.
So it turns out I did know something about Elasticsearch after all. I’ve previously used the Lucene prefix search in my “Geospatial Anomaly Detection: Part 4 – Cassandra Lucene” blog, and it turned out to be one of the fastest Lucene queries for geospatial searching (using a geohash) that I experimented with.
Everything from words to Galaxies have Morphology
It turns out I did know something about Morphology (“the study of the shape of things”) after all! In Linguistics, morphology includes stemming, lemmatization, etc. Hidden in Appendix B in the final pages in Terry Winograd’s “Syntax” book was a discussion on morphology and why it was hard to do with a rules based approach. Obviously I didn’t get to the end of the book! If I had, I would have learnt that the stem of the longest word in the English language, “antidisestablishmentarianism”, is “establish” (which can be found by the simpler stemming algorithms), and may have been less surprised by some of the above Elasticsearch features.
So that’s it for part 1 of this Elasticsearch blog. I only covered what I found particularly interesting from the course, so it’s not an exhaustive introduction to Elasticsearch, and it omits much of the course material that dealt with the relatively complicated aspects of Elasticsearch clustering and operations. If you use Instaclustr’s Managed Elasticsearch you don’t need to worry about much of this at all, although there are a few things you still need to consider including how many nodes of each type are needed in your cluster, and how many primary and replica shards etc. In fact, I still had lots of unanswered questions after completing the course, so I decided to send my questions to one of Instaclustr’s Elasticsearch gurus to see what he would make of them. The next part of the Elasticsearch blog will take the form of a Q&A around these.
And now I’ll leave the last word to Shakespeare:
“The search, sir, was profitable” [All’s Well that Ends Well, IV, 3]
4. Further Resources
Elasticsearch is a mature technology so there’s lots of documentation around. The catch is that much of it is out of date or for proprietary versions. Some of the Lucene documentation is also relevant.
Open Distro for Elasticsearch
Lucene documentation is at Apache Lucene 7.4.0 Documentation.
Some useful blogs
A good blog including search score and more.
This is a nice introduction with a real example.
23 Useful Elasticsearch Example Queries, the title says it all!
An Elasticsearch QA with one of the elasticsearch developers.
Terry Winograd, Language as a cognitive process: Volume 1: Syntax, 1983, PDF download
Encyclopedia of Language and Linguistics, this chapter covers the history of document search.
I asked some colleagues about materials on search and Thomas Griffiths recommended the following:
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008
TF–IDF on Wikipedia and Karen Spärck Jones’ original paper (1972), A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL, Journal of Documentation, Vol. 28 No. 1, pp. 11-21., PDF Download: A statistical interpretation of term specificity and its application in retrieval
BM25 The Next Generation of Lucene Relevance, Doug Turnbull, October 16, 2015
Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit, 1st Ed. 2009
Finally, some of Instaclustr’s online/virtual Elasticsearch webinars are here (under on-demand).
You can also try out Instaclustr’s Managed Elasticsearch
And read our Understanding Elasticsearch whitepaper