Comparing all the replicas of each piece of data that exist (or are supposed to) and updating each replica to the newest version.
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Apache LuceneTM is an open-source, high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is the base search technology used by Apache Solr and Elasticsearch.
Solr (pronounced “solar”) is an open source enterprise search platform, written in Java, from the Apache Lucene project.
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
Application Programming Interface (API) is a set of functions and procedures that allow the creation of applications which access the features or data of an operating system, application, or other service.
The Apache Software Foundation (ASF) is a non-profit corporation that oversees development of Apache software.
Process of establishing the true identity of a user or application
Process of establishing permissions to database resources through roles.
Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.
An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.
The process by which new nodes join the cluster transparently gathering the data needed from existing nodes.
C# is an object-oriented programming language developed by Microsoft as part of their .NET initiative, and later approved as a standard by ECMA and ISO. C# has a procedural, object oriented syntax based on C++ that includes aspects of several other programming languages (most notably Delphi, Visual Basic, and Java) with a particular emphasis on simplification (fewer symbolic requirements than C++, fewer decorative requirements than Java).
C++, originally named “C with Classes, is a high-level programming language developed by Bjarne Stroustrup at Bell Labs. C++ adds object-oriented features to its predecessor, C. C++ is a statically-typed free-form multi-paradigm language supporting procedural programming, data abstraction, object-oriented programming, and generic programming. C++ is one of the most popular programming languages. The C++ programming language standard was ratified in 1998 as ISO/IEC 14882:1998, and the current version of which is the 2003 version, ISO/IEC 14882:2003. A new version of the standard (known informally as C++0x) is being developed.model of the computer.
The number of unique values in a column.
The latest versions of Cassandra use virtual nodes, or vnodes.
A collection of Data Centers
The storage engine process that creates an index and keeps data in order based on the index.
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is the position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
The smallest increment of data, which contains a name, a value, and a timestamp.
A container for rows, similar to the table in a relational system. Called table in CQL 3.
A file to which Cassandra appends changed data for recovery in the event of a hardware failure.
The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index. The available compaction strategies are:
- DateTieredCompactionStrategy (DTCS)
- LeveledCompactionStrategy (LCS)
- SizeTieredCompactionStrategy (STCS)
- TimeWindowCompactionStrategy (TWCS)
Composite Partition Key
A partition key consisting of multiple columns.
Compound Primary Key
A primary key consisting of the partition key, which determines on which node data is stored, and one or more additional columns that determine clustering.
The synchronization of data on replicas in a cluster. Consistency is categorized as weak or strong.
A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.
The node that determines which nodes in the ring should get the request based on the cluster configured snitch.
A group of related nodes that are configured together within a cluster for replication and workload segregation purposes. Not necessarily a physical data center.
DateTieredCompactionStrategy (DTCS) is deprecated starting in Apache Cassandra 3.8. This strategy is particularly useful for time series data. It stores data written within a certain period of time in the same SSTable. For example, Apache Cassandra can store your last hour of data in one SSTable time window, and the next 4 hours of data in another time window, and so on. The most common queries for time series workloads retrieve the last hour/day/month of data.
Denormalization refers to the process of optimizing the read performance of a database by adding redundant data or by grouping data. In Cassandra, this process is accomplished by duplicating data in multiple tables, grouping data for queries.
Elassandra is an opensource software product that integrates Cassandra and Elasticsearch. Elassandra takes the advantages of both and combines them to provide the ability to have a distributed, highly available multi-datacenter search and secondary index data store.
Elasticsearch is an open source, RESTful search engine built on top of Apache Lucene and released under an Apache license. It is Java-based and can search and index document files in diverse formats.
Cassandra maximizes availability and partition tolerance. Cassandra ensures eventual consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed. This is Cassandra’s way of ensuring that any query always returns the most recent version of the result set, and that all replicas of any given row will eventually become completely consistent with each other.
An approach used in several NoSQL databases, including (optionally) Apache Cassandra, where the results of a successful write are not guaranteed to be reflected immediately in subsequent reads. Eventual consistent can be used to provide high availability with lower infrastructure costs than strong consistency.
Faceted search is the dynamic clustering of items or search results into categories that uses any value in any field to drill into search results, or even skip searching entirely.
An index structure over the entire graph.
A peer-to-peer communication protocol for exchanging location and state information between nodes.
A collection of vertices and edges.
A hard disk drive (HDD) or spinning disk is a data storage device used for storing and retrieving digital information using one or more rigid rapidly rotating disks. Compare to SSD.
Hadoop Distributed File System (HDFS) stores data on nodes to improve performance. HDFS is a necessary component in addition to MapReduce in a Hadoop distribution.
The amount of disk space required by a process (such as compaction) in addition to the space occupied by the data being processed.
An operation that can occur multiple times without changing the result, such as Cassandra performing the same update multiple times without affecting the outcome.
Data on disk that cannot be overwritten or changed.
A native Cassandra capability for finding a column in the database that does not involve using the primary key.
A namespace container that defines how data is replicated on nodes. Similar to a “database” in a RDBMS.
An open-source API gateway and microservices management layer, delivering high performance and reliability.
This compaction strategy creates SSTables of a fixed, relatively small size that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the previous. Disk I/O is more uniform and predictable on higher than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. This process can improve performance for reads, because Cassandra can determine which SSTables in each level to check for the existence of row key data.
Also called serializable consistency. The restriction that one operation cannot be executed unless and until another operation has completed. To ensure linearizable consistency in writes, Cassandra supports Lightweight transactions. The first phase of a Lightweight transaction works at SERIAL consistency and follows the Paxos protocol to ensure that the required operation succeeds. If this phase succeeds, the write is performed at the consistency level specified for the operation. Reads performed at the SERIAL consistency level are executed without Cassandra’s built-in read repair operations.
Hadoop’s parallel processing engine that can process large data sets relatively quickly. A necessary component in addition to MapReduce in a Hadoop distribution.
A materialized view is a table with data that is automatically inserted and updated from another base table. The materialized view has a primary key that differs from the base table, so that different queries can be implemented.
A Cassandra table-specific, in-memory data structure that resembles a write-back cache.
A single cluster can run transactional, search, and analytics nodes.
A mutation is either an insert, update or delete.
A node is the storage layer within a server.
Normalization refers to a series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database’s schema. In Cassandra, this process is inefficient because joining data in multiple tables for queries requires accessing more nodes.
Online Analytical Processing (OLAP) performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Compare to OLTP.
Online transaction processing (OLTP) is characterized by a large number of short on-line transactions for data entry and retrieval. Compare to OLAP.
A list of primary keys and the start position of data.
The limits of the partition that differ depending on the configured partitioner. Murmur3Partitioner (default) range is -263 to +263 and RandomPartitioner range is 0 to 2127-1.
A subset of the partition index. By default, 1 partition key out of every 128 is sampled.
Distributes the data across the cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.
The partition key. One or more columns that uniquely identify a row in addition to a user text table.
A key-value pair that describes some attribute of either a vertex or an edge. Property key is used to describe the key in the key-value pair. All properties are global in DSE Graph, meaning that a property can be used for any vertices. For example, “name” can be used for all vertices in a graph.
A change in the expanse of tokens assigned to a node.
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. RDDs have actions, which return values, and transformations, which return pointers to new RDDs.
A process that updates Cassandra replicas with the most recent version of frequently-read data.
A process that makes all data on a replica consistent.
Replica Placement Strategy
A specification that determines the replicas for each row of data.
Replication Factor (RF)
The total number of replicas across the cluster is referred to as the replication factor (RF). A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.
Set of permissions assigned to users that limit access to database resources. Check out our Apache Cassandra 2.2 Blog for a more detailed explanation
A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time, while other nodes continue to operate online.
- Columns that have the same primary key.
- A collection of cells per combination of columns in the storage engine.
A Cassandra component for improving performance of read-intensive operations. The row cache, in off-heap memory, holds rows most recently read from the local SSTables. Each local read operation stores its result set in the row cache and sends it to the coordinator node. The next read first checks the row cache. If the required data is there, Cassandra returns it immediately. This initial read can save further seeks in the Bloom filter, partition key cache, partition summary, partition index, and SSTables. Cassandra uses LRU (least-recently-used) eviction to ensure that the row cache is refreshed with the most frequently accessed rows. The size of the row cache can be configured in the cassandra.yaml file.
A graph query that traverses an entire graph or large sections of the graph.
- Authentication: Defines a service used for authentication and/or role assignment, such as Kerberos or LDAP.
- Database: Describes all database resources.
A seed, or seed node, are used to bootstrap the gossip process for new nodes joining a cluster.A seed node provides no other function and is not a single point of failure for a cluster.
The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold. A minor compaction does not involve all the tables in a keyspace. Also see STCS compaction subproperties in the relevant CQL documentation.
A set of clustered columns in a partition that you query as a set using, for example, a conditional WHERE clause.
The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and data centers. There are several types of snitches. The type of snitch affects the request routing mechanism.
A solid-state drive (SSD) is a solid-state storage device that uses integrated circuits to persistently store data. Compare to HDD.
A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are stored on disk sequentially and maintained for each Cassandra table.
A special column that is shared by all rows of a partition. Cassandra 2.0.6 and later.
A component that handles data exchange among nodes in the cluster. It is part of SSTable file. Examples include:
- When bootstrapping a new node, the new node gets data from existing nodes using streaming.
- When running nodetool repair, nodes exchange out-of-sync data using streaming.
- When bulk loading data from backup, sstableloader uses streaming to complete task.
By default, each installation of Cassandra includes a superuser account named cassandra whose password is also cassandra. A superuser grants initial permissions to access Cassandra data, and subsequently a user may or may not be given the permission to grant/revoke permissions.
The successor to DTCS – This compaction strategy compacts SSTables based on series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. Then the next time window starts and the compaction process repeats. Each TWCS time window contains data within a specified range and contains varying amounts of data.
An element on the ring that depends on the partitioner. A token determines the node’s position on the ring and the portion of data it is responsible for. The range for the Murmur3Partitioner (default) is -263 to +263. The range for the RandomPartitioner is 0 to 2127-1.
A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.
Time-to-live (TTL) is an optional expiration date for values that are inserted into a column. TTL is a very useful technique for timeseries data that is only relevant for limited period.
Cassandra ensures in all circumstances that all replicas of any given row eventually become completely consistent. For situations requiring immediate and complete consistency, Cassandra can be tuned to provide 100% consistency for specified operations, datacenters or clusters.
A change in the database that updates a specified column in a row if the column exists or inserts if it does not exist.
Vnode is a virtual node. Normally, nodes are responsible for a single partitioning range in the full token range of a cluster. If vnodes are enabled, each node is responsible for several virtual nodes, effectively spreading a partitioning range across more nodes in the cluster. Doing so can reduce the risk of hotspotting, or straining one node in the cluster.