Technical Technical — Kafka Thursday 5th November 2020

A Brief Introduction To Schema Evolution In Kafka Schema Registry

By Lewis DiFelice

Introduction to Kafka Schema Registry

The Confluent Schema Registry for Kafka (hereafter called Kafka Schema Registry or Schema Registry)  provides a serving layer for your Kafka metadata. It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas. It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types.

In this article, we look at the available compatibility settings, which schema changes are permitted by each compatibility type, and how the Schema Registry enforces these rules.

The Need for Formatted Messages

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day.  Messages are serialized at the producer, sent to the broker, and then deserialized at the consumer. Bytes are taken as an input and sent as an output. Kafka knows nothing about the format of the message and no data verification or format verification takes place. It is this constraint-free protocol that makes Kafka flexible, powerful, and fast. 

There is an implicit assumption that the messages between producers and consumers will be the same format and that format does not change. Should the producer use a different message format due to evolving business requirements, then parsing errors will occur at the consumer. 

Avro Schemas

Apache Avro is a data serialization framework that produces a compact binary message format. The structure of the message is defined by a schema written in JSON. Avro supports a number of primitive and complex data types.  Messages are sent by the producer with the schema attached. The consumer uses the schema to deserialize the data.

The Kafka Schema Registry (also called the Confluent Kafka Schema Registry) solves this problem by enabling Kafka clients to write and read messages using a well defined and agreed schema. It enforces compatibility rules between Kafka producers and consumers. As schemas continue to change, the Schema Registry provides a centralized schema management capability and compatibility checks.

The Kafka Schema Registry

Kafka Schema Registry handles the distribution of schemas between the consumer and producer and stores them for long-term availability.   Producers and consumers are able to update and evolve their schemas independently with assurances that they can read new and old data. Schema Registry provides operational efficiency by avoiding the need to include the schema with every data message. 

Architecture

A Kafka Schema Registry lives outside and separately from your Kafka brokers. It is an additional component that can be set up with any Kafka cluster setup and uses Kafka as its storage mechanism.  You can use the same Schema Registry for multiple Kafka clusters. 

It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types. It provides serializers that plug into Apache Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats.

A Schema Registry lives outside of and separately from your Kafka brokers, but uses Kafka for storage.  Your producers and consumers still talk to Kafka to publish and read data (messages) to topics. But now they also talk to the Schema Registry to send and retrieve schemas that describe the data models for the messages.  

At first, only the Avro schema was supported.  Support for Google Protocol Buffer (Protobuf) and JSON Schema formats was added in the Confluence Platform 5.5. 

In a Schema Registry, the context for compatibility is the subject, which is a set of mutually compatible schemas (i.e. different versions of the base schema). Each subject belongs to a topic, but a topic can have multiple subjects. 

Each schema is associated with a topic.  Each schema has a unique ID and a version number.  The schema id avoids the overhead of having to package the schema with each message. When a producer produces an event, the Schema Registry is searched. If the schema is new, it is registered and assigned a unique ID. If the schema already exists, its ID is returned. Either way, the ID is stored together with the event and sent to the consumer. When a consumer encounters an event with a schema ID, it uses the ID to look up the schema, and then uses the schema to deserialize the data.

A RESTful interface is supported for managing schemas and allows for the storage of a history of schemas that are versioned. The Schema Registry supports checking schema compatibility for Kafka.  

Schema Formats

A Schema Registry supports three data serialization formats:

  • Apache Avro is a data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols. Avro is the default format. 
  • JSON Schema is a method to describe and validate JSON documents.

Schema Registry stores and supports multiple formats at the same time. For example, you can have Avro schemas in one subject and Protobuf schemas in another. Furthermore, both Protobuf and JSON Schema have their own compatibility rules, so you can have your Protobuf schemas evolve in a backward or forward compatible manner, just as with Avro.

Create the Avro Schema

Before you can produce or consume messages using Avro and the Schema Registry you first need to define the data schema. An Avro schema in Kafka is defined using JSON. 

The Kafka Avro example schema defines a simple payment record with two fields: id—defined as a string, and amount—defined as a  double type.

A Kafka Avro Schema Registry example can be found here.  It covers how to generate the Avro object class. 

Centralized Schema Management

When you start modifying schemas you need to take into account a number of issues:  whether to upgrade consumers or producers first;  how consumers can handle the old events that are still stored in Kafka; how long we need to wait before we upgrade consumers; and how old consumers handle events written by new producers. 

These issues are discussed in the following sections. 

Schema Evolution and Compatibility

An important aspect of data management is schema evolution. After the initial schema is defined, applications may need to evolve over time. When a format change happens, it’s critical that the new message format does not break the consumers. 

From a Kafka perspective, schema evolution happens only during deserialization at the consumer (read). If the consumer’s schema is different from the producer’s schema, then the value or key is automatically modified during deserialization to conform to the consumer’s read schema if possible.

Schema compatibility checking is implemented in Schema Registry by versioning every single schema. 

When a schema is first created for a subject, it gets a unique id and it gets a version number, i.e. version 1. When the schema is updated (if it passes compatibility checks), it gets a new unique id and it gets an incremented version number, i.e. version 2.

Compatibility Types

How a schema may change without breaking the consumer is determined by the Schema Registry compatibility type property defined for a schema. The Schema Registry supports the four compatibility types:  Backward, Forward, Full, and None. While there is some difference between Avro, ProtoBuf, and JSON Schemaformats, the rules are as follows:

BACKWARD compatibility means that consumers using the new schema can read data produced with the last schema. An example of a BACKWARD compatible change is the removal of a field. A consumer that was developed to process events without this field will be able to process events written with the old schema and contain the field—the consumer will just ignore that field.

BACKWARD_TRANSITIVE compatibility is the same as BACKWARD except consumers using the new schema can read data produced with any previously registered schemas. 

FORWARD compatibility means that data produced with a new schema can be read by consumers using the last schema, even though they may not be able to use the full capabilities of the new schema. FORWARD compatible schema modification is adding a new field.

FORWARD_TRANSITIVE compatibility is the same as FORWARD but data produced with a new schema can be read by a consumer using any previously registered schemas. 

FULL compatibility means the new schema is forward and backward compatible with the latest registered schema.

FULL_TRANSITIVE means the new schema is forward and backward compatible with all previously registered schemas

NONE disables schema compatibility checks.

Examples 

If there are three schemas for a subject that change in order V1, V2, and V3:

BACKWARD: consumers using either V3 or V2 can read data produced by schema V3. BACKWARD_TRANSITIVE: data produced by schema V3 can be read using V3,  V2 or V1.

FORWARD: data produced using schema V3 can be read by consumers with schema V3 or V2.

FORWARD_TRANSITIVE: data produced using schema V3 can be read by consumers with schema V3, V2, or V1.

FULL: BACKWARD and FORWARD compatibility between schemas V3 and V2

FULL_TRANSITIVE: BACKWARD and FORWARD compatibility between schemas V3, V2, or V1.  

Evolving Schemas

The compatibility type assigned to a topic also determines the order for upgrading consumers and producers. 

  • BACKWARD or BACKWARD_TRANSITIVE: there is no assurance that consumers using older schemas can read data produced using the new schema. Therefore, upgrade all consumers before you start producing new events.
  • FORWARD or FORWARD_TRANSITIVE: there is no assurance that consumers using the new schema can read data produced using older schemas. Therefore, first upgrade all producers to using the new schema and make sure the data already produced using the older schemas are not available to consumers, then upgrade the consumers.
  • FULL or FULL_TRANSITIVE: there are assurances that consumers using older schemas can read data produced using the new schema and that consumers using the new schema can read data produced using older schemas. Therefore, you can upgrade the producers and consumers independently.
  • NONE: compatibility checks are disabled. Therefore, you need to be cautious about when to upgrade clients.

Failing Compatibility Checks

Compatibility checks fail when the producer: 

  • adds a required column and the consumer uses BACKWARD or FULL compatibility,
  • deletes a required column and the consumer uses FORWARD or FULL compatibility.

Passing Compatibility Checks

Compatibility checks succeed when the producer: 

  • Adds a required column and the consumer uses FORWARD compatibility. 
  • Adds an optional field and the consumer uses BACKWARD compatibility. 
  • Deletes optional fields and the consumer uses FORWARD or FULL compatibility. 

Changing Compatibility Type

The default compatibility type is BACKWARD, but you may change it globally or per subject.

Managing Schemas 

You manage schemas in the Schema Registry with the Kafka REST API which allows the following operations: 

  • Stores schemas for keys and values of Kafka records
  • Lists schemas by subject
  • Lists all versions of a subject (schema)
  • Retrieves a schema by version
  • Retrieves a schema by ID
  • Retrieves the latest version of a schema
  • Performs compatibility checks
  • Sets compatibility level globally

For example, here is the curl command to list the latest schema in the subject transactions-value. 

The expected output would be:

Supported Clients

A Kafka Avro producer uses the KafkaAvroSerializer to send messages of Avro type to Kafka.  The consumer uses the KafkaAvroserializer to receive messages of an Avro type. Schema Registry also supports serializers for Protobuf and JSON Schema formats. 

Avro, Protobuf, and JSON Schema provide serializers and deserializers that are currently available for C/C++, C#, Go, Python, and Java.  You can also configure hsqlDB for use with the supported formats. 

Kafka Connect and Schema Registry work together to capture schema information from connectors. For additional information, see Using Kafka Connect with Schema Registry.

Where to Get Schema Registry

Instaclustr offers Kafka Schema Registry as an add-on to its Apache Kafka Managed Service. To take advantage of this offering, you can now select ‘Kafka Schema Registry’ as an option when creating a new Apache Kafka cluster.

Confluent includes Schema Registry in the Confluent Platform

The Confluent Schema Registry for Docker containers is on DockerHub.

You can also download the code for Kafka Schema Registry from Github.

Karaspace is an open source version of the Confluent Schema Registry available on the Apache 2.0 license.

Summary 

In Kafka, an Avro schema is used to apply a structure to a producer’s message.  Schema Registry is an add-on to Kafka that enables the developer to manage their schemas. Although not part of Kafka, it stores  Avro, ProtoBuf, and JSON schemas in a special Kafka topic.  You manage schemas in the Schema Registry using the Kafka REST API. By the careful use of compatibility types schemas can be modified without causing errors. 

Where to Find More Information