Blog

Why your data streaming environment needs a real-time database

These days, if almost everything is a stream, can you survive without a database?

December 21, 2023 | 11 min read
Steve Tuohy website
Steve Tuohy
Director of Product Marketing

Amid all the innovation and messaging in the data ecosystem is an indisputable path towards more data and more successful real-time use cases in production. The streaming ecosystem is one area that generates a lot of attention and often some confusion. This piece examines more closely how a real-time database thrives alongside streaming technologies. In short, they are essential technologies that work best in tandem and are rarely alternatives to each other.

Streaming enables more real-time use cases

As the messaging and queuing world of JMS and ActiveMQ have been pushed forward by distributed event streaming tools like Kafka and Pulsar, the scale and throughput of data in motion have exploded. This increase in data in motion has expanded the market for real-time databases like Aerospike, making it a differentiator for not just banks and AdTech firms but retailers, telecom providers, manufacturers with IoT data, and many more.

Some people wonder if the proliferation of streaming technologies and capabilities obviates the humble database. If everything is a stream, can you survive without a database?

The truth is the high-performance, distributed database has been a real-time component of companies’ data environments even before the widespread adoption of distributed streaming platforms like Kafka. The growth in streaming architectures is a significant advancement for the real-time database as it enables more organizations to broadly “go real-time.” Real-time databases commonly deliver workloads with over 1 million transactions per second at under 100-, 10-, or even single-millisecond throughput at 99th percentiles. But honestly, this performance is only compelling in environments where the latencies of the other parts of the application pipeline are minimal. Suppose your application and your messaging queue take two seconds to process. In that case, a database with sub-millisecond transactions is not a game changer versus a more pedestrian database that does, say, 500 ms transactions.

Financial services and AdTech firms were the early adopters of distributed NoSQL databases, and they would typically connect the database directly or closely to data sources and destinations. Even today, the customers with the most demanding applications typically connect directly to a real-time bidding API, and organizations like TransUnion extend their database into financial organizations’ data centers to reduce the latency in fraud detection deployments.

Data streaming platforms typically need a database

Streaming has gotten so popular that some think of it as a silver bullet and that every data architecture should be built specifically for streaming. Again, some might argue that if everything is an event, you only need streaming topics, not databases and tables. Technically, you can make ksqlDB your database, but it is not a robust database at any scale. For real-time use cases that incorporate diverse and historical data, it won’t hold up.

The power of modern, distributed streaming services like Kafka and Pulsar is they can handle huge numbers of stream publishers and subscribers. Newer streaming platforms get the most out of modern hardware and have accelerated speed and performance. Microservices application environments depend on the fluidity of data in motion, with many services consuming and contributing different subsets of data.

Consider a basic online game. A player’s clicks are added to a stream topic, a server subscribed to the topic applies an action, and the server publishes the result back to the game environment. This real-time system doesn’t even need a database.

That said, much of this data in motion has historical value beyond the basic in-game mechanics and should be persisted. Many processes and decisions need to incorporate more than just the last few real-time clicks, as historical data aids in understanding player behaviors, detecting gameplay anomalies, and informing game development. This includes both streaming data of the past and other non-streaming data sources. Even fairly basic streaming architectures typically enrich real-time data streams with past data.

Streaming systems are also not optimized for random data access patterns like most databases are. Use cases like inventory management, customer support, and preventive maintenance are among many that often require low-latency access to specific data points.

Everything adds latency – even streaming

Streaming technologies are rightly associated with real-time use cases. But like every additional step in a data flow, they do introduce latency. Many use cases benefit from connecting an application directly to a high-performance database and omitting any streaming steps.

As fast as distributed streaming has become, it is still an additional stop in an end-to-end data process. Redpanda, a Kafka-based alternative, aims to reduce Kafka’s latency and add efficiency. In this Kafka versus Redpanda comparison, even the best latencies are four milliseconds, with many tail latencies of 100+ milliseconds, and much higher for larger workloads.

For many demanding applications, it often makes more sense to connect the database straight to the source and/or the destination API. As described before, those who are winning in milliseconds, like financial services and AdTech, still architect their priority applications to minimize the hops, including bypassing streaming technologies altogether.

Data consistency for distributed systems

Organizations that need to deliver consistency guarantees across distributed deployments cannot rely on streaming alone. Streaming systems focus on high availability, not strong consistency, i.e., they can’t provide a mode that balances both consistency and the partition tolerance of distributed deployments.

The CAP theorem encapsulates the tradeoffs of distributed systems. A system is distributed across partitions; therefore, when partitions break from each other, or a node fails, the partitions must provide a response (availability) or maintain and provide the most current write updates (consistency) to every request (or generate an error). The creators of Kafka at LinkedIn designated it as a CA system, meaning it does not handle partition tolerance at all. If your network fails, you cannot rely on streams as a consistent source of truth.

Think about how an application hits a database to know the current value of a record. Imagine there are 20 credit card transactions in a streaming topic for an individual. Each one impacts whether they’re exceeding their credit limit. The decision whether to allow their next purchase depends on where you are in the topic. As an eventually consistent model, it’s not assured that all previous purchases are evaluated before computing the credit limit, so using the streaming topic as a database would fail in this use case.

Similarly, you need to quickly load and process many records that get updated frequently, like 10+ updates per second. In a streaming-only architecture like Kafka, identifying the latest version of a record requires a time-consuming scan of the entire Kafka topic. Deciding whether to take a value at a specific timestamp or the current value at the head of the topic can add further challenge since Kafka is optimized for sequential, log-based reads, not for high-speed random reads. Consider this challenge in a real-time application like a stock trading platform, where real-time access to the latest data is crucial. While Kafka excels at managing large data streams, its design is less suited for scenarios requiring immediate, random data access.

Streaming raises all databases – some more than others

Streaming raises the bar for real-time capabilities in many organizations’ environments. When we talk to potential new customers, it’s a great sign if they have embraced Kafka, Pulsar, or Redpanda because it indicates they appreciate the benefits of low latency and performance at scale. As discussed, a real-time database is less helpful when surrounded by a wider architecture with high latency and low performance.

On the other hand, some less performant databases embrace Kafka and other streaming tools because they can be a way to make up for their database shortcomings. In high-volume environments, organizations sometimes have to put a queue in front of their database because the ingest speed of the database can’t keep up. The streaming engine or queue handles all the streaming events, and it buffers the writes for the database.

However, databases that deliver low latency at high throughput do not need this crutch. A real-time database ingests data faster than streaming at scale.

Narrowing the operational-analytical gap

One primary streaming use case is moving data between operational and analytical data systems. In the past, this was accomplished through batch ETL jobs that moved data from an operational database to a data warehouse once or twice daily. Now, there are modern databases, countless data sources, streaming engines, and cloud warehouses, and the cadence is by the hour, minute, or second. The data not only flows to analytics use cases, but increasingly, the analyzed and modeled data flows back to operational systems.

Business Intelligence tools and dashboards from modern data warehouses like Snowflake have improved so much that users expect reports to be current and “near real-time.” This is leading organizations to feed analytic data back into real-time operational workloads. Native Snowflake connectors, Kafka, and other streaming tools can move the data back to the operational systems.

This need for real-time data has expanded the opportunity for real-time operational databases. When customers start depending on Snowflake not just for historical analytics but in an operational capacity, there are limits to how fast Snowflake can serve and ingest large volumes. Some customers connect real-time databases like Aerospike directly to their Snowflake warehouse as a cache for their real-time demands.

An alternative approach to narrowing the operational-analytical gap is eliminating the ETL pipelines and managing operational and analytical workloads from a single database. A hybrid transactional/analytical processing (HTAP) database (also referred to as a translytical database) can eliminate the delay and complexity of transferring data. Similarly, tools like Trino (formerly PrestoSQL) can provide the analytical depth directly on data that remains in the operational database, querying and analyzing this data without needing ETL or streaming to a separate warehouse.

Aerospike – enhancing the streaming world

The areas above highlight the myriad challenges and opportunities in the dynamic world of data streaming. Streaming technologies have revolutionized how data movement is handled, but they also reveal the need for robust, efficient databases capable of complementing these advancements. This is where solutions like Aerospike come into play, offering unique capabilities and integrations that elevate the potential data architectures with streaming. Let’s delve deeper into how Aerospike fits into this evolving landscape and enhances it with its innovative features and integrations.

Aerospike Connect – architected for performance

Aerospike was designed as a shared-nothing, distributed database from the start. It exploits the latest computing capabilities and economics to address internet-scaling needs. The wide adoption of streaming platforms has similar roots. Kafka and Pulsar, in particular, can horizontally scale across independent nodes nearly without limit. There are no bottlenecks when ingesting streams to Aerospike nor when publishing from Aerospike to stream topics.

The Aerospike Connectors use our unique architecture to work hand-in-hand with the streaming tools to match their throughput without adding latency. The same technology we use for providing active-active configurations across geographic zones is behind these Connectors. Specifically, Aerospike Cross-Data Center Replication (XDR) uses a change data capture (CDC) mechanism (Change Notifications) to ingest from and push to Kafka, Pulsar, JMS, and other streaming technologies.

Aerospike Connect – tight integration

To bridge the operational-analytical gap within a single database framework, Aerospike provides native

secondary index capabilities and integrates tightly with Trino and Presto. Aerospike Connect for Presto-Trino and Aerospike SQL enables robust analytics directly on the Aerospike database. This integration approach showcases Aerospike’s robust architecture, which is adept at managing mixed workloads and multi-tenancy, efficiently handling read-heavy analytics alongside real-time operational workloads.

The range of deployment preferences among users is diverse, with many opting to transfer data into specialized analytical systems. To facilitate this, Aerospike connectors come equipped with outbound features, allowing for seamless integration into a variety of architectural setups. These connectors offer significant advantages over custom coding or third-party integration platforms in terms of efficiency and time savings. The rise of streaming platforms has notably enhanced capabilities in real-time analytics. Aerospike’s connectors, including those for Kafka, Pulsar, and JMS, seamlessly incorporate Aerospike data into broader analytics pipelines. Aerospike Spark and Elasticsearch connectors provide further integration for search and advanced analytics.

Better together

In the rapidly evolving data landscape, achieving real-time capabilities is critical, and streaming technologies play a pivotal role in meeting this demand. Aerospike provides seamless integration with streaming platforms and its robust real-time database architecture.

To see specific examples of this integration, check out Aerospike customer architectures and use cases at Cybereason, DBS Bank, Dream11, Globe Telecom, and Wayfair. Each runs streaming platforms alongside Aerospike for low latency, high performance, and efficient data processing.