Developing applications for speed, scale, and efficiency in Aerospike 7+
Explore how Aerospike 7's enhanced features support high-speed data access and improve performance for real-time bidding and large-scale applications.
Aerospike was designed for applications that need predictable sub-millisecond access to billions or trillions of objects and must store many terabytes or petabytes of data while maintaining a small cluster size and low operational costs. The goals of large and small cluster sizes require high-capacity, high-speed data storage on each node.
Aerospike pioneered database technology by effectively using SSDs for high-capacity, high-speed, persistent storage per node. Key innovations include:
Accessing SSDs like direct addressable memory for superior performance
Supporting a Hybrid Memory Architecture for index and data in DRAM, PMEM, or SSD
Implementing proprietary algorithms for consistent, resilient, and scalable storage across cluster nodes
Providing a Smart Client for single-hop data access that automatically adapts to cluster changes
Simply put, the Aerospike Database enables application performance at scale. However, like all powerful tools, you can only tap into that power by thoroughly understanding the requirements and with some careful planning. For example, The Trade Desk, the world’s largest advertising firm after Google and Facebook, processes around 10 million queries a second (800 billion queries per day). You don’t build a global, real-time bidding platform and grow it into a worldwide powerhouse without careful planning.
Businesses rely on applications for success much in the same way that applications rely on databases for success. The whole value equation is driven by understanding and satisfying business requirements. The foundation of the tech stack is the database, which allows you to drive the value equation with careful planning and implementation.
For this reason, we will dive into two new and improved features – secondary indexes (now on BLOBs) and persistent map indexes – that developers can leverage to improve speed, scale, and efficiency.
We already have a large collection of resources about designing and planning, such as Data modeling in Aerospike, Data Modeling for speed at scale and Data modeling for speed at scale (Part 2). This blog post reviews new and deprecated features and their implications for application performance, scalability, and efficiency.
New features
Aerospike Database 7 introduces multiple improvements.
Secondary indexes on BLOBs
The Aerospike Database has long been able to efficiently retrieve data matching a particular predicate across all the data in the cluster with secondary indexes. Secondary indexes key on bin values (like an RDBMS column), similar to using a WHERE clause in SQL. Aerospike Database can also store a binary large object (BLOB) in a bin. However, BLOBs could only be queried by key, not by the information contained in the BLOB.
Now, a secondary index can be created to query BLOB bins. As you may know, BLOBs can hold any binary data, such as images, videos, spreadsheets, documents, archives/backups, executable files, data files, and configuration files. A common example of using a BLOB in a database is storing information (like identifiers) in a binary format for compactness of data. Secondary indexes on BLOBs can then be used to find this data.
Aerospike stores secondary indexes in memory by default for fast lookup. Secondary indexes are built on every node in the cluster and co-located with the primary index. Each secondary index entry only contains references to records located on that node, and queries run in parallel across the cluster. Aerospike tools or the API are used to dynamically create and remove indexes based on bins and data types.
We’ve removed the need to download a BLOB to operate on it. Aerospike supports a rich set of bitwise operations to manipulate large BLOB bins on the server without needing to pull the entire BLOB to the client. This simplifies the code and workflow while removing a costly download/upload.
Persistent key-ordered map indexes
Aerospike records are made up of schema-less bins (fields). A bin can hold collection data types (CDTs) such as List (object field values are stored in a specific order) and Map (object fields stored as key-value pairs). Lists and Maps can also be nested within other CDTs, allowing for arbitrarily complex data structures.
You can now persist value-ordered Map indexes to improve performance. You can now store a value-sorted index along with the record so operations on the values (not the keys) become faster. This costs a little more to store, but the query performance improvements are dramatic for value-based operations because elements are already sorted. The new MapPolicy option can be applied when creating a new Map bin or with the Map set_type operation.
MapPolicy(MapOrder order, int flags, boolean persistIndex)
Let’s dig into this functionality a bit. This is a useful technique for systems that require ultra-fast data access, such as real-time bidding in AdTech, but only if they’re using value-style operations. An ad platform may have 10 milliseconds to determine the best ad to show a user browsing a website, but in those 10 milliseconds, the platform must assess the user’s interests, browsing history, and other relevant data. User profiles store information about users, typically audience segmentation data pertaining to the user’s interests. Typically, a user is identified by a cookie or device ID, rather than as an individual.
Audience segmentation retrieval for real-time bidding occurs tens of millions of times per second. AdTech companies bid to display an ad on users’ screens based on what ads they have been paid to display and how likely they believe the user is to click on the ad. This requires maintaining a list of interests for each user, and since someone is usually only interested in something for a period of time, AdTech platforms have user interests time out in order to target bids using the most current information. For example, if someone has not viewed information about laptops in the past 30 days, then they are probably no longer shopping for a laptop. Many AdTech companies expire audience segments after 30 or days.
Many AdTech companies attack this problem with Aerospike. They store a single record per user with a primary key of the user ID. Interests are stored in a Map with the segment ID as the Map key, and the expiry time and other information as the Map value. A call to removeByValueRange
can be used to remove expired segments.
Prior to Aerospike 7, there was a CPU cost to this operation for records stored on SSD because Maps only persisted the data sorted by the key. The index pointing to the values in value order was not persisted, resulting in the need for Aerospike to perform a sort every time values needed to be accessed in order. Aerospike 7 provides the ability to persist these indexes. The trade-off is exactly what you’d expect - you’ll need extra space on SSD to store the indexes, but you won’t need to use CPU cycles to perform the sort.
A more detailed explanation and data model are available in Building a fast user profile store for real-time bidding.
Removed or deprecated features
The following features were removed or deprecated in order to improve performance and simplify configuration and usage.
Single-bin. Support for single-bin namespaces was removed in Aerospike Database 6.4 to support the unified storage format. Single-bin namespaces provided savings of several bytes per record, an appealing feature when Aerospike Database was a new product, flash storage was much slower, and SSD and NVMe drives were much smaller. Today, common modern flash drives are significantly faster and larger, Aerospike’s storage formats are far more efficient than they were when single-bin was introduced, and Aerospike EE has a compression feature that helps reduce disk storage consumption. See Aerospike Database 6.4: Improved query and data distribution.
Data-in-index. Support for single-bin namespaces and data-in-index were removed in Aerospike Database 6.4 to enable the unified storage format. This optimization, no longer necessary, stored integer or float values in index space for faster retrieval. Removing single-bin and data-in-index frees up primary index (PI) space. See Aerospike Database 6.4: Improved query and data distribution.
The limit of 64K unique bin names per namespace was removed in version 7.0. This means operators no longer need to advise developers to restrict how many bin names their applications write into a namespace. As a result, the
bins
info command, thebin_names_quota
, and theavailable_bin_names
namespace statistics were removed in version 7.0.The limit on unique set names per namespace was raised from 1023 to 4095, allowing for set-level segregation of more tenants on the same Aerospike cluster.
The
write-block-size
configuration item was removed in version 7.1. Users are advised to useflush-size
andmax-record-size
instead.The
device_size
andmemory_size
expressions are now deprecated. Use the new expressionrecord_size
instead.For
storage-engine memory
namespaces without backing, thecommit-to-device
option is no longer configurable in 7.1, as the behavior is now automatic.In the Aerospike Tools package, the
truncate
command was removed from theaql
tool in version 10.0.0, which supports Aerospike Database 7.0.
Continuously improving the developer experience
Aerospike Database 7 includes a number of improvements that can affect application speed, scale, and efficiency.
Everyone knows that the true value of any tool lies in the hands of the user. Our expertise builds capabilities into Aerospike, and your expertise unlocks agility and innovation. It’s challenging to build always-on global cloud applications – you need a database that redefines the limits as you build expertise and push applications further.