Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

How We Built the New JSON API for Cassandra and Astra DB

Recently, we began to consider how to make Apache Cassandra more accessible to a wider audience of developers, particularly for the largest community of all: Node.js developers. Json is an important part of that developer ecosystem because of its flexibility.Many Node.js applications use an object document mapper called Mongoose.js that simplifies the process of converting JavaScript objects to and from JSON documents stored in a document database. Mongoose has approximately 2 million downloads a week on npm and 3.7 million public GitHub repositories list it as a dependency.We began looking at the Mongoose project as a representation of the kinds of data-access patterns that Node.js developers need, especially in terms of filtering, projection and updating. This includes features such as:Most of these data access patterns aren’t supported by Cassandra out of the box. Because it’s a distributed, table-oriented database, developers were traditionally encouraged to denormalize data to support reads and prioritize “no look” upserts. So we began to explore how we could provide an API on top of Cassandra that could implement these access patterns with improved scalability and performance.Our goal became to build the best backend for Mongoose. To accomplish this, we designed an API called the Json Api that is usable by Mongoose.js with only a configuration change or with any other language via the HTTP API. Earlier this month, we announced that the new JSON API is available in DataStax Astra DB vector databases and also can be used as part of the open source Stargate project, a data API gateway, against self-hosted Cassandra clusters.In this article, we’ll explain the details of the JSON API design and describe how it takes advantage of new Cassandra features to yield a rich set of document-oriented functionality for Node.js with demonstrably good performance and scalability.The key elements of this architecture are shown below. Client applications include the Mongoose JavaScript library along with the stargate-mongoose driver, packages available via npm. The developer just needs to configure the JSON API endpoint and things are ready to go.In designing the JSON API, we discovered that we could push the vast majority of the querying and filtering logic that Mongoose requires down into the Cassandra nodes themselves with a few key enhancements, especially improvements to the Storage Attached Index (SAI) implementation first introduced in Cassandra 4.0.To understand our design approach for the JSON API, it’s helpful to take a quick look back to set some context. Our first attempt back in 2021 at building a document-style API on top of Cassandra was the Stargate Docs API, based on a “document shredding” approach. While we were able to make some performance optimizations to this API, a key challenge was that the original shredding approach broke each document into components spread across multiple Cassandra rows.Although the resulting schema provided useful flexibility to help implement some of the “exact match” desired filtering operations, more complex filtering required overfetching of documents and filtering in memory. This design also required multiple queries for document insertion, retrieval, update and deletion operations. This hurt performance and added complexity to ensure consistency across multiple rows while those queries were in flight.For the JSON API, we’re using an improved approach known as “super shredding,” which Aaron Morton described in a recent talk. The design of super shredding was developed via a logical thought process to create a performant, scalable solution:Although the Mongoose.js API provides user-level control over indexing, we decided to build the JSON API to support efficient querying on all fields in a document without the user needing to create indexes. To achieve this, we separated the two concerns of the schema:Separating the two concerns led to optimizing the design for each concern, which created a more robust model.Consider the following example, which we’ll use to describe the super shredding table schema and how it works.The JSON API supports the concepts of namespaces and collections, which correspond to Cassandra keyspaces and tables, respectively. If a user created a namespace called purchase_database and a collection called products, the following Cassandra table would be created.Let’s look at how these columns are used. Several of them are always populated for every document (row) that is inserted:Other columns are optionally populated based on the contents of the document in order to support various application queries:We created SAI indexes on the exist_keys, query_* and array_* columns to support fast filtering on lookups. We’ll see an example of this below.Next we’ll see what happens when a client application inserts a JSON document. We’ll focus on what happens in the JSON API and how it uses Cassandra. For a view of what the experience is like from the perspective of a client JavaScript application, see the blog post “Build a Text and Image Search App with Astra DB Vector Search, NodeJS, Stargate’s New JSON API, and Stargate-Mongoose.”Let’s assume the client application inserts the following document:As you can see, this document has some nesting, including arrays and subdocuments, and uses a timestamp to represent the purchase date. This document provides a useful example for examining various types of queries, and in fact this is used in the JSON API Postman collection, which you can use to learn more about JSON API. One thing we should note while we’re talking about document structure: While use of Mongoose does require creation of JSON schema (it is, after all, an Object-Document Mapper), the JSON API itself does not enforce any schema, so you can insert whatever documents you wish if using it directly.Another key design goal of the JSON API is not to leak “Cassandra-isms” into the interface. If you’re a Cassandra user, you know some of these, such as the Partition Key versus the Clustering Keys. Users will not see any CQL or “Cassandra-isms” when using the API or stargate-mongoose. However, curious Cassandra developers will be interested to see the CQL row that was inserted into the purchase table, which looks something like the output below (with some formatting and values omitted for readability):We’ll focus on the contents of the query_text_values field to demonstrate other aspects of the super shredding design.Next, let’s look at what happens if the client application queries for documents with a specific city. Here is the JSON API query:The JSON API takes this query and interprets the requested value for purchase_date as a string. Therefore, it performs the following CQL query using the query_text_values column:SELECT key, tx_id, doc_json FROM purchase_database.purchase WHERE query_text_values["customer.address.city"] = "New York"The document inserted above will match this query. This query works because when the client application created the purchases collection, the JSON API created an SAI index on the values of the query_text_values column:CREATE CUSTOM INDEX IF NOT EXISTS purchase_query_text_values ON purchase_database.purchase (entries(query_text_values)) USING 'StorageAttachedIndex';This is a simple equality query, but the SAI also supports more complex inequality or NOT queries. Make sure to check out the JSON API documentation to see all the supported options.The JSON API also supports the rich set of update commands expected by Mongoose.js for partial or full documents, including unsetting fields or removing subdocuments, as well as optionally returning projections of the original or updated document.For example, the following JSON API query could be used to unset the preferred customer field from a document and return the updated document:This demonstrates some of the complexity of dealing with JSON types; in this case, a Boolean value can be true, false, null or unset, and the app can use unset, false or null to represent “not a preferred customer.”To implement the requested update, the JSON API must use a read-modify-write pattern, that is, the service pulls the JSON document into memory, updates it, then writes it back to the database. The resulting write looks something like this:The actual values have been omitted for brevity. Notice the use of the CQL IF clause, which checks to make sure that the tx_id has the value that was obtained from the initial document read. The IF clause implies use of a lightweight transaction (LWT) to ensure consistent updates, in this case, that the document contents have not been changed since it was read. While not all possible update commands strictly require this protection, the most correct design is to execute them all using this read-modify-write pattern. We designed this pattern to be able to take advantage of the new Accord-based transactions coming in Cassandra 5.0 for improved performance.Experienced Cassandra users might have some questions about the performance implications of some elements of the design. For example:While these are valid concerns, it’s important to keep in mind that the performance expectations for a document store are different from those that many of us in the Cassandra community are accustomed to.To validate that our implementation supports performance in keeping with the typical expectations of a document database, we executed benchmark tests of the JSON API running with a DataStax Astra serverless database. The tests were performed using NoSQLBench and Fallout framework using a variety of queries and documents of different size and complexity. You can find the files used to execute these tests on the JSON API GitHub repository.One test consisted of a warmup phase with multiple concurrent inserts, followed by a main phase including seven different operations running in parallel. The test executed each operation 10,000 times at a rate of 25 operations per second, for a combined rate of 175 ops/s. The results are shown in the chart below:As these results show, the JSON API was able to sustain consistent performance under a reasonably aggressive operational load. We’re continuing to work on performance testing and optimization, and look forward to taking advantage of improvements in Cassandra to get even more speed and scalability.Coming up, we’ll explain how we extended the JSON API to handle vector search, including how that affects the super shredding design. Future enhancements of the JSON API include adoption of additional Cassandra features from 5.0 and beyond. For example, the new Accord feature will enable the JSON API to improve performance of document update queries. In the meantime, we’ll be working on hardening and performance improvements as we look toward an official general availability release in the near future.But you don’t have to wait to start using this exciting new capability. Stargate-Mongoose is now available as an npm package, and we’ve just launched a public preview of the JSON API in DataStax Astra. Please try these new releases and let us know what you think, whether online or in person. We’d love to see you at Aaron’s talk, “Fast and Flexible JSON Retrieval in Cassandra Using SAI,” at Cassandra Summit, held December 12-13 in San Jose, California.For more details, see our JSON API documentation: “JSON API QuickStart with Mongoose” and “Developing with the JSON API.”Community created roadmaps, articles, resources and journeys fordevelopers to help you choose your path and grow in your career.



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

How We Built the New JSON API for Cassandra and Astra DB

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×