Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Data Access API over Data Lake Tables Without the Complexity

Sign upSign InSign upSign InAlon AgmonFollowTowards Data Science--ListenShareData lake tables are mostly utilized by data engineering teams using big data compute engines, such as Spark or Flink, as well as by data analysts and scientists creating models and reports with heavy SQL query engines, such as Trino or Redshift. These compute engines have become the standard for accessing data in data lakes because they were designed to efficiently handle the challenges big data wrangling typically involves: scanning large volumes of data, dealing with cloud-based object storage, reading and writing query-optimized formatted files such as Parquet or ORC, etc.However, it is also a common requirement to make big data products (or some aggregated view of them) accessible to thinner clients, such as internal micro services, using some sort of API. Suppose that we have a data lake table that stores real-time statistics about our customers generated by some Spark application. This data might be primarily used for internal reporting, but might also be valuable for other services in our organization. Despite the fact that this is a common requirement, it is far from being simple, mostly because it requires quite a different tool-set. Making parquet files in an S3 bucket available for low latency HTTP-based API is not straightforward (especially when the files are continuously updated and some transformation is required before making them accessible).To make such use case work, we will typically need a database that will be able to process queries in a fast customer-facing latency. Likewise, we will need some ETL jobs that can handle and transform data files in S3 and load them into the database. Finally, we will have to create a proper API endpoint to serve client’s queries.Indeed, as the illustration shows, providing thin clients the ability to query data lake files fast usually comes at the price of adding more moving parts and processes to our pipeline, in order to either copy and ingest data to more expensive customer-facing warehouses or aggregate and transform it to fit low-latency databases.The purpose of this post is to explore and demonstrate a different and simpler approach to tackle this requirement using lighter in-process query engines. Specifically, I show how we can use in-process engines, such as Duckdb and Arrow Data Fusion, in order to create services that can both handle data lake files and volumes and act as a fast memory store that serves low-latency API calls. Using this approach, we can efficiently consolidate the required functionality into a single query service, which can be horizontally scaled, that will load data, aggregate and store it in memory, and serve API calls efficiently and fast.In what follows, Section 2 will outline the main requirements and building blocks for this service and explain how they help us solve the main challenges it involves. Section 3 will delve into the core of the service — the data loading and query functionality, and demonstrate how it can be implemented using DuckDB and Go. (This post will be focused on DuckDB and Go, but you can find an implementation of this concept in Rust using Arrow Data Fusion in the repo linked below). Section 4 will add the GraphQL API serving layer on top. Section 5 will conclude.It should be noted that the basic assumption in this approach is that the data we want to make available for our API can fit in the memory of our service, or the machine that runs it. For some use cases this might be a problematic restriction, but I believe that this is less restrictive that it seems. For example, I have used this approach with an in-memory relational table that consists of 2M records and 10 columns with memory usage of around 350MB. We tend to forget that the data that we actually serve is often much smaller than the data we store or process. At any rate, that’s an important consideration to keep in mind.An independent service that will operate as a simpler and more attractive alternative to the common architecture described above should answer, at minimum, the following requirements:Simply put, instead of extending our application’s infrastructure with a database and an additional ETL process, ideally, we would like to create a service that can load and transform data directly from the source, store it efficiently, and provide us the ability to query it fast.In my view, the combination of these 3 features is one of the greatest advantages that DuckDB (which is the focus of this post) and Arrow Data Fusion bring with them. It's true that in memory DBs are not something new but the game changer with DuckDB and Arrow Data Fusion is their extensibility, which enables us to use extensions that easily add capabilities allowing us to directly read and write data in different formats and locations, as well as to do it in scale and fast.Therefore, our service will consist of 3 main components or layers that will wrap or encapsulate each other: a low-level data component that will encapsulate a DuckDB connection (to which I shall refer as a DataDriver), a DAO component that will use the driver to execute queries and handle API requests, and an API resolver that will serve it.In other words, in terms of dependencies and their relations, we have the following structure:API-Resolver encapsulates a DAO Struct encapsulates a DataDriver struct encapsulates a DuckDB ConnectionThe next section will start by focusing on the lower layers (the DAO struct and DataDriver) while the section that follows will discuss the top API layer and how it brings it all together.In this section, we are going to create the driver component that wraps a DuckDB connection. It will take care of initializing DuckDB and expose an interface to execute SQL statements and queries. We will use the excellent go-duckdb library, which provides a database/sql interface for DuckDB by statically linking to its C lib.As mentioned, we will wrap the sql.DB interface provided by the go-duckdb library with a struct named DuckDBDriver that will take care of initializing it properly. We initialize DuckDB by executing a number of initialization statements (bootQueries ) using a Connector object. The connector executes statements that set AWS credentials (as we want to load data from S3), as well as loading and installing the extensions that are required by our service: the parquet extension (for reading parquet files) and httpfs (for directly reading from HTTP-based object storage such as S3).As shown in the code block above, the function getBootQueries() simply returns a collection of init statements as strings (you can see the statements here). The init statements are executed by the connector, so that when we call OpenDB() we get a DuckDB as an sql.DB connection loaded with the required extensions and secrets. Because go-duckdb provides a database/sql interface to DuckDB connection then its main query functionality can be implemented and exposed quite simply as:As mentioned, the DuckDB data driver struct will simply act as a utility class to wrap the connection to DuckDB’s DB, while all query executions will be effectively managed by a DAO struct that will have a set of biz logic functions that use the driver’s methods. The DAO struct will in turn wrap the DuckDBDriver struct.In the final stage of the initialization of the service’s data back-end , we load the data we want to serve from parquet files in S3 into a memory table. To do so, we will use our driver’s execute() function with a CTAS query, that will create a named table using any kind of transformation we can express in SQL, from a read_parquet() function. An example will make this clearer.Suppose we have a parquet table that consists of data that describes our users. We want to create a service that will expose just 3 fields from this parquet table for fast API access: name, last_name, and age. We also want to make sure that the age field will be accessible as an Integer although it is saved in the parquet file as a string.To do this, after our DuckDB driver has been initialized with the required extensions, and the required AWS credentials have been set, we simply execute an SQL statement that selects the data we want, directly from parquet files in S3, into memory, using the read_parquet() function.In this statement, we essentially create an in-memory table named Users that consists of the fields we select from the parquet files in the location given to the read_parquet() function. We could use any SQL function and syntax that DuckDB supports, including complex query statements and aggregations. Here is a fuller example of how we use this approach to initialize our service.After the service has been initialized and the data has been loaded, we can execute any SQL query we need directly on the memory table, represented by our data driver , in order to serve our API with sub-second latency.Now that we have a connection to a memory table loaded with cached parquet data, the last step would be to create a GraphQL endpoint over it in order to efficiently answer data queries. To do so, we are going to use the library gqlgen by 99designs, which makes this task rather straightforward.An in-depth introduction to gqlgen is, unfortunately, beyond my scope here. Interested readers that are less familiar with GraphQL are encouraged to skim its documentation which is very clearly presented and explained. However, I believe that some familiarity with GraphQL concepts is sufficient to follow this section and get the idea.Exposing a GraphQL endpoint using gqlgen typically involves 3 main steps: (1) creating a schema, (2) generating resolver code and stubs, and (3) adding resolver code that implements the API functions.We are going to start with this schema that describes a User in our table and 2 main functions to fetch users data: a general fetch-all, and a fetch by email.After creating our schema, we call gqlgen code generation procedure in our project directory:Running the generate procedure will generate a lot of code, including the actual User struct (our data model struct), a corresponding resolver deceleration template and the resolver implementation. Lets discuss them in order.The resolver struct was generated in a file named resolver.go with just 2 statements — a struct type declaration with no properties or members, and a constructor ( a new() method) that will initialize it. As we shall shortly see, the resolver is our API serving layer that implements a function for each of the API’s methods. The purpose of the resolver.go file is for us to inject any required dependency to the resolver or add to it whatever we need to serve queries for our API. Recall that this is exactly the purpose of our DAO struct. Our DAO struct wraps the DuckDB data driver which holds a connection to our in memory tables, and is in charge on “translating” API requests for data to sql queries. Therefore, we simply inject an initialized DAO object to the resolver, so that the resolver can use it to execute queries.The next file that was generated (and will be re-generated every time we run gqlgen generate procedure) is schema.resolvers.go which is the implementation of the resolver’s methods. The generated schema.resolvers file essentially contains the method signatures of the API functions declared in the schema. In our case, it will therefore consist of the 2 methodsTo implement these functions, we first need to have the corresponding method in our DAO struct, but lets first implement one for the sake of example, and then complete the required DAO code.As you can see, because our DAO was injected to the resolver struct, we can simply call its functions using our resolver. This structure makes the API layer’s code very clean and simple.Now let’s write the actual implementation of the required function in the DAO struct. As you can see below, the required code is fairly simple. Although I’m using some helper functions (that you can see in the companion github repo) the GetUsers() function simply executes an SQL query on our in-mem DuckDB table and builds a list of users (recall the model.User struct was generated by gqlgen using our schema).Now we essentially have all the required layers that we need to chain together. That is, a data driver struct (that encapsulates a db connection), which is injected into a DAO struct that implements and serves as an interface for all the required API functions, which are invoked by the resolver — our API handler. The relationship between the components and role is very clearly expressed in the way they are chained together in the main server.go file that bootstraps our service and its dependencies.When our service is initialized, we first initialize our driver which gets a connection to our DuckDB in-mem store. Next, we inject the driver to the NewStore method that creates a DAO and uses the driver to load the data from the parquet files to memory. Finally, we inject the DAO struct to the API handler that invokes its function when serving API requests.The purpose of this post was to offer an alternative approach toward enabling HTTP API access to data lake tables for thin clients. This use case is getting increasingly common and usually requires adding a number of moving parts, monitoring, and resources to our pipeline. In this post I proposed a simpler alternative, which I believe can fit many uses cases. I demonstrated how we can use DuckDB’s query performance and extensions in order to enable our service to load data from remote object storage, save it in an in-memory relational table and enable us to query it with sub-second latency. More generally, I tried to give an example of the great capabilities that DuckDB’s extensions can bring to our services as well as the ease in which in can be embedded.Hope this will be useful!** All images, unless otherwise noted, are by the author----Towards Data ScienceEngineering Director | Big Data | Data Science | Data Streaming and AnalyticsAlon AgmoninTowards Data Science--Heiko HotzinTowards Data Science--16Giuseppe ScalamognainTowards Data Science--15Alon AgmoninTowards Data Science--2Nicholas Leong--4Mark CraddockinPrompt Engineering--1Sam Zamany--4Dominik PolzerinTowards Data Science--8Muhammad Junaid--Nam Huynh Thien--HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

Data Access API over Data Lake Tables Without the Complexity

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×