October 20th 2023

Yingjun WuFollowBetter Programming--ListenShareTwo weeks ago, Current 2023, the biggest data Streaming event in the world, took place in San Jose. It ranks among my favorite events of 2023. This isn’t solely because the conference venue was a mere ten-minute drive from my home — but also because it’s arguably the only annual conference where data streaming experts from around the globe convene to discuss technology openly.If you missed the event, fret not. My friend Yaroslav Tkachenko from Goldsky has penned a comprehensive blog detailing the key insights from the event. Among the many insights he shared, one that particularly piqued my interest was his comments on streaming databases:As the founder of RisingWave, a leading streaming database (of which I am shamelessly proud), Yaroslav’s observations prompted a whirlwind of reflections on my end. My reflections are not rooted in disagreement. On the contrary, I wholeheartedly concur with his viewpoints. His blog post spurred me to contemplate our use cases and SQL’s Stream Processing and streaming databases from various angles. I’m eager to share these musings with everyone in the data streaming community and the wider realm of data engineering.Streaming databases enable users to process streaming data in the style of databases, with SQL naturally being the primary language. Like most modern databases, RisingWave and several other streaming databases prioritize SQL, and they also offer User-Defined Functions (UDFs) in languages like Python and Java. However, these databases do not really provide lower-level programmatic APIs.So, the question we grapple with is: is the expressiveness of SQL (even with UDF support) sufficient?In my conversation with hundreds of data streaming practitioners, many argue that SQL alone doesn’t suffice for stream processing. The top three use cases that immediately come to my mind are (1) (rule-based) fraud detection, (2) financial trading, and (3) machine learning.With fraud detection, many applications continue to lean on a rule-based paradigm. For these, Java often proves more straightforward for expressing rules and directly integrating with applications. Why?Primarily because numerous system backends are developed in Java. If the streaming data doesn’t necessitate persistence in a database, articulating logic in a consistent language becomes considerably more manageable.In financial trading, some purport that SQL falls short, particularly when specialized expressions outside the scope of standard SQL are in demand. Although they could embed this logic within UDFs, concerns loom regarding the latency these UDFs introduce. Traditional UDF implementations typically involve hosting a UDF server and are notorious for inducing significant latency.As for machine learning use cases, practitioners have a penchant for Python, with the bulk of their applications crafted in it. Python libraries like Pandas and Numpy are their favorite toolkits. Resorting to SQL to express their logic doesn’t come intuitively.While I could list numerous real-world scenarios where SQL (with and without UDF support) sufficiently addresses the needs, I do not intend to deny that SQL’s expressiveness doesn’t quite match that of Java or Python.However, one thing that may be worth discussing is that if SQL can satisfy their stream processing needs, will people choose SQL-centric interfaces or still resort to a Java-centric framework? Most people will choose SQL. The crux of my argument lies in SQL's ubiquity and foundational nature. Every data engineer, analyst, and scientist is versed in it. If basic tools can be harnessed to meet needs, why gravitate toward a more intricate solution?Most data systems, like Hadoop, Spark, Hive, and Flink, emerged during the big data era and were Java-centric. However, newer systems, like ClickHouse, RisingWave, and DuckDB, are fundamentally database systems that prioritize SQL. Who exactly uses these Java-centric systems, and who exactly uses SQL-centric systems?I often find it challenging to convince established companies founded before 2015, such as LinkedIn, Uber, and Pinterest, to adopt SQL for stream processing. For sure, many of them didn’t like my pitch solely because they already had well-established data infrastructures and preferred focusing on developing more application-level projects.I know these days, many companies are looking into LLMs, but a closer examination of their data infrastructure reveals some patterns:Several factors make it arduous to persuade such enterprises to embrace stream processing:While selling SQL-centric systems to these corporations can be daunting, don’t be disheartened. We do have success stories. There are some promising indicators to watch for:Interestingly, the rise of SQL stream processing has been buoyed by the advocacy of Java-centric big data technologies like Apache Spark Streaming and Apache Flink. Although they began with a Java interface, these platforms increasingly emphasize the relevance of SQL.The prevailing trend indicates that most platform newcomers initiate their journey through SQL. This predisposes them towards the SQL ecosystem over the Java one. And even if they’ve begun with Java-centric systems, a transition to SQL streaming databases in the future might be a smoother pivot than one might expect.Before delving into the market size of SQL stream processing, it’s essential to first consider the broader data streaming market. We must recognize that the data streaming market, as it stands, is somewhat niche compared to the batching market. Debating this would be fruitless.A simple examination of today’s market value for Confluent (the leading streaming company) compared to Snowflake (the dominant batch company) illustrates this point. Regardless of its current stature, the streaming market is undoubtedly booming. An increasing amount of venture capital is being invested, and major data infrastructure players, including Snowflake, Databricks, and Mongo, are beginning to develop their own modern streaming systems.It’s plausible to suggest that the stream processing market will eventually mirror the batch processing market in its patterns and trends. So, within the batch processing market, what’s the size of the SQL segment? The revenue figures for SQL-centric products like Snowflake, Redshift, and Big Query speak volumes. Then what about the market for products that primarily offer Java interfaces?Well, at least in the data infrastructure space, I didn’t see any strong cash cow. Someone may mention Databricks, the rapidly growing pre-IPO company commercializing Spark. While no one can deny that Spark is the most widely used big data system in the world, a closer look at Databricks’ offerings and marketing strategies would conclude that the SQL-centric data lakehouse is what they bet on.This observation raises a paradox: SQL’s expressiveness might be limited compared to Java, yet SQL-centric data systems manage to generate more revenue. Why is that?First, as highlighted in Yaroslav’s blog, SQL caters to approximately 50–80% of use cases. While the exact figure remains elusive, it’s evident that SQL suffices for a significant proportion of organizational needs.Hypothetically, if a company determines that SQL stream processing aligns with its use cases, which system would they likely opt for? A SQL-centric one or a Java-centric one?If you’re unsure, consider this analogy: if you aim to cut a beef rib, would you opt for a specialized beef rib knife or a Swiss Army knife? The preference is clear.Second, consider the audience for Java. Individuals with a computer science background might proficiently navigate Java and grasp system-specific Java APIs. However, expecting those without such a background to master Java is unrealistic. Even if they did, wouldn’t they prefer managing the service independently? While it’s not absolute, companies boasting a robust engineering team seem less inclined to outsource.While we’ve extensively discussed SQL stream processing, it’s time to pivot our attention to streaming databases. Classic stream processing engines like Spark Streaming and Flink have incorporated SQL.As mentioned, these engines have begun using SQL as an entry-level language. Vendors are primarily building on SQL, with Confluent’s Flink offering standing as a notable example. Given that Spark Streaming and Flink provide SQL interfaces, why the push for streaming databases?The distinctions are significant. Big data SQL fundamentally diverges from database SQL. For instance, big data SQL often lacks standard SQL statements common in database systems, such as create, drop, alter, insert, update, and delete, among others.Digging deeper, one discerns that a pivotal difference lies in storage capabilities: streaming databases possess them, while stream processing engines typically do not. This discrepancy influences design, implementation, system efficiency, cost, performance, and various other dimensions.For those intrigued by these distinctions, I’d recommend my QCon San Francisco 2023 talk (slide deck here).Furthermore, a database's fundamental idea is distinct from a computation engine's. To illustrate, database users often employ BI tools or client libraries for result visualization, while those using computation engines typically depend on an external system for storage and querying.Some perceive a streaming database as a fusion of a stream processing engine and a traditional database. Technically, you could construct a streaming database by merging, say, Flink (a stream processing engine) and Postgres (a database). However, such an endeavor would present myriad challenges, including maintenance, consistency, failure recovery, and other intricate technical issues.I’ll delve into these in an upcoming article.Debating SQL’s expressiveness is moot. While SQL boasts sufficient expressiveness for many scenarios, languages like Java, Python, and others outshine it in specific use cases. However, the decision to adopt SQL stream processing isn’t solely influenced by its expressiveness. Often, it’s shaped by a company’s current standing and data infrastructure journey.Mirroring trends observed in the batch domain, one can hypothesize about the future SQL stream processing market size. Regardless, streaming databases offer capabilities that extend beyond mere stream processing. It’ll be fascinating to see how the landscape evolves in the coming years.----Better ProgrammingFounder of RisingWave (risingwave.com), a distributed SQL streaming database. Previously AWS Redshift, IBM Research Almaden. NUS PhD, CMU-DB alumnus.Yingjun WuinData Engineer Things--1VinitainBetter Programming--36Benoit RuizinBetter Programming--210Yingjun WuinData Engineer Things--14Bruno Gonzalez--Apache Doris--Analytics at Meta--16Mahesh SainiinInterviewNoodle--45StarRocks Engineering--Souhaib GuitouniinBlaBlaCar--1HelpStatusAboutCareersBlogPrivacyTermsText to speechTeams

This post first appeared on VedVyas Articles, please read the originial post: here

People also like

The Ultimate Guide to Cloud Gaming: Discover the Best Services

Stream Processing: Is SQL Good Enough?

Related Articles

Stream Processing: Is SQL Good Enough?

Related Articles

Share the post

Subscribe to Vedvyas Articles

Thank you for your subscription