Streaming Annotated Monthly – April 2021
Open community to discuss the last news and articles about streaming platforms and processing frameworks
This was a quite complex month for me but here we go with a new newsletter full of content: 76 articles and 5 new tools. Amazing! One of the interesting things I’ve discovered this month is another streaming data compilation by Guido Schmutz. If I only knew it before, I would never start this one!
I’m quite happy with the community we are building here and I plan to focus more on that. The Telegram group is working quite well and this is only the beginning. I have in mind to do some type of activity, maybe a Telegram Voice Chat about the articles of the newsletter would be of interest for you? Let’s chat about it in the group.
That’s all! Learn a lot and stay safe!
Design and Architecture
Use an Event-Driven Data Mesh to Avoid Drowning in the (Data) Lake: Data Mesh & event-driven = hype^2
The Rise of Streamzilla: Leveraging Real-time Data in the Cloud: streaming in Porche. Streaming with Apache NiFi?
Real-time Data Infrastructure at Uber (paper): 🌶️🌶️🌶️
Under the Hood of Real-Time Analytics with Apache Kafka and Pinot: good article, Pinot and Kafka seem to play well together 🌶️
Apache Kafka and MQTT series
Monitoring Your Event Streams: Integrating Confluent with Prometheus and Grafana: it’s always to discover more Grafana dashboards and see how people is organizing the data 🌶️🌶️
Reducing database queries to a minimum with DataLoaders: this seems an interesting article but my lack of JS knowledge makes hard to evaluate it.
Securing the Infrastructure of Confluent with HashiCorp Vault: Vault is a solid piece of software 🌶️🌶️
Understanding Kafka Topic Partitions: good introduction for Kafka beginners.
Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper: we have a winner in the Kafka section, awesome article by Ben and Ismael 🌶️🌶️🌶️
Ververica + StreamNative: Cloud Partners : it’s interesting to see all the alliances between open-source Cloud services. I predict there are more coming.
Apache Pulsar Hackathon (6-7 May)
Taking an In-Depth Look at How to Achieve Isolation in Pulsar: another good example of the flexibility to scale up/scale down 🌶️
Kafka Streams, Kafka Connect, etc.
How to Tune RocksDB for Your Kafka Streams Application: we never get tired of articles about RocksDB optimization. This is a good introduction and it can be also quite helpful 🌶️🌶️
Kafka Connect JDBC Sink deep-dive: Working with Primary Keys: Robin never fails in this section 🌶️
Building your own Kafka Connect image with a custom resource: automate all the things! 🌶️🌶️
Apache Flink: Towards a 20x throughput improvement using in-memory buffers: this is a great deep dive, mandatory if you are into Flink optimization pipelines 🌶️🌶️🌶️
A Rundown of Batch Execution Mode in the DataStream API: use the same API for streaming and batch seems a great idea 🌶️
Building Riviera: A Declarative Real-Time Feature Engineering Framework: it’s amazing what some people is doing out there 🌶️🌶️
Pinterest Flink Deployment Framework: It’s fine but I was expecting a bit more from Pinterest on this.
Window Of Vulnerability: this is a surprising deep-dive on Fault Tolerance in Flink.
Data Driven Development for Stream Processing: this is more about observability. I liked it because I learn a couple of things which aren’t usually explained about how to organize your dashboards 🌶️🌶️
Apache Flink Roadmap: the feature radar is superb! Great idea 🌶️🌶️🌶️
Spark Structured Streaming
Beam College: a free educational program to provide hands-on training. Highly recommended if you want to learn Beam 🌶️🌶️🌶️
Getting Started with Snowflake and Apache Beam: there are many conversations about streaming ingestion in Snowflake. A space which should evolve very soon.
Understanding JSON Schema compatibility: Robert’s blog is a hidden treasure 🌶️🌶️🌶️
Change Data Capture
Capturing Every Change From Shopify’s Sharded Monolith: if you are going to read only one article of this newsletter, choose this one! Many of us are working with CDC and facing problems to scale it to an organization level. This is article is full of insights and experiences doing it 🌶️🌶️🌶️
The Journey from Batch to Real-time with Change Data Capture: it includes a comparison of Debezium and Amazon (AWS) Data Migration Service (DMS).
Creating and managing schemas: it’s in preview and with limited functionality, but it’s great to see cloud provider to support schema management 🌶️
Introducing Apache Spark Structured Streaming connector for Pub/Sub Lite: Pub/Sub Lite may decrease your bill significantly and the ecosystem around it is becoming more stable.
Pub/Sub push subscriptions can now be created with Cloud Run service endpoints protected by VPC Service Controls (preview)
Pub/Sub is now available in the europe-central2 region (Warsaw).
Dataflow Execution details are now available in Preview.
Dataflow SQL now supports user-defined functions (UDFs) written using SQL. For more information, see Dataflow SQL user-defined functions (preview).
Dataflow is now able to use workers, Dataflow Shuffle, Streaming Engine, FlexRS, and regional endpoints in zones in europe-central2 (Warsaw)
Integrating Azure and Confluent: Real-Time Search Powered by Azure Cache for Redis and Spring Cloud: it’s amazing how well Microsoft integrate different open-source Cloud services in their own cloud 🌶️
Awesome Open-Source Contribs for Apache Kafka: a curated list of awesome open-source frameworks, libraries, tools and examples for the Apache Kafka project.
A Great Day Out With... Apache Kafka: superb idea by @gunnarmorling and @hpgrahsl 🙌
ksqlDB GraphQL poc: setup to serve as proof of concept in using Kafka with ksqlDB in combination with the query language GraphQL.