Streaming Annotated Monthly – March 2021

Receive the last news and interesting articles about streaming platforms and processing frameworks in your mailbox

Mar 13, 2021

Here we go another month with the Streaming Annotated Newsletter! I continue with my personal goal to find out how to make this more collaborative. The newsletter numbers are solid but I miss to have more interaction.

So let’s try something different: a Telegram group. I already participate in a couple of them (mainly Java related and StreamingHispano) and they are great. You can read from your phone and configure notifications by group. I will post there articles with 🌶️🌶️🌶️ and we can comment them or discuss whatever you want related to streaming.

Does it sound interesting? Join the Streaming Annotated Telegram group!

Architecture and design

Making Sense of Unbounded Data.
Why You Need To Set SLAs for Your Data Pipelines: unusual topic but very important. 🌶️🌶️
Scaling Reporting at Reddit. 🌶️

Kafka

Examining Apache Kafka Performance Metrics ft. Alok Nikhil: podcast.
Real-time monitoring of Formula 1 telemetry data on Kubernetes with Grafana, Apache Kafka, and Strimzi.
Announcing the Confluent Community Forum.
User authentication and authorization in Apache Kafka.
Docker free Kafka integration tests.
Introducing Confluent Platform 6.1.
Automatic Observer Promotion Brings Fast and Safe Multi-Datacenter Failover with Confluent Platform 6.1: there are several options for geographic redundancy, Multi-Region Clusters is one of commercial options. 🌶️
42 Things You Can Stop Doing Once ZooKeeper Is Gone from Apache Kafka: one of the most exciting changes coming to Kafka this year.
Pragmatic Guide to Apache Kafka’s Exactly Once Semantics: excellent video from Gwen Shapira covering the typical gotchas and misunderstanding with Exaclty Once semantics in Kafka. 🌶️🌶️
Lessons Learned from Running Apache Kafka at Scale at Pinterest: highly recommended article. I liked the part covering Kafka upgrades and balancing of partitions.🌶️🌶️🌶️
Twitter thread about Kafka monitoring and observability. 🌶️
Microcks 1.2.0 release: it supports now Kafka and Avro. This guide Kafka, Avro and Schema Registry is available now. 🌶️
Visualizing Kafka: very good introductory article to Kafka.
Preview: Apache Kafka Log4j2 Support (KIP-653): this is a bit low-level but I appreciate to know more about logging in Kafka and CVE associated to it.
Kafka Monthly Digest – February 2021. 🌶️🌶️🌶️
Streaming microservices with ZIO and Kafka: I didn’t knew ZIO Streams. It’s interesting. 🌶️

Kafka on Kubernetes

Apache Beam

Gotchas of Stream Processing: Data Skewness: good overview and tips, most of the content applies to other frameworks. 🌶️🌶️
Apache Beam 2.28.0.
Beam Collegue: free Apache Beam training! 🌶️🌶️🌶️

Change Data Capture (CDC)

STREAMS Explained: Snowflake Change Data Capture Using Streams with a Snowpipe (Revised): it’s very interesting what snowflake is building. It seems big news are coming.
Change Data Capture in Postgres With Debezium.
A Gentle Introduction to Event-driven Change Data Capture.
Debezium 1.5.0.Alpha1 Released.
Oracle CDC Source Premium Connector is Now Generally Available.
Oracle to Kafka — Playing with Confluent’s new Oracle CDC Source Connector in Docker.
From Oracle to Google Big Query by Kafka.
https://www.infoq.com/articles/saga-orchestration-outbox/: another great post by Gunnar Morling. His Java articles covering new features are also superb. 🌶️🌶️🌶️

Google Cloud

Orchestration with Workflows: codelab with Pub/Sub.
Introducing real-time data integration for BigQuery with Cloud Data Fusion.
Dataflow now supports Dataflow Shuffle, Streaming Engine, FlexRS.
We had an incident, and it was great: article describing the experience with a “poison pills” incident on Dataflow. 🌶️🌶️🌶️
How Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020: post with this technical level are highly appreciated. In this case, it covers Sort Merge Bucket, an optimization that reduces shuffle by doing work up front on the producer side.🌶️🌶️🌶️
Announcing the Launch of Databricks on Google Cloud.
Google Cloud Functions Sink Connector for Confluent Cloud.
Monitoring your Dataflow pipelines: an overview: it’s always good to see metrics and dashboard for data pipelines. 🌶️🌶️
An Apache Spark connector is now available for Pub/Sub Lite.
Architect your data lake on Google Cloud with Data Fusion and Composer.

AWS

Azure

RocksDB

It isn’t exactly streaming but it’s quite relevant for Flink, Kafka Streams, etc.

Tools

Streams Explorer: it allows examining Apache Kafka data pipelines.
3rd party command line tools for Apache Kafka: this is a solid compilation of Kafka CLI tools.
Kafka Connect FileSystem Connector: amazing work by Mario Molina.
klustr: An open source monitoring tool for Apache Kafka.
kafka-connect-transform-kryptonite: Kafka Connect SMT to do field-level encryption/decryption of records.

That’s all! Comments? Drop me a message in Twitter or join the Streaming Annotated Telegram group!

Streaming Annotated Newsletter

Streaming Annotated Monthly – March 2021

Receive the last news and interesting articles about streaming platforms and processing frameworks in your mailbox

Architecture and design

Kafka

Kafka on Kubernetes

Pulsar

Kafka Streams, Kafka Connect, etc.

Flink

Spark Structured Streaming