Streaming Annotated Monthly – February 2021

Receive the last news and interesting articles about streaming platforms and processing frameworks in your mailbox

Feb 08, 2021

When I published the previous newsletter with only one subscriber (myself) I never expected it would have such great acceptance. 148 subscribers and 784 views are a more than enough indicator of interest for me. Much more important than the newsletter numbers, I had some interesting conversations with people who read the newsletter: I received several messages on Twitter, Linkedin and even organized some virtual coffee breaks.

I’m always happy to share my experience and (limited) knowledge and learn from others. It seems this newsletter is an excellent way to do it. So here we go another month with the list and a couple of novelties:

We have a new design and logo for the newsletter. Thanks to the amazing Andrea Magan, we are very lucky to have her in the community.
After some conversations with subscribers, I’ve added a final section with tools and libraries. It’s nothing commercial, just the typical projects you may find in GitHub and save your day.

As usual, I would love if you share with me the articles I missed or your thoughts about the newsletter or any streaming topic. My DMs are open.

Architecture and design

High Throughput Ingestion with Iceberg: full of details, highly recommended in your are working on ingestion 🌶️🌶️🌶️
FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format 🌶️
Better to Be Wrong Than Vague: Apache Kafka and Software Architecture Predictions for 2021
Apache Kafka vs Apache Pulsar: a lot of common sense in this article 🌶️
Should IAS transition from Kafka to Pulsar: I wouldn’t take conclusions only from this article but it’s a good representation of many others with the same approach and conclusions. It’s hard to compare one system with other because you have a lot more knowledge with the older.
Trident - Real-time event processing at scale 🌶️
Streaming Integration: A Guide to Realizing the Value of Real-Time Data: free e-book.
Building a data lake: from batch to real-time using Kafka 🌶️
Manas Realtime — Enabling changes to be searchable in a blink of an eye: I enjoyed this article a lot 🌶️🌶️🌶️
How Intuit Built a Self-serve Stream Processing Platform with Flink: self-service platforms are my thing. It’s a good article but it would be great to have more details.
4 big data architectures, Data Streaming, Lambda architecture, Kappa architecture and Unifield architecture
Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability: great example of continuous optimization 🌶️🌶️🌶️
Architecting messaging solutions with Apache ActiveMQ Artemis: this is a great post and many things apply to other broker technologies.
Vectorized: free and source available, cloud native infrastructure for real-time applications.

Event-driven architecture

Apache Kafka

Kafka Monthly Digest – January 2021 🌶️🌶️🌶️
4 Steps for Kafka Rebalance - Notes From the Field 🌶️🌶️🌶️
Intro to Apache Kafka: How Kafka Works: this is a must if you are starting with Kafka.
Implementing mTLS and Securing Apache Kafka at Zendesk 🌶️
How Microcks Can Speed-Up Your AsyncAPI Adoption - Part 2
Kafka As A Database? Yes Or No: it’s the new tab vs. spaces!
Property Based Testing Confluent Cloud Storage for Fun and Safety: this article about testing Kafka features is very refreshing and full of new approaches 🌶️🌶️🌶️
Top 5 Things Every Apache Kafka Developer Should Know
Combining strict order with massive parallelism using Kafka 🌶️
How Apache Kafka Enables Podium to “Ship It and See What Happens”: the title is a bit fuzzy but it’s an interesting article describing their pipeline to ingest in Elasticsearch 🌶️
Top 3 Kafka Books and Tutorials: it’s always nice to read about tech books. This is a solid list.

Kafka client libs, Kafka Connect, Kafka Streams, etc.

Apache Pulsar

Datastax acquires Kesque as it gets into data streaming: it isn’t a technical article but it’s interesting to see how databases are going into streaming.
Watch Your Streams: Implementing OpenTelemetry with Apache Pulsar: video.
Workshop Apache Cassandra™ and Apache Pulsar™: very interesting workshop, we want more content like this! 🌶️🌶️🌶️

Flink

Stateful Functions 2.2.2 Release Announcement
What’s New in the Pulsar Flink Connector 2.7.0
Apache Flink 1.12.1 Released
Using RocksDB State Backend in Apache Flink: When and How 🌶️🌶️🌶️
Exploring fine-grained recovery of bounded data sets on Flink 🌶️
2021 Apache Flink Meetup - Hosted by Netflix
Batch, Streaming & DevRel Outer Space: interview with Marta Paes!
Apache Flink 1.10.3 Released
Flink setup for development (and some IntelliJ Idea cool tricks): self-hype 😇

Spark Structured Streaming

Apache Beam

Cloud

What’s new on the cloud for data engineers - part 2 (11.2020-01.2021): this is awesome. I don’t know how Bartosz was able to cover every cloud provider but it’s very useful 🌶️🌶️🌶️

Change Data Capture

Tools, libs and scripts (free and open source)

KLoadGen - Kafka + (Avro/Json Schema) Load Generator: CoruNet is a great company and this tool is a good example.
A Clojure Apache Kafka client with core.async api
Kafka Configs Metrics Exporter
Sampler. Visualization for any shell command: cool CLI alternative to Graphana with a Kafka example.
Helpful Tools for Apache Kafka Developers: kafkacat, REST Proxy, jq, Kafka Streams Topology Visualizer
kafka-encryption: a Java framework that eases the encryption/decryption of Kafka record’s value at the serializer/deserializer level.
kattlo: Apache Kafka Configuration Made Easy
voluble: a generator of data

That’s all! If you found it useful, please, share it with your network.

Cheers from the rainy (but beautiful) Galicia!

Streaming Annotated Newsletter

Discussion about this post