Streaming Annotated Monthly – February 2021
Receive the last news and interesting articles about streaming platforms and processing frameworks in your mailbox
When I published the previous newsletter with only one subscriber (myself) I never expected it would have such great acceptance. 148 subscribers and 784 views are a more than enough indicator of interest for me. Much more important than the newsletter numbers, I had some interesting conversations with people who read the newsletter: I received several messages on Twitter, Linkedin and even organized some virtual coffee breaks.
I’m always happy to share my experience and (limited) knowledge and learn from others. It seems this newsletter is an excellent way to do it. So here we go another month with the list and a couple of novelties:
We have a new design and logo for the newsletter. Thanks to the amazing Andrea Magan, we are very lucky to have her in the community.
After some conversations with subscribers, I’ve added a final section with tools and libraries. It’s nothing commercial, just the typical projects you may find in GitHub and save your day.
As usual, I would love if you share with me the articles I missed or your thoughts about the newsletter or any streaming topic. My DMs are open.
Architecture and design
High Throughput Ingestion with Iceberg: full of details, highly recommended in your are working on ingestion 🌶️🌶️🌶️
FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format 🌶️
Better to Be Wrong Than Vague: Apache Kafka and Software Architecture Predictions for 2021
Apache Kafka vs Apache Pulsar: a lot of common sense in this article 🌶️
Should IAS transition from Kafka to Pulsar: I wouldn’t take conclusions only from this article but it’s a good representation of many others with the same approach and conclusions. It’s hard to compare one system with other because you have a lot more knowledge with the older.
Streaming Integration: A Guide to Realizing the Value of Real-Time Data: free e-book.
Building a data lake: from batch to real-time using Kafka 🌶️
Manas Realtime — Enabling changes to be searchable in a blink of an eye: I enjoyed this article a lot 🌶️🌶️🌶️
How Intuit Built a Self-serve Stream Processing Platform with Flink: self-service platforms are my thing. It’s a good article but it would be great to have more details.
Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability: great example of continuous optimization 🌶️🌶️🌶️
Architecting messaging solutions with Apache ActiveMQ Artemis: this is a great post and many things apply to other broker technologies.
Vectorized: free and source available, cloud native infrastructure for real-time applications.
Event-driven architecture
Apache Kafka
Intro to Apache Kafka: How Kafka Works: this is a must if you are starting with Kafka.
Kafka As A Database? Yes Or No: it’s the new tab vs. spaces!
Property Based Testing Confluent Cloud Storage for Fun and Safety: this article about testing Kafka features is very refreshing and full of new approaches 🌶️🌶️🌶️
Combining strict order with massive parallelism using Kafka 🌶️
How Apache Kafka Enables Podium to “Ship It and See What Happens”: the title is a bit fuzzy but it’s an interesting article describing their pipeline to ingest in Elasticsearch 🌶️
Top 3 Kafka Books and Tutorials: it’s always nice to read about tech books. This is a solid list.
Kafka client libs, Kafka Connect, Kafka Streams, etc.
Streaming data into Kafka S01/E04 — Parsing log files using Grok Expressions
Kafka doc: new section on geo-replication and cross-cluster data replication. Thanks to @miguno.
Apache Pulsar
Datastax acquires Kesque as it gets into data streaming: it isn’t a technical article but it’s interesting to see how databases are going into streaming.
Watch Your Streams: Implementing OpenTelemetry with Apache Pulsar: video.
Workshop Apache Cassandra™ and Apache Pulsar™: very interesting workshop, we want more content like this! 🌶️🌶️🌶️
Flink
Using RocksDB State Backend in Apache Flink: When and How 🌶️🌶️🌶️
Exploring fine-grained recovery of bounded data sets on Flink 🌶️
Batch, Streaming & DevRel Outer Space: interview with Marta Paes!
Flink setup for development (and some IntelliJ Idea cool tricks): self-hype 😇
Spark Structured Streaming
Apache Beam
Cloud
What’s new on the cloud for data engineers - part 2 (11.2020-01.2021): this is awesome. I don’t know how Bartosz was able to cover every cloud provider but it’s very useful 🌶️🌶️🌶️
Google Cloud
Using Dataflow snapshots (new preview feature) 🌶️
Microsoft Azure
Introducing Kubernetes Event Grid Bridge: Bringing Kubernetes events to Microsoft Azure.
Introducing seamless integration between Microsoft Azure and Confluent Cloud
Amazon AWS
MSK: Updating the broker type: You can now change the broker type for an existing cluster 🌶️
Change Data Capture
BordeauxJUG - Janv. 2021 - Gunnar Morling and Katia Aresti - Let’s discover Infinispan and Debezium: Video. Katia and Gunnar together!
Change Data Capture — Convert your database into a stream with Debezium
Tools, libs and scripts (free and open source)
KLoadGen - Kafka + (Avro/Json Schema) Load Generator: CoruNet is a great company and this tool is a good example.
Sampler. Visualization for any shell command: cool CLI alternative to Graphana with a Kafka example.
Helpful Tools for Apache Kafka Developers: kafkacat, REST Proxy, jq, Kafka Streams Topology Visualizer
kafka-encryption: a Java framework that eases the encryption/decryption of Kafka record’s value at the serializer/deserializer level.
That’s all! If you found it useful, please, share it with your network.
Cheers from the rainy (but beautiful) Galicia!