Data Engineering Zoomcamp

@dezoomcamp

04.03.2026 10:59

We're starting our stream about streaming!

This is going to be a part of module 7 about streaming - and reworked workshop from the last year with the latest versions of PyFlink

Stream: https://www.youtube.com/watch?v=YDUgFeHQzJU
Workshop: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/07-streaming/workshop

Watch now or later in recording!

PyFlink Stream Processing Tutorial: Build a Real-Time Pipeline with Kafka, Redpanda and Python

In this workshop, Alexey Grigorev breaks down the complexities of real-time data engineering, moving from basic Python-based Kafka consumers to enterprise-grade Apache Flink pipelines. This workshop, part of the Data Engineering Zoomcamp, provides a hands-on look at how to handle high-velocity data, out-of-order events, and stateful aggregations. You’ll learn about: - Kafka vs. Red Panda: Understand why Red Panda is a faster, simpler alternative to Kafka for developers. - Producer & Consumer: Learn to turn Python data into streamable bytes and read them back. - Database Integration: How to move data from a live stream into a permanent PostgreSQL database. - Flink Basics: Learn how Job Managers (the brain) and Task Managers (the muscle) run streaming jobs. - Handling Late Data: Use Watermarks to manage events that arrive late or out of order. - Time Windows: Group data into time blocks (like "every 5 minutes") to calculate real-time totals. Links: - Course: https://github.com/DataTalksClub/data-engineering-zoomcamp - Workshop: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/07-streaming/workshop/README.md - DTC Courses: https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html TIMECODES: 00:00 Workshop Introduction and Course Context 01:34 Finding the Workshop Materials and Course Links 03:01 Streaming Stack Overview: Kafka, Redpanda, and Flink 04:56 Credits, Workshop Changes, and Environment Setup Plan 06:32 Creating a Repository and Launching Codespaces 10:18 Python Project Setup with UV 13:09 Starting Redpanda and Defining Producer and Consumer Roles 15:37 Loading Taxi Data and Modeling Ride Events 23:48 Building a Kafka Producer and Sending JSON Events 27:16 Creating a Consumer and Cleaning Up Serialization Logic 40:02 Streaming Many Events and Writing Them to Postgres 48:41 Why Flink for Stream Processing 52:46 Flink Architecture and Docker Setup 01:03:24 First Flink Job: Kafka to Postgres Pass-Through 01:11:30 From Pass-Through to Real-Time Aggregations 01:14:50 Generating Real-Time and Late Events 01:18:58 Window Aggregations and Watermarks in Flink 01:28:57 Wrap-Up, Watermark Discussion, and Closing Connect with DataTalks.Club: - Join the community - https://datatalks.club/slack.html - Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ - Check other upcoming events - https://lu.ma/dtc-events - GitHub: https://github.com/DataTalksClub - LinkedIn - https://www.linkedin.com/company/datatalks-club/ - Twitter - https://twitter.com/DataTalksClub - Website - https://datatalks.club/ Connect with Alexey - Twitter - https://twitter.com/Al_Grigor - Linkedin - https://www.linkedin.com/in/agrigorev/ Check our free online courses: - ML Engineering course - http://mlzoomcamp.com - Data Engineering course - https://github.com/DataTalksClub/data-engineering-zoomcamp - MLOps course - https://github.com/DataTalksClub/mlops-zoomcamp - LLM course - https://github.com/DataTalksClub/llm-zoomcamp - Open-source LLM course: https://github.com/DataTalksClub/open-source-llm-zoomcamp - AI Dev Tools course: https://github.com/DataTalksClub/ai-dev-tools-zoomcamp 👉🏼 Read about all our courses in one place - https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html 👋🏼 Support/inquiries If you want to support our community, use this link - https://github.com/sponsors/alexeygrigorev If you’re a company, reach us at alexey@datatalks.club #apacheflink #kafka #redpanda #python #dataengineering #streaming #postgresql #docker #pyflink #realtimeanalytics #bigdata #softwareengineering #codespaces #datapipelines #streamprocessing #eventdriven #coding #techworkshop #learninginpublic #datatalksclub

❤ 10

😁 2

2 9.7K

Обсуждение 0

Обсуждение не доступно в веб-версии. Чтобы написать комментарий, перейдите в приложение Telegram.

Обсудить в Telegram

Data Engineering Zoomcamp

@dezoomcamp

30.1K

Все посты канала

Открыть в Telegram