YouTube
PyFlink Stream Processing Tutorial: Build a Real-Time Pipeline with Kafka, Redpanda and Python
In this workshop, Alexey Grigorev breaks down the complexities of real-time data engineering, moving from basic Python-based Kafka consumers to enterprise-grade Apache Flink pipelines. This workshop, part of the Data Engineering Zoomcamp, provides a hands-on look at how to handle high-velocity data, out-of-order events, and stateful aggregations.
You’ll learn about:
- Kafka vs. Red Panda: Understand why Red Panda is a faster, simpler alternative to Kafka for developers.
- Producer & Consumer: Learn to turn Python data into streamable bytes and read them back.
- Database Integration: How to move data from a live stream into a permanent PostgreSQL database.
- Flink Basics: Learn how Job Managers (the brain) and Task Managers (the muscle) run streaming jobs.
- Handling Late Data: Use Watermarks to manage events that arrive late or out of order.
- Time Windows: Group data into time blocks (like "every 5 minutes") to calculate real-time totals.
Links:
- Course: https://github.com/DataTalksClub/data-engineering-zoomcamp
- Workshop: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/07-streaming/workshop/README.md
- DTC Courses: https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html
TIMECODES:
00:00 Workshop Introduction and Course Context
01:34 Finding the Workshop Materials and Course Links
03:01 Streaming Stack Overview: Kafka, Redpanda, and Flink
04:56 Credits, Workshop Changes, and Environment Setup Plan
06:32 Creating a Repository and Launching Codespaces
10:18 Python Project Setup with UV
13:09 Starting Redpanda and Defining Producer and Consumer Roles
15:37 Loading Taxi Data and Modeling Ride Events
23:48 Building a Kafka Producer and Sending JSON Events
27:16 Creating a Consumer and Cleaning Up Serialization Logic
40:02 Streaming Many Events and Writing Them to Postgres
48:41 Why Flink for Stream Processing
52:46 Flink Architecture and Docker Setup
01:03:24 First Flink Job: Kafka to Postgres Pass-Through
01:11:30 From Pass-Through to Real-Time Aggregations
01:14:50 Generating Real-Time and Late Events
01:18:58 Window Aggregations and Watermarks in Flink
01:28:57 Wrap-Up, Watermark Discussion, and Closing
Connect with DataTalks.Club:
- Join the community - https://datatalks.club/slack.html
- Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
- Check other upcoming events - https://lu.ma/dtc-events
- GitHub: https://github.com/DataTalksClub
- LinkedIn - https://www.linkedin.com/company/datatalks-club/
- Twitter - https://twitter.com/DataTalksClub
- Website - https://datatalks.club/
Connect with Alexey
- Twitter - https://twitter.com/Al_Grigor
- Linkedin - https://www.linkedin.com/in/agrigorev/
Check our free online courses:
- ML Engineering course - http://mlzoomcamp.com
- Data Engineering course - https://github.com/DataTalksClub/data-engineering-zoomcamp
- MLOps course - https://github.com/DataTalksClub/mlops-zoomcamp
- LLM course - https://github.com/DataTalksClub/llm-zoomcamp
- Open-source LLM course: https://github.com/DataTalksClub/open-source-llm-zoomcamp
- AI Dev Tools course: https://github.com/DataTalksClub/ai-dev-tools-zoomcamp
👉🏼 Read about all our courses in one place - https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html
👋🏼 Support/inquiries
If you want to support our community, use this link - https://github.com/sponsors/alexeygrigorev
If you’re a company, reach us at alexey@datatalks.club
#apacheflink #kafka #redpanda #python #dataengineering #streaming #postgresql #docker #pyflink #realtimeanalytics #bigdata #softwareengineering #codespaces #datapipelines #streamprocessing #eventdriven #coding #techworkshop #learninginpublic #datatalksclub
Обсуждение 0
Обсуждение не доступно в веб-версии. Чтобы написать комментарий, перейдите в приложение Telegram.
Обсудить в Telegram