google cloud platform – Building a Restart-Safe Kafka to BigQuery Pipeline with Durable Checkpointing Using Apache Beam – Dataflow

Problem Statement:
I needed a robust way to ingest data from Kafka to BigQuery using Apache Beam/Dataflow, with at-least-once delivery, durable checkpointing, and safe offset progression—even when there are source gaps or invalid payloads. Additionally, I wanted the ability to restart the pipeline from the last committed checkpoint after a failure or restart.

Apache Beam dataflow runner does not natively support checkpointing for records written to BigQuery.

Solution:
Implement a layered Beam pipeline that:

  • Reads from Kafka.

  • Selects contiguous offsets, handling source gaps and timeouts.

  • Writes to BigQuery and extracts both successful and failed offsets from write results.

  • Merges all offsets (including BQ success, BQ error, source gap, and invalid payloads)

  • Commits them using an external checkpoint authority (e.g., Firestore).

  • In case of an abrupt failure, simply restart the pipeline and reset the Kafka consumer to the last committed checkpoint.

This design ensures restart safety, monotonic checkpoint progression, and predictable drain behavior.

Dataflow Diagram:
Building a Restart-Safe Kafka to BigQuery Pipeline with Durable Checkpointing Using Apache Beam/Dataflow

Link to GitHub:
Code Link

Open for suggestions to further refine this solution.

Read more here: Source link