apache spark – Job completion tracking across kafka

I have a distributed pipeline that reads jobs from a Queue (SQS). Each job then produces millions of messages and sent over to Kafka topic TopicA to analyze. After the analysis is done, it’s then sent over to another Kafka Topic TopicB which is consumed to store in a data store.

The problem I’m facing right now is that I have to track the completion of the job. The job is said to be complete when all the messages generated by the job is stored in data store.

Since, it’s 100s of millions of messages at any given time, I’m having hard time to come up with something that can track the completion of the job. Other thing I’m exploring is using spark streaming which I’m totally new to.

Is spark streaming feasible to solve the stated problem?

Read more here: Source link