Kafka streams re-partitioning skews data to single partition – causing performance issues

I have a usecase in which I have to count number of events based on a parameter in payload value

Data in stream is something like
Key -> String (distinct and can scale well and unique)
Value -> json (has alot of fields and one field is date, which is same for large chunk)

Now I have to count records based on common date field

Problem is

When i repartition stream based on new key (date field), by using groupby or map or selectkey operation, Followed by an aggregation.

I end up having all messages in a stream skewed onto one partition in repartition topic

Which in a way hampers performance

And impact is exponentially increasing with number of data on stream with same same date

Is there any better alternative to this usecase?

#Kafkastream

When i repartition stream based on new key (date field), by using groupby or map or selectkey operation, Followed by an aggregation.

Read more here: Source link