Kafka streams re-partitioning skews data to single partition – causing performance issues
I have a usecase in which I have to count number of events based on a parameter in payload value
Data in stream is something like
Key -> String (distinct and can scale well and unique)
Value -> json (has alot of fields and one field is date, which is same for large chunk)
Now I have to count records based on common date field
Problem is
When i repartition stream based on new key (date field), by using groupby or map or selectkey operation, Followed by an aggregation.
I end up having all messages in a stream skewed onto one partition in repartition topic
Which in a way hampers performance
And impact is exponentially increasing with number of data on stream with same same date
Is there any better alternative to this usecase?
#Kafkastream
When i repartition stream based on new key (date field), by using groupby or map or selectkey operation, Followed by an aggregation.
Read more here: Source link
