Few queries #54

cloudcompute · 2023-02-19T21:16:49Z

This project looks valuable. I have few queries relating to it.

a. Why have you decided not to use CDC-Debezium and how are you able to achieve it without using CDC.. pretty interesting? Documentation says: "It keeps trying writing to Kafka until succeeded", won't it impact performance?

b. What is the role of akka in this project? Can we do away with it because it is BSL now.

c. While writing to Kafka, Avro/protobuf formats have not been used which are really fast. Any reason for it? Is it possible to integrate Thoth with this gRPC/REST proxy for Kafka?

d. Does it support these: Optimistic Concurrency, (multiple users writing events to the same table at the same time) Event Replay (right from beginning for Audits etc.), Snapshots, topic compaction (delete/overwrite a message record for a given key), and Event Versioning?

e. What is the difference between Non blocking JOOQ and Standard JOOQ/Kafka implementations. In which case (use cases), shall we use which implementation?

f. Is there no need for Kafka-proxy

Thanks

ptitFicus · 2023-02-20T13:25:36Z

Hello,

Thanks for your interest.

a. I'm not a CDC / Debezium expert, however here are some elements :

we want to be able to publish event to kafka even if the application crashes right after postgres write. From my understanding, CDC works with triggers that only happen once for each event, for instance in the following scenario event may never be published :
1. Postgres writes succeed
2. CDC triggers try to publish into Kafka
3. Kafka publication fails (network issue / kafka is down / ...)
4. Application crashes
5. Application restart, but CDC trigger won't happen again for this event

since our use case is pretty simple, we didn't feel the need to bring additional dependencies such as debezium
Regarding the "how", here is what Thoth does under the hood (simplified version):
1. You send a command to Thoth
2. Thoth derives event(s) to be written from your command and current state (reading events from database)
3. Thoth opens a transaction :
  1. It writes events in event store, with a flag indicated that these events are not published in Kafka
  2. It computes "in transaction" / "real times" projections and store them in DB (if needed)
  3. It tries a first kafka publication, and on success update the publication flag to successful state.
4. If both 1 and 2 works, Thoth commit the transaction and the CompletionStage you get in return succeed. If the first kafka publication fails, a process start in background and retries to publish any non published event. This process is independant from the completionStage you get as a result, therefore it does not impact response time.

b. @larousso is working on an akka free version using reactor instead

c. Current Kafka publication is speed is enough for our need, therefore we chose to keep it simple for Kafka publication. I think it may be possible to add an opt-in option for Avro / protobuf support, but it's definitely not on our roadmap. Perhabs it could be a good contribution, what do you think @larousso ?

d.

Optimistic Concurrency, (multiple users writing events to the same table at the same time) : yes
Event Replay (right from beginning for Audits etc.) : not sure I understand your question, by default Thoth will replay all events for a given state to get the final version of it, more info here : https://maif.github.io/thoth/manual/advanced/aggregatestore.html
Snapshots : yes, see this part of the doc https://maif.github.io/thoth/manual/advanced/aggregatestore.html
topic compaction (delete/overwrite a message record for a given key) : I don't think so, but @larousso may have more information
Event Versioning? : yes a version field is present in the event "enveloppe"

e. The main difference is in the driver used to communicate with postgres, Non blocking Jooq Akka implementation ise based on the reactive vertx driver for postgres, while Standard JOOQ/Kafka implementation uses a non reactive driver

f. Could you elaborate on this point ?

larousso · 2023-02-20T15:06:03Z

Hi,

a. With the actual implementation, the event are published in a near real time to kafka. I think that with debezium it would have a latency.
The actual algorithm is to enqueue in memory the messages in order to send it to kafka. If kafka is down, the process crash and restart reading the events from database in order to get the unpublished messages while the hot messages are kept in memory. The idea is to try keep the order of the events.

b. The first version of thoth use akka as reactive stream implementation but with the change of licence, the akka implementation was kept and a reactor implementation was added.
I think we won't support the akka implementation because we won't use it anymore.
At the moment the only dependency that remains is for tests but I will remove it as soon as possible.

c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.

d.

for optimistic concurency, a sequence num can be used to detect updates from a wrong version.
for event replay, there is a stream method from the API that can be used to read all the events from the database.
topic compaction : at the moment the shard id is the entity id so if you use compaction in kafka i think that it will kept the last event for a given entity.

e. I don't know if @ptitFicus uses jdbc in production, but I think the non blocking one is the more stable solution.

cloudcompute · 2023-02-20T18:09:27Z

Thanks to both @larousso and @ptitFicus for the detailed responses.

Implementing CDC using Outbox Pattern

It preserves the order of events. Latency is there but very low. An asynchronous process runs in the background that copies the event records from Outbox table to Kafka. In case it is restarted, it reads again but it may cause duplicates in Kafka. This ensures at-least-once processing guarantee semantics. If duplicates are there, the consuming service has to make sure to ignore the duplicate (idempotency).

hot messages are kept in memory

I don't know how have you implemented it, but what if the process that is keeping track of these messages itself die.

f. Is there no need for Kafka-proxy

It was left by mistake. I merged this question in point c. I wanted to ask whether it is possible to (or makes sense while using ES) to put a Kafka proxy in front of Kafka so that all events written/read to/from it (by microservices) go through this proxy. There are several benefits, for instance, if there are hundreds of microservices they don't need to know about the Kafka cluster details like location, they just need to need to know about this proxy. It has other benefits too which are stated here: [gRPC/REST proxy for Kafka]

c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.

Correct

With Regards to both of you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few queries #54

Few queries #54

cloudcompute commented Feb 19, 2023 •

edited

Loading

ptitFicus commented Feb 20, 2023

larousso commented Feb 20, 2023

cloudcompute commented Feb 20, 2023

Few queries #54

Few queries #54

Comments

cloudcompute commented Feb 19, 2023 • edited Loading

ptitFicus commented Feb 20, 2023

larousso commented Feb 20, 2023

cloudcompute commented Feb 20, 2023

cloudcompute commented Feb 19, 2023 •

edited

Loading