Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide task_attempt table #50

Merged
merged 11 commits into from
Nov 9, 2024
Merged

provide task_attempt table #50

merged 11 commits into from
Nov 9, 2024

Conversation

maxcountryman
Copy link
Owner

This reworks how task executions are tracked by providing a new table, task_attempt which is a log of each dequeue of a given task from the queue. Put another way, each time a task is executed, a row is inserted into the attempts table and the result of that execution is stored on that row.

As such, the task table is now a container for the fixed metadata of a specific enqueue of a task. This means that metadata associated with an attempt now lives on the task_attempt table. For example, error messages are stored on the attempts table and likewise creation and completion timestamps also live on this table.

Importantly some fields of the task row have been removed as they are now derived from the attempts log. For instance, task_attempt provides an updated_at column which allows for e.g. determining succeeded_at or failed_at timestamps of the task enqueue. Similarly retry_count is a function of the number of attempt rows associated to a given task.

Furthermore, the retry policy is now embedded as a JSONB column on the task row.

Related discussion: #17

state underway.task_state not null default 'pending',

-- Error metatdata.
error_message text,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered adding a error_details JSONB column, but it's not clear what would live there. For example, std::error::Error is the limit of the API surface we have and at most it seems sources could be interest but it's unclear if that's actually valuable and could be added later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in clojure such things where simplier to handle cause one can just throw exception which contains any object
so it allows to have meta data on error itself (mostly related to domain specific errors, like Order not in current state)
Maybe indeed there is no much data to save

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think e.g. anyhow, etc provide ways of enriching errors (e.g. they provide context methods) and that's potentially additional metadata that could be stored.

However, I don't like the idea of using anyhow here: we have a concrete error type (task::Error) and so we can directly augment it if we want. That said, I don't know what additional metadata we'd want. Most of it will be provided by the Display impl I think and so to_string is already giving us that.

I'm certainly open to increasing the fidelity here, so if you have ideas about more context we could embed I'll be happy to explore that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like it is nice to have more info about error, but i cant figure out what to put there


create table underway.task_attempt (
task_id uuid not null,
task_queue_name text not null,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought that task always run in the same queue?
so i think this is task related field. Am i missing something?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The reason it's included is because task has a composite key and the queue name is therefore required to create the fk.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, oh
i was sure task_id is uniq...

attempt_number integer not null,

-- Task state.
state underway.task_state not null default 'pending',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small one, dont think it can be pending, I would expect that attemp inserted once it happen, so it is failure or success

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree "pending" is probably not right. The attempt is created upon dequeue (this is when we actually start the attempt). So this should likely be "in_progress" as a reasonable default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my implementation i didnt bother to have in_progress, I've inserted only finished attempts (failed or succeded)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that too but I think I prefer it this way for instrumentation purposes: by creating the row when we dequeue we'll be able to derive more accurate metrics about task attempts (i.e. it'll incorporate the time it took to execute the task).

It's also a little easier reason about derived properties knowing that the attempt row will always exist once it's been dequeued.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am under impression that because of tx nature - no one will see this attempt in in_progress state.
Please correct me if I am wrong

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue itself doesn't require a transaction (e.g. dequeue takes a PgConnection). The provided worker implementation does use a transaction, so in that case it should be true.

Holding a transaction open for the duration of task execution is probably okay most of the time but I want to retain the possibility for extending the crate with workers that might not work that way.

Even if we don't end up providing that right away, creating attempt rows in tandem with a dequeue is beneficial in a least a couple of ways:

  1. Encapsulating the state transition in the queue itself is an arguably more sound design (it means a caller doesn't have to remember to e.g. create an attempt row later, or even know about attempt rows at all, and keeps the overall design more coherent in my view) and,
  2. Because attempt rows are created during the dequeue, their timing metrics will be better aligned with the moment a task was dequeued--this could be done manually, but I'm not really sure what we gain by doing that.

If later we're positive that we'll never support out-of-transaction workers I think we could still make a change that constrains the states and removes "in progress" so this isn't necessarily set in stone.

-- Error metatdata.
error_message text,

created_at timestamp with time zone not null default now(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having started_at/completed_at allows to catch when particial attempt happen.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but I think those are always derivable:

  • started_at is always created_at (since an attempt is created upon dequeue)
  • completed_at is updated_at when state is "succeeded" or "failed"

We could rename created_at if that's more clear and technically we could set a separate completed_at column to be sure updated_at isn't overloaded unnecessarily.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct, my bad

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth having a separate completed_at column? I can't immediately think of use cases for changing updated_at after an attempt is completed, but it would ensure that the semantic isn't conflated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer started_at/created_at

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great I'll update that too: I think this is started_at and completed_at, correct?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this now.

drop column if exists last_failed_at;

alter table underway.task
add column if not exists retry_policy jsonb not null default '{"max_attempts": 5, "initial_interval_ms": 1000, "max_interval_ms": 60000, "backoff_coefficient": 2.0}';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not big fan of json, knowning the shape of data seems helps to reason about things better.
I do understand that this is more work and maybe doesnt worth it, but wdyt about having custom type here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go into more detail? What does a custom type entail? Is that a separate table or something else SQLx supports?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is only a psql feature,
something like this:

CREATE TYPE book AS (name varchar(255), author varchar(255), rating varchar(5));

CREATE TABLE users
  (
     id    SERIAL,
     name  VARCHAR(255),
     email VARCHAR(100),
     books book
  );

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I didn't realize this was possible. Yes, I'll look at implement it that way. Thank you.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but again, i am not sure its worth tbh.
i now think probably it is not...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not too difficult, something like this:

-- Define retry policies as their own type
create type underway.task_retry_policy as (
    max_attempts         int,
    initial_interval_ms  int,
    max_interval_ms      int,
    backoff_coefficient  float
);

alter table underway.task
    add column if not exists retry_policy underway.task_retry_policy not null 
    default row(5, 1000, 60000, 2.0)::underway.task_retry_policy;

And we can derive the type with SQLx:

#[derive(Debug, Clone, Copy, PartialEq, sqlx::Type)]
#[sqlx(type_name = "task_retry_policy")]
pub struct RetryPolicy {
    pub(crate) max_attempts: i32,
    pub(crate) initial_interval_ms: i32,
    pub(crate) max_interval_ms: i32,
    pub(crate) backoff_coefficient: f32,
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added. It was pretty straightforward to do. I'm not sure if there are downsides I might be overlooking here but it seems nice to create the custom type as you suggested.

@maxcountryman
Copy link
Owner Author

I should point out that I have not added tests here that specifically target task attempt rows. If we agree this is the right direction, I will plan to add those before we release this.

This reworks how task executions are tracked by providing a new table,
`task_attempt` which is a log of each dequeue of a given task from the
queue. Put another way, each time a task is executed, a row is inserted
into the attempts table and the result of that execution is stored on
that row.

As such, the `task` table is now a container for the fixed metadata of a
specific enqueue of a task. This means that metadata associated with an
attempt now lives on the `task_attempt` table. For example, error
messages are stored on the attempts table and likewise creation and
completion timestamps also live on this table.

Importantly some fields of the `task` row have been removed as they are
now derived from the attempts log. For instance, `task_attempt` provides
an `updated_at` column which allows for e.g. determining `succeeded_at`
or `failed_at` timestamps of the task enqueue. Similarly `retry_count`
is a function of the number of attempt rows associated to a given task.

Furthermore, the retry policy is now embedded as a JSONB column on the
task row.

Related discussion: #17
@maxcountryman maxcountryman force-pushed the task-attempts-table branch 2 times, most recently from bdb9c73 to d367106 Compare November 9, 2024 04:38
Here we rename `DequeuedTask` to `InProgressTask` and rework it to fully
encapsulate state transitions for "in-progress" tasks.

This helps ensure that state transitions are both more ergonomic and
correct.
@maxcountryman maxcountryman marked this pull request as ready for review November 9, 2024 13:15
@maxcountryman
Copy link
Owner Author

This is ready for review.

Testing could still be improved but I've added cases for the state transitions that check attempt rows are being created as we'd expect.

@kirillsalykin
Copy link

it looks good to me

@kirillsalykin
Copy link

Maybe I'll have more comments once I start using it

@maxcountryman
Copy link
Owner Author

@kirillsalykin thanks for all your feedback thus far: very much appreciate it. 🙏

@maxcountryman maxcountryman merged commit b2a1bd4 into main Nov 9, 2024
4 checks passed
@maxcountryman maxcountryman deleted the task-attempts-table branch November 9, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants