Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secure Jsonnet Pool context deadline exceeded #3923

Open
3 of 5 tasks
Sytten opened this issue May 15, 2024 · 7 comments
Open
3 of 5 tasks

Secure Jsonnet Pool context deadline exceeded #3923

Sytten opened this issue May 15, 2024 · 7 comments
Labels
bug Something is not working.

Comments

@Sytten
Copy link

Sytten commented May 15, 2024

Preflight checklist

Ory Network Project

No response

Describe the bug

We are trying to upgrade from commit https://github.com/ory/kratos/tree/0c5ea9bf735a67ef35011ba41d7f98afc6f8e118 to v1.1.

We get this error:

time=2024-05-15T17:00:51Z level=error msg=ExecutePostRegistrationPostPersistHook hook failed with an error. audience=application error=map[message:jsonnetsecure: eval: context deadline exceeded] executor=*hook.WebHook executor_position=1 executors=[*hook.Verifier *hook.WebHook] flow_method=password

The timeout is weird because it timeouts after a bit more than one second, our jsonnet is very simple:

function(ctx) { identity: ctx.identity }

It didn't fail locally on macOS, but it failed on our cloud provider so this suggests that there is something weird going with the new pool/vm system that was introduced in caido/kratos@master...ory:kratos:master#diff-ab967ab1a2f3a1b769106eeb7bfe892ef0e81d1d27811fa15be08e6749feee1fR58-R77

The cloud machine is small, but I would not expect the jsonnet to take 1s to resolve. So maybe the timeout is warranted but somehow the pool is stuck.

Reproducing the bug

  1. Use docker on linux
  2. Setup kratos with a registration webhook using function(ctx) { identity: ctx.identity }
  3. Try to register a new user

Relevant log output

time=2024-05-15T17:00:51Z level=error msg=ExecutePostRegistrationPostPersistHook hook failed with an error. audience=application error=map[message:jsonnetsecure: eval: context deadline exceeded] executor=*hook.WebHook executor_position=1 executors=[*hook.Verifier *hook.WebHook] flow_method=password

Relevant configuration

No response

Version

v1.1

On which operating system are you observing this issue?

Linux

In which environment are you deploying?

Docker

Additional Context

No response

@Sytten Sytten added the bug Something is not working. label May 15, 2024
@Sytten
Copy link
Author

Sytten commented May 15, 2024

So this 1s seems hardcoded: https://github.com/ory/x/pull/755/files#diff-e8a7ac45033a2903ea00a56a5e7957e57ea7b60a3e9e7d4f23666f375fe2eec5R185

Now why does the machine take 1s to process it is kinda of the question. I am looking at my logs when using the previous versions and it doesn't take 1s to process the whole thing:

time=2024-05-15T17:54:39Z level=info msg=A new identity has registered using self-service registration. audience=audit 
time=2024-05-15T17:54:39Z level=info msg=Dispatching webhook audience=application host:cloud-backend:6000 method:POST path:/webhook/kratos/registration query:<nil> remote: scheme:http] s
time=2024-05-15T17:54:39Z level=info msg=[DEBUG] POST http://cloud-backend:6000/webhook/kratos/registration audience=application
time=2024-05-15T17:54:39Z level=info msg=completed handling request size:1043 status:200 text_status:OK took:575.330798ms]

@Sytten Sytten changed the title Jsonnet context deadline exceeded Secure Jsonnet Pool context deadline exceeded May 15, 2024
@Sytten
Copy link
Author

Sytten commented May 15, 2024

I think it would be nice to have the option to disable the pool and revert back to the older in-process VM? @alnr
If you agree I am happy to push a PR to add that setting.

@Sytten
Copy link
Author

Sytten commented May 15, 2024

It's weird that before it was still spawning a new process per execution so in theory a pool should just improve performances and it discards the idea that the platform is somehow not able to spawn processes.
So this leaves a couple ideas:

  • The timeout is just too short since it includes the Acquire (https://github.com/ory/x/pull/755/files#diff-e8a7ac45033a2903ea00a56a5e7957e57ea7b60a3e9e7d4f23666f375fe2eec5R198) which means that if the pool for some reason is already in use it is very easy to go over the timeout (say one render takes 200ms and you have only one worker, it only takes 5 requests to go over the timeout).
  • The Acquire is not fair, that happens often in pool if it is not properly implemented. Some tasks end up stale longer.
  • The worker died or was stuck but the parent process was not notified

I think that option 3 is likely since this was happening in staging without any load (only one user trying to register). I am not sure how to get logs for this worker process though.

@alnr
Copy link
Contributor

alnr commented May 15, 2024

Not sure the process pool is actually active in the OSS release. Are you seeing persistent kratos jsonnet -0 processes?

@alnr
Copy link
Contributor

alnr commented May 15, 2024

We've been running the process pool in production for a while without any regressions but with 100x better performance.

@alnr
Copy link
Contributor

alnr commented May 15, 2024

Are you running your process with GOMAXPROCS=1 or similar perhaps?

@Sytten
Copy link
Author

Sytten commented May 15, 2024

Not sure the process pool is actually active in the OSS release. Are you seeing persistent kratos jsonnet -0 processes?

I reverted but I will take a look, I am sure it is active since the "spawn process every time" option doesnt have a timeout AFAIK.

We've been running the process pool in production for a while without any regressions but with 100x better performance.

Interesting, it was very easy to trigger for us but its a super small machine since its a staging thing. What I am not sure is why it would work with a spawned thread but it would hang for 1s with the pool. Unless before it was running in-process?

Are you running your process with GOMAXPROCS=1 or similar perhaps?

No, but its a 0.25 CPU machine on render.com so similar outcome I am guessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working.
Projects
None yet
Development

No branches or pull requests

2 participants