Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActionRunners get stuck when there is transient connection loss to RabbitMQ/ Zookeeper #6097

Open
sravs-dev opened this issue Dec 14, 2023 · 1 comment

Comments

@sravs-dev
Copy link
Contributor

SUMMARY

We have st2 HA setup in k8s environment. We use zookeeper for coordination backend . We observed that actionrunners/schedulers/workflowengines are hung when there is transient connectivity issues with Mongo/RabbitMQ/Zookeeper.
Mongo/rabbitmq/zk containers get restarted due to k8s maintenance operations

STACKSTORM VERSION

st2 3.7.0, on Python 3.6.8

OS, environment, install method

st2 helm charts in kubernetes. CentOS base image.

Steps to reproduce the problem

Introduce connectivity errors by restarting RabbitMQ/Mongo.
st2 services loose connection to rabbitmq or mongo , they try to reconnect automatically. When the retry count exceeds , the service is hung even after the RabbitMQ/Mongo comes up.

Expected Results

No of retries and backoff time for retries should be configurable so that we can customize for the individual st2 deployments.
Or
St2 Services should be configured to exit on connectivity failures after the retry threshold is reached. In k8s environment, the containers will be auto restarted by k8s when process with pid#1 dies.

Actual Results

St2 services - actionrunner, scheduler, workflowengine, rulesengine are hung and are not able to serve traffic. Manual restart of these services are needed to resolve the issue.
In a HA setup, restarting all services would take about 15-20 minutes which is an outage.

Recommendation

RabbitMQ errors seem to be coming from here https://github.com/StackStorm/st2/blob/master/st2common/st2common/transport/consumers.py#L197 . exit_on_error can be in st2.conf with default as false.

Similar setting for Mongo would help. I would like to hear the thoughts from the maintainers. Happy to fix and test with some guidance.

@sravs-dev
Copy link
Contributor Author

sravs-dev commented Dec 18, 2023

Related issues
#4775
#4958
#4731
#4020

@sravs-dev sravs-dev changed the title ActionRunners get stuck when there is transient connection loss to Rabbit/MQ ActionRunners get stuck when there is transient connection loss to RabbitMQ/ Zookeeper Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant