Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory and redis issues on Worker instance #489

Open
leobarcellos opened this issue Aug 20, 2024 · 9 comments
Open

Memory and redis issues on Worker instance #489

leobarcellos opened this issue Aug 20, 2024 · 9 comments

Comments

@leobarcellos
Copy link
Contributor

Hey there! I'm experiencing two weird errors that I don't know what else to do anymore.

First one is about some sort of memory leak or anything like that:
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

image

I already tried to increase memory of the instance, right now it's at 4GB and already tried to change the node max old space size using the ENV NODE_OPTIONS="--max-old-space-size=3072" env var. But it is still happening, not sure what else I could do to mitigate this.

Second one is related to Redis, I often receive a ReplyError: READONLY You can't write against a read only replica. -- My Redis instance is separated from the worker, but I'm not sure what is causing this. -- This one is happening frequently.

image image

Any thoughts of how to handle it?

@leobarcellos
Copy link
Contributor Author

I just ran these commands on redis, hope this will prevent the second error mentioned.

SLAVEOF NO ONE
REPLICAOF NO ONE

@pushchris
Copy link
Contributor

I have not seen either of those before! How do you have redis setup? Are you in cluster mode? Using Elasticache or your own instance running?

@leobarcellos
Copy link
Contributor Author

@pushchris I'm running on AWS ECS Fargate, I have a task with worker and a separated task with redis.
The "slave" error went away with the commands I've mentioned earlier.. however the first issue is the real problem here, I often get this memory heap error and it causes deadlock issues on databases etc

@leobarcellos
Copy link
Contributor Author

Update: Redis issue was not related to memory issue.

I just updated my deploy to a EC2 "monolith" (ui / api / worker / redis) with AWS RDS

It's running smoothier than the ECS, however I just experienced the same error:

Screenshot 2024-09-04 at 13 01 02

@pushchris We have like 10 projects and each of them have a lot of lists, I've observed that all lists get updated at the same time once I open the project on the UI (is this the normal behavior ?) -- Also, everytime I get this heap memory error, I have like 2~4 campaigns running, sending 1k-2k emails each.

Another info that might be useful, not sure if this is OK, but when I check worker logs, I get like SEVERAL log of queue:job:started, like really SEVERAL, screenshot is just illustrative because it prints like a lot.

Screenshot 2024-09-04 at 13 09 06

Is this normal? Or something is wrong here?

@leobarcellos
Copy link
Contributor Author

By the way, I just checked this answer on stack overflow:
https://stackoverflow.com/questions/55613789/how-to-fix-fatal-error-ineffective-mark-compacts-near-heap-limit-allocation-fa

I just updated the env for the worker like this: NODE_OPTIONS=--max_old_space_size=7580

Let's see if this helps.

@leobarcellos
Copy link
Contributor Author

Well, it keep happening, however I'm not experiencing deadlock issues or stuck campaigns anymore. I guess it was happening only on ECS FARGATE.

Screenshot 2024-09-05 at 13 32 47

It's throwing errors but at least its working. Not sure what else I could do to fix this.

@pushchris
Copy link
Contributor

Unfortunately heap errors like that are really hard to debug without actually running a memory dump. Typically they are caused by some sort of memory leak that is building up over time.

To address your other question, what do you have your log level set to? All of that being printed is related to being at an info level, if you drop down to a warn you'll get less console logging and it should help somewhat with memory (though I don't expect substantial) but mostly just help with speed.

I would be very curious to know what job specifically is causing that memory issue though and how large the data is that is being passed around. Do you have a way of monitoring memory on those instances over time? If its a big spike that causes the crash that would be different than a slow gradual increase and could be helpful to know.

@leobarcellos
Copy link
Contributor Author

Yeah, I was at debug level to actually see what's going on. I returned it back to error log level, but it's still happening.

I will drop the screenshot here, it oftens causes a deadlock error on journey_process job

Same errors, different days (actually it's happening every day) -- However it seems to be working just fine, besides the errors.

Screenshot 2024-09-06 at 12 45 32
Screenshot 2024-09-09 at 13 00 05

@leobarcellos
Copy link
Contributor Author

More info, memory of worker instance keep increasing.. that's why things start to fail at certain point:

Screenshot 2024-09-12 at 14 06 41

It starts like 50mb.. 70mb, and then it keeps increasing.. 300mb.. 700mb.. 1gb.. until maximum available.

I just changed docker compose to keep worker at 2GB maximum memory and I'm spawning 5 workers using docker compose up worker --scale worker=5 -d. Hopefully it will contain the problem.

However I don't know exactly where this memory leak is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants