Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We had a spike in errors and after that 100% of errors are getting dropped, could someone help me figure out why? #2960

Open
edgariscoding opened this issue Apr 15, 2024 · 9 comments

Comments

@edgariscoding
Copy link
Contributor

Self-Hosted Version

24.3.0 unknown

CPU Architecture

x86_64

Docker Version

24.0.7

Docker Compose Version

2.21.0

Steps to Reproduce

On April 8th (Monday) we experienced a spike in errors dropped. There was nothing peculiar going on this day, we didn't receive any complaints of downtime for our web application.

image

image

According to the stats page this started at 9am and from April 8th at 9am until today 100% of errors have been dropped.

I have rate limiting set up but that doesnt seem to be the cause as can be seen in screenshots below.

I don't see any warnings in the System Warnings page in the admin panel.

Anybody have any suggestions?

I'd love if Sentry showed a reason as to why the errors were dropped.

Expected Result

Expected errors to not be dropped.

Actual Result

Docker compose logs:
https://pastebin.com/raw/TXHJL7i3

image
image
image
image

Event ID

No response

@hubertdeng123
Copy link
Member

That is indeed interesting. I'm seeing Net Exception: Socket is not connected, Stack trace in your clickhouse logs? Maybe your Sentry instance lost connection there?

@edgariscoding
Copy link
Contributor Author

@hubertdeng123 I'm not sure. It seems like there was a RAM bottleneck along with storage bottleneck. The docker directory ballooned in size to over 60GB. I increased the storage and RAM and reinstalled.

Now Sentry is logging errors, i can see them come in... but in the stats page it shows that there were 32 errors and 32 of them were dropped.
image

But if i look at the list of issues for this project for the last 7 days i have about 350 pages of issues.

Errors are coming in but Sentry isnt counting them and it's considering them as dropped.

@azaslavsky
Copy link
Contributor

It's quite difficult to debug this remotely - Sentry knows that some errors didn't make it all the way through the pipeline, but that's really all it knows, otherwise they wouldn't be dropped errors. Usually these sorts of things are related to connection issues between various containers (hence the dropping), memory limitations, or configuration at the orchestrator or cloud provider level.

@edgariscoding
Copy link
Contributor Author

@azaslavsky Do you know if there’s a guide on how to rebuild/reinstall from scratch but retaining data like the projects themselves, user accounts, settings, etc? I don’t care if I lose all of the issues.

Running ./install.sh doesn’t seem to be enough for me, I keep having issues.

@azaslavsky
Copy link
Contributor

Yep, there is a backup/restore tool for exactly this use case: https://develop.sentry.dev/self-hosted/backup/#partial-json-backup

@csvan
Copy link

csvan commented May 4, 2024

That is indeed interesting. I'm seeing Net Exception: Socket is not connected, Stack trace in your clickhouse logs? Maybe your Sentry instance lost connection there?

@hubertdeng123 having this exact issue and getting absolutely spammed by the logs you mention above:

clickhouse-1                                    | 2024.05.04 21:52:34.085404 [ 281 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    |
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x101540cd in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6fd5 in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
clickhouse-1                                    | 9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
clickhouse-1                                    | 10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

It is not clear to me at all why this started happening. Our instance has run for months without incident and there have been no changes I am aware of. What could cause it to lose connection to clickhouse?

@azaslavsky
Copy link
Contributor

@csvan Have you updated your install recently?

@edgariscoding
Copy link
Contributor Author

I'm not sure what happened but after updating to version 24.4.2 everything SEEMS to be working fine, I no longer have 100% errors dropped. I didnt change anything on our server.

@yakky

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for: Community
Status: No status
Status: No status
Development

No branches or pull requests

5 participants