Plunk API Fails Periodically - Self Hosted #114

ejscheepers · 2024-10-16T07:43:49Z

Every now and again the API fails and does not restart.

Server Logs:

ode:internal/deps/undici/undici:13185
2024-10-16T04:29:01.190454265Z Error.captureStackTrace(err);
2024-10-16T04:29:01.190459825Z ^
2024-10-16T04:29:01.190463705Z
2024-10-16T04:29:01.190467185Z TypeError: fetch failed
2024-10-16T04:29:01.190470865Z at node:internal/deps/undici/undici:13185:13
2024-10-16T04:29:01.190474745Z at process.processTicksAndRejections (node:internal/process/task_queues:105:5) {
2024-10-16T04:29:01.190478825Z [cause]: AggregateError [ETIMEDOUT]:
2024-10-16T04:29:01.190482625Z at internalConnectMultiple (node:net:1122:18)
2024-10-16T04:29:01.190486185Z at internalConnectMultiple (node:net:1190:5)
2024-10-16T04:29:01.190489785Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5)
2024-10-16T04:29:01.190493465Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190498985Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190502665Z code: 'ETIMEDOUT',
2024-10-16T04:29:01.190506065Z [errors]: [
2024-10-16T04:29:01.190509545Z Error: connect ETIMEDOUT 188.114.97.3:443
2024-10-16T04:29:01.190513105Z at createConnectionError (node:net:1652:14)
2024-10-16T04:29:01.190516705Z at Timeout.internalConnectMultipleTimeout (node:net:1711:38)
2024-10-16T04:29:01.190520425Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190524025Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190527745Z errno: -110,
2024-10-16T04:29:01.190531145Z code: 'ETIMEDOUT',
2024-10-16T04:29:01.190534585Z syscall: 'connect',
2024-10-16T04:29:01.190538065Z address: '188.114.97.3',
2024-10-16T04:29:01.190542545Z port: 443
2024-10-16T04:29:01.190545865Z },
2024-10-16T04:29:01.190549545Z Error: connect ENETUNREACH 2a06:98c1:3121::3:443 - Local (:::0)
2024-10-16T04:29:01.190553745Z at internalConnectMultiple (node:net:1186:16)
2024-10-16T04:29:01.190558345Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5)
2024-10-16T04:29:01.190580945Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190585225Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190589025Z errno: -101,
2024-10-16T04:29:01.190594705Z code: 'ENETUNREACH',
2024-10-16T04:29:01.190598105Z syscall: 'connect',
2024-10-16T04:29:01.190601545Z address: '2a06:98c1:3121::3',
2024-10-16T04:29:01.190605065Z port: 443
2024-10-16T04:29:01.190608985Z },
2024-10-16T04:29:01.190612345Z Error: connect ETIMEDOUT 188.114.96.3:443
2024-10-16T04:29:01.190616065Z at createConnectionError (node:net:1652:14)
2024-10-16T04:29:01.190619665Z at Timeout.internalConnectMultipleTimeout (node:net:1711:38)
2024-10-16T04:29:01.190623585Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190627105Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190630745Z errno: -110,
2024-10-16T04:29:01.190634105Z code: 'ETIMEDOUT',
2024-10-16T04:29:01.190637505Z syscall: 'connect',
2024-10-16T04:29:01.190640905Z address: '188.114.96.3',
2024-10-16T04:29:01.190644305Z port: 443
2024-10-16T04:29:01.190647745Z },
2024-10-16T04:29:01.190651065Z Error: connect ENETUNREACH 2a06:98c1:3120::3:443 - Local (:::0)
2024-10-16T04:29:01.190655825Z at internalConnectMultiple (node:net:1186:16)
2024-10-16T04:29:01.190659665Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5)
2024-10-16T04:29:01.190663545Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190667145Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190670745Z errno: -101,
2024-10-16T04:29:01.190674145Z code: 'ENETUNREACH',
2024-10-16T04:29:01.190677785Z syscall: 'connect',
2024-10-16T04:29:01.190681225Z address: '2a06:98c1:3120::3',
2024-10-16T04:29:01.190684665Z port: 443
2024-10-16T04:29:01.190687986Z }
2024-10-16T04:29:01.190691346Z ]
2024-10-16T04:29:01.190695146Z }
2024-10-16T04:29:01.190698506Z }
2024-10-16T04:29:01.190701826Z
2024-10-16T04:29:01.190705146Z Node.js v22.9.0

If I restart container, it starts working again.

ejscheepers · 2024-10-16T14:43:04Z

Not sure if it might be related, but here are logs from Postgres DB:

2024-10-16T14:40:47.565962634Z 2024-10-16 14:40:47.565 UTC [884] FATAL: role "postgres" does not exist
2024-10-16T14:40:52.620820389Z 2024-10-16 14:40:52.618 UTC [891] FATAL: role "postgres" does not exist
2024-10-16T14:40:57.660249044Z 2024-10-16 14:40:57.660 UTC [898] FATAL: role "postgres" does not exist
2024-10-16T14:41:02.701285029Z 2024-10-16 14:41:02.701 UTC [906] FATAL: role "postgres" does not exist
2024-10-16T14:41:07.741504375Z 2024-10-16 14:41:07.741 UTC [913] FATAL: role "postgres" does not exist
2024-10-16T14:41:12.775925703Z 2024-10-16 14:41:12.775 UTC [920] FATAL: role "postgres" does not exist
2024-10-16T14:41:17.819197070Z 2024-10-16 14:41:17.817 UTC [928] FATAL: role "postgres" does not exist
2024-10-16T14:41:22.866831741Z 2024-10-16 14:41:22.866 UTC [935] FATAL: role "postgres" does not exist
2024-10-16T14:41:27.908494833Z 2024-10-16 14:41:27.908 UTC [942] FATAL: role "postgres" does not exist
2024-10-16T14:41:32.946391915Z 2024-10-16 14:41:32.946 UTC [949] FATAL: role "postgres" does not exist
2024-10-16T14:41:37.981018911Z 2024-10-16 14:41:37.980 UTC [956] FATAL: role "postgres" does not exist
2024-10-16T14:41:43.017404840Z 2024-10-16 14:41:43.017 UTC [963] FATAL: role "postgres" does not exist

ejscheepers · 2024-10-16T14:45:00Z

Just a bit more context:

I am self-hosting on Coolify
The domain is proxied by Cloudflare
I have whitelisted my server IP on Cloudflare WAF (thought it might be rate limiting)
I will post Docker Compose below:

version: '3'
services:
  plunk:
    image: driaug/plunk
    depends_on:
      postgresql:
        condition: service_healthy
      redis:
        condition: service_started
    environment:
      - SERVICE_FQDN_PLUNK_3000
      - 'REDIS_URL=redis://redis:6379'
      - 'DATABASE_URL=postgresql://${SERVICE_USER_POSTGRES}:${SERVICE_PASSWORD_POSTGRES}@postgresql/plunk?schema=public'
      - 'JWT_SECRET=${SERVICE_PASSWORD_JWT_SECRET}'
      - 'AWS_REGION=${AWS_REGION}'
      - 'AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}'
      - 'AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}'
      - 'AWS_SES_CONFIGURATION_SET=${AWS_SES_CONFIGURATION_SET}'
      - 'NEXT_PUBLIC_API_URI=${SERVICE_FQDN_PLUNK}/api'
      - 'APP_URI=${SERVICE_FQDN_PLUNK}'
      - 'API_URI=${SERVICE_FQDN_PLUNK}/api'
      - DISABLE_SIGNUPS=False
    entrypoint:
      - /app/entry.sh
    healthcheck:
      test:
        - CMD
        - wget
        - '-q'
        - '--spider'
        - 'http://127.0.0.1:3000'
      interval: 2s
      timeout: 10s
      retries: 15
  postgresql:
    image: 'postgres:16-alpine'
    environment:
      - POSTGRES_USER=$SERVICE_USER_POSTGRES
      - POSTGRES_PASSWORD=$SERVICE_PASSWORD_POSTGRES
      - 'POSTGRES_DB=${POSTGRES_DB:-plunk}'
    volumes:
      - 'postgresql-data:/var/lib/postgresql/data'
    healthcheck:
      test:
        - CMD-SHELL
        - 'pg_isready -U postgres -d postgres'
      interval: 5s
      timeout: 10s
      retries: 20
  redis:
    image: 'redis:7.4-alpine'
    volumes:
      - 'redis-data:/data'
    healthcheck:
      test:
        - CMD
        - redis-cli
        - PING
      interval: 5s
      timeout: 10s
      retries: 20

ejscheepers · 2024-10-17T11:05:21Z

Not sure if possible @driaug , but adding an api health route would be very useful in the mean time? If the container crashes, we could use it to restart again.

At the moment I am using:

 (wget -S --spider http://127.0.0.1:3000/api/users/@me 2>&1 | grep -q 'HTTP/1.1 [1-4]')

Before I was only checking http://127.0.0.1:3000, but this would give a false positive as only the dashboard would be running.

`PLUNK_API_URI` is a placeholder for `NEXT_PUBLIC_API_URI` inside the Dockerfile. `API_URI` variable could be an internal URI like `http://plunk:3000` while `NEXT_PUBLIC_API_URI` has to be public. Separating internal and public URI variables can help with performance by avoiding network overhead for server requests. This commit also solves the issue useplunk#114.

ardasevinc · 2024-11-16T20:00:00Z

I second adding an healthcheck route. I'm using caprover to deploy - here's my captain-definition/one-click-app file for ref.

I think the issue stems from ipv6. I added the env var NODE_OPTIONS=--dns-result-order=ipv4first. Currently testing this, no crashes yet.

edit: I can verify that the node options above fixed this issue, please test. @ejscheepers @driaug
edit2: I have switched to --no-network-family-autoselection node option, dns result order didn't work. This is probably an issue with nodejs happy eyeballs implementation.
edit3: still going strong after a week, no crashes yet with the --no-network-family-autoselection arg.

Code42Cate · 2024-11-21T16:39:26Z

@ardasevinc does the new healthcheck route work for you? should be available in the latest version

ardasevinc · 2024-11-27T10:04:31Z

@ardasevinc does the new healthcheck route work for you? should be available in the latest version

I think I can't change the healtcheck in caprover after it's been deployed. Will test later.

For what this issue concerns, I have solved it for my case with the --no-network-family-autoselection option for node. Healthcheck itself is not related to the crashes but very nice to have.

rkcreation · 2024-11-27T23:02:52Z

I've also got random errors with the 2 flags set in NODE_OPTIONS env var :

--no-network-family-autoselection --dns-result-order=ipv4first (tested each other separately)

2024-11-27T22:53:00.439408401Z node:internal/deps/undici/undici:13185
2024-11-27T22:53:00.439446827Z       Error.captureStackTrace(err);
2024-11-27T22:53:00.439457347Z             ^
2024-11-27T22:53:00.439459732Z 
2024-11-27T22:53:00.439461743Z TypeError: fetch failed
2024-11-27T22:53:00.439463823Z     at node:internal/deps/undici/undici:13185:13
2024-11-27T22:53:00.439465919Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2024-11-27T22:53:00.439480844Z   [cause]: Error: connect ECONNREFUSED 127.0.1.1:443
2024-11-27T22:53:00.439482971Z       at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1607:16) {
2024-11-27T22:53:00.439485669Z     errno: -111,
2024-11-27T22:53:00.439488752Z     code: 'ECONNREFUSED',
2024-11-27T22:53:00.439504355Z     syscall: 'connect',
2024-11-27T22:53:00.439507075Z     address: '127.0.1.1',
2024-11-27T22:53:00.439509697Z     port: 443
2024-11-27T22:53:00.439512196Z   }
2024-11-27T22:53:00.439514461Z }

rkcreation · 2024-11-27T23:13:03Z

Seems to be related to https://r1ch.net/blog/node-v20-aggregateeerror-etimedout-happy-eyeballs

ardasevinc · 2024-11-29T08:57:44Z

I've also got random errors with the 2 flags set in NODE_OPTIONS env var :

--no-network-family-autoselection --dns-result-order=ipv4first (tested each other separately)

2024-11-27T22:53:00.439408401Z node:internal/deps/undici/undici:13185
2024-11-27T22:53:00.439446827Z       Error.captureStackTrace(err);
2024-11-27T22:53:00.439457347Z             ^
2024-11-27T22:53:00.439459732Z 
2024-11-27T22:53:00.439461743Z TypeError: fetch failed
2024-11-27T22:53:00.439463823Z     at node:internal/deps/undici/undici:13185:13
2024-11-27T22:53:00.439465919Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2024-11-27T22:53:00.439480844Z   [cause]: Error: connect ECONNREFUSED 127.0.1.1:443
2024-11-27T22:53:00.439482971Z       at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1607:16) {
2024-11-27T22:53:00.439485669Z     errno: -111,
2024-11-27T22:53:00.439488752Z     code: 'ECONNREFUSED',
2024-11-27T22:53:00.439504355Z     syscall: 'connect',
2024-11-27T22:53:00.439507075Z     address: '127.0.1.1',
2024-11-27T22:53:00.439509697Z     port: 443
2024-11-27T22:53:00.439512196Z   }
2024-11-27T22:53:00.439514461Z }

that's odd. have you tried recreating the container? maybe force rebuilding.

I'm self-hosting plunk via caprover and --no-network-family-autoselection option worked for me. this issue is definitely related to the happy eyeballs impl of node after v18. Could be fixed for good by changing the listen uri

`PLUNK_API_URI` is a placeholder for `NEXT_PUBLIC_API_URI` inside the Dockerfile. `API_URI` variable could be an internal URI like `http://plunk:3000` while `NEXT_PUBLIC_API_URI` has to be public. Separating internal and public URI variables can help with performance by avoiding network overhead for server requests. This commit also solves the issue useplunk#114.

ardasevinc · 2024-12-03T14:37:26Z

I've also got random errors with the 2 flags set in NODE_OPTIONS env var :
--no-network-family-autoselection --dns-result-order=ipv4first (tested each other separately)
2024-11-27T22:53:00.439408401Z node:internal/deps/undici/undici:13185
2024-11-27T22:53:00.439446827Z       Error.captureStackTrace(err);
2024-11-27T22:53:00.439457347Z             ^
2024-11-27T22:53:00.439459732Z 
2024-11-27T22:53:00.439461743Z TypeError: fetch failed
2024-11-27T22:53:00.439463823Z     at node:internal/deps/undici/undici:13185:13
2024-11-27T22:53:00.439465919Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2024-11-27T22:53:00.439480844Z   [cause]: Error: connect ECONNREFUSED 127.0.1.1:443
2024-11-27T22:53:00.439482971Z       at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1607:16) {
2024-11-27T22:53:00.439485669Z     errno: -111,
2024-11-27T22:53:00.439488752Z     code: 'ECONNREFUSED',
2024-11-27T22:53:00.439504355Z     syscall: 'connect',
2024-11-27T22:53:00.439507075Z     address: '127.0.1.1',
2024-11-27T22:53:00.439509697Z     port: 443
2024-11-27T22:53:00.439512196Z   }
2024-11-27T22:53:00.439514461Z }
that's odd. have you tried recreating the container? maybe force rebuilding.

I'm self-hosting plunk via caprover and --no-network-family-autoselection option worked for me. this issue is definitely related to the happy eyeballs impl of node after v18. Could be fixed for good by changing the listen uri

this solution lasted 2-3 weeks. it seems the issue has to do with amount of requests. when the plunk api is frequently used this failing period shortens. I did not get the ECONNREFUSED, ETIMEDOUT or ETIMEDOUT errors this time though. Seems like the last option I tried fixed those. Will investigate further.

2024-12-03T13:50:10.251439516Z node:internal/deps/undici/undici:13185
2024-12-03T13:50:10.251476036Z Error.captureStackTrace(err);
2024-12-03T13:50:10.251479836Z ^
2024-12-03T13:50:10.251482356Z
2024-12-03T13:50:10.251484716Z TypeError: fetch failed
2024-12-03T13:50:10.251487076Z at node:internal/deps/undici/undici:13185:13
2024-12-03T13:50:10.251489396Z at processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-12-03T13:50:10.251492036Z at runNextTicks (node:internal/process/task_queues:64:3)
2024-12-03T13:50:10.251494476Z at process.processImmediate (node:internal/timers:454:9) {
2024-12-03T13:50:10.251496916Z [cause]: ConnectTimeoutError: Connect Timeout Error
2024-12-03T13:50:10.251499276Z at onConnectTimeout (node:internal/deps/undici/undici:2331:28)
2024-12-03T13:50:10.251519516Z at node:internal/deps/undici/undici:2283:50
2024-12-03T13:50:10.251521716Z at Immediate._onImmediate (node:internal/deps/undici/undici:2315:13)
2024-12-03T13:50:10.251523556Z at process.processImmediate (node:internal/timers:483:21) {
2024-12-03T13:50:10.251525196Z code: 'UND_ERR_CONNECT_TIMEOUT'
2024-12-03T13:50:10.251526916Z }
2024-12-03T13:50:10.251528436Z }
2024-12-03T13:50:10.251529876Z
2024-12-03T13:50:10.251531356Z Node.js v20.18.0

related: https://undici.nodejs.org/#/?id=network-address-family-autoselection

ardasevinc · 2024-12-16T21:12:18Z

I've been trying another fix since ~2 weeks. Seems working for now, no crashes. The fix is:

use VERCEL_UNDICI=1 env variable. More info in this comment. Also check the related issue
use the new /health endpoint. I used it by hitting http://localhost:5000/api/health. This is for using the endpoint as a docker service healtcheck. As these execute inside the container, we have to use the port 5000.

I haven't had any crashes or errors in the logs since I implemented these fixes.

related: https://x.com/cramforce/status/1836415221683941556

emreloper mentioned this issue Oct 29, 2024

Replace NEXT_PUBLIC_API_URI instead of API_URI #119

Open

Code42Cate mentioned this issue Nov 18, 2024

Add basic health check api route #137

Merged

ardasevinc mentioned this issue Nov 27, 2024

Plunk API is failing mere minutes after being restarted. #140

Open

This was referenced Nov 27, 2024

Error sending emails through UI #142

Open

[Bug]: Plunk healthcheck + node options coollabsio/coolify#4426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plunk API Fails Periodically - Self Hosted #114

Plunk API Fails Periodically - Self Hosted #114

ejscheepers commented Oct 16, 2024

ejscheepers commented Oct 16, 2024

ejscheepers commented Oct 16, 2024

ejscheepers commented Oct 17, 2024

ardasevinc commented Nov 16, 2024 •

edited

Loading

Code42Cate commented Nov 21, 2024

ardasevinc commented Nov 27, 2024

rkcreation commented Nov 27, 2024

rkcreation commented Nov 27, 2024

ardasevinc commented Nov 29, 2024

ardasevinc commented Dec 3, 2024 •

edited

Loading

ardasevinc commented Dec 16, 2024 •

edited

Loading

Plunk API Fails Periodically - Self Hosted #114

Plunk API Fails Periodically - Self Hosted #114

Comments

ejscheepers commented Oct 16, 2024

ejscheepers commented Oct 16, 2024

ejscheepers commented Oct 16, 2024

ejscheepers commented Oct 17, 2024

ardasevinc commented Nov 16, 2024 • edited Loading

Code42Cate commented Nov 21, 2024

ardasevinc commented Nov 27, 2024

rkcreation commented Nov 27, 2024

rkcreation commented Nov 27, 2024

ardasevinc commented Nov 29, 2024

ardasevinc commented Dec 3, 2024 • edited Loading

ardasevinc commented Dec 16, 2024 • edited Loading

ardasevinc commented Nov 16, 2024 •

edited

Loading

ardasevinc commented Dec 3, 2024 •

edited

Loading

ardasevinc commented Dec 16, 2024 •

edited

Loading