Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plunk API Fails Periodically - Self Hosted #114

Open
ejscheepers opened this issue Oct 16, 2024 · 11 comments
Open

Plunk API Fails Periodically - Self Hosted #114

ejscheepers opened this issue Oct 16, 2024 · 11 comments

Comments

@ejscheepers
Copy link
Contributor

Every now and again the API fails and does not restart.

Server Logs:

ode:internal/deps/undici/undici:13185
2024-10-16T04:29:01.190454265Z Error.captureStackTrace(err);
2024-10-16T04:29:01.190459825Z ^
2024-10-16T04:29:01.190463705Z
2024-10-16T04:29:01.190467185Z TypeError: fetch failed
2024-10-16T04:29:01.190470865Z at node:internal/deps/undici/undici:13185:13
2024-10-16T04:29:01.190474745Z at process.processTicksAndRejections (node:internal/process/task_queues:105:5) {
2024-10-16T04:29:01.190478825Z [cause]: AggregateError [ETIMEDOUT]:
2024-10-16T04:29:01.190482625Z at internalConnectMultiple (node:net:1122:18)
2024-10-16T04:29:01.190486185Z at internalConnectMultiple (node:net:1190:5)
2024-10-16T04:29:01.190489785Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5)
2024-10-16T04:29:01.190493465Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190498985Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190502665Z code: 'ETIMEDOUT',
2024-10-16T04:29:01.190506065Z [errors]: [
2024-10-16T04:29:01.190509545Z Error: connect ETIMEDOUT 188.114.97.3:443
2024-10-16T04:29:01.190513105Z at createConnectionError (node:net:1652:14)
2024-10-16T04:29:01.190516705Z at Timeout.internalConnectMultipleTimeout (node:net:1711:38)
2024-10-16T04:29:01.190520425Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190524025Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190527745Z errno: -110,
2024-10-16T04:29:01.190531145Z code: 'ETIMEDOUT',
2024-10-16T04:29:01.190534585Z syscall: 'connect',
2024-10-16T04:29:01.190538065Z address: '188.114.97.3',
2024-10-16T04:29:01.190542545Z port: 443
2024-10-16T04:29:01.190545865Z },
2024-10-16T04:29:01.190549545Z Error: connect ENETUNREACH 2a06:98c1:3121::3:443 - Local (:::0)
2024-10-16T04:29:01.190553745Z at internalConnectMultiple (node:net:1186:16)
2024-10-16T04:29:01.190558345Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5)
2024-10-16T04:29:01.190580945Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190585225Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190589025Z errno: -101,
2024-10-16T04:29:01.190594705Z code: 'ENETUNREACH',
2024-10-16T04:29:01.190598105Z syscall: 'connect',
2024-10-16T04:29:01.190601545Z address: '2a06:98c1:3121::3',
2024-10-16T04:29:01.190605065Z port: 443
2024-10-16T04:29:01.190608985Z },
2024-10-16T04:29:01.190612345Z Error: connect ETIMEDOUT 188.114.96.3:443
2024-10-16T04:29:01.190616065Z at createConnectionError (node:net:1652:14)
2024-10-16T04:29:01.190619665Z at Timeout.internalConnectMultipleTimeout (node:net:1711:38)
2024-10-16T04:29:01.190623585Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190627105Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190630745Z errno: -110,
2024-10-16T04:29:01.190634105Z code: 'ETIMEDOUT',
2024-10-16T04:29:01.190637505Z syscall: 'connect',
2024-10-16T04:29:01.190640905Z address: '188.114.96.3',
2024-10-16T04:29:01.190644305Z port: 443
2024-10-16T04:29:01.190647745Z },
2024-10-16T04:29:01.190651065Z Error: connect ENETUNREACH 2a06:98c1:3120::3:443 - Local (:::0)
2024-10-16T04:29:01.190655825Z at internalConnectMultiple (node:net:1186:16)
2024-10-16T04:29:01.190659665Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5)
2024-10-16T04:29:01.190663545Z at listOnTimeout (node:internal/timers:596:11)
2024-10-16T04:29:01.190667145Z at process.processTimers (node:internal/timers:529:7) {
2024-10-16T04:29:01.190670745Z errno: -101,
2024-10-16T04:29:01.190674145Z code: 'ENETUNREACH',
2024-10-16T04:29:01.190677785Z syscall: 'connect',
2024-10-16T04:29:01.190681225Z address: '2a06:98c1:3120::3',
2024-10-16T04:29:01.190684665Z port: 443
2024-10-16T04:29:01.190687986Z }
2024-10-16T04:29:01.190691346Z ]
2024-10-16T04:29:01.190695146Z }
2024-10-16T04:29:01.190698506Z }
2024-10-16T04:29:01.190701826Z
2024-10-16T04:29:01.190705146Z Node.js v22.9.0

If I restart container, it starts working again.

@ejscheepers
Copy link
Contributor Author

Not sure if it might be related, but here are logs from Postgres DB:

2024-10-16T14:40:47.565962634Z 2024-10-16 14:40:47.565 UTC [884] FATAL: role "postgres" does not exist
2024-10-16T14:40:52.620820389Z 2024-10-16 14:40:52.618 UTC [891] FATAL: role "postgres" does not exist
2024-10-16T14:40:57.660249044Z 2024-10-16 14:40:57.660 UTC [898] FATAL: role "postgres" does not exist
2024-10-16T14:41:02.701285029Z 2024-10-16 14:41:02.701 UTC [906] FATAL: role "postgres" does not exist
2024-10-16T14:41:07.741504375Z 2024-10-16 14:41:07.741 UTC [913] FATAL: role "postgres" does not exist
2024-10-16T14:41:12.775925703Z 2024-10-16 14:41:12.775 UTC [920] FATAL: role "postgres" does not exist
2024-10-16T14:41:17.819197070Z 2024-10-16 14:41:17.817 UTC [928] FATAL: role "postgres" does not exist
2024-10-16T14:41:22.866831741Z 2024-10-16 14:41:22.866 UTC [935] FATAL: role "postgres" does not exist
2024-10-16T14:41:27.908494833Z 2024-10-16 14:41:27.908 UTC [942] FATAL: role "postgres" does not exist
2024-10-16T14:41:32.946391915Z 2024-10-16 14:41:32.946 UTC [949] FATAL: role "postgres" does not exist
2024-10-16T14:41:37.981018911Z 2024-10-16 14:41:37.980 UTC [956] FATAL: role "postgres" does not exist
2024-10-16T14:41:43.017404840Z 2024-10-16 14:41:43.017 UTC [963] FATAL: role "postgres" does not exist

@ejscheepers
Copy link
Contributor Author

Just a bit more context:

  • I am self-hosting on Coolify
  • The domain is proxied by Cloudflare
  • I have whitelisted my server IP on Cloudflare WAF (thought it might be rate limiting)
  • I will post Docker Compose below:
version: '3'
services:
  plunk:
    image: driaug/plunk
    depends_on:
      postgresql:
        condition: service_healthy
      redis:
        condition: service_started
    environment:
      - SERVICE_FQDN_PLUNK_3000
      - 'REDIS_URL=redis://redis:6379'
      - 'DATABASE_URL=postgresql://${SERVICE_USER_POSTGRES}:${SERVICE_PASSWORD_POSTGRES}@postgresql/plunk?schema=public'
      - 'JWT_SECRET=${SERVICE_PASSWORD_JWT_SECRET}'
      - 'AWS_REGION=${AWS_REGION}'
      - 'AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}'
      - 'AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}'
      - 'AWS_SES_CONFIGURATION_SET=${AWS_SES_CONFIGURATION_SET}'
      - 'NEXT_PUBLIC_API_URI=${SERVICE_FQDN_PLUNK}/api'
      - 'APP_URI=${SERVICE_FQDN_PLUNK}'
      - 'API_URI=${SERVICE_FQDN_PLUNK}/api'
      - DISABLE_SIGNUPS=False
    entrypoint:
      - /app/entry.sh
    healthcheck:
      test:
        - CMD
        - wget
        - '-q'
        - '--spider'
        - 'http://127.0.0.1:3000'
      interval: 2s
      timeout: 10s
      retries: 15
  postgresql:
    image: 'postgres:16-alpine'
    environment:
      - POSTGRES_USER=$SERVICE_USER_POSTGRES
      - POSTGRES_PASSWORD=$SERVICE_PASSWORD_POSTGRES
      - 'POSTGRES_DB=${POSTGRES_DB:-plunk}'
    volumes:
      - 'postgresql-data:/var/lib/postgresql/data'
    healthcheck:
      test:
        - CMD-SHELL
        - 'pg_isready -U postgres -d postgres'
      interval: 5s
      timeout: 10s
      retries: 20
  redis:
    image: 'redis:7.4-alpine'
    volumes:
      - 'redis-data:/data'
    healthcheck:
      test:
        - CMD
        - redis-cli
        - PING
      interval: 5s
      timeout: 10s
      retries: 20

@ejscheepers
Copy link
Contributor Author

Not sure if possible @driaug , but adding an api health route would be very useful in the mean time? If the container crashes, we could use it to restart again.

At the moment I am using:

 (wget -S --spider http://127.0.0.1:3000/api/users/@me 2>&1 | grep -q 'HTTP/1.1 [1-4]') 

Before I was only checking http://127.0.0.1:3000, but this would give a false positive as only the dashboard would be running.

emreloper added a commit to emreloper/plunk that referenced this issue Oct 29, 2024
`PLUNK_API_URI` is a placeholder for `NEXT_PUBLIC_API_URI` inside the Dockerfile. `API_URI` variable could be an internal URI like `http://plunk:3000` while `NEXT_PUBLIC_API_URI` has to be public.

Separating internal and public URI variables can help with performance by avoiding network overhead for server requests.

This commit also solves the issue useplunk#114.
@ardasevinc
Copy link

ardasevinc commented Nov 16, 2024

I second adding an healthcheck route. I'm using caprover to deploy - here's my captain-definition/one-click-app file for ref.

I think the issue stems from ipv6. I added the env var NODE_OPTIONS=--dns-result-order=ipv4first. Currently testing this, no crashes yet.

edit: I can verify that the node options above fixed this issue, please test. @ejscheepers @driaug
edit2: I have switched to --no-network-family-autoselection node option, dns result order didn't work. This is probably an issue with nodejs happy eyeballs implementation.
edit3: still going strong after a week, no crashes yet with the --no-network-family-autoselection arg.

@Code42Cate
Copy link
Contributor

@ardasevinc does the new healthcheck route work for you? should be available in the latest version

@ardasevinc
Copy link

@ardasevinc does the new healthcheck route work for you? should be available in the latest version

I think I can't change the healtcheck in caprover after it's been deployed. Will test later.

For what this issue concerns, I have solved it for my case with the --no-network-family-autoselection option for node. Healthcheck itself is not related to the crashes but very nice to have.

@rkcreation
Copy link

I've also got random errors with the 2 flags set in NODE_OPTIONS env var :

--no-network-family-autoselection --dns-result-order=ipv4first (tested each other separately)

2024-11-27T22:53:00.439408401Z node:internal/deps/undici/undici:13185
2024-11-27T22:53:00.439446827Z       Error.captureStackTrace(err);
2024-11-27T22:53:00.439457347Z             ^
2024-11-27T22:53:00.439459732Z 
2024-11-27T22:53:00.439461743Z TypeError: fetch failed
2024-11-27T22:53:00.439463823Z     at node:internal/deps/undici/undici:13185:13
2024-11-27T22:53:00.439465919Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2024-11-27T22:53:00.439480844Z   [cause]: Error: connect ECONNREFUSED 127.0.1.1:443
2024-11-27T22:53:00.439482971Z       at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1607:16) {
2024-11-27T22:53:00.439485669Z     errno: -111,
2024-11-27T22:53:00.439488752Z     code: 'ECONNREFUSED',
2024-11-27T22:53:00.439504355Z     syscall: 'connect',
2024-11-27T22:53:00.439507075Z     address: '127.0.1.1',
2024-11-27T22:53:00.439509697Z     port: 443
2024-11-27T22:53:00.439512196Z   }
2024-11-27T22:53:00.439514461Z }

@rkcreation
Copy link

@ardasevinc
Copy link

I've also got random errors with the 2 flags set in NODE_OPTIONS env var :

--no-network-family-autoselection --dns-result-order=ipv4first (tested each other separately)

2024-11-27T22:53:00.439408401Z node:internal/deps/undici/undici:13185
2024-11-27T22:53:00.439446827Z       Error.captureStackTrace(err);
2024-11-27T22:53:00.439457347Z             ^
2024-11-27T22:53:00.439459732Z 
2024-11-27T22:53:00.439461743Z TypeError: fetch failed
2024-11-27T22:53:00.439463823Z     at node:internal/deps/undici/undici:13185:13
2024-11-27T22:53:00.439465919Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2024-11-27T22:53:00.439480844Z   [cause]: Error: connect ECONNREFUSED 127.0.1.1:443
2024-11-27T22:53:00.439482971Z       at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1607:16) {
2024-11-27T22:53:00.439485669Z     errno: -111,
2024-11-27T22:53:00.439488752Z     code: 'ECONNREFUSED',
2024-11-27T22:53:00.439504355Z     syscall: 'connect',
2024-11-27T22:53:00.439507075Z     address: '127.0.1.1',
2024-11-27T22:53:00.439509697Z     port: 443
2024-11-27T22:53:00.439512196Z   }
2024-11-27T22:53:00.439514461Z }

that's odd. have you tried recreating the container? maybe force rebuilding.

I'm self-hosting plunk via caprover and --no-network-family-autoselection option worked for me. this issue is definitely related to the happy eyeballs impl of node after v18. Could be fixed for good by changing the listen uri

emreloper pushed a commit to emreloper/plunk that referenced this issue Dec 2, 2024
`PLUNK_API_URI` is a placeholder for `NEXT_PUBLIC_API_URI` inside the Dockerfile. `API_URI` variable could be an internal URI like `http://plunk:3000` while `NEXT_PUBLIC_API_URI` has to be public.

Separating internal and public URI variables can help with performance by avoiding network overhead for server requests.

This commit also solves the issue useplunk#114.
@ardasevinc
Copy link

ardasevinc commented Dec 3, 2024

I've also got random errors with the 2 flags set in NODE_OPTIONS env var :
--no-network-family-autoselection --dns-result-order=ipv4first (tested each other separately)

2024-11-27T22:53:00.439408401Z node:internal/deps/undici/undici:13185
2024-11-27T22:53:00.439446827Z       Error.captureStackTrace(err);
2024-11-27T22:53:00.439457347Z             ^
2024-11-27T22:53:00.439459732Z 
2024-11-27T22:53:00.439461743Z TypeError: fetch failed
2024-11-27T22:53:00.439463823Z     at node:internal/deps/undici/undici:13185:13
2024-11-27T22:53:00.439465919Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2024-11-27T22:53:00.439480844Z   [cause]: Error: connect ECONNREFUSED 127.0.1.1:443
2024-11-27T22:53:00.439482971Z       at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1607:16) {
2024-11-27T22:53:00.439485669Z     errno: -111,
2024-11-27T22:53:00.439488752Z     code: 'ECONNREFUSED',
2024-11-27T22:53:00.439504355Z     syscall: 'connect',
2024-11-27T22:53:00.439507075Z     address: '127.0.1.1',
2024-11-27T22:53:00.439509697Z     port: 443
2024-11-27T22:53:00.439512196Z   }
2024-11-27T22:53:00.439514461Z }

that's odd. have you tried recreating the container? maybe force rebuilding.

I'm self-hosting plunk via caprover and --no-network-family-autoselection option worked for me. this issue is definitely related to the happy eyeballs impl of node after v18. Could be fixed for good by changing the listen uri

this solution lasted 2-3 weeks. it seems the issue has to do with amount of requests. when the plunk api is frequently used this failing period shortens. I did not get the ECONNREFUSED, ETIMEDOUT or ETIMEDOUT errors this time though. Seems like the last option I tried fixed those. Will investigate further.

2024-12-03T13:50:10.251439516Z node:internal/deps/undici/undici:13185
2024-12-03T13:50:10.251476036Z Error.captureStackTrace(err);
2024-12-03T13:50:10.251479836Z ^
2024-12-03T13:50:10.251482356Z
2024-12-03T13:50:10.251484716Z TypeError: fetch failed
2024-12-03T13:50:10.251487076Z at node:internal/deps/undici/undici:13185:13
2024-12-03T13:50:10.251489396Z at processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-12-03T13:50:10.251492036Z at runNextTicks (node:internal/process/task_queues:64:3)
2024-12-03T13:50:10.251494476Z at process.processImmediate (node:internal/timers:454:9) {
2024-12-03T13:50:10.251496916Z [cause]: ConnectTimeoutError: Connect Timeout Error
2024-12-03T13:50:10.251499276Z at onConnectTimeout (node:internal/deps/undici/undici:2331:28)
2024-12-03T13:50:10.251519516Z at node:internal/deps/undici/undici:2283:50
2024-12-03T13:50:10.251521716Z at Immediate._onImmediate (node:internal/deps/undici/undici:2315:13)
2024-12-03T13:50:10.251523556Z at process.processImmediate (node:internal/timers:483:21) {
2024-12-03T13:50:10.251525196Z code: 'UND_ERR_CONNECT_TIMEOUT'
2024-12-03T13:50:10.251526916Z }
2024-12-03T13:50:10.251528436Z }
2024-12-03T13:50:10.251529876Z
2024-12-03T13:50:10.251531356Z Node.js v20.18.0

related: https://undici.nodejs.org/#/?id=network-address-family-autoselection

@ardasevinc
Copy link

ardasevinc commented Dec 16, 2024

I've been trying another fix since ~2 weeks. Seems working for now, no crashes. The fix is:

  • use VERCEL_UNDICI=1 env variable. More info in this comment. Also check the related issue
  • use the new /health endpoint. I used it by hitting http://localhost:5000/api/health. This is for using the endpoint as a docker service healtcheck. As these execute inside the container, we have to use the port 5000.

I haven't had any crashes or errors in the logs since I implemented these fixes.

related: https://x.com/cramforce/status/1836415221683941556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants