Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent "Transport endpoint is not connected" across nodes #4435

Open
eeraser710 opened this issue Dec 4, 2024 · 2 comments
Open

Frequent "Transport endpoint is not connected" across nodes #4435

eeraser710 opened this issue Dec 4, 2024 · 2 comments

Comments

@eeraser710
Copy link

eeraser710 commented Dec 4, 2024

Overview:
Earlier we had 3 nodes in gluster pool and due to hardware issue node3 crashed and could not be recovered. we added 4th node in gluster pool and heal was working as expected without any issue but after couple of days we started getting "Transport endpoint is not connected" across available nodes (node1, node2 & node4) randomly and to keep the application working we had to stop the application, unmount the share, remount and start the application.

As of now we are hitting the issue more frequently and couldn't figure out the exact problem . need help in finding and fixing the underlying issue

Main issue which we are observing are:

  1. Frequent "Transport endpoint is not connected" across all available nodes (node1, node2, node4)
  2. Very Slow healing of brick of node4

gluster peer status (from node2)

Number of Peers: 3

Hostname: instance-03
Uuid: 15269a81-be4d-46d6-83d9-6bcc4d612833
State: Peer in Cluster (Disconnected)

Hostname: instance-01
Uuid: 1e32d4d8-e0ac-4902-8bdb-0609945eca45
State: Peer in Cluster (Connected)

Hostname: instance-02
Uuid: 3ec8e252-ef7e-454a-932c-7d62c756e5b3
State: Peer in Cluster (Connected)

Healthy gluster mount point (sample)

# df -kh /srv/atlassian-shared-data
Filesystem                                                                 Size  Used Avail Use% Mounted on
node2:/ocistack   15T  2.6T   12T  18% /srv/atlassian-shared-data

df output (when issue is encountered)

# df -h
df: '/srv/atlassian-shared-data': Transport endpoint is not connected

Message in logs, when issue is encountered

2024-12-03 16:37:18.801483] I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 13-dict: key 'trusted.afr.ocistack-client-0' would not be sent on wire in the future [Invalid argument]
[2024-12-03 16:37:18.801685] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/6dev/xlator/performance/open-behind.so(+0x3d6c) [0x7f2651dded6c] -->/usr/lib64/glusterfs/6dev/xlator/performance/open-behind.so(+0x3bc6)
 [0x7f2651ddebc6] -->/lib64/libglusterfs.so.0(dict_ref+0x5d) [0x7f2660cc4c8d] ) 13-dict: dict is NULL [Invalid argument]
[2024-12-03 16:37:18.802191] W [fd-lk.c:90:fd_lk_ctx_ref] (-->/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x6d049) [0x7f2653002049] -->/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x3b76b) [0x7f265
2fd076b] -->/lib64/libglusterfs.so.0(fd_lk_ctx_ref+0x5d) [0x7f2660d2f3bd] ) 13-fd-lk: invalid argument [Invalid argument]
[2024-12-03 16:37:17.207431] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 13-epoll: Failed to dispatch handler
[2024-12-03 16:37:18.801480] I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 13-dict: key 'trusted.afr.ocistack-client-3' would not be sent on wire in the future [Invalid argument]
The message "I [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 13-dict: key 'trusted.afr.ocistack-client-2' would not be sent on wire in the future [Invalid argument]" repeated 2 times between [2024-12-03 16:37:18.8
01303] and [2024-12-03 16:37:18.801481]
pending frames:
frame : type(1) op(FSTAT)
frame : type(1) op(UNLINK)
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(1) op(RENAME)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
[2024-12-03 16:37:18.802480] W [fd-lk.c:90:fd_lk_ctx_ref] (-->/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x6d049) [0x7f2653002049] -->/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x3b76b) [0x7f265
2fd076b] -->/lib64/libglusterfs.so.0(fd_lk_ctx_ref+0x5d) [0x7f2660d2f3bd] ) 13-fd-lk: invalid argument [Invalid argument]
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
[2024-12-03 16:37:18.802564] W [fd-lk.c:90:fd_lk_ctx_ref] (-->/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x6d049) [0x7f2653002049] -->/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x3b76b) [0x7f265
2fd076b] -->/lib64/libglusterfs.so.0(fd_lk_ctx_ref+0x5d) [0x7f2660d2f3bd] ) 13-fd-lk: invalid argument [Invalid argument]
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2024-12-03 16:37:18
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6dev
pending frames:
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(UNLINK)
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(1) op(RENAME)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2024-12-03 16:37:18
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6dev
/lib64/libglusterfs.so.0(+0x26fb0)[0x7f2660cd0fb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f2660cdb364]
/lib64/libc.so.6(+0x36400)[0x7f265f331400]
/lib64/libglusterfs.so.0(__gf_realloc+0x3b)[0x7f2660cfbbdb]
/lib64/libglusterfs.so.0(__fd_ctx_set+0xa6)[0x7f2660cfa396]
/lib64/libglusterfs.so.0(fd_ctx_set+0x4c)[0x7f2660cfa43c]
/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x16e6e)[0x7f2652fabe6e]
/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x3b7a8)[0x7f2652fd07a8]
/usr/lib64/glusterfs/6dev/xlator/protocol/client.so(+0x6d049)[0x7f2653002049]
/lib64/libgfrpc.so.0(+0xec80)[0x7f2660a9cc80]
/lib64/libgfrpc.so.0(+0xf053)[0x7f2660a9d053]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f2660a98f23]
/usr/lib64/glusterfs/6dev/rpc-transport/socket.so(+0xa3db)[0x7f26552e33db]
/lib64/libglusterfs.so.0(+0x8afe9)[0x7f2660d34fe9]
/lib64/libpthread.so.0(+0x8105)[0x7f265fb34105]
/lib64/libc.so.6(clone+0x6d)[0x7f265f3f9b2d]
---------
/lib64/libglusterfs.so.0(+0x26fb0)[0x7f2660cd0fb0]

Mandatory info:
- The output of the gluster volume info command:

instance-02# gluster volume info

Volume Name: ocistack
Type: Replicate
Volume ID: e1ebb04b-db38-46f2-a89e-0911d32a3f8f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: instance-01:/srv/glusterfs/brick1/ocistack
Brick2: instance-02:/srv/glusterfs/brick1/ocistack
Brick3: instance-03:/srv/glusterfs/brick1/ocistack
Brick4: instance-04:/srv/glusterfs/brick1/ocistack
Options Reconfigured:
diagnostics.brick-log-level: ERROR
performance.least-prio-threads: 16
cluster.shd-wait-qlength: 2048
cluster.shd-max-threads: 16
cluster.heal-wait-queue-length: 128
cluster.self-heal-readdir-size: 4KB
cluster.data-self-heal-algorithm: full
cluster.self-heal-window-size: 32
cluster.heal-timeout: 300
cluster.background-self-heal-count: 16
client.event-threads: 32
server.event-threads: 32
cluster.readdir-optimize: on
performance.io-thread-count: 16
cluster.lookup-optimize: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

- The output of the gluster volume status command:

instance-02# gluster volume status

Status of volume: ocistack
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick instance-01:/srv/glusterfs/b
rick1/ocistack                              49152     0          Y       64071
Brick instance-02:/srv/glusterfs/b
rick1/ocistack                              49153     0          Y       36552
Brick instance-04:/srv/glusterfs/b
rick1/ocistack                              49153     0          Y       2537
Self-heal Daemon on localhost               N/A       N/A        Y       36579
Self-heal Daemon on instance-01    N/A       N/A        Y       16056
Self-heal Daemon on instance-04    N/A       N/A        Y       2550

Task Status of Volume ocistack
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:
Taking very long to complete, probably because there are still lot of file to be healed

- Provide logs present on following locations of client and server nodes -
attached [logs.tar.gz](https://github.com/user-attachments/files/18005138/logs.tar.gz)

**- Is there any crash ? Provide the backtrace and coredump
`Yes,
NOTE: coredump file is big unable to upload it . providing the backtrace out

gdb core.43609
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.0.3.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
[New LWP 43639]
[New LWP 43679]
[New LWP 43615]
[New LWP 43609]
[New LWP 43613]
[New LWP 43610]
[New LWP 43614]
[New LWP 43645]
[New LWP 43649]
[New LWP 43646]
[New LWP 43647]
[New LWP 43648]
[New LWP 43651]
[New LWP 43653]
[New LWP 43634]
[New LWP 43652]
[New LWP 43654]
[New LWP 43656]
[New LWP 43664]
[New LWP 43662]
[New LWP 43658]
[New LWP 43668]
[New LWP 43675]
[New LWP 43660]
[New LWP 43665]
[New LWP 43715]
[New LWP 43672]
[New LWP 43655]
[New LWP 90735]
[New LWP 43670]
[New LWP 43657]
[New LWP 43673]
[New LWP 43666]
[New LWP 43678]
[New LWP 43676]
[New LWP 15031]
[New LWP 45590]
[New LWP 14745]
[New LWP 43677]
[New LWP 43671]
[New LWP 4668]
[New LWP 43716]
[New LWP 43612]
[New LWP 43667]
[New LWP 43681]
[New LWP 43682]
Missing separate debuginfo for the main executable file
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/7e/5af82b8ddad1a975a6fb0399b08f959d41e6c9
Core was generated by `/usr/sbin/glusterfs --fuse-mountopts=noatime --process-name fuse --volfile-serv'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f6f77619f90 in ?? ()
"/tmp/dumps_n_logs/core.43609" is a core file.
Please specify an executable to debug.
(gdb) bt
#0 0x00007f6f77619f90 in ?? ()
#1 0x00007f6f698c2215 in ?? ()
#2 0x00007f6f5cd6db10 in ?? ()
#3 0x00007f6f69aced30 in ?? ()
#4 0x00007f6f5ce8c218 in ?? ()
#5 0x00007f6f5cd6db10 in ?? ()
#6 0x00007f6f78b4dbc0 in ?? ()
#7 0x00007f6eca532d58 in ?? ()
#8 0x00007f6f78b4dc40 in ?? ()
#9 0x00007f6f64020c00 in ?? ()
#10 0x00007f6ebcfc8c78 in ?? ()
#11 0x00007f6f698c2b39 in ?? ()
#12 0x00007f6f78a9dd08 in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x00007f6ebcfc8cf8 in ?? ()
#15 0x00007f6f5ce8c218 in ?? ()
#16 0x00007f6f78b4dc40 in ?? ()
#17 0x00007f6f698c2cbe in ?? ()
#18 0x00007f6f5cd6db10 in ?? ()
#19 0x00007f6f788345c2 in ?? ()
#20 0x00007f6eca53ee20 in ?? ()
#21 0x00007f6f64020c00 in ?? ()
#22 0x00007f6f5cd6db90 in ?? ()
#23 0x00007f6f5cd6db90 in ?? ()
#24 0x0000000000000028 in ?? ()
#25 0xb2a4d64e6f246100 in ?? ()
#26 0x0000557a2200d240 in ?? ()
#27 0x00007f6f5ce8e448 in ?? ()
#28 0x00007f6eca5401c8 in ?? ()
#29 0x00007f6f64020c00 in ?? ()
#30 0x00007f6f698c2e50 in ?? ()
#31 0x00007f6f5ce8e448 in ?? ()
#32 0x00007f6f5d19df38 in ?? ()
#33 0x00007f6f698c2e83 in ?? ()
#34 0x00007f6eca5401c8 in ?? ()
#35 0x00007f6f76ef3aad in ?? ()
#36 0x00007f6f696a1c80 in ?? ()
#37 0x00007f6f787e9e25 in ?? ()
#38 0x00007f6eca5469e8 in ?? ()
#39 0x00007f6f787e0bbb in ?? ()
#40 0x00007f6eca53ee08 in ?? ()
#41 0x00007f6f64022a30 in ?? ()
#42 0x00007f6eca5401c8 in ?? ()
#43 0x00007f6f64022a30 in ?? ()
#44 0x00007f6eca5401c8 in ?? ()
#45 0x00007f6f64022a30 in ?? ()
#46 0x00007f6f698c2e50 in ?? ()
#47 0x00007f6f696a1e9e in ?? ()
#48 0x00007f6f00000000 in ?? ()
#49 0x0000000000000000 in ?? ()
(gdb)

- The operating system / glusterfs version:

Oracle Linux Server 7.9, 
glusterfs 6dev

Rpms: 
glusterfs-cli-6dev-0.163.gitbd4d8b1.el7.x86_64
glusterfs-fuse-6dev-0.163.gitbd4d8b1.el7.x86_64
glusterfs-6dev-0.163.gitbd4d8b1.el7.x86_64
glusterfs-client-xlators-6dev-0.163.gitbd4d8b1.el7.x86_64
glusterfs-server-6dev-0.163.gitbd4d8b1.el7.x86_64
glusterfs-libs-6dev-0.163.gitbd4d8b1.el7.x86_64
gluster_mon-1.1-1.x86_64
glusterfs-api-6dev-0.163.gitbd4d8b1.el7.x86_64
@anon314159
Copy link

Silly question, have you removed the failed brick from the volume and peer from the trusted pool?

@eeraser710
Copy link
Author

yes .. we removed node3 completely from pool

current setup look like below , though we observed delay in accessing the shared path after adding node4 brick in cluster.
and when we stopped gluster on node4 delay is gone .

not sure , what could be causing this , if you can help to figure the issue .

# gluster volume info

Volume Name: ocistack
Type: Replicate
Volume ID: e1ebb04b-db38-46f2-a89e-0911d32a3f8f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: instance-01:/srv/glusterfs/brick1/ocistack
Brick2: instance-02:/srv/glusterfs/brick1/ocistack
Brick3: instance-04:/srv/glusterfs/brick1/ocistack
Options Reconfigured:
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
cluster.lookup-optimize: on
performance.io-thread-count: 16
cluster.readdir-optimize: on
server.event-threads: 32
client.event-threads: 32
cluster.background-self-heal-count: 16
cluster.heal-timeout: 5
cluster.self-heal-window-size: 32
cluster.data-self-heal-algorithm: full
cluster.self-heal-readdir-size: 4KB
cluster.heal-wait-queue-length: 128
cluster.shd-max-threads: 16
cluster.shd-wait-qlength: 2048
performance.least-prio-threads: 16
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants