Created attachment 2081029 [details] coredump file Description of problem: Create a 2-node HA cluster based on pacemaker/corosync. Use redundant lines, transport knet link_mode=active, and transport=sctp. Using one node to create a cluster and then adding another node, corosync crashes Version-Release number of selected component (if applicable): OS: Red Hat Enterprise Linux release 9.5 (Plow) corosync: Corosync Cluster Engine, version '3.1.8' Copyright (c) 2006-2021 Red Hat, Inc. Built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow Available crypto models: nss openssl Available compression models: zlib lz4 lz4hc lzo2 lzma bzip2 zstd How reproducible: Steps to Reproduce: 1. pcs host auth node0 -u hacluster -p password 2. pcs cluster setup virtstack node0 addr=169.254.88.30 addr=192.168.0.50 transport knet link_mode=active link linknumber=0 transport=sctp link linknumber=1 transport=sctp 3. pcs cluster start --all 4. pcs property set stonith-enabled=false 5. crm_verify -L -V 6. pcs resource defaults resource-stickiness=100 7. pcs resource op defaults timeout=240s add another node 8. pcs host auth node1 -u hacluster -p password 9. pcs cluster node add node1 addr=169.254.88.31 addr=192.168.0.51 --start --wait=120 Actual results: coredump systemctl status corosync.service Γ corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; preset: disabled) Active: failed (Result: core-dump) since Thu 2025-03-20 17:41:39 CST; 25s ago Duration: 1min 30.458s Docs: man:corosync man:corosync.conf man:corosync_overview Process: 2293 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=dumped, signal=SEGV) Main PID: 2293 (code=dumped, signal=SEGV) CPU: 866ms Mar 20 17:41:36 node0 corosync[2293]: [KNET ] host: host: 2 has no active links Mar 20 17:41:36 node0 corosync[2293]: [KNET ] host: host: 2 has 0 active links Mar 20 17:41:36 node0 corosync[2293]: [KNET ] host: host: 2 has no active links Mar 20 17:41:36 node0 corosync[2293]: [KNET ] host: host: 2 has 0 active links Mar 20 17:41:36 node0 corosync[2293]: [KNET ] host: host: 2 has no active links Mar 20 17:41:36 node0 corosync[2293]: [KNET ] pmtud: MTU manually set to: 0 Mar 20 17:41:37 node0 corosync[2293]: [KNET ] sctp: SCTP getsockopt() on connecting socket 37 failed: Bad file descriptor Mar 20 17:41:39 node0 systemd-coredump[2455]: [π‘] Process 2293 (corosync) of user 0 dumped core. Stack trace of thread 2296: #0 0x00007f70658e3b0c _reconnect_socket (libknet.so.1 + 0x14b0c) #1 0x00007f70658e6e18 _sctp_connect_thread (libknet.so.1 + 0x17e18) #2 0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2) #3 0x00007f706570e800 __clone3 (libc.so.6 + 0x10e800) Stack trace of thread 2293: #0 0x00007f706570de3e epoll_wait (libc.so.6 + 0x10de3e) #1 0x00007f70659f6240 _poll_and_add_to_jobs_ (libqb.so.100 + 0x1e240) #2 0x00007f70659e85f7 qb_loop_run (libqb.so.100 + 0x105f7) #3 0x00005624d380c3e3 main (corosync + 0xb3e3) #4 0x00007f70656295d0 __libc_start_call_main (libc.so.6 + 0x295d0) #5 0x00007f7065629680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680) #6 0x00005624d380d0e5 _start (corosync + 0xc0e5) Stack trace of thread 2299: #0 0x00007f706570de3e epoll_wait (libc.so.6 + 0x10de3e) #1 0x00007f70658e5e98 _handle_send_to_links_thread (libknet.so.1 + 0x16e98) #2 0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2) #3 0x00007f706570e800 __clone3 (libc.so.6 + 0x10e800) Stack trace of thread 2297: #0 0x00007f70656d3c35 clock_nanosleep.5 (libc.so.6 + 0xd3c35) #1 0x00007f70656d8847 __nanosleep (libc.so.6 + 0xd8847) #2 0x00007f70657046e9 usleep (libc.so.6 + 0x1046e9) #3 0x00007f70658d7eca _handle_pmtud_link_thread (libknet.so.1 + 0x8eca) #4 0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2) #5 0x00007f706570e800 __clone3 (libc.so.6 + 0x10e800) Stack trace of thread 2298: #0 0x00007f7065686839 __futex_abstimed_wait_common (libc.so.6 + 0x86839) #1 0x00007f706568f30b pthread_rwlock_rdlock.5 (libc.so.6 + 0x8f30b) #2 0x00007f70658d7db0 shutdown_in_progress (libknet.so.1 + 0x8db0) #3 0x00007f70658de00e _handle_dst_link_handler_thread (libknet.so.1 + 0xf00e) #4 0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2) Expected results: add another node successed. Additional info:
I've found a simple way to get around this: sleep 5 between steps 8/9, then execute step 9 start node1 so that node0's corosync process doesn't crash. i think this is corosync bug, The corosync process may be loading new configuration file and apply.
I don't know why the sleep helps in your instance, but the fundamental problem (when I try to reproduce it, at least) is that the sctp kernel modules is not being loaded. If you do modprobe sctp on the nodes first then the cluster forms correctly because the sockets can be created.
[root@node0 ~]# lsmod|grep sctp sctp 544768 4 ip6_udp_tunnel 16384 1 sctp udp_tunnel 36864 1 sctp libcrc32c 12288 2 xfs,sctp [root@node0 ~]# cat /etc/modules-load.d/sctp.conf sctp [root@node0 ~]# cat /etc/modprobe.d/sctp-blacklist.conf # This kernel module can be automatically loaded by non-root users. To # enhance system security, the module is blacklisted by default to ensure # system administrators make the module available for use as needed. # See https://access.redhat.com/articles/3760101 for more details. # # Remove the blacklist by adding a comment # at the start of the line. #blacklist sctp [root@node0 ~]# cat /etc/modprobe.d/sctp_diag-blacklist.conf # This kernel module can be automatically loaded by non-root users. To # enhance system security, the module is blacklisted by default to ensure # system administrators make the module available for use as needed. # See https://access.redhat.com/articles/3760101 for more details. # # Remove the blacklist by adding a comment # at the start of the line. #blacklist sctp_diag
Of course, this may be a special case, it isn't needed βsleep 5β on other server platforms, examples: VWware WorkStation/ESXi/KVM, HuaWei/Supermicro/Dell, only occured on advantech server.
[root@node0 ~]# pcs host auth node0 -u hacluster -p VirtStack@1302 && pcs cluster setup virtstack node0 addr=169.254.88.30 addr=192.168.0.50 transport knet link_mode=active link linknumber=0 transport=sctp link linknumber=1 transport=sctp && pcs cluster start --all && pcs property set stonith-enabled=false && crm_verify -L -V && pcs resource defaults resource-stickiness=100 && pcs resource op defaults timeout=240s node0: Authorized Destroying cluster on hosts: 'node0'... node0: Successfully destroyed cluster Requesting remove 'pcsd settings' from 'node0' node0: successful removal of the file 'pcsd settings' Sending 'corosync authkey', 'pacemaker authkey' to 'node0' node0: successful distribution of the file 'corosync authkey' node0: successful distribution of the file 'pacemaker authkey' Sending 'corosync.conf' to 'node0' node0: successful distribution of the file 'corosync.conf' Cluster has been successfully set up. node0: Starting Cluster... Deprecation Warning: This command is deprecated and will be removed. Please use 'pcs resource defaults update' instead. Warning: Defaults do not apply to resources which override them with their own defined values Deprecation Warning: This command is deprecated and will be removed. Please use 'pcs resource op defaults update' instead. Warning: Defaults do not apply to resources which override them with their own defined values [root@node0 ~]# pcs host auth node1 -u hacluster -p VirtStack@1302 && pcs cluster node add node1 addr=169.254.88.31 addr=192.168.0.51 && pcs cluster start node1 && sleep 5 && systemctl status corosync.service node1: Authorized Disabling sbd... node1: sbd disabled Sending 'corosync authkey', 'pacemaker authkey' to 'node1' node1: successful distribution of the file 'corosync authkey' node1: successful distribution of the file 'pacemaker authkey' Sending updated corosync.conf to nodes... node0: Succeeded node1: Succeeded node0: Corosync configuration reloaded node1: Starting Cluster... Γ corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; preset: disabled) Active: failed (Result: core-dump) since Fri 2025-03-28 20:16:29 CST; 4s ago Duration: 32.242s Docs: man:corosync man:corosync.conf man:corosync_overview Process: 4278 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=dumped, signal=SEGV) Main PID: 4278 (code=dumped, signal=SEGV) Mar 28 20:16:25 node0 corosync[4278]: [KNET ] host: host: 2 has no active links Mar 28 20:16:25 node0 corosync[4278]: [KNET ] host: host: 2 has 0 active links Mar 28 20:16:25 node0 corosync[4278]: [KNET ] host: host: 2 has no active links Mar 28 20:16:25 node0 corosync[4278]: [KNET ] host: host: 2 has 0 active links Mar 28 20:16:25 node0 corosync[4278]: [KNET ] host: host: 2 has no active links Mar 28 20:16:25 node0 corosync[4278]: [KNET ] pmtud: MTU manually set to: 0 Mar 28 20:16:26 node0 corosync[4278]: [KNET ] sctp: SCTP getsockopt() on connecting socket 37 failed: Bad file descriptor Mar 28 20:16:27 node0 corosync[4278]: [KNET ] sctp: SCTP getsockopt() on connecting socket 39 failed: Bad file descriptor Mar 28 20:16:29 node0 systemd[1]: corosync.service: Main process exited, code=dumped, status=11/SEGV Mar 28 20:16:29 node0 systemd[1]: corosync.service: Failed with result 'core-dump'.
if need other info please tell me
With the sctp module loaded on both nodes I can't reproduce this. Are there any messages from the kernel when this happens? Also can you post rpm versions of the kernel, libknet1 and corosync please