2353527 – corosync coredump with transport=sctp while add node

Bug 2353527 - corosync coredump with transport=sctp while add node

Summary: corosync coredump with transport=sctp while add node

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	corosync-epel
Sub Component:
Version:	epel9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Davide Cavalca
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2025-03-20 09:58 UTC by tangyla
Modified:	2025-06-04 11:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
coredump file (322.66 KB, application/octet-stream) 2025-03-20 09:58 UTC, tangyla	no flags	Details
View All

Description tangyla 2025-03-20 09:58:33 UTC

Created attachment 2081029 [details]
coredump file

Description of problem:

Create a 2-node HA cluster based on pacemaker/corosync. Use redundant lines, transport knet link_mode=active, and transport=sctp.
Using one node to create a cluster and then adding another node, corosync crashes

Version-Release number of selected component (if applicable):

OS:

Red Hat Enterprise Linux release 9.5 (Plow)

corosync:

Corosync Cluster Engine, version '3.1.8'
Copyright (c) 2006-2021 Red Hat, Inc.

Built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
Available crypto models: nss openssl
Available compression models: zlib lz4 lz4hc lzo2 lzma bzip2 zstd



How reproducible:


Steps to Reproduce:
1. pcs host auth node0 -u hacluster -p password
2. pcs cluster setup virtstack node0 addr=169.254.88.30 addr=192.168.0.50 transport knet link_mode=active link linknumber=0 transport=sctp link linknumber=1 transport=sctp
3. pcs cluster start --all
4. pcs property set stonith-enabled=false
5. crm_verify -L -V
6. pcs resource defaults resource-stickiness=100
7. pcs resource op defaults timeout=240s

add another node

8. pcs host auth node1 -u hacluster -p password
9. pcs cluster node add node1  addr=169.254.88.31 addr=192.168.0.51 --start --wait=120


Actual results:
coredump
systemctl status corosync.service

× corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; preset: disabled)
     Active: failed (Result: core-dump) since Thu 2025-03-20 17:41:39 CST; 25s ago
   Duration: 1min 30.458s
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 2293 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=dumped, signal=SEGV)
   Main PID: 2293 (code=dumped, signal=SEGV)
        CPU: 866ms

Mar 20 17:41:36 node0 corosync[2293]:   [KNET  ] host: host: 2 has no active links
Mar 20 17:41:36 node0 corosync[2293]:   [KNET  ] host: host: 2 has 0 active links
Mar 20 17:41:36 node0 corosync[2293]:   [KNET  ] host: host: 2 has no active links
Mar 20 17:41:36 node0 corosync[2293]:   [KNET  ] host: host: 2 has 0 active links
Mar 20 17:41:36 node0 corosync[2293]:   [KNET  ] host: host: 2 has no active links
Mar 20 17:41:36 node0 corosync[2293]:   [KNET  ] pmtud: MTU manually set to: 0
Mar 20 17:41:37 node0 corosync[2293]:   [KNET  ] sctp: SCTP getsockopt() on connecting socket 37 failed: Bad file descriptor
Mar 20 17:41:39 node0 systemd-coredump[2455]: [🡕] Process 2293 (corosync) of user 0 dumped core.
                                              
                                              Stack trace of thread 2296:
                                              #0  0x00007f70658e3b0c _reconnect_socket (libknet.so.1 + 0x14b0c)
                                              #1  0x00007f70658e6e18 _sctp_connect_thread (libknet.so.1 + 0x17e18)
                                              #2  0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2)
                                              #3  0x00007f706570e800 __clone3 (libc.so.6 + 0x10e800)
                                              
                                              Stack trace of thread 2293:
                                              #0  0x00007f706570de3e epoll_wait (libc.so.6 + 0x10de3e)
                                              #1  0x00007f70659f6240 _poll_and_add_to_jobs_ (libqb.so.100 + 0x1e240)
                                              #2  0x00007f70659e85f7 qb_loop_run (libqb.so.100 + 0x105f7)
                                              #3  0x00005624d380c3e3 main (corosync + 0xb3e3)
                                              #4  0x00007f70656295d0 __libc_start_call_main (libc.so.6 + 0x295d0)
                                              #5  0x00007f7065629680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680)
                                              #6  0x00005624d380d0e5 _start (corosync + 0xc0e5)
                                              
                                              Stack trace of thread 2299:
                                              #0  0x00007f706570de3e epoll_wait (libc.so.6 + 0x10de3e)
                                              #1  0x00007f70658e5e98 _handle_send_to_links_thread (libknet.so.1 + 0x16e98)
                                              #2  0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2)
                                              #3  0x00007f706570e800 __clone3 (libc.so.6 + 0x10e800)
                                              
                                              Stack trace of thread 2297:
                                              #0  0x00007f70656d3c35 clock_nanosleep.5 (libc.so.6 + 0xd3c35)
                                              #1  0x00007f70656d8847 __nanosleep (libc.so.6 + 0xd8847)
                                              #2  0x00007f70657046e9 usleep (libc.so.6 + 0x1046e9)
                                              #3  0x00007f70658d7eca _handle_pmtud_link_thread (libknet.so.1 + 0x8eca)
                                              #4  0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2)
                                              #5  0x00007f706570e800 __clone3 (libc.so.6 + 0x10e800)
                                              
                                              Stack trace of thread 2298:
                                              #0  0x00007f7065686839 __futex_abstimed_wait_common (libc.so.6 + 0x86839)
                                              #1  0x00007f706568f30b pthread_rwlock_rdlock.5 (libc.so.6 + 0x8f30b)
                                              #2  0x00007f70658d7db0 shutdown_in_progress (libknet.so.1 + 0x8db0)
                                              #3  0x00007f70658de00e _handle_dst_link_handler_thread (libknet.so.1 + 0xf00e)
                                              #4  0x00007f70656897e2 start_thread (libc.so.6 + 0x897e2)


Expected results:
add another node successed.

Additional info:

Comment 1 tangyla 2025-03-27 02:05:20 UTC

I've found a simple way to get around this: sleep 5 between steps 8/9, then execute step 9 start node1 so that node0's corosync process doesn't crash.

i think this is corosync bug, The corosync process may be loading new configuration file and apply.

Comment 2 Christine Caulfield 2025-03-27 09:53:55 UTC

I don't know why the sleep helps in your instance, but the fundamental problem (when I try to reproduce it, at least) is that the sctp kernel modules is not being loaded. If you do modprobe sctp on the nodes first then the cluster forms correctly because the sockets can be created.

Comment 3 tangyla 2025-03-28 10:08:50 UTC

[root@node0 ~]# lsmod|grep sctp
sctp                  544768  4
ip6_udp_tunnel         16384  1 sctp
udp_tunnel             36864  1 sctp
libcrc32c              12288  2 xfs,sctp
[root@node0 ~]# cat /etc/modules-load.d/sctp.conf 
sctp
[root@node0 ~]# cat /etc/modprobe.d/sctp-blacklist.conf
# This kernel module can be automatically loaded by non-root users. To
# enhance system security, the module is blacklisted by default to ensure
# system administrators make the module available for use as needed.
# See https://access.redhat.com/articles/3760101 for more details.
#
# Remove the blacklist by adding a comment # at the start of the line.
#blacklist sctp
[root@node0 ~]# cat /etc/modprobe.d/sctp_diag-blacklist.conf
# This kernel module can be automatically loaded by non-root users. To
# enhance system security, the module is blacklisted by default to ensure
# system administrators make the module available for use as needed.
# See https://access.redhat.com/articles/3760101 for more details.
#
# Remove the blacklist by adding a comment # at the start of the line.
#blacklist sctp_diag

Comment 4 tangyla 2025-03-28 10:13:09 UTC

Of course, this may be a special case, it isn't needed ‘sleep 5’ on other server platforms, examples: VWware WorkStation/ESXi/KVM, HuaWei/Supermicro/Dell, only occured on advantech server.

Comment 5 tangyla 2025-03-28 12:17:54 UTC

[root@node0 ~]# pcs host auth node0 -u hacluster -p VirtStack@1302 && pcs cluster setup virtstack node0 addr=169.254.88.30 addr=192.168.0.50 transport knet link_mode=active link linknumber=0 transport=sctp link linknumber=1 transport=sctp && pcs cluster start --all && pcs property set stonith-enabled=false && crm_verify -L -V && pcs resource defaults resource-stickiness=100 && pcs resource op defaults timeout=240s
node0: Authorized
Destroying cluster on hosts: 'node0'...
node0: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node0'
node0: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node0'
node0: successful distribution of the file 'corosync authkey'
node0: successful distribution of the file 'pacemaker authkey'
Sending 'corosync.conf' to 'node0'
node0: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.
node0: Starting Cluster...
Deprecation Warning: This command is deprecated and will be removed. Please use 'pcs resource defaults update' instead.
Warning: Defaults do not apply to resources which override them with their own defined values
Deprecation Warning: This command is deprecated and will be removed. Please use 'pcs resource op defaults update' instead.
Warning: Defaults do not apply to resources which override them with their own defined values
[root@node0 ~]# pcs host auth node1 -u hacluster -p VirtStack@1302 && pcs cluster node add node1 addr=169.254.88.31 addr=192.168.0.51 && pcs cluster start node1 && sleep 5 && systemctl status corosync.service
node1: Authorized
Disabling sbd...
node1: sbd disabled
Sending 'corosync authkey', 'pacemaker authkey' to 'node1'
node1: successful distribution of the file 'corosync authkey'
node1: successful distribution of the file 'pacemaker authkey'
Sending updated corosync.conf to nodes...
node0: Succeeded
node1: Succeeded
node0: Corosync configuration reloaded
node1: Starting Cluster...
× corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; preset: disabled)
     Active: failed (Result: core-dump) since Fri 2025-03-28 20:16:29 CST; 4s ago
   Duration: 32.242s
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 4278 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=dumped, signal=SEGV)
   Main PID: 4278 (code=dumped, signal=SEGV)

Mar 28 20:16:25 node0 corosync[4278]:   [KNET  ] host: host: 2 has no active links
Mar 28 20:16:25 node0 corosync[4278]:   [KNET  ] host: host: 2 has 0 active links
Mar 28 20:16:25 node0 corosync[4278]:   [KNET  ] host: host: 2 has no active links
Mar 28 20:16:25 node0 corosync[4278]:   [KNET  ] host: host: 2 has 0 active links
Mar 28 20:16:25 node0 corosync[4278]:   [KNET  ] host: host: 2 has no active links
Mar 28 20:16:25 node0 corosync[4278]:   [KNET  ] pmtud: MTU manually set to: 0
Mar 28 20:16:26 node0 corosync[4278]:   [KNET  ] sctp: SCTP getsockopt() on connecting socket 37 failed: Bad file descriptor
Mar 28 20:16:27 node0 corosync[4278]:   [KNET  ] sctp: SCTP getsockopt() on connecting socket 39 failed: Bad file descriptor
Mar 28 20:16:29 node0 systemd[1]: corosync.service: Main process exited, code=dumped, status=11/SEGV
Mar 28 20:16:29 node0 systemd[1]: corosync.service: Failed with result 'core-dump'.

Comment 6 tangyla 2025-03-28 12:18:43 UTC

if need other info please tell me

Comment 7 Christine Caulfield 2025-06-04 11:55:09 UTC

With the sctp module loaded on both nodes I can't reproduce this. Are there any messages from the kernel when this happens?

Also can you post rpm versions of the kernel, libknet1 and corosync please

Note You need to log in before you can comment on or make changes to this bug.