1397923 – [RHGS + Ganesha] : Corosync crashes and dumps core when glusterd/nfs-ganesha are restarted amidst continous I/O

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1397923 - [RHGS + Ganesha] : Corosync crashes and dumps core when glusterd/nfs-ganesha are restarted amidst continous I/O

Summary: [RHGS + Ganesha] : Corosync crashes and dumps core when glusterd/nfs-ganesha...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1402344 (view as bug list)
Depends On:
Blocks:	1402308 1402344
TreeView+	depends on / blocked

Reported:	2016-11-23 14:59 UTC by Ambarish
Modified:	2023-09-14 03:34 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1402308 (view as bug list)
Environment:
Last Closed:	2017-01-16 12:04:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ambarish 2016-11-23 14:59:39 UTC

Description of problem:
-----------------------

4 node Ganesha cluster.
4 clients mounted the volume ,2 via v3 and 2 via v4.
Linux tarball untar was running inside 4 different subdirs from 4 different clients.
I restarted nfs-ganesha,followed by glusterd on all four nodes.

One of the machines went for a toss.I could see the following BT from corosync when it came up:

Thread 2 (Thread 0x7f9be7773700 (LWP 22667)):
#0  0x00007f9bea37c79b in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, 
    futex=0x7f9bea7f0ca0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
#1  do_futex_wait (sem=sem@entry=0x7f9bea7f0ca0, abstime=0x0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:223
#2  0x00007f9bea37c82f in __new_sem_wait_slow (sem=0x7f9bea7f0ca0, abstime=0x0)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:292
#3  0x00007f9bea37c8cb in __new_sem_wait (sem=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:28
#4  0x00007f9bea5a0621 in qb_logt_worker_thread () from /lib64/libqb.so.0
#5  0x00007f9bea376dc5 in start_thread (arg=0x7f9be7773700) at pthread_create.c:308
#6  0x00007f9bea0a573d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f9beae26740 (LWP 22666)):
#0  0x00007f9be9fe31d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f9be9fe4a08 in __GI_abort () at abort.c:119
#2  0x00007f9be9fdc146 in __assert_fail_base (fmt=0x7f9bea12d3a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7f9beaa0fad9 "token_memb_entries >= 1", file=file@entry=0x7f9beaa0fa70 "totemsrp.c", 
    line=line@entry=1324, function=function@entry=0x7f9beaa10f10 <__PRETTY_FUNCTION__.8928> "memb_consensus_agreed")
    at assert.c:92
#3  0x00007f9be9fdc1f2 in __GI___assert_fail (assertion=assertion@entry=0x7f9beaa0fad9 "token_memb_entries >= 1", 
    file=file@entry=0x7f9beaa0fa70 "totemsrp.c", line=line@entry=1324, 
    function=function@entry=0x7f9beaa10f10 <__PRETTY_FUNCTION__.8928> "memb_consensus_agreed") at assert.c:101
#4  0x00007f9bea9f9f70 in memb_consensus_agreed (instance=0x7f9beade9010) at totemsrp.c:1324
#5  0x00007f9beaa030f8 in memb_consensus_agreed (instance=0x7f9beade9010) at totemsrp.c:1301
#6  0x00007f9beaa061c8 in memb_join_process (instance=instance@entry=0x7f9beade9010, 
    memb_join=memb_join@entry=0x7f9beca4cec8) at totemsrp.c:4247
#7  0x00007f9beaa099db in message_handler_memb_join (instance=0x7f9beade9010, msg=<optimized out>, 
    msg_len=<optimized out>, endian_conversion_needed=<optimized out>) at totemsrp.c:4519
#8  0x00007f9beaa00f01 in rrp_deliver_fn (context=0x7f9beca4cc10, msg=0x7f9beca4cec8, msg_len=519) at totemrrp.c:1952
#9  0x00007f9bea9fcf9e in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7f9beca4ce60)
    at totemudpu.c:499
---Type <return> to continue, or q <return> to quit---
#10 0x00007f9bea59783f in _poll_dispatch_and_take_back_ () from /lib64/libqb.so.0
#11 0x00007f9bea5973d0 in qb_loop_run () from /lib64/libqb.so.0
#12 0x00007f9beae4c7d0 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1405
(gdb) 
(gdb) 
 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

corosync-2.4.0-4.el7.x86_64
glusterfs-server-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
pacemaker-1.1.15-11.el7.x86_64
pcs-0.9.152-10.el7.x86_64
glibc-2.17-157.el7.x86_64

[root@gqas009 tmp]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)
[root@gqas009 tmp]# 

[root@gqas009 tmp]# uname -r
3.10.0-514.el7.x86_64
[root@gqas009 tmp]# 

Keeping the setup in the same state,in case someone wants to take a look.

How reproducible:
-----------------

Reporting the first occurence.

Steps to Reproduce:
------------------

Not sure how reproducible this is,still..This is what I did :

1. Create a 2*2 volume.Mount it on 4 clients - v3 and v4 - 2 each.

2. Run continuous IO.

3. Restart glusterd and nfs-ganesha

Actual results:
---------------

Corosync crashed on one of the nodes.

Expected results:
-----------------

Corosync should not crash

Additional info:
---------------

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 23f6d9c5-ff8a-44c7-be85-8a7b5e83f54a
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.stat-prefetch: off
server.allow-insecure: on
features.cache-invalidation: on
ganesha.enable: on
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
cluster.enable-shared-storage: enable
nfs-ganesha: enable
[root@gqas009 tmp]# 



*pcs status post crash* :

[root@gqas009 tmp]# pcs status
Cluster name: G1474623742.03
Stack: corosync
Current DC: gqas010.sbu.lab.eng.bos.redhat.com (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Wed Nov 23 09:58:23 2016		Last change: Wed Nov 23 08:07:50 2016 by root via crm_attribute on gqas009.sbu.lab.eng.bos.redhat.com

4 nodes and 24 resources configured

Online: [ gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ gqas008.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ]
     Stopped: [ gqas009.sbu.lab.eng.bos.redhat.com ]
 Resource Group: gqas008.sbu.lab.eng.bos.redhat.com-group
     gqas008.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gqas008.sbu.lab.eng.bos.redhat.com
     gqas008.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas008.sbu.lab.eng.bos.redhat.com
     gqas008.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gqas008.sbu.lab.eng.bos.redhat.com
 Resource Group: gqas009.sbu.lab.eng.bos.redhat.com-group
     gqas009.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gqas008.sbu.lab.eng.bos.redhat.com
     gqas009.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas008.sbu.lab.eng.bos.redhat.com
     gqas009.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gqas008.sbu.lab.eng.bos.redhat.com
 Resource Group: gqas010.sbu.lab.eng.bos.redhat.com-group
     gqas010.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gqas010.sbu.lab.eng.bos.redhat.com
     gqas010.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas010.sbu.lab.eng.bos.redhat.com
     gqas010.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gqas010.sbu.lab.eng.bos.redhat.com
 Resource Group: gqas014.sbu.lab.eng.bos.redhat.com-group
     gqas014.sbu.lab.eng.bos.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started gqas014.sbu.lab.eng.bos.redhat.com
     gqas014.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas014.sbu.lab.eng.bos.redhat.com
     gqas014.sbu.lab.eng.bos.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started gqas014.sbu.lab.eng.bos.redhat.com

Failed Actions:
* nfs-grace_monitor_0 on gqas009.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=77, status=complete, exitreason='none',
    last-rc-change='Wed Nov 23 08:16:45 2016', queued=0ms, exec=68ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: inactive/disabled
[root@gqas009 tmp]#

Comment 4 Jan Friesse 2016-11-28 09:08:24 UTC

@Ambarish:
This assertion usually happens ether when:
- ifdown happens
- there is more than one cluster with same configuration on the network
- different encryption methods

For this BZ, ifdown looks like a case. "[22660] gqas009.sbu.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] The network interface is down." in gqas009/corosync.log.

Please never do ifdown. Ifdown makes corosync break. If you are using NM (recommended in RHEL 7), install NetworkManager-config-server.

Can you please retest without ifdown?

Comment 5 Ambarish 2016-11-28 10:12:09 UTC

Jan,

Thanks for your comment.

But I didn't  do an ifdown explicitly.Not really sure what triggered it?I just restarted gluster and ganesha daemons while pumping IO from my mounts,when one of my servers bounced.

Also,you mentioned about corosync breaking on ifdowns.Is it a known issue?Can you please point me to a BZ?

Comment 6 Jan Friesse 2016-11-28 10:37:36 UTC

Corosync + ifdown is long term issue. It's very hard to fix and we plan solution for corosync 3.x (so RHEL.next), so fix is probably not going to happen in RHEL 7/6.

I don't really know what is main reason for ifdown but it for sure happened (as noted please see "The network interface is down" in logs).

Make sure to install NetworkManager-config-server if you are using NM.

Please try to retest if bug happens again even when "The network interface is down" is not in corosync.log.

Comment 7 Jan Friesse 2016-12-07 12:24:30 UTC

*** Bug 1402344 has been marked as a duplicate of this bug. ***

Comment 9 Red Hat Bugzilla 2023-09-14 03:34:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.