Bug 1397923
| Summary: | [RHGS + Ganesha] : Corosync crashes and dumps core when glusterd/nfs-ganesha are restarted amidst continous I/O | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Ambarish <asoman> | |
| Component: | corosync | Assignee: | Jan Friesse <jfriesse> | |
| Status: | CLOSED DEFERRED | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.3 | CC: | amukherj, asoman, bturner, ccaulfie, cluster-maint, jthottan, kkeithle, rhinduja, skoduri | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1402308 (view as bug list) | Environment: | ||
| Last Closed: | 2017-01-16 12:04:05 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1402308, 1402344 | |||
@Ambarish: This assertion usually happens ether when: - ifdown happens - there is more than one cluster with same configuration on the network - different encryption methods For this BZ, ifdown looks like a case. "[22660] gqas009.sbu.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] The network interface is down." in gqas009/corosync.log. Please never do ifdown. Ifdown makes corosync break. If you are using NM (recommended in RHEL 7), install NetworkManager-config-server. Can you please retest without ifdown? Jan, Thanks for your comment. But I didn't do an ifdown explicitly.Not really sure what triggered it?I just restarted gluster and ganesha daemons while pumping IO from my mounts,when one of my servers bounced. Also,you mentioned about corosync breaking on ifdowns.Is it a known issue?Can you please point me to a BZ? Corosync + ifdown is long term issue. It's very hard to fix and we plan solution for corosync 3.x (so RHEL.next), so fix is probably not going to happen in RHEL 7/6. I don't really know what is main reason for ifdown but it for sure happened (as noted please see "The network interface is down" in logs). Make sure to install NetworkManager-config-server if you are using NM. Please try to retest if bug happens again even when "The network interface is down" is not in corosync.log. *** Bug 1402344 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Description of problem: ----------------------- 4 node Ganesha cluster. 4 clients mounted the volume ,2 via v3 and 2 via v4. Linux tarball untar was running inside 4 different subdirs from 4 different clients. I restarted nfs-ganesha,followed by glusterd on all four nodes. One of the machines went for a toss.I could see the following BT from corosync when it came up: Thread 2 (Thread 0x7f9be7773700 (LWP 22667)): #0 0x00007f9bea37c79b in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7f9bea7f0ca0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 #1 do_futex_wait (sem=sem@entry=0x7f9bea7f0ca0, abstime=0x0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:223 #2 0x00007f9bea37c82f in __new_sem_wait_slow (sem=0x7f9bea7f0ca0, abstime=0x0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:292 #3 0x00007f9bea37c8cb in __new_sem_wait (sem=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:28 #4 0x00007f9bea5a0621 in qb_logt_worker_thread () from /lib64/libqb.so.0 #5 0x00007f9bea376dc5 in start_thread (arg=0x7f9be7773700) at pthread_create.c:308 #6 0x00007f9bea0a573d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7f9beae26740 (LWP 22666)): #0 0x00007f9be9fe31d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f9be9fe4a08 in __GI_abort () at abort.c:119 #2 0x00007f9be9fdc146 in __assert_fail_base (fmt=0x7f9bea12d3a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f9beaa0fad9 "token_memb_entries >= 1", file=file@entry=0x7f9beaa0fa70 "totemsrp.c", line=line@entry=1324, function=function@entry=0x7f9beaa10f10 <__PRETTY_FUNCTION__.8928> "memb_consensus_agreed") at assert.c:92 #3 0x00007f9be9fdc1f2 in __GI___assert_fail (assertion=assertion@entry=0x7f9beaa0fad9 "token_memb_entries >= 1", file=file@entry=0x7f9beaa0fa70 "totemsrp.c", line=line@entry=1324, function=function@entry=0x7f9beaa10f10 <__PRETTY_FUNCTION__.8928> "memb_consensus_agreed") at assert.c:101 #4 0x00007f9bea9f9f70 in memb_consensus_agreed (instance=0x7f9beade9010) at totemsrp.c:1324 #5 0x00007f9beaa030f8 in memb_consensus_agreed (instance=0x7f9beade9010) at totemsrp.c:1301 #6 0x00007f9beaa061c8 in memb_join_process (instance=instance@entry=0x7f9beade9010, memb_join=memb_join@entry=0x7f9beca4cec8) at totemsrp.c:4247 #7 0x00007f9beaa099db in message_handler_memb_join (instance=0x7f9beade9010, msg=<optimized out>, msg_len=<optimized out>, endian_conversion_needed=<optimized out>) at totemsrp.c:4519 #8 0x00007f9beaa00f01 in rrp_deliver_fn (context=0x7f9beca4cc10, msg=0x7f9beca4cec8, msg_len=519) at totemrrp.c:1952 #9 0x00007f9bea9fcf9e in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7f9beca4ce60) at totemudpu.c:499 ---Type <return> to continue, or q <return> to quit--- #10 0x00007f9bea59783f in _poll_dispatch_and_take_back_ () from /lib64/libqb.so.0 #11 0x00007f9bea5973d0 in qb_loop_run () from /lib64/libqb.so.0 #12 0x00007f9beae4c7d0 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1405 (gdb) (gdb) Version-Release number of selected component (if applicable): ------------------------------------------------------------- corosync-2.4.0-4.el7.x86_64 glusterfs-server-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 pacemaker-1.1.15-11.el7.x86_64 pcs-0.9.152-10.el7.x86_64 glibc-2.17-157.el7.x86_64 [root@gqas009 tmp]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo) [root@gqas009 tmp]# [root@gqas009 tmp]# uname -r 3.10.0-514.el7.x86_64 [root@gqas009 tmp]# Keeping the setup in the same state,in case someone wants to take a look. How reproducible: ----------------- Reporting the first occurence. Steps to Reproduce: ------------------ Not sure how reproducible this is,still..This is what I did : 1. Create a 2*2 volume.Mount it on 4 clients - v3 and v4 - 2 each. 2. Run continuous IO. 3. Restart glusterd and nfs-ganesha Actual results: --------------- Corosync crashed on one of the nodes. Expected results: ----------------- Corosync should not crash Additional info: --------------- *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: 23f6d9c5-ff8a-44c7-be85-8a7b5e83f54a Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.stat-prefetch: off server.allow-insecure: on features.cache-invalidation: on ganesha.enable: on cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 cluster.enable-shared-storage: enable nfs-ganesha: enable [root@gqas009 tmp]# *pcs status post crash* : [root@gqas009 tmp]# pcs status Cluster name: G1474623742.03 Stack: corosync Current DC: gqas010.sbu.lab.eng.bos.redhat.com (version 1.1.15-11.el7-e174ec8) - partition with quorum Last updated: Wed Nov 23 09:58:23 2016 Last change: Wed Nov 23 08:07:50 2016 by root via crm_attribute on gqas009.sbu.lab.eng.bos.redhat.com 4 nodes and 24 resources configured Online: [ gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gqas008.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas009.sbu.lab.eng.bos.redhat.com ] Resource Group: gqas008.sbu.lab.eng.bos.redhat.com-group gqas008.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas008.sbu.lab.eng.bos.redhat.com gqas008.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas008.sbu.lab.eng.bos.redhat.com gqas008.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas008.sbu.lab.eng.bos.redhat.com Resource Group: gqas009.sbu.lab.eng.bos.redhat.com-group gqas009.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas008.sbu.lab.eng.bos.redhat.com gqas009.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas008.sbu.lab.eng.bos.redhat.com Resource Group: gqas010.sbu.lab.eng.bos.redhat.com-group gqas010.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas010.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas010.sbu.lab.eng.bos.redhat.com gqas010.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas010.sbu.lab.eng.bos.redhat.com Resource Group: gqas014.sbu.lab.eng.bos.redhat.com-group gqas014.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas014.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas014.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas014.sbu.lab.eng.bos.redhat.com Failed Actions: * nfs-grace_monitor_0 on gqas009.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=77, status=complete, exitreason='none', last-rc-change='Wed Nov 23 08:16:45 2016', queued=0ms, exec=68ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: inactive/disabled [root@gqas009 tmp]#