Bug 1018232

Summary: gfs_controld[1438]: cpg_initialize error 100 after fenced node rejoins and mounts GFS2
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Kortus <jkortus>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.5CC: ccaulfie, cluster-maint, fdinitto, rpeterso, teigland, tlavigne
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-1.4.1-18.el6 Doc Type: Bug Fix
Doc Text:
This BZ doesn't need documentation. It's only for newer versions of kernels (newer then included in RHEL 6).
Story Points: ---
Clone Of:
: 1019343 (view as bug list) Environment:
Last Closed: 2014-10-14 07:11:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1055584    
Bug Blocks: 1019343, 1023565    
Attachments:
Description Flags
Reproducer
none
Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop)
none
Proposed patch - part 2 - Set SO_PASSCRED on listener socket none

Description Jaroslav Kortus 2013-10-11 14:00:59 UTC
Description of problem:
Sometimes after the fenced node is rebooted the mount of GFS2 file system fails.
[2013-10-11 15:39:44.443563] [revolver-gfs2] mounting /dev/mapper/STSRHTS8560-STSRHTS85600 on /mnt/STSRHTS85600 on virt-139.cluster-qe.lab.eng.brq.redhat.com
[2013-10-11 15:39:44.528422] [revolver-gfs2] gfs_controld join error: -107
[2013-10-11 15:39:44.528636] [revolver-gfs2] error mounting lockproto lock_dlm

With debug package of corosync we can get more details:
Oct 11 15:39:44 virt-139 qarshd[2266]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 0, uid = 4294967295, gid = 4294967295
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security invalid
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] Invalid IPC credentials.
Oct 11 15:39:44 virt-139 gfs_controld[1438]: cpg_initialize error 100
Oct 11 15:39:44 virt-139 gfs_controld[1438]: join: group join error -107
Oct 11 15:39:44 virt-139 gfs_controld[1438]: process_dlmcontrol no mg STSRHTS85600


Version-Release number of selected component (if applicable):
gfs2-utils-3.0.12.1-59.el6.x86_64
kernel-2.6.32-421.el6.x86_64
(20131007.n.0 snap)

How reproducible:
like 30% of my jobs fail on this

Steps to Reproduce:
1. setup 3x gfs2 filesystem
2. mount and run I/O on it
3. fence a node, wait for rejoin and remount the file systems
4. (we call this revolver)

Actual results:
FS not mounted, IPC credentials errors printed in syslog

Expected results:
FS mounted

Additional info:

Comment 1 Jaroslav Kortus 2013-10-11 14:04:38 UTC
Successful run of the same command later:

Oct 11 15:55:00 virt-139 qarshd[2642]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1438, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1438, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Starting up
Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Ready
Oct 11 15:55:00 virt-139 kernel: GFS2 (built Sep 30 2013 12:03:53) installed
Oct 11 15:55:00 virt-139 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "STSRHTS8560:STSRHTS85600"
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: Joined cluster. Now mounting FS...
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1, already locked for use
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Looking at journal...
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Done


root      1438  0.0  0.1 134248  1904 ?        Ssl  15:39   0:00 gfs_controld
root      1382  0.0  0.5 218228  5948 ?        Ssl  15:39   0:00 dlm_controld

Comment 4 David Teigland 2013-10-11 15:25:26 UTC
It looks like something may have changed with IPC since 6.4.

Comment 5 Jan Friesse 2013-10-14 07:29:33 UTC
Jardo,
thanks for running test with debug packages. It helped me reduce problem space a lot (internal uidgid handling looks ok). Problem is really in CRED CMSG. Sadly that code wasn't touched for ages, so it seems to be ether new feature of kernel of libc, or ...

This is serious blocker (= must be included in 6.5) so I've changed priority.

Comment 7 Jan Friesse 2013-10-15 14:01:46 UTC
After digging to problem a little more. Problem seems to be kernel problem (not corosync one) and related to  SO_PASSCRED.

Next comment includes very simple (I would call it as minimal) reproducer.

Steps to reproduce:
gcc usock-to-stdout.c
./a.out /tmp/a

on different console execute:
while true;do echo "ahjo" | nc -U /tmp/a;done

Reproducible with kernel 2.6.32-419.el6.x86_64.

Comment 8 Jan Friesse 2013-10-15 14:04:50 UTC
Created attachment 812543 [details]
Reproducer

Reproducer

Comment 9 Jan Friesse 2013-10-18 09:15:09 UTC
Created attachment 813659 [details]
Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop)

Comment 10 Jan Friesse 2013-10-18 09:16:16 UTC
Created attachment 813660 [details]
Proposed patch - part 2 - Set SO_PASSCRED on listener socket

This patch is needed only if http://patchwork.ozlabs.org/patch/284366/ is accepted without any changes.

Comment 15 errata-xmlrpc 2014-10-14 07:11:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html