Bug 1018232 - gfs_controld[1438]: cpg_initialize error 100 after fenced node rejoins and mounts GFS2
gfs_controld[1438]: cpg_initialize error 100 after fenced node rejoins and mo...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.5
Unspecified Unspecified
medium Severity medium
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
:
Depends On: 1055584
Blocks: 1023565 1019343
  Show dependency treegraph
 
Reported: 2013-10-11 10:00 EDT by Jaroslav Kortus
Modified: 2014-10-14 03:11 EDT (History)
6 users (show)

See Also:
Fixed In Version: corosync-1.4.1-18.el6
Doc Type: Bug Fix
Doc Text:
This BZ doesn't need documentation. It's only for newer versions of kernels (newer then included in RHEL 6).
Story Points: ---
Clone Of:
: 1019343 (view as bug list)
Environment:
Last Closed: 2014-10-14 03:11:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Reproducer (2.70 KB, text/x-csrc)
2013-10-15 10:04 EDT, Jan Friesse
no flags Details
Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop) (1.85 KB, patch)
2013-10-18 05:15 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 2 - Set SO_PASSCRED on listener socket (1.26 KB, patch)
2013-10-18 05:16 EDT, Jan Friesse
no flags Details | Diff

  None (edit)
Description Jaroslav Kortus 2013-10-11 10:00:59 EDT
Description of problem:
Sometimes after the fenced node is rebooted the mount of GFS2 file system fails.
[2013-10-11 15:39:44.443563] [revolver-gfs2] mounting /dev/mapper/STSRHTS8560-STSRHTS85600 on /mnt/STSRHTS85600 on virt-139.cluster-qe.lab.eng.brq.redhat.com
[2013-10-11 15:39:44.528422] [revolver-gfs2] gfs_controld join error: -107
[2013-10-11 15:39:44.528636] [revolver-gfs2] error mounting lockproto lock_dlm

With debug package of corosync we can get more details:
Oct 11 15:39:44 virt-139 qarshd[2266]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 0, uid = 4294967295, gid = 4294967295
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security invalid
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] Invalid IPC credentials.
Oct 11 15:39:44 virt-139 gfs_controld[1438]: cpg_initialize error 100
Oct 11 15:39:44 virt-139 gfs_controld[1438]: join: group join error -107
Oct 11 15:39:44 virt-139 gfs_controld[1438]: process_dlmcontrol no mg STSRHTS85600


Version-Release number of selected component (if applicable):
gfs2-utils-3.0.12.1-59.el6.x86_64
kernel-2.6.32-421.el6.x86_64
(20131007.n.0 snap)

How reproducible:
like 30% of my jobs fail on this

Steps to Reproduce:
1. setup 3x gfs2 filesystem
2. mount and run I/O on it
3. fence a node, wait for rejoin and remount the file systems
4. (we call this revolver)

Actual results:
FS not mounted, IPC credentials errors printed in syslog

Expected results:
FS mounted

Additional info:
Comment 1 Jaroslav Kortus 2013-10-11 10:04:38 EDT
Successful run of the same command later:

Oct 11 15:55:00 virt-139 qarshd[2642]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1438, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1438, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Starting up
Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Ready
Oct 11 15:55:00 virt-139 kernel: GFS2 (built Sep 30 2013 12:03:53) installed
Oct 11 15:55:00 virt-139 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "STSRHTS8560:STSRHTS85600"
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: Joined cluster. Now mounting FS...
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1, already locked for use
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Looking at journal...
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Done


root      1438  0.0  0.1 134248  1904 ?        Ssl  15:39   0:00 gfs_controld
root      1382  0.0  0.5 218228  5948 ?        Ssl  15:39   0:00 dlm_controld
Comment 4 David Teigland 2013-10-11 11:25:26 EDT
It looks like something may have changed with IPC since 6.4.
Comment 5 Jan Friesse 2013-10-14 03:29:33 EDT
Jardo,
thanks for running test with debug packages. It helped me reduce problem space a lot (internal uidgid handling looks ok). Problem is really in CRED CMSG. Sadly that code wasn't touched for ages, so it seems to be ether new feature of kernel of libc, or ...

This is serious blocker (= must be included in 6.5) so I've changed priority.
Comment 7 Jan Friesse 2013-10-15 10:01:46 EDT
After digging to problem a little more. Problem seems to be kernel problem (not corosync one) and related to  SO_PASSCRED.

Next comment includes very simple (I would call it as minimal) reproducer.

Steps to reproduce:
gcc usock-to-stdout.c
./a.out /tmp/a

on different console execute:
while true;do echo "ahjo" | nc -U /tmp/a;done

Reproducible with kernel 2.6.32-419.el6.x86_64.
Comment 8 Jan Friesse 2013-10-15 10:04:50 EDT
Created attachment 812543 [details]
Reproducer

Reproducer
Comment 9 Jan Friesse 2013-10-18 05:15:09 EDT
Created attachment 813659 [details]
Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop)
Comment 10 Jan Friesse 2013-10-18 05:16:16 EDT
Created attachment 813660 [details]
Proposed patch - part 2 - Set SO_PASSCRED on listener socket

This patch is needed only if http://patchwork.ozlabs.org/patch/284366/ is accepted without any changes.
Comment 15 errata-xmlrpc 2014-10-14 03:11:38 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html

Note You need to log in before you can comment on or make changes to this bug.