RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1018232 - gfs_controld[1438]: cpg_initialize error 100 after fenced node rejoins and mounts GFS2
Summary: gfs_controld[1438]: cpg_initialize error 100 after fenced node rejoins and mo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 1055584
Blocks: 1019343 1023565
TreeView+ depends on / blocked
 
Reported: 2013-10-11 14:00 UTC by Jaroslav Kortus
Modified: 2014-10-14 07:11 UTC (History)
6 users (show)

Fixed In Version: corosync-1.4.1-18.el6
Doc Type: Bug Fix
Doc Text:
This BZ doesn't need documentation. It's only for newer versions of kernels (newer then included in RHEL 6).
Clone Of:
: 1019343 (view as bug list)
Environment:
Last Closed: 2014-10-14 07:11:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducer (2.70 KB, text/x-csrc)
2013-10-15 14:04 UTC, Jan Friesse
no flags Details
Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop) (1.85 KB, patch)
2013-10-18 09:15 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - part 2 - Set SO_PASSCRED on listener socket (1.26 KB, patch)
2013-10-18 09:16 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1508 0 normal SHIPPED_LIVE corosync bug fix update 2014-10-14 01:22:31 UTC

Description Jaroslav Kortus 2013-10-11 14:00:59 UTC
Description of problem:
Sometimes after the fenced node is rebooted the mount of GFS2 file system fails.
[2013-10-11 15:39:44.443563] [revolver-gfs2] mounting /dev/mapper/STSRHTS8560-STSRHTS85600 on /mnt/STSRHTS85600 on virt-139.cluster-qe.lab.eng.brq.redhat.com
[2013-10-11 15:39:44.528422] [revolver-gfs2] gfs_controld join error: -107
[2013-10-11 15:39:44.528636] [revolver-gfs2] error mounting lockproto lock_dlm

With debug package of corosync we can get more details:
Oct 11 15:39:44 virt-139 qarshd[2266]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 0, uid = 4294967295, gid = 4294967295
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security invalid
Oct 11 15:39:44 virt-139 corosync[1309]:   [IPC   ] Invalid IPC credentials.
Oct 11 15:39:44 virt-139 gfs_controld[1438]: cpg_initialize error 100
Oct 11 15:39:44 virt-139 gfs_controld[1438]: join: group join error -107
Oct 11 15:39:44 virt-139 gfs_controld[1438]: process_dlmcontrol no mg STSRHTS85600


Version-Release number of selected component (if applicable):
gfs2-utils-3.0.12.1-59.el6.x86_64
kernel-2.6.32-421.el6.x86_64
(20131007.n.0 snap)

How reproducible:
like 30% of my jobs fail on this

Steps to Reproduce:
1. setup 3x gfs2 filesystem
2. mount and run I/O on it
3. fence a node, wait for rejoin and remount the file systems
4. (we call this revolver)

Actual results:
FS not mounted, IPC credentials errors printed in syslog

Expected results:
FS mounted

Additional info:

Comment 1 Jaroslav Kortus 2013-10-11 14:04:38 UTC
Successful run of the same command later:

Oct 11 15:55:00 virt-139 qarshd[2642]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1438, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1438, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Starting up
Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Ready
Oct 11 15:55:00 virt-139 kernel: GFS2 (built Sep 30 2013 12:03:53) installed
Oct 11 15:55:00 virt-139 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "STSRHTS8560:STSRHTS85600"
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:00 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred found
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - client_pid = 1382, uid = 0, gid = 0
Oct 11 15:55:01 virt-139 corosync[1309]:   [IPC   ] IPC cred - Security valid
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: Joined cluster. Now mounting FS...
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1, already locked for use
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Looking at journal...
Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Done


root      1438  0.0  0.1 134248  1904 ?        Ssl  15:39   0:00 gfs_controld
root      1382  0.0  0.5 218228  5948 ?        Ssl  15:39   0:00 dlm_controld

Comment 4 David Teigland 2013-10-11 15:25:26 UTC
It looks like something may have changed with IPC since 6.4.

Comment 5 Jan Friesse 2013-10-14 07:29:33 UTC
Jardo,
thanks for running test with debug packages. It helped me reduce problem space a lot (internal uidgid handling looks ok). Problem is really in CRED CMSG. Sadly that code wasn't touched for ages, so it seems to be ether new feature of kernel of libc, or ...

This is serious blocker (= must be included in 6.5) so I've changed priority.

Comment 7 Jan Friesse 2013-10-15 14:01:46 UTC
After digging to problem a little more. Problem seems to be kernel problem (not corosync one) and related to  SO_PASSCRED.

Next comment includes very simple (I would call it as minimal) reproducer.

Steps to reproduce:
gcc usock-to-stdout.c
./a.out /tmp/a

on different console execute:
while true;do echo "ahjo" | nc -U /tmp/a;done

Reproducible with kernel 2.6.32-419.el6.x86_64.

Comment 8 Jan Friesse 2013-10-15 14:04:50 UTC
Created attachment 812543 [details]
Reproducer

Reproducer

Comment 9 Jan Friesse 2013-10-18 09:15:09 UTC
Created attachment 813659 [details]
Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop)

Comment 10 Jan Friesse 2013-10-18 09:16:16 UTC
Created attachment 813660 [details]
Proposed patch - part 2 - Set SO_PASSCRED on listener socket

This patch is needed only if http://patchwork.ozlabs.org/patch/284366/ is accepted without any changes.

Comment 15 errata-xmlrpc 2014-10-14 07:11:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html


Note You need to log in before you can comment on or make changes to this bug.