Hide Forgot
Description of problem: Sometimes after the fenced node is rebooted the mount of GFS2 file system fails. [2013-10-11 15:39:44.443563] [revolver-gfs2] mounting /dev/mapper/STSRHTS8560-STSRHTS85600 on /mnt/STSRHTS85600 on virt-139.cluster-qe.lab.eng.brq.redhat.com [2013-10-11 15:39:44.528422] [revolver-gfs2] gfs_controld join error: -107 [2013-10-11 15:39:44.528636] [revolver-gfs2] error mounting lockproto lock_dlm With debug package of corosync we can get more details: Oct 11 15:39:44 virt-139 qarshd[2266]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600 Oct 11 15:39:44 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:39:44 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 0, uid = 4294967295, gid = 4294967295 Oct 11 15:39:44 virt-139 corosync[1309]: [IPC ] IPC cred - Security invalid Oct 11 15:39:44 virt-139 corosync[1309]: [IPC ] Invalid IPC credentials. Oct 11 15:39:44 virt-139 gfs_controld[1438]: cpg_initialize error 100 Oct 11 15:39:44 virt-139 gfs_controld[1438]: join: group join error -107 Oct 11 15:39:44 virt-139 gfs_controld[1438]: process_dlmcontrol no mg STSRHTS85600 Version-Release number of selected component (if applicable): gfs2-utils-3.0.12.1-59.el6.x86_64 kernel-2.6.32-421.el6.x86_64 (20131007.n.0 snap) How reproducible: like 30% of my jobs fail on this Steps to Reproduce: 1. setup 3x gfs2 filesystem 2. mount and run I/O on it 3. fence a node, wait for rejoin and remount the file systems 4. (we call this revolver) Actual results: FS not mounted, IPC credentials errors printed in syslog Expected results: FS mounted Additional info:
Successful run of the same command later: Oct 11 15:55:00 virt-139 qarshd[2642]: Running cmdline: mount -t gfs2 /dev/mapper/STSRHTS8560-STSRHTS85600 /mnt/STSRHTS85600 Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1438, uid = 0, gid = 0 Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1438, uid = 0, gid = 0 Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Starting up Oct 11 15:55:00 virt-139 kernel: Slow work thread pool: Ready Oct 11 15:55:00 virt-139 kernel: GFS2 (built Sep 30 2013 12:03:53) installed Oct 11 15:55:00 virt-139 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "STSRHTS8560:STSRHTS85600" Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:00 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred found Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - client_pid = 1382, uid = 0, gid = 0 Oct 11 15:55:01 virt-139 corosync[1309]: [IPC ] IPC cred - Security valid Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: Joined cluster. Now mounting FS... Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1, already locked for use Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Looking at journal... Oct 11 15:55:01 virt-139 kernel: GFS2: fsid=STSRHTS8560:STSRHTS85600.1: jid=1: Done root 1438 0.0 0.1 134248 1904 ? Ssl 15:39 0:00 gfs_controld root 1382 0.0 0.5 218228 5948 ? Ssl 15:39 0:00 dlm_controld
It looks like something may have changed with IPC since 6.4.
Jardo, thanks for running test with debug packages. It helped me reduce problem space a lot (internal uidgid handling looks ok). Problem is really in CRED CMSG. Sadly that code wasn't touched for ages, so it seems to be ether new feature of kernel of libc, or ... This is serious blocker (= must be included in 6.5) so I've changed priority.
After digging to problem a little more. Problem seems to be kernel problem (not corosync one) and related to SO_PASSCRED. Next comment includes very simple (I would call it as minimal) reproducer. Steps to reproduce: gcc usock-to-stdout.c ./a.out /tmp/a on different console execute: while true;do echo "ahjo" | nc -U /tmp/a;done Reproducible with kernel 2.6.32-419.el6.x86_64.
Created attachment 812543 [details] Reproducer Reproducer
Created attachment 813659 [details] Proposed patch - part 1 - (required) Backport LibQB handling of CMSG (for loop)
Created attachment 813660 [details] Proposed patch - part 2 - Set SO_PASSCRED on listener socket This patch is needed only if http://patchwork.ozlabs.org/patch/284366/ is accepted without any changes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1508.html