Bug 315741 - Groupd: service cman start fails if local gfs2 is mounted
Groupd: service cman start fails if local gfs2 is mounted
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
low Severity low
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-02 14:31 EDT by Robert Peterson
Modified: 2009-04-16 19:03 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 16:53:00 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to test (1.80 KB, text/plain)
2008-07-09 10:58 EDT, David Teigland
no flags Details
patch to test (1.78 KB, text/plain)
2008-07-10 15:06 EDT, David Teigland
no flags Details

  None (edit)
Description Robert Peterson 2007-10-02 14:31:11 EDT
Description of problem:
I had a local gfs2 file system mounting at init due to an entry in
/etc/fstab.  When I tried to start the cman service script, it failed.

I took the entry out of fstab and was able to recreate it just by
mounting my local gfs2 file system before starting the cman init script.

Version-Release number of selected component (if applicable):
RHEL51 Beta

How reproducible:
At will.

Steps to Reproduce:
mkfs.gfs2 -O /dev/sdb1
mount -tgfs2 /dev/sdb1 /mnt/gfs2
service cman start

Actual results:
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... failed
                                                           [FAILED]

Expected results:
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]

Additional info:
Apparently reported by Russell Cattelan a long time ago, but never
made into a bug record until now.

Excerpt from syslog:

Oct  2 12:53:42 roth-01 ccsd[3055]: Initial status:: Quorate 
Oct  2 12:53:42 roth-01 groupd[3073]: found uncontrolled kernel object sdb1 in
/sys/fs/gfs2
Oct  2 12:53:42 roth-01 groupd[3073]: local node must be reset to clear 1
uncontrolled instances of gfs and/or dlm
Oct  2 12:53:42 roth-01 openais[3061]: [CMAN ] cman killed by node 1 because we
were killed by cman_tool or other application 
Oct  2 12:53:42 roth-01 fence_node[3074]: Fence of "roth-01" was unsuccessful 
Oct  2 12:53:42 roth-01 fenced[3080]: cman_init error 0 111
Oct  2 12:53:42 roth-01 gfs_controld[3092]: cman_init error 111
Oct  2 12:53:52 roth-01 dlm_controld[3086]: group_init error 0 111
Comment 3 David Teigland 2008-07-09 10:58:07 EDT
Created attachment 311380 [details]
patch to test

Here's a fix, if someone wants to give it a try.
Comment 4 Robert Peterson 2008-07-10 14:51:09 EDT
I gave this fix a try.  It's a good theory, but the problem is that 
/sys/fs/gfs2/<table>/lock_module/* is not set up in
sysfs for lock_nolock.  Right now, only lock_dlm creates that, so
it would require a change to lock_nolock.  I can code that up if
there's not a better way.  There's also the matter of gfs(1) mounts
which also won't work under this scheme.

Perhaps it would be better (and not require a kernel change nor gfs
kernel change) to have groupd keep track of all gfs and gfs2 mount
points found in /proc/mounts when it starts up, and ignore them if
they're in that list.  After all, if they appear as type "gfs" in
/proc/mounts prior to groupd running, they must be lock_nolock and
they should be treated as outside the concern of groupd.
Comment 5 David Teigland 2008-07-10 15:06:50 EDT
Created attachment 311504 [details]
patch to test

Oops, forgot that nolock didn't add anything to sysfs, but that just
makes the test simpler since the open will fail for nolock.  AFAIK
this should work for either gfs1 or gfs2.
Comment 6 Robert Peterson 2008-07-10 18:28:54 EDT
This patch works okay for mounting, but now we still have a problem
with stopping the cman service.  I haven't tracked it down.  Observe:

[root@exxon-03 ~]# mount -tgfs2 /dev/sdc /mnt/gfs
[root@exxon-03 ~]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
[root@exxon-03 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:   1 logical volume(s) in volume group "exxon_vg" now active
  2 logical volume(s) in volume group "VolGroup00" now active
                                                           [  OK  ]
[root@exxon-03 ~]# mount -tgfs2 /dev/exxon_vg/exxon_lv /mnt/gfs2
[root@exxon-03 ~]# ls /mnt/gfs2/
exxon-01
[root@exxon-03 ~]# umount /mnt/gfs2
[root@exxon-03 ~]# service clvmd stop
Deactivating VG exxon_vg:   0 logical volume(s) in volume group "exxon_vg" now a
ctive
  clvmd not running on node exxon-02
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]
[root@exxon-03 ~]# service cman stop
Stopping cluster: 
   Stopping fencing... failed

                                                           [FAILED]
[root@exxon-03 ~]# mount | grep gfs2
/dev/sdc on /mnt/gfs type gfs2 (rw,localflocks,localcaching)
[root@exxon-03 ~]# umount /mnt/gfs
[root@exxon-03 ~]# ps ax | tail
 2981 ?        S<     0:00 [glock_workqueue]
 2982 ?        S<     0:00 [glock_workqueue]
 3057 ?        Ssl    0:00 /sbin/ccsd
 3066 ?        SLl    0:00 aisexec
 3074 ?        Ss     0:00 /sbin/groupd
 3082 ?        Ss     0:00 /sbin/fenced
 3088 ?        Ss     0:00 /sbin/dlm_controld
 3094 ?        Ss     0:00 /sbin/gfs_controld
 3300 pts/0    R+     0:00 ps ax
 3301 pts/0    S+     0:00 tail

Note: the cman and clvmd services were started and stopped
simultaneously on all three nodes through cssh.  The cman and 
clvmd services stopped correctly on the other two nodes, but
the daemons were left running on the node that had the file system
mounted prior through lock_nolock.

Dump of gfs_controld logs (from the umount only):
1215728220 kernel: remove@ bobs_exxon:exxon_lv
1215728220 exxon_lv get open /sys/fs/gfs2/bobs_exxon:exxon_lv/lock_module/id
error -1 2
1215728220 exxon_lv ping_kernel_mount -1
1215728220 client 6: leave /mnt/gfs2 gfs2 0
1215728220 client 6 fd 11 dead
1215728220 client 6 fd -1 dead
1215728220 groupd cb: stop exxon_lv
1215728220 exxon_lv set /sys/fs/gfs2/bobs_exxon:exxon_lv/lock_module/block to 1
1215728220 exxon_lv set open /sys/fs/gfs2/bobs_exxon:exxon_lv/lock_module/block
error -1 2
1215728220 exxon_lv do_stop skipped fs unmounted
1215728220 groupd cb: terminate exxon_lv
1215728220 exxon_lv purged 0 plocks for 0
1215728220 exxon_lv termination of our unmount leave
1215728769 client 6: dump

Dump of groupd logs (again, only from the umount on):
1215728220 got client 4 leave
1215728220 1:exxon_lv got leave
1215728220 1:exxon_lv cpg_leave ok
1215728220 1:exxon_lv confchg left 1 joined 0 total 2
1215728220 1:exxon_lv confchg removed node 3 reason 2
1215728220 1:exxon_lv process_node_leave 3
1215728220 1:exxon_lv cpg del node 3 total 2
1215728220 1:exxon_lv make_event_id 300020002 nodeid 3 memb_count 2 type 2
1215728220 1:exxon_lv queue leave event for nodeid 3
1215728220 1:exxon_lv process_current_event 300020002 3 LEAVE_BEGIN
1215728220 1:exxon_lv action for app: stop exxon_lv
1215728220 got client 4 stop_done
1215728220 1:exxon_lv send stopped
1215728220 1:exxon_lv waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv mark node 3 stopped
1215728220 1:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv mark node 1 stopped
1215728220 1:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv mark node 2 stopped
1215728220 1:exxon_lv process_current_event 300020002 3 LEAVE_ALL_STOPPED
1215728220 1:exxon_lv finalize_our_leave
1215728220 1:exxon_lv action for app: terminate exxon_lv
1215728220 got client 5 leave
1215728220 2:exxon_lv got leave
1215728220 2:exxon_lv cpg_leave ok
1215728220 2:exxon_lv confchg left 1 joined 0 total 2
1215728220 2:exxon_lv confchg removed node 3 reason 2
1215728220 2:exxon_lv process_node_leave 3
1215728220 2:exxon_lv cpg del node 3 total 2
1215728220 2:exxon_lv make_event_id 300020002 nodeid 3 memb_count 2 type 2
1215728220 2:exxon_lv queue leave event for nodeid 3
1215728220 2:exxon_lv process_current_event 300020002 3 LEAVE_BEGIN
1215728220 2:exxon_lv action for app: stop exxon_lv
1215728220 got client 5 stop_done
1215728220 2:exxon_lv send stopped
1215728220 2:exxon_lv waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv mark node 3 stopped
1215728220 2:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv mark node 1 stopped
1215728220 2:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv mark node 2 stopped
1215728220 2:exxon_lv process_current_event 300020002 3 LEAVE_ALL_STOPPED
1215728220 2:exxon_lv finalize_our_leave
1215728220 2:exxon_lv action for app: terminate exxon_lv
1215728226 1:clvmd confchg left 1 joined 0 total 2
1215728226 1:clvmd confchg removed node 2 reason 2
1215728226 1:clvmd process_node_leave 2
1215728226 1:clvmd cpg del node 2 total 2
1215728226 1:clvmd make_event_id 200020002 nodeid 2 memb_count 2 type 2
1215728226 1:clvmd queue leave event for nodeid 2
1215728226 1:clvmd process_current_event 200020002 2 LEAVE_BEGIN
1215728226 1:clvmd action for app: stop clvmd
1215728226 got client 4 stop_done
1215728226 1:clvmd send stopped
1215728226 1:clvmd waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd mark node 2 stopped
1215728226 1:clvmd waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd mark node 3 stopped
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd mark node 1 stopped
1215728226 1:clvmd process_current_event 200020002 2 LEAVE_ALL_STOPPED
1215728226 1:clvmd app node leave: del 2 total 2
1215728226 1:clvmd action for app: start clvmd 13 3 2 3 1
1215728226 got client 4 start_done
1215728226 1:clvmd send started
1215728226 1:clvmd mark node 1 started
1215728226 1:clvmd mark node 3 started
1215728226 1:clvmd process_current_event 200020002 2 LEAVE_ALL_STARTED
1215728226 1:clvmd action for app: finish clvmd 13
1215728226 got client 4 leave
1215728226 1:clvmd got leave
1215728226 1:clvmd cpg_leave ok
1215728226 1:clvmd confchg left 1 joined 0 total 1
1215728226 1:clvmd confchg removed node 3 reason 2
1215728226 1:clvmd process_node_leave 3
1215728226 1:clvmd cpg del node 3 total 1
1215728226 1:clvmd make_event_id 300010002 nodeid 3 memb_count 1 type 2
1215728226 1:clvmd queue leave event for nodeid 3
1215728226 1:clvmd process_current_event 300010002 3 LEAVE_BEGIN
1215728226 1:clvmd action for app: stop clvmd
1215728226 got client 4 stop_done
1215728226 1:clvmd send stopped
1215728226 1:clvmd waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728226 1:clvmd mark node 3 stopped
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728226 1:clvmd mark node 1 stopped
1215728226 1:clvmd process_current_event 300010002 3 LEAVE_ALL_STOPPED
1215728226 1:clvmd finalize_our_leave
1215728226 1:clvmd action for app: terminate clvmd
1215728240 0:default confchg left 1 joined 0 total 2
1215728240 0:default confchg removed node 1 reason 2
1215728240 0:default process_node_leave 1
1215728240 0:default cpg del node 1 total 2
1215728240 0:default make_event_id 100020002 nodeid 1 memb_count 2 type 2
1215728240 0:default queue leave event for nodeid 1
1215728240 0:default process_current_event 100020002 1 LEAVE_BEGIN
1215728240 0:default action for app: stop default
1215728240 got client 3 stop_done
1215728240 0:default send stopped
1215728240 0:default waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default mark node 1 stopped
1215728240 0:default waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default mark node 2 stopped
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default mark node 3 stopped
1215728240 0:default process_current_event 100020002 1 LEAVE_ALL_STOPPED
1215728240 0:default app node leave: del 1 total 2
1215728240 0:default action for app: start default 15 3 2 3 2
1215728240 got client 3 start_done
1215728240 0:default send started
1215728240 0:default mark node 3 started
1215728240 0:default mark node 2 started
1215728240 0:default process_current_event 100020002 1 LEAVE_ALL_STARTED
1215728240 0:default action for app: finish default 15
1215728240 0:default confchg left 1 joined 0 total 1
1215728240 0:default confchg removed node 2 reason 2
1215728240 0:default process_node_leave 2
1215728240 0:default cpg del node 2 total 1
1215728240 0:default make_event_id 200010002 nodeid 2 memb_count 1 type 2
1215728240 0:default queue leave event for nodeid 2
1215728240 0:default process_current_event 200010002 2 LEAVE_BEGIN
1215728240 0:default action for app: stop default
1215728240 got client 3 stop_done
1215728240 0:default send stopped
1215728240 0:default waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 2
1215728240 0:default mark node 2 stopped
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728240 0:default mark node 3 stopped
1215728240 0:default process_current_event 200010002 2 LEAVE_ALL_STOPPED
1215728240 0:default app node leave: del 2 total 1
1215728240 0:default action for app: start default 16 3 1 3
1215728240 got client 3 start_done
1215728240 0:default send started
1215728240 0:default mark node 3 started
1215728240 0:default process_current_event 200010002 2 LEAVE_ALL_STARTED
1215728240 0:default action for app: finish default 16
1215728257 cman: node 1 removed
1215728257 add_recovery_set_cman nodeid 1
1215728257 cman: lost quorum
1215728257 cman: node 2 removed
1215728257 add_recovery_set_cman nodeid 2
1215728257 groupd confchg total 2 left 1 joined 0
1215728257 add_recovery_set_cpg nodeid 1 procdown 0
1215728257 free unused recovery set 1 cpg
1215728272 cman: have quorum
1215728272 groupd confchg total 1 left 1 joined 0
1215728272 add_recovery_set_cpg nodeid 2 procdown 0
1215728272 free unused recovery set 2 cpg
1215728831 client connection 7
1215728831 got client 7 dump

fence log (the whole thing):
[root@exxon-03 ~]# group_tool dump fence
1215728182 our_nodeid 3 our_name exxon-03
1215728182 listen 4 member 5 groupd 7
1215728183 client 3: join default
1215728183 delay post_join 1800s post_fail 20s
1215728183 added 3 nodes from ccs
1215728183 setid default 65538
1215728183 start default 1 members 3 2 
1215728183 do_recovery stop 0 start 1 finish 0
1215728183 finish default 1
1215728183 stop default
1215728183 start default 2 members 1 3 2 
1215728183 do_recovery stop 1 start 2 finish 1
1215728183 finish default 2
1215728240 stop default
1215728240 start default 15 members 3 2 
1215728240 do_recovery stop 2 start 15 finish 2
1215728240 add node 1 to list 3
1215728240 finish default 15
1215728240 stop default
1215728240 start default 16 members 3 
1215728240 do_recovery stop 15 start 16 finish 15
1215728240 add node 2 to list 3
1215728240 finish default 16
1215728993 client 3: dump
[root@exxon-03 ~]# 
Comment 7 David Teigland 2008-07-31 13:52:56 EDT
pushed to git even though I've not been able to verify it
RHEL5 commit aee3ab578619a22e03cc9c9d024647927df479b0
STABLE2 commit 405dbcb97c1dbd5136664948f96c2bec285ac489
Comment 11 errata-xmlrpc 2009-01-20 16:53:00 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html

Note You need to log in before you can comment on or make changes to this bug.