Bug 315741
| Summary: | Groupd: service cman start fails if local gfs2 is mounted | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> | ||||||
| Component: | cman | Assignee: | David Teigland <teigland> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | low | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 5.0 | CC: | ccaulfield, cluster-maint | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2009-01-20 21:53:00 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 311380 [details]
patch to test
Here's a fix, if someone wants to give it a try.
I gave this fix a try. It's a good theory, but the problem is that /sys/fs/gfs2/<table>/lock_module/* is not set up in sysfs for lock_nolock. Right now, only lock_dlm creates that, so it would require a change to lock_nolock. I can code that up if there's not a better way. There's also the matter of gfs(1) mounts which also won't work under this scheme. Perhaps it would be better (and not require a kernel change nor gfs kernel change) to have groupd keep track of all gfs and gfs2 mount points found in /proc/mounts when it starts up, and ignore them if they're in that list. After all, if they appear as type "gfs" in /proc/mounts prior to groupd running, they must be lock_nolock and they should be treated as outside the concern of groupd. Created attachment 311504 [details]
patch to test
Oops, forgot that nolock didn't add anything to sysfs, but that just
makes the test simpler since the open will fail for nolock. AFAIK
this should work for either gfs1 or gfs2.
This patch works okay for mounting, but now we still have a problem
with stopping the cman service. I haven't tracked it down. Observe:
[root@exxon-03 ~]# mount -tgfs2 /dev/sdc /mnt/gfs
[root@exxon-03 ~]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
[root@exxon-03 ~]# service clvmd start
Starting clvmd: [ OK ]
Activating VGs: 1 logical volume(s) in volume group "exxon_vg" now active
2 logical volume(s) in volume group "VolGroup00" now active
[ OK ]
[root@exxon-03 ~]# mount -tgfs2 /dev/exxon_vg/exxon_lv /mnt/gfs2
[root@exxon-03 ~]# ls /mnt/gfs2/
exxon-01
[root@exxon-03 ~]# umount /mnt/gfs2
[root@exxon-03 ~]# service clvmd stop
Deactivating VG exxon_vg: 0 logical volume(s) in volume group "exxon_vg" now a
ctive
clvmd not running on node exxon-02
[ OK ]
Stopping clvm: [ OK ]
[root@exxon-03 ~]# service cman stop
Stopping cluster:
Stopping fencing... failed
[FAILED]
[root@exxon-03 ~]# mount | grep gfs2
/dev/sdc on /mnt/gfs type gfs2 (rw,localflocks,localcaching)
[root@exxon-03 ~]# umount /mnt/gfs
[root@exxon-03 ~]# ps ax | tail
2981 ? S< 0:00 [glock_workqueue]
2982 ? S< 0:00 [glock_workqueue]
3057 ? Ssl 0:00 /sbin/ccsd
3066 ? SLl 0:00 aisexec
3074 ? Ss 0:00 /sbin/groupd
3082 ? Ss 0:00 /sbin/fenced
3088 ? Ss 0:00 /sbin/dlm_controld
3094 ? Ss 0:00 /sbin/gfs_controld
3300 pts/0 R+ 0:00 ps ax
3301 pts/0 S+ 0:00 tail
Note: the cman and clvmd services were started and stopped
simultaneously on all three nodes through cssh. The cman and
clvmd services stopped correctly on the other two nodes, but
the daemons were left running on the node that had the file system
mounted prior through lock_nolock.
Dump of gfs_controld logs (from the umount only):
1215728220 kernel: remove@ bobs_exxon:exxon_lv
1215728220 exxon_lv get open /sys/fs/gfs2/bobs_exxon:exxon_lv/lock_module/id
error -1 2
1215728220 exxon_lv ping_kernel_mount -1
1215728220 client 6: leave /mnt/gfs2 gfs2 0
1215728220 client 6 fd 11 dead
1215728220 client 6 fd -1 dead
1215728220 groupd cb: stop exxon_lv
1215728220 exxon_lv set /sys/fs/gfs2/bobs_exxon:exxon_lv/lock_module/block to 1
1215728220 exxon_lv set open /sys/fs/gfs2/bobs_exxon:exxon_lv/lock_module/block
error -1 2
1215728220 exxon_lv do_stop skipped fs unmounted
1215728220 groupd cb: terminate exxon_lv
1215728220 exxon_lv purged 0 plocks for 0
1215728220 exxon_lv termination of our unmount leave
1215728769 client 6: dump
Dump of groupd logs (again, only from the umount on):
1215728220 got client 4 leave
1215728220 1:exxon_lv got leave
1215728220 1:exxon_lv cpg_leave ok
1215728220 1:exxon_lv confchg left 1 joined 0 total 2
1215728220 1:exxon_lv confchg removed node 3 reason 2
1215728220 1:exxon_lv process_node_leave 3
1215728220 1:exxon_lv cpg del node 3 total 2
1215728220 1:exxon_lv make_event_id 300020002 nodeid 3 memb_count 2 type 2
1215728220 1:exxon_lv queue leave event for nodeid 3
1215728220 1:exxon_lv process_current_event 300020002 3 LEAVE_BEGIN
1215728220 1:exxon_lv action for app: stop exxon_lv
1215728220 got client 4 stop_done
1215728220 1:exxon_lv send stopped
1215728220 1:exxon_lv waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv mark node 3 stopped
1215728220 1:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv mark node 1 stopped
1215728220 1:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 1:exxon_lv mark node 2 stopped
1215728220 1:exxon_lv process_current_event 300020002 3 LEAVE_ALL_STOPPED
1215728220 1:exxon_lv finalize_our_leave
1215728220 1:exxon_lv action for app: terminate exxon_lv
1215728220 got client 5 leave
1215728220 2:exxon_lv got leave
1215728220 2:exxon_lv cpg_leave ok
1215728220 2:exxon_lv confchg left 1 joined 0 total 2
1215728220 2:exxon_lv confchg removed node 3 reason 2
1215728220 2:exxon_lv process_node_leave 3
1215728220 2:exxon_lv cpg del node 3 total 2
1215728220 2:exxon_lv make_event_id 300020002 nodeid 3 memb_count 2 type 2
1215728220 2:exxon_lv queue leave event for nodeid 3
1215728220 2:exxon_lv process_current_event 300020002 3 LEAVE_BEGIN
1215728220 2:exxon_lv action for app: stop exxon_lv
1215728220 got client 5 stop_done
1215728220 2:exxon_lv send stopped
1215728220 2:exxon_lv waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv mark node 3 stopped
1215728220 2:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv mark node 1 stopped
1215728220 2:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728220 2:exxon_lv mark node 2 stopped
1215728220 2:exxon_lv process_current_event 300020002 3 LEAVE_ALL_STOPPED
1215728220 2:exxon_lv finalize_our_leave
1215728220 2:exxon_lv action for app: terminate exxon_lv
1215728226 1:clvmd confchg left 1 joined 0 total 2
1215728226 1:clvmd confchg removed node 2 reason 2
1215728226 1:clvmd process_node_leave 2
1215728226 1:clvmd cpg del node 2 total 2
1215728226 1:clvmd make_event_id 200020002 nodeid 2 memb_count 2 type 2
1215728226 1:clvmd queue leave event for nodeid 2
1215728226 1:clvmd process_current_event 200020002 2 LEAVE_BEGIN
1215728226 1:clvmd action for app: stop clvmd
1215728226 got client 4 stop_done
1215728226 1:clvmd send stopped
1215728226 1:clvmd waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd mark node 2 stopped
1215728226 1:clvmd waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd mark node 3 stopped
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728226 1:clvmd mark node 1 stopped
1215728226 1:clvmd process_current_event 200020002 2 LEAVE_ALL_STOPPED
1215728226 1:clvmd app node leave: del 2 total 2
1215728226 1:clvmd action for app: start clvmd 13 3 2 3 1
1215728226 got client 4 start_done
1215728226 1:clvmd send started
1215728226 1:clvmd mark node 1 started
1215728226 1:clvmd mark node 3 started
1215728226 1:clvmd process_current_event 200020002 2 LEAVE_ALL_STARTED
1215728226 1:clvmd action for app: finish clvmd 13
1215728226 got client 4 leave
1215728226 1:clvmd got leave
1215728226 1:clvmd cpg_leave ok
1215728226 1:clvmd confchg left 1 joined 0 total 1
1215728226 1:clvmd confchg removed node 3 reason 2
1215728226 1:clvmd process_node_leave 3
1215728226 1:clvmd cpg del node 3 total 1
1215728226 1:clvmd make_event_id 300010002 nodeid 3 memb_count 1 type 2
1215728226 1:clvmd queue leave event for nodeid 3
1215728226 1:clvmd process_current_event 300010002 3 LEAVE_BEGIN
1215728226 1:clvmd action for app: stop clvmd
1215728226 got client 4 stop_done
1215728226 1:clvmd send stopped
1215728226 1:clvmd waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 3
1215728226 1:clvmd mark node 3 stopped
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728226 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 3
1215728226 1:clvmd mark node 1 stopped
1215728226 1:clvmd process_current_event 300010002 3 LEAVE_ALL_STOPPED
1215728226 1:clvmd finalize_our_leave
1215728226 1:clvmd action for app: terminate clvmd
1215728240 0:default confchg left 1 joined 0 total 2
1215728240 0:default confchg removed node 1 reason 2
1215728240 0:default process_node_leave 1
1215728240 0:default cpg del node 1 total 2
1215728240 0:default make_event_id 100020002 nodeid 1 memb_count 2 type 2
1215728240 0:default queue leave event for nodeid 1
1215728240 0:default process_current_event 100020002 1 LEAVE_BEGIN
1215728240 0:default action for app: stop default
1215728240 got client 3 stop_done
1215728240 0:default send stopped
1215728240 0:default waiting for 3 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default mark node 1 stopped
1215728240 0:default waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default mark node 2 stopped
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 1
1215728240 0:default mark node 3 stopped
1215728240 0:default process_current_event 100020002 1 LEAVE_ALL_STOPPED
1215728240 0:default app node leave: del 1 total 2
1215728240 0:default action for app: start default 15 3 2 3 2
1215728240 got client 3 start_done
1215728240 0:default send started
1215728240 0:default mark node 3 started
1215728240 0:default mark node 2 started
1215728240 0:default process_current_event 100020002 1 LEAVE_ALL_STARTED
1215728240 0:default action for app: finish default 15
1215728240 0:default confchg left 1 joined 0 total 1
1215728240 0:default confchg removed node 2 reason 2
1215728240 0:default process_node_leave 2
1215728240 0:default cpg del node 2 total 1
1215728240 0:default make_event_id 200010002 nodeid 2 memb_count 1 type 2
1215728240 0:default queue leave event for nodeid 2
1215728240 0:default process_current_event 200010002 2 LEAVE_BEGIN
1215728240 0:default action for app: stop default
1215728240 got client 3 stop_done
1215728240 0:default send stopped
1215728240 0:default waiting for 2 more stopped messages before LEAVE_ALL_STOPPED 2
1215728240 0:default mark node 2 stopped
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728240 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1215728240 0:default mark node 3 stopped
1215728240 0:default process_current_event 200010002 2 LEAVE_ALL_STOPPED
1215728240 0:default app node leave: del 2 total 1
1215728240 0:default action for app: start default 16 3 1 3
1215728240 got client 3 start_done
1215728240 0:default send started
1215728240 0:default mark node 3 started
1215728240 0:default process_current_event 200010002 2 LEAVE_ALL_STARTED
1215728240 0:default action for app: finish default 16
1215728257 cman: node 1 removed
1215728257 add_recovery_set_cman nodeid 1
1215728257 cman: lost quorum
1215728257 cman: node 2 removed
1215728257 add_recovery_set_cman nodeid 2
1215728257 groupd confchg total 2 left 1 joined 0
1215728257 add_recovery_set_cpg nodeid 1 procdown 0
1215728257 free unused recovery set 1 cpg
1215728272 cman: have quorum
1215728272 groupd confchg total 1 left 1 joined 0
1215728272 add_recovery_set_cpg nodeid 2 procdown 0
1215728272 free unused recovery set 2 cpg
1215728831 client connection 7
1215728831 got client 7 dump
fence log (the whole thing):
[root@exxon-03 ~]# group_tool dump fence
1215728182 our_nodeid 3 our_name exxon-03
1215728182 listen 4 member 5 groupd 7
1215728183 client 3: join default
1215728183 delay post_join 1800s post_fail 20s
1215728183 added 3 nodes from ccs
1215728183 setid default 65538
1215728183 start default 1 members 3 2
1215728183 do_recovery stop 0 start 1 finish 0
1215728183 finish default 1
1215728183 stop default
1215728183 start default 2 members 1 3 2
1215728183 do_recovery stop 1 start 2 finish 1
1215728183 finish default 2
1215728240 stop default
1215728240 start default 15 members 3 2
1215728240 do_recovery stop 2 start 15 finish 2
1215728240 add node 1 to list 3
1215728240 finish default 15
1215728240 stop default
1215728240 start default 16 members 3
1215728240 do_recovery stop 15 start 16 finish 15
1215728240 add node 2 to list 3
1215728240 finish default 16
1215728993 client 3: dump
[root@exxon-03 ~]#
pushed to git even though I've not been able to verify it RHEL5 commit aee3ab578619a22e03cc9c9d024647927df479b0 STABLE2 commit 405dbcb97c1dbd5136664948f96c2bec285ac489 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0189.html |
Description of problem: I had a local gfs2 file system mounting at init due to an entry in /etc/fstab. When I tried to start the cman service script, it failed. I took the entry out of fstab and was able to recreate it just by mounting my local gfs2 file system before starting the cman init script. Version-Release number of selected component (if applicable): RHEL51 Beta How reproducible: At will. Steps to Reproduce: mkfs.gfs2 -O /dev/sdb1 mount -tgfs2 /dev/sdb1 /mnt/gfs2 service cman start Actual results: Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed [FAILED] Expected results: Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] Additional info: Apparently reported by Russell Cattelan a long time ago, but never made into a bug record until now. Excerpt from syslog: Oct 2 12:53:42 roth-01 ccsd[3055]: Initial status:: Quorate Oct 2 12:53:42 roth-01 groupd[3073]: found uncontrolled kernel object sdb1 in /sys/fs/gfs2 Oct 2 12:53:42 roth-01 groupd[3073]: local node must be reset to clear 1 uncontrolled instances of gfs and/or dlm Oct 2 12:53:42 roth-01 openais[3061]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application Oct 2 12:53:42 roth-01 fence_node[3074]: Fence of "roth-01" was unsuccessful Oct 2 12:53:42 roth-01 fenced[3080]: cman_init error 0 111 Oct 2 12:53:42 roth-01 gfs_controld[3092]: cman_init error 111 Oct 2 12:53:52 roth-01 dlm_controld[3086]: group_init error 0 111