Description of problem: We don't have a clear picture of the problem yet because there are so many different things going on and very little debugging is enabled in the default RHEL5 installation. Here's what we know: - Paul had a two node gfs cluster on which he was storing the image files for virtual machines that are forming their own cluster (non gfs) - he attempted to enable the rgmanager virtual service (on cmd line with clusvcadm) for one of the virtual machines that was running. - gfs on one of the physical machines withdrew after the dlm returned an unknown error ar 26 10:01:58 ask-07 clurgmgrd[19198]: <notice> Recovering failed service vm:pk_4 Mar 26 10:02:00 ask-07 kernel: tap tap-10-51712: 2 getting info Mar 26 10:02:01 ask-07 kernel: device vif10.0 entered promiscuous mode Mar 26 10:02:01 ask-07 kernel: ADDRCONF(NETDEV_UP): vif10.0: link is not ready Mar 26 10:02:01 ask-07 clurgmgrd[19198]: <notice> Service vm:pk_4 started Mar 26 10:02:05 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed Mar 26 10:02:05 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: function = gfs_glock_xmote_th Mar 26 10:02:05 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: file = /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_xen/src/gfs/glock.c, line = 1162 Mar 26 10:02:05 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: time = 1174921325 Mar 26 10:02:05 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: about to withdraw from the cluster Mar 26 10:02:05 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: telling LM to withdraw Mar 26 10:02:06 ask-07 kernel: GFS: fsid=ask_cluster:gfs.1: withdrawn Mar 26 10:02:06 ask-07 kernel: [<ee73b8ae>] gfs_lm_withdraw+0x76/0x82 [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee74f812>] gfs_assert_withdraw_i+0x26/0x31 [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee7345b4>] gfs_glock_xmote_th+0x138/0x17c [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee7330e2>] run_queue+0x236/0x2f9 [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee733702>] gfs_glock_nq+0x31d/0x36f [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee74353a>] do_read_direct+0x76/0x17f [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee7421b0>] walk_vm+0xb5/0x274 [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee742499>] __gfs_read+0x43/0x4a [gfs] Mar 26 10:02:06 ask-07 kernel: [<ee7434c4>] do_read_direct+0x0/0x17f [gfs] Mar 26 10:02:06 ask-07 kernel: [<c047b6b5>] aio_pread+0x25/0x6f Mar 26 10:02:06 ask-07 kernel: [<c047b690>] aio_pread+0x0/0x6f Mar 26 10:02:06 ask-07 kernel: [<c047c023>] aio_run_iocb+0xbd/0x141 Mar 26 10:02:06 ask-07 kernel: [<c047c79c>] io_submit_one+0x24b/0x2a1 Mar 26 10:02:06 ask-07 kernel: [<c047cc39>] sys_io_submit+0x82/0xd8 Mar 26 10:02:06 ask-07 kernel: [<c0404cff>] syscall_call+0x7/0xb some of the processes that were running: 3301 S xenstored --pid-file /var/r - 3311 S python /usr/sbin/xend start wait 3312 Sl python /usr/sbin/xend start stext 3314 Sl xenconsoled stext 3316 Ssl blktapctrl stext 3708 Ss /sbin/dhclient -1 -q -lf /v - 3840 Ssl /sbin/ccsd stext 3848 SLl aisexec stext 3864 Ss /sbin/groupd - 3872 Ss /sbin/fenced - 3878 Ss /sbin/dlm_controld - 3884 Ss /sbin/gfs_controld - 3971 Ssl clvmd -T20 stext 3977 S< [dlm_astd] dlm_astd 3978 S< [dlm_scand] - 3979 S< [dlm_recvd] dlm_recvd 3980 S< [dlm_sendd] dlm_sendd 3981 S< [dlm_recoverd] dlm_recoverd 4394 S< [gfs_scand] - 4395 S< [gfs_glockd] gfs_glockd 4398 S< [gfs_recoverd] - 4399 S< [gfs_logd] - 4400 S< [gfs_quotad] - 4401 S< [gfs_inoded] - 4745 Rs sshd: root@pts/8 - 4755 Ss -bash wait 4918 R+ ps ax -o pid,stat,cmd,wchan - 18461 S<s ricci -u 100 - 19197 S<s clurgmgrd wait 19198 S<l clurgmgrd stext 19199 S< [dlm_recoverd] dlm_recoverd 20496 Sl tapdisk /dev/xen/tapctrlwri stext 20520 Sl /usr/lib/xen/bin/xen-vncfb stext 20587 Sl tapdisk /dev/xen/tapctrlwri stext 20600 Sl /usr/lib/xen/bin/xen-vncfb stext 20676 Sl tapdisk /dev/xen/tapctrlwri stext 20689 Sl /usr/lib/xen/bin/xen-vncfb stext 20795 S< [xvd 7] tap_blkif_schedule 20796 S< [xvd 8] tap_blkif_schedule 20797 S< [xvd 9] tap_blkif_schedule 29559 Ss sshd: root@pts/0 - 29566 Ss+ -bash - 29657 S /usr/libexec/gconfd-2 8 - 29669 S python /usr/share/virt-mana - 29931 Ss sshd: root@pts/4 - 29933 Ss -bash wait 30021 S+ tail -f /var/log/messages - 31529 Dl tapdisk /dev/xen/tapctrlwri glock_wait_internal 31553 Sl /usr/lib/xen/bin/xen-vncfb stext 31637 S< [xvd 10] tap_blkif_schedule Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Hmm... This is working on the simpsons cluster - they're x86_64, using GFS to hold the virtual machines' root disks. Any hints maybe as to what the VM was doing at the time? (writing to disk?) Ryan has seen strange behaviors where he has to rebuild the root disk images; I wonder if this is related.
The behavior I was seeing--apparent filesystem corruption in the guests--was happening when the guest disk images were LV-backed with the LVs on a clustered VG. I'm testing now with file-backed guest images on GFS to see if I can reproduce the behavior. So far, no problems, though.
Only happen with direct IO and AIO combination ? How about the default mode (buffer IO) ?
I think the problem is we're not sure exactly what was doing the writing at the time of the panic... xend? Xen doing some hypervisor magic? This makes it hard to try it with other access methods :(
Yeah, I'm surprised the kernel didn't print a pid in the backtrace.... Would it be "tapdisk /dev/xen/tapctrlwri" from the ps?
Latest backtrace did not include aio/dio functions: xenbr0: port 3(vif1.0) entering learning state xenbr0: topology change detected, propagating xenbr0: port 3(vif1.0) entering forwarding state GFS: fsid=ask_cluster:gfs.0: fatal: assertion "!(lck_ret & LM_OUT_ERROR)" failed GFS: fsid=ask_cluster:gfs.0: function = gfs_glock_xmote_th GFS: fsid=ask_cluster:gfs.0: file = /builddir/build/BUILD/gfs-kmod-0.1.16/_kmo d_build_xen/src/gfs/glock.c, line = 1162 GFS: fsid=ask_cluster:gfs.0: time = 1175069757 GFS: fsid=ask_cluster:gfs.0: about to withdraw from the cluster GFS: fsid=ask_cluster:gfs.0: telling LM to withdraw GFS: fsid=ask_cluster:gfs.0: withdrawn [<ee7388ae>] gfs_lm_withdraw+0x76/0x82 [gfs] [<ee74c812>] gfs_assert_withdraw_i+0x26/0x31 [gfs] [<ee7315b4>] gfs_glock_xmote_th+0x138/0x17c [gfs] [<ee7300e2>] run_queue+0x236/0x2f9 [gfs] [<ee730702>] gfs_glock_nq+0x31d/0x36f [gfs] [<ee732d6b>] gfs_glock_nq_atime+0x145/0x291 [gfs] [<ee740697>] do_read_buf+0x54/0x104 [gfs] [<ee73f1b0>] walk_vm+0xb5/0x274 [gfs] [<ee7407e7>] gfs_open+0x0/0x126 [gfs] [<c0460c76>] __dentry_open+0xea/0x1ab [<ee73f499>] __gfs_read+0x43/0x4a [gfs] [<ee740643>] do_read_buf+0x0/0x104 [gfs] [<ee73f4db>] gfs_read+0x0/0xe [gfs] [<ee73f4e6>] gfs_read+0xb/0xe [gfs] [<c046285c>] vfs_read+0x9f/0x141 [<c0462caa>] sys_read+0x3c/0x63 [<c0404cff>] syscall_call+0x7/0xb =======================
This is probably related to the "altmode" code in the dlm that's used by gfs for apps doing direct-io (like blktap). It may also be related to the conversion-deadlock resolution code which is the current subject of bug 236277. ask-08 Resource ebc532c0 Name (len=24) " 2 1faf64d" Local Copy, Master is node 1 Granted Queue Conversion Queue 000103d0 PR (--) Master: 00010028 Waiting Queue ask-07 Resource e9846dc0 Name (len=24) " 2 1faf64d" Master Copy Granted Queue 00010028 PR Remote: 2 000103d0 Conversion Queue Waiting Queue 5111 Dl tapdisk /dev/xen/tapctrlwri glock_wait_internal
There are flags to enable two specialized features in the dlm: 1. CONVDEADLK causes the dlm to resolve conversion deadlocks internally by changing the granted mode of locks to NL. 2. ALTPR/ALTCW cause the dlm to change the requested mode of locks to PR or CW to grant them if the main mode can't be granted. GFS direct i/o exercises both of these features, especially when mixed with buffered i/o. The dlm has problems with them. The first problem is on the master node. If it demotes a lock as a part of converting it, the actual step of converting the lock isn't being done after the demotion, the lock is just left sitting on the granted queue with a granted mode of NL. I think the mistaken assumption was that the call to grant_pending_locks() would grant it, but that function naturally doesn't look at locks on the granted queue. The second problem is on the process node. If the master either demotes or gives an altmode, the munging of the gr/rq modes is never done in the process copy of the lock, leaving the master/process copies out of sync.
*** Bug 236277 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 153038 [details] patch posted to rhkernel on 04/19
Created attachment 153444 [details] same change as above, but the patch applies against the 5.0 kernel
in 2.6.18-17.el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html