Bug 832156
| Summary: | RFE: Support customizable actions when sanlock leases are lost | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Daniel Berrangé <berrange> | ||||||||||||||||||||
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||||||||||
| Priority: | urgent | ||||||||||||||||||||||
| Version: | 6.4 | CC: | acathrow, ajia, cpelland, dallan, dyasny, dyuan, fsimonce, lsu, mzhan, rwu, teigland, weizhan | ||||||||||||||||||||
| Target Milestone: | rc | Keywords: | FutureFeature | ||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||
| Fixed In Version: | libvirt-0.10.2-4.el6 | Doc Type: | Enhancement | ||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||
| Last Closed: | 2013-02-21 07:17:32 UTC | Type: | Bug | ||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||
| Bug Depends On: | 820173, 886421 | ||||||||||||||||||||||
| Bug Blocks: | |||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Daniel Berrangé
2012-06-14 17:08:34 UTC
*** Bug 829316 has been marked as a duplicate of this bug. *** This is now implemented upstream by commits v0.10.2-123-gd0ea530 through v0.10.2-128-g8936476:
commit d0ea530b00e69801043fee52e78226cd44eb3194
Author: Jiri Denemark <jdenemar>
Date: Thu Sep 6 21:56:49 2012 +0200
conf: Rename life cycle actions to event actions
While current on_{poweroff,reboot,crash} action configuration is about
configuring life cycle actions, they can all be considered events and
actions that need to be done on a particular event. Let's generalize the
code by renaming life cycle actions to event actions so that it can be
reused later for non-lifecycle events.
commit 76f5bcabe611d90cca202fe365340a753f8cd0c3
Author: Jiri Denemark <jdenemar>
Date: Thu Sep 6 22:17:01 2012 +0200
conf: Add on_lockfailure event configuration
Using this new element, one can configure an action that should be
performed when resource locks are lost.
commit e55ff49cbc99d50149c6daf491c1cac566150d90
Author: Jiri Denemark <jdenemar>
Date: Mon Sep 17 15:12:53 2012 +0200
locking: Add const char * parameter to avoid ugly typecasts
commit d236f3fc3881c97c1655023a6a2d4e5486613569
Author: Jiri Denemark <jdenemar>
Date: Mon Sep 17 15:36:47 2012 +0200
locking: Pass hypervisor driver name when acquiring locks
This is required in case a lock manager needs to contact libvirtd in
case of an unexpected event.
commit 297c704a1ce2122f35871e1a1c93cad7b79afc58
Author: Jiri Denemark <jdenemar>
Date: Tue Sep 18 13:40:13 2012 +0200
locking: Add support for lock failure action
commit 893647671b052cba67f2241bb910df56f3191f2e
Author: Jiri Denemark <jdenemar>
Date: Tue Sep 18 13:41:26 2012 +0200
locking: Implement lock failure action in sanlock driver
While the changes to sanlock driver should be stable, the actual
implementation of sanlock_helper is supposed to be replaced in the
future. However, before we can implement a better sanlock_helper, we
need an administrative interface to libvirtd so that the helper can just
pass a "leases lost" event to the particular libvirt driver and
everything else will be taken care of internally. This approach will
also allow libvirt to pass such event to applications and use
appropriate reasons when changing domain states.
The temporary implementation handles all actions directly by calling
appropriate libvirt APIs (which among other things means that it needs
to know the credentials required to connect to libvirtd).
(In reply to comment #3) > In POST: > http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-October/ > msg00505.html Hi Jiri, Unfortunately, the new sanlock feature doesn't work on libvirt-0.10.2-4.el6 for me, I can't succesfully start guest with new lock configuration, the following is my test steps: 1. configure sanlock # tail -1 /etc/libvirt/qemu.conf lock_manager = "sanlock" # tail -3 /etc/libvirt/qemu-sanlock.conf disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 1 auto_disk_leases = 1 # service libvirtd restart Stopping libvirtd daemon: [ OK ] Starting libvirtd daemon: [ OK ] # ll -Z /var/lib/libvirt/sanlock/ total 1028 -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__ 2. append <on_lockfailure>ignore</on_lockfailure> into guest XML # virsh dumpxml foo <domain type='kvm'> <name>foo</name> <uuid>cae86633-904f-1c90-6bc4-9c579f70e699</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <vcpu placement='static'>1</vcpu> <os> <type arch='x86_64' machine='rhel6.2.0'>hvm</type> <boot dev='hd'/> </os> <features> <acpi/> <apic eoi='off'/> <pae/> </features> <clock offset='localtime'> <timer name='kvmclock' present='yes'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <on_lockfailure>ignore</on_lockfailure> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/foo'/> <target dev='hda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk> ...... <--- ignore </devices> </domain> 3. start the guest # virsh start foo-2 error: Failed to start domain foo-2 error: Child quit during startup handshake: Input/output error # ll -Z /var/lib/libvirt/sanlock/ -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 63170d7446adfb743772450ffb7a6af3 -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__ 4. check libvirtd.log 2012-10-18 07:46:30.314+0000: 1555: error : virCommandHandshakeWait:2528 : Child quit during startup handshake: Input/output error Please help confirm this question, thx. Alex Interesting, I can't reproduce your error. What version of sanlock do you have installed? Could you attach more context of the error from libvirtd.log and the domain's log file found in /var/log/libvirt/qemu/foo-2.log? Also, please check /var/log/messages if there's anything from sanlock at the time you're trying to start the domain. (In reply to comment #7) > Interesting, I can't reproduce your error. What version of sanlock do you > have installed? Oh, I forgot to add them. # rpm -q libvirt-lock-sanlock sanlock libvirt-lock-sanlock-0.10.2-4.el6.x86_64 sanlock-2.6-1.el6.x86_64 > Could you attach more context of the error from libvirtd.log > and the domain's log file found in /var/log/libvirt/qemu/foo-2.log? Also, > please check /var/log/messages if there's anything from sanlock at the time > you're trying to start the domain. 1. Retest # tailf -2 /etc/libvirt/libvirtd.conf log_filters="3:remote 4:event 1:qemu 1:libvirt 3:conf 1:locking" log_outputs="1:file:/var/log/libvirt/libvirtd.log" # service libvirtd restart Stopping libvirtd daemon: [ OK ] Starting libvirtd daemon: [ OK ] # ll -Z /var/lib/libvirt/sanlock -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__ # virsh start foo error: Failed to start domain foo error: Child quit during startup handshake: Input/output error Notes, please check attachment libvirtd-1.log. # ll -Z /var/lib/libvirt/sanlock -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 25944cbb94ba4a6a496d284b8683cf76 -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__ # ps -fC sanlock UID PID PPID C STIME TTY TIME CMD root 2460 1 0 Oct03 ? 00:00:36 sanlock daemon -w 0 Notes, here is a old sanlock configuration. 2. use new default sanlock configration # virsh list --all Id Name State ---------------------------------------------------- 3 myRHEL6 running # service sanlock restart Sending stop signal sanlock (2460): [ OK ] Waiting for sanlock (2460) to stop: [ OK ] Starting sanlock: [ OK ] # virsh domstate myRHEL6 shut off Notes, it should be a new question, I just append ' <on_lockfailure>ignore</on_lockfailure>' to foo gust then restart sanlock service after failing to start foo, however, a pervious running guest myRHEL6 is stopped, for details, please see libvirtd-2.log. # ps -fC sanlock UID PID PPID C STIME TTY TIME CMD sanlock 12016 1 0 13:11 ? 00:00:00 sanlock daemon -U sanlock -G sanlock # service libvirtd restart Stopping libvirtd daemon: [ OK ] Starting libvirtd daemon: [ OK ] # ll -Z /var/lib/libvirt/sanlock -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 25944cbb94ba4a6a496d284b8683cf76 -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__ # virsh start foo error: Failed to reconnect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory # service libvirtd status libvirtd dead but subsys locked Notes, libvirtd is dead. (In reply to comment #8) > # service libvirtd restart > Stopping libvirtd daemon: [ OK ] > Starting libvirtd daemon: [ OK ] > Notes, please see libvirtd-3.log. > # ll -Z /var/lib/libvirt/sanlock > -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 > 25944cbb94ba4a6a496d284b8683cf76 > -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 > __LIBVIRT__DISKS__ > > # virsh start foo > error: Failed to reconnect to the hypervisor > error: no valid connection > error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such > file or directory > > # service libvirtd status > libvirtd dead but subsys locked > > Notes, libvirtd is dead. Created attachment 629770 [details]
libvirtd-1.log
Created attachment 629771 [details]
libvirtd-2.log
Created attachment 629772 [details]
libvirtd-3.log
Created attachment 629773 [details]
guest log
In addition, I saw some sanlock error and AVC denied in /var/log/messages, and I have ever filed some selinux bugs(831908) on sanlock, maybe, I should upgrade the following selinux version to >=3.7.19-155.el6_3.4, but even so, libvirt shouldn't be dead, for details, please check messages log. My current selinux version: # rpm -qa|grep selinux-policy selinux-policy-3.7.19-153.el6.noarch selinux-policy-targeted-3.7.19-153.el6.noarch Created attachment 629777 [details]
/var/log/messages
To upgrade selinux pacakges and retest: # rpm -qa|grep selinux-policy selinux-policy-3.7.19-173.el6.noarch selinux-policy-targeted-3.7.19-173.el6.noarch Notes, this version has fixed bug 831908. # getsebool -a|grep sanlock sanlock_use_fusefs --> off sanlock_use_nfs --> on sanlock_use_samba --> off virt_use_sanlock --> on Notes, virt_use_sanlock --> on. # service sanlock restart Sending stop signal sanlock (12725): [ OK ] Waiting for sanlock (12725) to stop: [ OK ] Starting sanlock: [ OK ] # ps -fC sanlock UID PID PPID C STIME TTY TIME CMD sanlock 13597 1 0 14:07 ? 00:00:00 sanlock daemon -U sanlock -G sanlock # tailf /var/log/messages Notes, not AVC denied error any more. # service libvirtd restart Stopping libvirtd daemon: [FAILED] Starting libvirtd daemon: [ OK ] # service libvirtd status libvirtd dead but subsys locked # ll -Z /var/lib/libvirt/sanlock/ -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 25944cbb94ba4a6a496d284b8683cf76 -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__ # tailf /var/log/messages Oct 19 14:07:33 localhost libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: No such file or directory Oct 19 14:07:33 localhost dnsmasq[2412]: read /etc/hosts - 6 addresses Oct 19 14:07:33 localhost dnsmasq[17434]: read /etc/hosts - 6 addresses Oct 19 14:07:33 localhost sanlock[13597]: 2012-10-19 14:07:33+0800 1389213 [13713]: open error -13 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ Oct 19 14:07:33 localhost sanlock[13597]: 2012-10-19 14:07:33+0800 1389213 [13713]: s1 open_disk /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ error -13 Oct 19 14:07:34 localhost sanlock[13597]: 2012-10-19 14:07:34+0800 1389214 [13602]: s1 add_lockspace fail result -19 Oh, it looks like you hit bug 820173. Could you modify /etc/sysconfig/sanlock to contain: SANLOCKUSER="root" SANLOCKOPTS="-w 0" and try again after restarting sanlock service? However, even if the features appears to be working after that, don't verify this bug since we need to check it is working in the default configuration. (In reply to comment #17) > Oh, it looks like you hit bug 820173. Could you modify /etc/sysconfig/sanlock > to contain: > > SANLOCKUSER="root" > SANLOCKOPTS="-w 0" > > and try again after restarting sanlock service? However, even if the features > appears to be working after that, don't verify this bug since we need to > check > it is working in the default configuration. # tailf -2 /etc/sysconfig/sanlock SANLOCKUSER="root" SANLOCKOPTS="-w 0" # service sanlock restart Sending stop signal sanlock (13597): [ OK ] Waiting for sanlock (13597) to stop: [ OK ] Starting sanlock: [ OK ] # ps -fC sanlock UID PID PPID C STIME TTY TIME CMD root 5608 1 0 18:33 ? 00:00:00 sanlock daemon -w 0 # service libvirtd restart Stopping libvirtd daemon: [ OK ] Starting libvirtd daemon: [ OK ] # service libvirtd status libvirtd (pid 5641) is running... # virsh start foo <--------- <on_lockfailure>ignore</on_lockfailure> Domain foo started Yeah, as you said, although this configuration works for us, but it can't verify this bug, because default configuration doesn't work. Will there be another patch for the default configuration of this bug ? Well, the non-default configuration was just a workaround allowing this bug to be tested until bug 820173 is fixed. I believe, this should now work even without changing /etc/sysconfig/sanlock. (In reply to comment #20) > Well, the non-default configuration was just a workaround allowing this bug > to be tested until bug 820173 is fixed. I believe, this should now work even > without changing /etc/sysconfig/sanlock. Hi Jiri, Bug 820173 was verified with libvirtd die finally according to https://bugzilla.redhat.com/show_bug.cgi?id=820173#c54 . Then how should we handle this one ? Well, since bug 820173 is verified, it should mean libvirt works with sanlock in normal case, shouldn't it? That is, if you start with a clean state, start libvirtd, let it do the job without restarting it too early, it should work fine and thus you should be able to test this bug. If this is however not the case, I don't understand, why that bug was verified. Hi Jiri, With verified 820173 , i'm tring verify this one , but some isuees block me.. BTW , now , everything is ok , libvirtd works well , both the lease file and __LIBVIRT__DISKS__ generated success and the guest start well. How to make sanlock_helper run? I noticed that it recive parameters as a independt program , my questions is how to pass the parameters to it and run . What i do is config sanlock , start guest with <on_lockfailure> element and delete the lease files .Then if sanlock_helper runs , it should log some errors if the config not right or just excute the on_lockfailure evernt. Could you tell me anything i missed ? Thanks. sanlock_helper is run by sanlock daemon whenever it thinks disk leases are lost. When libvirtd is starting a new domain, it tells sanlockd what parameters to use for sanlock_helper according to domain's on_lockfailure configuration. That is, you just need to configure on_lockfailure and make sanlockd think it lost the leases: - "sanlock client status" command can be used to list active lockspaces (they start with "s" prefix) - "sanlock client rem_lockspace -s <lockspace>" command can be used to manually remove the lockspace If you need further assistance with how to make this working, Federico Simoncelli knows much more about this stuff (and how vdsm wants to use it) than I do. Anyway, some of the the on_lockfailure actions are known not to work with automatic disk leases, you need to configure them manually in domain XML (which is how VDSM is going to use this). Ah..Jiri..i test both auto and mannul way , still got the same issue like mentioned in the mail.Any sussesgion? qemu-sanlock.conf //close auto leases auto_disk_leases = 0 1. <on_lockfailure>ignore</on_lockfailure> #sanlock client status daemon 68bb75a1-da1d-4abe-94d5-3452d9be2b4c.intel-e312 p -1 helper p -1 listener p 10928 test p -1 rem_lockspace p -1 status s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 r TEST_LS:sles11sp2-disk-resource-lock:/var/lib/libvirt/sanlock/sles11sp2-disk-resource-lock:0:4 p 10928 guest xml: <lease> <lockspace>TEST_LS</lockspace> <key>sles11sp2-disk-resource-lock</key> <target path='/var/lib/libvirt/sanlock/sles11sp2-disk-resource-lock'/> </lease> Then # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 rem_lockspace //It will hang here 2.<on_lockfailure>restart</on_lockfailure> The guest still just shut down and not come back with the same configuration like above (In reply to comment #26) > 1. <on_lockfailure>ignore</on_lockfailure> > > Then > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > rem_lockspace > //It will hang here What hangs here? > 2.<on_lockfailure>restart</on_lockfailure> > The guest still just shut down and not come back with the same configuration > like above Does anything appear in /var/log/libvirtd.log that would be an evidence of sanlock_helper connecting to libvirtd and restarting the domain? Is anything logged by sanlockd to /var/log/messages? (In reply to comment #27) > (In reply to comment #26) > > 1. <on_lockfailure>ignore</on_lockfailure> > > > > Then > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > rem_lockspace > > //It will hang here > > What hangs here? > Well , normally the command is like # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 rem_lockspace rem_lockspace done 0 # After add the ignore event , it become # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 rem_lockspace //Just stop here But if destroy the guest in another terminal ,then the rem_lockspace will finished. # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 rem_lockspace //stopped untill destroy guest in another terminal //virsh destroy test , then it continue output rem_lockspace done 0 One more thing here , if i do cancel/*ctrl + c*/ the rem_lockspace ps when it hang , then destroy the guest and start it , the guest will fail to start with a error "No space left on device" # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 rem_lockspace ^C #virsh destroy test #virsh start test error: Failed to start domain test error: Failed to acquire lock: No space left on device > > 2.<on_lockfailure>restart</on_lockfailure> > > The guest still just shut down and not come back with the same configuration > > like above > > Does anything appear in /var/log/libvirtd.log that would be an evidence of > sanlock_helper connecting to libvirtd and restarting the domain? > > Is anything logged by sanlockd to /var/log/messages? Sanlock log: 1.Ignore: [9339]: s8 kill 20746 sig 100 count 1 2.Ignore_cancel: [9344]: r17 cmd_acquire 2,9,21056 invalid lockspace found -1 failed 0 name TEST_LS 3.Restart: [9339]: s9 kill 20813 sig 100 count 1 [9339]: dead 20813 ci 2 count 1 [9344]: r14 cmd_acquire 2,9,20854 invalid lockspace found -1 failed 0 name TEST_LS The libvirtd.log recorded from sanlock rem_lockspace command begin to excute. Three situations: 1.Ignore event 2.Ignore event then cancel the rem_lockspace 3.Restart event Created attachment 676623 [details]
ignore_libvirtd
Created attachment 676624 [details]
ignore_cancel_libvirtd
Created attachment 676625 [details]
restart_libvirtd
(In reply to comment #28) > (In reply to comment #27) > > (In reply to comment #26) > > > 1. <on_lockfailure>ignore</on_lockfailure> > > > > > > Then > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > > rem_lockspace > > > //It will hang here > > > > What hangs here? > > > > Well , normally the command is like > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > rem_lockspace > rem_lockspace done 0 > # > > After add the ignore event , it become > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > rem_lockspace > //Just stop here It depends on how much you waited. The regular timeout is about 3 minutes. Please also check what is logged in the sanlock log (/var/log/sanlock/sanlock.log). (In reply to comment #32) > (In reply to comment #28) > > (In reply to comment #27) > > > (In reply to comment #26) > > > > 1. <on_lockfailure>ignore</on_lockfailure> > > > > > > > > Then > > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > > > rem_lockspace > > > > //It will hang here > > > > > > What hangs here? > > > > > > > Well , normally the command is like > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > rem_lockspace > > rem_lockspace done 0 > > # > > > > After add the ignore event , it become > > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > rem_lockspace > > //Just stop here > > It depends on how much you waited. The regular timeout is about 3 minutes. > Please also check what is logged in the sanlock log > (/var/log/sanlock/sanlock.log). There are some issue in mannul config lockspace , i always get an error -19 when add , i'm researching it.Any update will comment here. For <ignore> and auto lease , i still wait 12m and cancel it.Sanlock log is same like i comment before. 1.Ignore: [9339]: s8 kill 20746 sig 100 count 1 2.Ignore_cancel: [9344]: r17 cmd_acquire 2,9,21056 invalid lockspace found -1 failed 0 name TEST_LS time sanlock client rem_lockspace -s __LIBVIRT__DISKS__:1:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0 rem_lockspace ^C real 12m18.657s user 0m0.001s sys 0m0.001s Hav(In reply to comment #33) > (In reply to comment #32) > > (In reply to comment #28) > > > (In reply to comment #27) > > > > (In reply to comment #26) > > > > > 1. <on_lockfailure>ignore</on_lockfailure> > > > > > > > > > > Then > > > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > > > > rem_lockspace > > > > > //It will hang here > > > > > > > > What hangs here? > > > > > > > > > > Well , normally the command is like > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > > rem_lockspace > > > rem_lockspace done 0 > > > # > > > > > > After add the ignore event , it become > > > > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 > > > rem_lockspace > > > //Just stop here > > > > It depends on how much you waited. The regular timeout is about 3 minutes. > > Please also check what is logged in the sanlock log > > (/var/log/sanlock/sanlock.log). > > There are some issue in mannul config lockspace , i always get an error -19 > when add , i'm researching it.Any update will comment here. Have you tried to temporarily disable selinux? (setenforce 0) (In reply to comment #34) > Have you tried to temporarily disable selinux? (setenforce 0) Yeah , i tried , but it doesn't work...And audit.log doesn't have something related it. BTW i made a mistake when config a sanlock lockspace , now it works.. This is the record last night # time sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 rem_lockspace ^C real 878m14.682s user 0m0.000s sys 0m0.002s During the "hang time" , libvirtd shows it repeat same procedure again and again. Any sussgesion? Jiri && Federico , thanks Created attachment 679298 [details]
full hang log
Any sussesgion , Federico ? Thanks The on_lockfailure policies to check are: poweroff ======== If I understand correctly this is currently working. The vm is shutdown. restart ======= I don't think we'll ever see this working with sanlock. Once you removed the lockspace you're not able to start the VM. Anyway this depends on the implementation of "restart", if libvirt is actually killing/shutting-down the qemu process, then my assumption is correct. If the restart is handled in some way so that the qemu process remains the same then this would appear (in sanlock view) as an "ignore" (see below). pause ===== If I understand correctly this is currently working. The vm is paused and the sanlock resource is released (double check this). ignore ====== This is not supposed to work with sanlock. If the qemu process is ignoring the request and it's not releasing the resources then sanlock should escalate to kill, kill -9 and eventually rebooting the host. For what I saw the escalation is not happening on the sanlock side. David, do you want to take a look? Thanks. (In reply to comment #38) > restart > ======= > I don't think we'll ever see this working with sanlock. Once you removed the > lockspace you're not able to start the VM. Anyway this depends on the > implementation of "restart", if libvirt is actually killing/shutting-down > the qemu process, then my assumption is correct. Yes, as requested in the bug description, libvirt kills the process and tries to start it again. Sorry I couldn't follow all the discussion above very well, so I'll probably repeat some obvious background to make sure that we're all expecting the same things. pause ----- You need to pass sanlock the path to a kill script/program that sanlock will run against the vm when the lock fails. In the libvirt case we expect this program to result in the following (probably done within libvirtd): 1. pause/suspend the vm 2. inquire and save the lease state from sanlock 3. release the sanlock leases for the vm When the sanlock daemon sees that the leases are gone, it will no longer trigger the watchdog reset. ignore ------ You should not set killpath if you don't want sanlock to use it. In this case, sanlock will use SIGTERM and SIGKILL against the vm when its lock fails. If the pid does not exit from either of those, then the host will be reset by the watchdog. If this is not happening, could you run "sanlock client log_dump > log.txt" and send that to me? Finally, I'm not sure what rem_lockspace is being used for above; it should probably not be used to test lock failure. The way I usually simulate lock failures is by using dmsetup to load the error target under the leases lv. (In reply to comment #40) > Sorry I couldn't follow all the discussion above very well, so I'll probably > repeat some obvious background to make sure that we're all expecting the > same things. > > pause > ----- > You need to pass sanlock the path to a kill script/program that sanlock will > run against the vm when the lock fails. In the libvirt case we expect this > program to result in the following (probably done within libvirtd): > 1. pause/suspend the vm > 2. inquire and save the lease state from sanlock > 3. release the sanlock leases for the vm > > When the sanlock daemon sees that the leases are gone, it will no longer > trigger the watchdog reset. > > ignore > ------ > You should not set killpath if you don't want sanlock to use it. In this > case, sanlock will use SIGTERM and SIGKILL against the vm when its lock > fails. I think that we have a misconception here, the "ignore" policy was implemented (Jiri correct me if I'm wrong) "ignoring" the fact that sanlock is requesting to release the resource. In this situation sanlock should escalate anyway ("ignore" == forced reboot in the sanlock implementation). > If the pid does not exit from either of those, then the host will be > reset by the watchdog. If this is not happening, could you run "sanlock > client log_dump > log.txt" and send that to me? # sanlock client log_dump 2013-01-24 18:43:42+0800 6085 [2735]: sanlock daemon started 2.6 host 7d9676dc-9af3-4d63-bc91-dc5ba9e50a7e.intel-8400 2013-01-24 18:43:50+0800 6094 [2739]: cmd_add_lockspace 2,9 TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 flags 0 timeout 0 2013-01-24 18:43:50+0800 6094 [2739]: s1 lockspace TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 2013-01-24 18:43:50+0800 6094 [2849]: s1 delta_acquire begin TEST_LS:1 2013-01-24 18:43:51+0800 6094 [2849]: s1 delta_acquire write 1 1 6094 7d9676dc-9af3-4d63-bc91-dc5ba9e50a7e.intel-8400 2013-01-24 18:43:51+0800 6094 [2849]: s1 delta_acquire delta_short_delay 20 2013-01-24 18:44:11+0800 6114 [2849]: s1 delta_acquire done 1 1 6094 2013-01-24 18:44:11+0800 6115 [2739]: s1 add_lockspace done 2013-01-24 18:44:11+0800 6115 [2739]: cmd_add_lockspace 2,9 done 0 2013-01-24 18:46:30+0800 6253 [2735]: cmd_register ci 2 fd 9 pid 2913 2013-01-24 18:46:30+0800 6253 [2740]: cmd_killpath 2,9,2913 flags 0 2013-01-24 18:46:31+0800 6254 [2735]: cmd_restrict ci 2 fd 9 pid 2913 flags 1 2013-01-24 18:46:31+0800 6254 [2739]: cmd_acquire 2,9,2913 ci_in 3 fd 12 count 1 2013-01-24 18:46:31+0800 6254 [2739]: s1:r1 resource TEST_LS:sles11sp2-disk-resource-lock:/var/lib/libvirt/sanlock/sles11sp2-disk-resource-lock:0 for 2,9,2913 2013-01-24 18:46:31+0800 6254 [2739]: r1 paxos_acquire begin 0 0 0 2013-01-24 18:46:31+0800 6254 [2739]: r1 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0 2013-01-24 18:46:31+0800 6254 [2739]: r1 paxos_acquire leader 0 free 2013-01-24 18:46:31+0800 6254 [2739]: r1 ballot 1 phase1 mbal 1 2013-01-24 18:46:31+0800 6254 [2739]: r1 ballot 1 phase2 bal 1 inp 1 1 6254 q_max -1 2013-01-24 18:46:31+0800 6254 [2739]: r1 ballot 1 commit self owner 1 1 6254 2013-01-24 18:46:31+0800 6254 [2739]: r1 acquire_disk rv 1 lver 1 at 6254 2013-01-24 18:46:31+0800 6254 [2739]: cmd_acquire 2,9,2913 result 0 pid_dead 0 2013-01-25 02:04:10+0800 32513 [2740]: cmd_rem_lockspace 3,12 TEST_LS flags 0 2013-01-25 02:04:10+0800 32513 [2735]: s1 set killing_pids check 0 remove 1 2013-01-25 02:04:10+0800 32513 [2735]: s1:r1 client_using_space pid 2913 2013-01-25 02:04:10+0800 32513 [2735]: s1 kill 2913 sig 100 count 1 2013-01-25 02:05:49+0800 32612 [2735]: s1 killing pids stuck 1 <...nothing else, 5 minutes passed...> (In reply to comment #41) > (In reply to comment #40) > I think that we have a misconception here, the "ignore" policy was > implemented (Jiri correct me if I'm wrong) "ignoring" the fact that sanlock > is requesting to release the resource. In this situation sanlock should > escalate anyway ("ignore" == forced reboot in the sanlock implementation). Actually let me correct myself, "ignore" == vm is abruptly killed (and eventually we might escalate to the reboot). The first problem is as I mentioned above: rem_lockspace is not equivalent to a failed lock, and should not be used to test that. (This does reveal a possible problem with a forced rem_lockspace, though, which I will look into.) There might also be a problem with the killpath program because the the lease is not removed or the pid does not exit. We'd expect one of those results from running killpath. (If the lockspace had actually failed, then sanlock would have escalated when the killpath did not do anything.) After talked with Jiri in irc , i verify this bug as the action "power off" and "pause" work well. For other two actions: Ignore : will lead to sanlock stuck Restart : Can shutdown successfuly but fail to start again. I will create two bugs respectively to track them on 6.5 Thanks your help David , Federico and Jiri. I create two bugs and re-write the steps , anything miss or any mistake i made please correct me Bug 905280 Lockfailure action Ignore will lead to sanlock rem_lockspace stuck Bug 905282 - Lockfailure action Restart can shutdown the guest but fail to start it (In reply to comment #45) > I create two bugs and re-write the steps , anything miss or any mistake i > made please correct me > > Bug 905280 Lockfailure action Ignore will lead to sanlock rem_lockspace stuck I think David already has a fix for this. > Bug 905282 - Lockfailure action Restart can shutdown the guest but fail to > start it I don't think there's a way to fix this. It's probably NOTABUG. (In reply to comment #46) > > Bug 905282 - Lockfailure action Restart can shutdown the guest but fail to > > start it > > I don't think there's a way to fix this. It's probably NOTABUG. Not really, libvirt should at least refuse to create a domain with restart lockfailure action if sanlock is used as the lock manager. In case it can't be fixed of course. But anyway, let's move further discussion to the new bugs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0276.html |