832156 – RFE: Support customizable actions when sanlock leases are lost

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 832156 - RFE: Support customizable actions when sanlock leases are lost

Summary: RFE: Support customizable actions when sanlock leases are lost

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	829316 (view as bug list)
Depends On:	820173 886421
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-14 17:08 UTC by Daniel Berrangé
Modified:	2016-04-26 14:26 UTC (History)
CC List:	12 users (show)
Fixed In Version:	libvirt-0.10.2-4.el6
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-21 07:17:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
libvirtd-1.log (19.92 KB, text/plain) 2012-10-19 05:33 UTC, Alex Jia	no flags	Details
libvirtd-2.log (3.24 KB, text/plain) 2012-10-19 05:33 UTC, Alex Jia	no flags	Details
libvirtd-3.log (5.26 KB, text/plain) 2012-10-19 05:34 UTC, Alex Jia	no flags	Details
guest log (61.34 KB, text/plain) 2012-10-19 05:38 UTC, Alex Jia	no flags	Details
/var/log/messages (5.34 KB, text/plain) 2012-10-19 05:48 UTC, Alex Jia	no flags	Details
ignore_libvirtd (252.55 KB, application/octet-stream) 2013-01-11 03:27 UTC, Luwen Su	no flags	Details
ignore_cancel_libvirtd (787.05 KB, application/octet-stream) 2013-01-11 03:27 UTC, Luwen Su	no flags	Details
restart_libvirtd (548.73 KB, application/octet-stream) 2013-01-11 03:28 UTC, Luwen Su	no flags	Details
full hang log (1.48 MB, application/octet-stream) 2013-01-16 02:56 UTC, Luwen Su	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2013:0276	0	normal	SHIPPED_LIVE	Moderate: libvirt security, bug fix, and enhancement update	2013-02-20 21:18:26 UTC

Description Daniel Berrangé 2012-06-14 17:08:34 UTC

Description of problem:
Currently if using KVM with sanlock, if the leases are lost (due to storage problems), the sanlock process will forceably kill off the VMs.

This is undesirable, because if the storage problem is only temporary, it would be possible to resume execution of a VM once the problem is resolved.

Sanlock will gain the ability to specify a "killscript" parameter when registering a VM. Any time the leases are lost, sanlock will invoke the killscript.

The intent is that libvirt will provide a kill script implementation to take action, based on a new parameter in the VM XML

  <on_lockfailure>poweroff|restart|pause|ignore</on_lockfailure>

Where

  poweroff -> kill the VM process
  restart -> kill the VM process & start a new one
  pause -> pause the VM CPUs
  ignore -> do nothing

The kill script must also trigger the emission of a new domain event, so that mgmt apps using libvirt can see when this occurs.


Version-Release number of selected component (if applicable):
libvirt-0.9.10-21.el6

How reproducible:
N/A

Comment 1 Daniel Berrangé 2012-06-15 09:11:07 UTC

*** Bug 829316 has been marked as a duplicate of this bug. ***

Comment 2 Jiri Denemark 2012-10-11 13:09:16 UTC

This is now implemented upstream by commits v0.10.2-123-gd0ea530 through v0.10.2-128-g8936476:

commit d0ea530b00e69801043fee52e78226cd44eb3194
Author: Jiri Denemark <jdenemar>
Date:   Thu Sep 6 21:56:49 2012 +0200

    conf: Rename life cycle actions to event actions
    
    While current on_{poweroff,reboot,crash} action configuration is about
    configuring life cycle actions, they can all be considered events and
    actions that need to be done on a particular event. Let's generalize the
    code by renaming life cycle actions to event actions so that it can be
    reused later for non-lifecycle events.

commit 76f5bcabe611d90cca202fe365340a753f8cd0c3
Author: Jiri Denemark <jdenemar>
Date:   Thu Sep 6 22:17:01 2012 +0200

    conf: Add on_lockfailure event configuration
    
    Using this new element, one can configure an action that should be
    performed when resource locks are lost.

commit e55ff49cbc99d50149c6daf491c1cac566150d90
Author: Jiri Denemark <jdenemar>
Date:   Mon Sep 17 15:12:53 2012 +0200

    locking: Add const char * parameter to avoid ugly typecasts

commit d236f3fc3881c97c1655023a6a2d4e5486613569
Author: Jiri Denemark <jdenemar>
Date:   Mon Sep 17 15:36:47 2012 +0200

    locking: Pass hypervisor driver name when acquiring locks
    
    This is required in case a lock manager needs to contact libvirtd in
    case of an unexpected event.

commit 297c704a1ce2122f35871e1a1c93cad7b79afc58
Author: Jiri Denemark <jdenemar>
Date:   Tue Sep 18 13:40:13 2012 +0200

    locking: Add support for lock failure action

commit 893647671b052cba67f2241bb910df56f3191f2e
Author: Jiri Denemark <jdenemar>
Date:   Tue Sep 18 13:41:26 2012 +0200

    locking: Implement lock failure action in sanlock driver
    
    While the changes to sanlock driver should be stable, the actual
    implementation of sanlock_helper is supposed to be replaced in the
    future. However, before we can implement a better sanlock_helper, we
    need an administrative interface to libvirtd so that the helper can just
    pass a "leases lost" event to the particular libvirt driver and
    everything else will be taken care of internally. This approach will
    also allow libvirt to pass such event to applications and use
    appropriate reasons when changing domain states.
    
    The temporary implementation handles all actions directly by calling
    appropriate libvirt APIs (which among other things means that it needs
    to know the credentials required to connect to libvirtd).

Comment 3 Jiri Denemark 2012-10-11 13:10:17 UTC

In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-October/msg00505.html

Comment 5 Alex Jia 2012-10-18 08:03:04 UTC

(In reply to comment #3)
> In POST:
> http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-October/
> msg00505.html

Hi Jiri,
Unfortunately, the new sanlock feature doesn't work on libvirt-0.10.2-4.el6 for me, I can't succesfully start guest with new lock configuration, the following is my test steps:


1. configure sanlock

# tail -1 /etc/libvirt/qemu.conf
lock_manager = "sanlock"

# tail -3 /etc/libvirt/qemu-sanlock.conf
disk_lease_dir = "/var/lib/libvirt/sanlock"
host_id = 1
auto_disk_leases = 1

# service libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]

# ll -Z /var/lib/libvirt/sanlock/
total 1028
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__


2. append <on_lockfailure>ignore</on_lockfailure> into guest XML

# virsh dumpxml foo
<domain type='kvm'>
  <name>foo</name>
  <uuid>cae86633-904f-1c90-6bc4-9c579f70e699</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <os>
    <type arch='x86_64' machine='rhel6.2.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic eoi='off'/>
    <pae/>
  </features>
  <clock offset='localtime'>
    <timer name='kvmclock' present='yes'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <on_lockfailure>ignore</on_lockfailure>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/foo'/>
      <target dev='hda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    ......         <--- ignore
  </devices>
</domain>


3. start the guest

# virsh start foo-2
error: Failed to start domain foo-2
error: Child quit during startup handshake: Input/output error

# ll -Z /var/lib/libvirt/sanlock/
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 63170d7446adfb743772450ffb7a6af3
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__


4. check libvirtd.log

2012-10-18 07:46:30.314+0000: 1555: error : virCommandHandshakeWait:2528 : Child quit during startup handshake: Input/output error


Please help confirm this question, thx.

Alex

Comment 7 Jiri Denemark 2012-10-18 14:59:15 UTC

Interesting, I can't reproduce your error. What version of sanlock do you have installed? Could you attach more context of the error from libvirtd.log and the domain's log file found in /var/log/libvirt/qemu/foo-2.log? Also, please check /var/log/messages if there's anything from sanlock at the time you're trying to start the domain.

Comment 8 Alex Jia 2012-10-19 05:27:39 UTC

(In reply to comment #7)
> Interesting, I can't reproduce your error. What version of sanlock do you
> have installed? 

Oh, I forgot to add them.

# rpm -q libvirt-lock-sanlock sanlock
libvirt-lock-sanlock-0.10.2-4.el6.x86_64
sanlock-2.6-1.el6.x86_64


> Could you attach more context of the error from libvirtd.log
> and the domain's log file found in /var/log/libvirt/qemu/foo-2.log? Also,
> please check /var/log/messages if there's anything from sanlock at the time
> you're trying to start the domain.

1. Retest

# tailf -2 /etc/libvirt/libvirtd.conf
log_filters="3:remote 4:event 1:qemu 1:libvirt 3:conf 1:locking"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

# service libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]

# ll -Z /var/lib/libvirt/sanlock
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__

# virsh start foo
error: Failed to start domain foo
error: Child quit during startup handshake: Input/output error

Notes, please check attachment libvirtd-1.log.

# ll -Z /var/lib/libvirt/sanlock
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 25944cbb94ba4a6a496d284b8683cf76
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__

# ps -fC sanlock
UID        PID  PPID  C STIME TTY          TIME CMD
root      2460     1  0 Oct03 ?        00:00:36 sanlock daemon -w 0

Notes, here is a old sanlock configuration.


2. use new default sanlock configration

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 3     myRHEL6                        running

# service sanlock restart
Sending stop signal sanlock (2460):                        [  OK  ]
Waiting for sanlock (2460) to stop:                        [  OK  ]
Starting sanlock:                                          [  OK  ]

# virsh domstate myRHEL6
shut off

Notes, it should be a new question, I just append ' <on_lockfailure>ignore</on_lockfailure>' to foo gust then restart sanlock service after failing to start foo, however, a pervious running guest myRHEL6 is stopped, for details, please see libvirtd-2.log.

# ps -fC sanlock
UID        PID  PPID  C STIME TTY          TIME CMD
sanlock  12016     1  0 13:11 ?        00:00:00 sanlock daemon -U sanlock -G sanlock

# service libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]

# ll -Z /var/lib/libvirt/sanlock
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 25944cbb94ba4a6a496d284b8683cf76
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__

# virsh start foo
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

# service libvirtd status
libvirtd dead but subsys locked

Notes, libvirtd is dead.

Comment 9 Alex Jia 2012-10-19 05:32:05 UTC

(In reply to comment #8)
 
> # service libvirtd restart
> Stopping libvirtd daemon:                                  [  OK  ]
> Starting libvirtd daemon:                                  [  OK  ]
> 

Notes, please see libvirtd-3.log.

> # ll -Z /var/lib/libvirt/sanlock
> -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0
> 25944cbb94ba4a6a496d284b8683cf76
> -rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0
> __LIBVIRT__DISKS__
> 
> # virsh start foo
> error: Failed to reconnect to the hypervisor
> error: no valid connection
> error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such
> file or directory
> 
> # service libvirtd status
> libvirtd dead but subsys locked
> 
> Notes, libvirtd is dead.

Comment 10 Alex Jia 2012-10-19 05:33:17 UTC

Created attachment 629770 [details]
libvirtd-1.log

Comment 11 Alex Jia 2012-10-19 05:33:47 UTC

Created attachment 629771 [details]
libvirtd-2.log

Comment 12 Alex Jia 2012-10-19 05:34:20 UTC

Created attachment 629772 [details]
libvirtd-3.log

Comment 13 Alex Jia 2012-10-19 05:38:46 UTC

Created attachment 629773 [details]
guest log

Comment 14 Alex Jia 2012-10-19 05:48:12 UTC

In addition, I saw some sanlock error and AVC denied in /var/log/messages, and I have ever filed some selinux bugs(831908)  on sanlock, maybe, I should upgrade the following selinux version to >=3.7.19-155.el6_3.4, but even so, libvirt shouldn't be dead, for details, please check messages log.

My current selinux version:

# rpm -qa|grep selinux-policy
selinux-policy-3.7.19-153.el6.noarch
selinux-policy-targeted-3.7.19-153.el6.noarch

Comment 15 Alex Jia 2012-10-19 05:48:41 UTC

Created attachment 629777 [details]
/var/log/messages

Comment 16 Alex Jia 2012-10-19 06:14:21 UTC

To upgrade selinux pacakges and retest:

# rpm -qa|grep selinux-policy
selinux-policy-3.7.19-173.el6.noarch
selinux-policy-targeted-3.7.19-173.el6.noarch

Notes, this version has fixed bug 831908.

# getsebool -a|grep sanlock
sanlock_use_fusefs --> off
sanlock_use_nfs --> on
sanlock_use_samba --> off
virt_use_sanlock --> on

Notes, virt_use_sanlock --> on.

# service sanlock restart
Sending stop signal sanlock (12725):                       [  OK  ]
Waiting for sanlock (12725) to stop:                       [  OK  ]
Starting sanlock:                                          [  OK  ]

# ps -fC sanlock
UID        PID  PPID  C STIME TTY          TIME CMD
sanlock  13597     1  0 14:07 ?        00:00:00 sanlock daemon -U sanlock -G sanlock

# tailf /var/log/messages

Notes, not AVC denied error any more.

# service libvirtd restart
Stopping libvirtd daemon:                                  [FAILED]
Starting libvirtd daemon:                                  [  OK  ]

# service libvirtd status
libvirtd dead but subsys locked

# ll -Z /var/lib/libvirt/sanlock/
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 25944cbb94ba4a6a496d284b8683cf76
-rw-------. root root unconfined_u:object_r:virt_var_lib_t:s0 __LIBVIRT__DISKS__

# tailf /var/log/messages

Oct 19 14:07:33 localhost libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: No such file or directory
Oct 19 14:07:33 localhost dnsmasq[2412]: read /etc/hosts - 6 addresses
Oct 19 14:07:33 localhost dnsmasq[17434]: read /etc/hosts - 6 addresses
Oct 19 14:07:33 localhost sanlock[13597]: 2012-10-19 14:07:33+0800 1389213 [13713]: open error -13 /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__
Oct 19 14:07:33 localhost sanlock[13597]: 2012-10-19 14:07:33+0800 1389213 [13713]: s1 open_disk /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ error -13
Oct 19 14:07:34 localhost sanlock[13597]: 2012-10-19 14:07:34+0800 1389214 [13602]: s1 add_lockspace fail result -19

Comment 17 Jiri Denemark 2012-10-19 12:05:33 UTC

Oh, it looks like you hit bug 820173. Could you modify /etc/sysconfig/sanlock
to contain:

    SANLOCKUSER="root"
    SANLOCKOPTS="-w 0"

and try again after restarting sanlock service? However, even if the features
appears to be working after that, don't verify this bug since we need to check
it is working in the default configuration.

Comment 18 Alex Jia 2012-10-22 10:39:09 UTC

(In reply to comment #17)
> Oh, it looks like you hit bug 820173. Could you modify /etc/sysconfig/sanlock
> to contain:
> 
>     SANLOCKUSER="root"
>     SANLOCKOPTS="-w 0"
> 
> and try again after restarting sanlock service? However, even if the features
> appears to be working after that, don't verify this bug since we need to
> check
> it is working in the default configuration.


# tailf -2 /etc/sysconfig/sanlock
SANLOCKUSER="root"
SANLOCKOPTS="-w 0"

# service sanlock restart
Sending stop signal sanlock (13597):                       [  OK  ]
Waiting for sanlock (13597) to stop:                       [  OK  ]
Starting sanlock:                                          [  OK  ]

# ps -fC sanlock
UID        PID  PPID  C STIME TTY          TIME CMD
root      5608     1  0 18:33 ?        00:00:00 sanlock daemon -w 0

# service libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]

# service libvirtd status
libvirtd (pid  5641) is running...

# virsh start foo          <--------- <on_lockfailure>ignore</on_lockfailure>
Domain foo started

Yeah, as you said, although this configuration works for us, but it can't verify this bug, because default configuration doesn't work.

Comment 19 dyuan 2012-12-04 03:43:03 UTC

Will there be another patch for the default configuration of this bug ?

Comment 20 Jiri Denemark 2012-12-05 10:41:47 UTC

Well, the non-default configuration was just a workaround allowing this bug to be tested until bug 820173 is fixed. I believe, this should now work even without changing /etc/sysconfig/sanlock.

Comment 22 dyuan 2012-12-12 09:56:08 UTC

(In reply to comment #20)
> Well, the non-default configuration was just a workaround allowing this bug
> to be tested until bug 820173 is fixed. I believe, this should now work even
> without changing /etc/sysconfig/sanlock.

Hi Jiri,

Bug 820173 was verified with libvirtd die finally according to https://bugzilla.redhat.com/show_bug.cgi?id=820173#c54 .
Then how should we handle this one ?

Comment 23 Jiri Denemark 2012-12-12 10:55:44 UTC

Well, since bug 820173 is verified, it should mean libvirt works with sanlock in normal case, shouldn't it? That is, if you start with a clean state, start libvirtd, let it do the job without restarting it too early, it should work fine and thus you should be able to test this bug. If this is however not the case, I don't understand, why that bug was verified.

Comment 24 Luwen Su 2013-01-06 03:25:11 UTC

Hi Jiri,

With verified 820173 , i'm tring verify this one , but some isuees  block me..
BTW , now , everything is ok , libvirtd works well , both the lease file and __LIBVIRT__DISKS__ generated success and the guest start well.

How to make sanlock_helper run?
I noticed that it recive  parameters as a independt program , my questions is how to pass the parameters to it and run . 
What i do is config sanlock , start guest with <on_lockfailure> element and delete the lease files .Then if sanlock_helper runs , it should log some errors if the config not right or just excute the on_lockfailure evernt.

Could you tell me anything i missed ? Thanks.

Comment 25 Jiri Denemark 2013-01-08 10:34:52 UTC

sanlock_helper is run by sanlock daemon whenever it thinks disk leases are lost. When libvirtd is starting a new domain, it tells sanlockd what parameters to use for sanlock_helper according to domain's on_lockfailure configuration.

That is, you just need to configure on_lockfailure and make sanlockd think it lost the leases:

- "sanlock client status" command can be used to list active lockspaces (they start with "s" prefix)
- "sanlock client rem_lockspace -s <lockspace>" command can be used to manually remove the lockspace

If you need further assistance with how to make this working, Federico Simoncelli knows much more about this stuff (and how vdsm wants to use it) than I do.

Anyway, some of the the on_lockfailure actions are known not to work with automatic disk leases, you need to configure them manually in domain XML (which is how VDSM is going to use this).

Comment 26 Luwen Su 2013-01-10 10:07:32 UTC

Ah..Jiri..i test both auto and mannul way , still got the same issue like mentioned in the mail.Any sussesgion? 

qemu-sanlock.conf //close auto leases
auto_disk_leases = 0

1. <on_lockfailure>ignore</on_lockfailure>
#sanlock client status
daemon 68bb75a1-da1d-4abe-94d5-3452d9be2b4c.intel-e312
p -1 helper
p -1 listener
p 10928 test
p -1 rem_lockspace
p -1 status
s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
r TEST_LS:sles11sp2-disk-resource-lock:/var/lib/libvirt/sanlock/sles11sp2-disk-resource-lock:0:4 p 10928

guest xml:
<lease>
  <lockspace>TEST_LS</lockspace>
  <key>sles11sp2-disk-resource-lock</key>
  <target path='/var/lib/libvirt/sanlock/sles11sp2-disk-resource-lock'/>
</lease>

Then
# sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
rem_lockspace
//It will hang here

2.<on_lockfailure>restart</on_lockfailure>
The guest still just shut down and not come back with the same configuration like above

Comment 27 Jiri Denemark 2013-01-10 10:57:19 UTC

(In reply to comment #26)
> 1. <on_lockfailure>ignore</on_lockfailure>
> 
> Then
> # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> rem_lockspace
> //It will hang here

What hangs here?

> 2.<on_lockfailure>restart</on_lockfailure>
> The guest still just shut down and not come back with the same configuration
> like above

Does anything appear in /var/log/libvirtd.log that would be an evidence of
sanlock_helper connecting to libvirtd and restarting the domain?

Is anything logged by sanlockd to /var/log/messages?

Comment 28 Luwen Su 2013-01-11 03:14:16 UTC

(In reply to comment #27)
> (In reply to comment #26)
> > 1. <on_lockfailure>ignore</on_lockfailure>
> > 
> > Then
> > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > rem_lockspace
> > //It will hang here
> 
> What hangs here?
> 

Well , normally the command is like 
# sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
rem_lockspace
rem_lockspace done 0
#

After add the ignore event , it become

# sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
rem_lockspace
              //Just stop here 

But if destroy the guest in another terminal ,then the rem_lockspace will finished.

# sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
rem_lockspace
              //stopped untill destroy guest in another terminal
              //virsh destroy test , then it continue output
rem_lockspace done 0

One more thing here , if i do

cancel/*ctrl + c*/ the rem_lockspace ps when it hang , then destroy the guest and start  it , the guest will fail to start with a error "No space left on device"

# sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
rem_lockspace
^C
#virsh destroy test
#virsh start test
error: Failed to start domain test
error: Failed to acquire lock: No space left on device


> > 2.<on_lockfailure>restart</on_lockfailure>
> > The guest still just shut down and not come back with the same configuration
> > like above
> 
> Does anything appear in /var/log/libvirtd.log that would be an evidence of
> sanlock_helper connecting to libvirtd and restarting the domain?
> 
> Is anything logged by sanlockd to /var/log/messages?


Sanlock log:
1.Ignore:
[9339]: s8 kill 20746 sig 100 count 1
2.Ignore_cancel:
[9344]: r17 cmd_acquire 2,9,21056 invalid lockspace found -1 failed 0 name TEST_LS
3.Restart:
[9339]: s9 kill 20813 sig 100 count 1
[9339]: dead 20813 ci 2 count 1
[9344]: r14 cmd_acquire 2,9,20854 invalid lockspace found -1 failed 0 name TEST_LS


The libvirtd.log recorded from sanlock rem_lockspace command begin to excute.
Three situations:
1.Ignore event
2.Ignore event then cancel the rem_lockspace
3.Restart event

Comment 29 Luwen Su 2013-01-11 03:27:02 UTC

Created attachment 676623 [details]
ignore_libvirtd

Comment 30 Luwen Su 2013-01-11 03:27:57 UTC

Created attachment 676624 [details]
ignore_cancel_libvirtd

Comment 31 Luwen Su 2013-01-11 03:28:36 UTC

Created attachment 676625 [details]
restart_libvirtd

Comment 32 Federico Simoncelli 2013-01-14 19:18:36 UTC

(In reply to comment #28)
> (In reply to comment #27)
> > (In reply to comment #26)
> > > 1. <on_lockfailure>ignore</on_lockfailure>
> > > 
> > > Then
> > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > > rem_lockspace
> > > //It will hang here
> > 
> > What hangs here?
> > 
> 
> Well , normally the command is like 
> # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> rem_lockspace
> rem_lockspace done 0
> #
> 
> After add the ignore event , it become
> 
> # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> rem_lockspace
>               //Just stop here 

It depends on how much you waited. The regular timeout is about 3 minutes.
Please also check what is logged in the sanlock log (/var/log/sanlock/sanlock.log).

Comment 33 Luwen Su 2013-01-15 11:13:41 UTC

(In reply to comment #32)
> (In reply to comment #28)
> > (In reply to comment #27)
> > > (In reply to comment #26)
> > > > 1. <on_lockfailure>ignore</on_lockfailure>
> > > > 
> > > > Then
> > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > > > rem_lockspace
> > > > //It will hang here
> > > 
> > > What hangs here?
> > > 
> > 
> > Well , normally the command is like 
> > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > rem_lockspace
> > rem_lockspace done 0
> > #
> > 
> > After add the ignore event , it become
> > 
> > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > rem_lockspace
> >               //Just stop here 
> 
> It depends on how much you waited. The regular timeout is about 3 minutes.
> Please also check what is logged in the sanlock log
> (/var/log/sanlock/sanlock.log).

There are some issue in mannul config lockspace , i always get an error -19 when add , i'm researching it.Any update  will comment here.
For <ignore> and auto lease , i still wait 12m and cancel it.Sanlock log is same like i comment before.

1.Ignore:
[9339]: s8 kill 20746 sig 100 count 1
2.Ignore_cancel:
[9344]: r17 cmd_acquire 2,9,21056 invalid lockspace found -1 failed 0 name TEST_LS

time sanlock client rem_lockspace -s __LIBVIRT__DISKS__:1:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0
rem_lockspace
^C

real	12m18.657s
user	0m0.001s
sys	0m0.001s

Comment 34 Federico Simoncelli 2013-01-15 18:55:35 UTC

Hav(In reply to comment #33)
> (In reply to comment #32)
> > (In reply to comment #28)
> > > (In reply to comment #27)
> > > > (In reply to comment #26)
> > > > > 1. <on_lockfailure>ignore</on_lockfailure>
> > > > > 
> > > > > Then
> > > > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > > > > rem_lockspace
> > > > > //It will hang here
> > > > 
> > > > What hangs here?
> > > > 
> > > 
> > > Well , normally the command is like 
> > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > > rem_lockspace
> > > rem_lockspace done 0
> > > #
> > > 
> > > After add the ignore event , it become
> > > 
> > > # sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
> > > rem_lockspace
> > >               //Just stop here 
> > 
> > It depends on how much you waited. The regular timeout is about 3 minutes.
> > Please also check what is logged in the sanlock log
> > (/var/log/sanlock/sanlock.log).
> 
> There are some issue in mannul config lockspace , i always get an error -19
> when add , i'm researching it.Any update  will comment here.

Have you tried to temporarily disable selinux? (setenforce 0)

Comment 35 Luwen Su 2013-01-16 02:49:20 UTC

(In reply to comment #34)
> Have you tried to temporarily disable selinux? (setenforce 0)

Yeah , i tried , but it doesn't work...And audit.log doesn't have something related it.
BTW i made a mistake when config a sanlock lockspace , now it works..

This is the record last night
# time sanlock client rem_lockspace -s TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
rem_lockspace
^C

real	878m14.682s
user	0m0.000s
sys	0m0.002s


During the "hang time" , libvirtd shows it repeat same procedure again and again. 
Any sussgesion? Jiri && Federico , thanks

Comment 36 Luwen Su 2013-01-16 02:56:55 UTC

Created attachment 679298 [details]
full hang log

Comment 37 Luwen Su 2013-01-22 07:36:32 UTC

Any sussesgion  , Federico ? Thanks

Comment 38 Federico Simoncelli 2013-01-24 13:22:48 UTC

The on_lockfailure policies to check are:

poweroff
========
If I understand correctly this is currently working. The vm is shutdown.

restart
=======
I don't think we'll ever see this working with sanlock. Once you removed the lockspace you're not able to start the VM. Anyway this depends on the implementation of "restart", if libvirt is actually killing/shutting-down the qemu process, then my assumption is correct. If the restart is handled in some way so that the qemu process remains the same then this would appear (in sanlock view) as an "ignore" (see below).

pause
=====
If I understand correctly this is currently working. The vm is paused and the sanlock resource is released (double check this).

ignore
======
This is not supposed to work with sanlock. If the qemu process is ignoring the request and it's not releasing the resources then sanlock should escalate to kill, kill -9 and eventually rebooting the host.

For what I saw the escalation is not happening on the sanlock side. David, do you want to take a look? Thanks.

Comment 39 Jiri Denemark 2013-01-24 13:49:36 UTC

(In reply to comment #38)
> restart
> =======
> I don't think we'll ever see this working with sanlock. Once you removed the
> lockspace you're not able to start the VM. Anyway this depends on the
> implementation of "restart", if libvirt is actually killing/shutting-down
> the qemu process, then my assumption is correct.

Yes, as requested in the bug description, libvirt kills the process and tries to start it again.

Comment 40 David Teigland 2013-01-24 15:31:04 UTC

Sorry I couldn't follow all the discussion above very well, so I'll probably repeat some obvious background to make sure that we're all expecting the same things.

pause
-----
You need to pass sanlock the path to a kill script/program that sanlock will run against the vm when the lock fails.  In the libvirt case we expect this program to result in the following (probably done within libvirtd):
1. pause/suspend the vm
2. inquire and save the lease state from sanlock
3. release the sanlock leases for the vm

When the sanlock daemon sees that the leases are gone, it will no longer trigger the watchdog reset.

ignore
------
You should not set killpath if you don't want sanlock to use it.  In this case, sanlock will use SIGTERM and SIGKILL against the vm when its lock fails.  If the pid does not exit from either of those, then the host will be reset by the watchdog.  If this is not happening, could you run "sanlock client log_dump > log.txt" and send that to me?


Finally, I'm not sure what rem_lockspace is being used for above; it should probably not be used to test lock failure.  The way I usually simulate lock failures is by using dmsetup to load the error target under the leases lv.

Comment 41 Federico Simoncelli 2013-01-24 18:11:53 UTC

(In reply to comment #40)
> Sorry I couldn't follow all the discussion above very well, so I'll probably
> repeat some obvious background to make sure that we're all expecting the
> same things.
> 
> pause
> -----
> You need to pass sanlock the path to a kill script/program that sanlock will
> run against the vm when the lock fails.  In the libvirt case we expect this
> program to result in the following (probably done within libvirtd):
> 1. pause/suspend the vm
> 2. inquire and save the lease state from sanlock
> 3. release the sanlock leases for the vm
> 
> When the sanlock daemon sees that the leases are gone, it will no longer
> trigger the watchdog reset.
> 
> ignore
> ------
> You should not set killpath if you don't want sanlock to use it.  In this
> case, sanlock will use SIGTERM and SIGKILL against the vm when its lock
> fails.

I think that we have a misconception here, the "ignore" policy was implemented (Jiri correct me if I'm wrong) "ignoring" the fact that sanlock is requesting to release the resource. In this situation sanlock should escalate anyway ("ignore" == forced reboot in the sanlock implementation).

> If the pid does not exit from either of those, then the host will be
> reset by the watchdog.  If this is not happening, could you run "sanlock
> client log_dump > log.txt" and send that to me?

# sanlock client log_dump
2013-01-24 18:43:42+0800 6085 [2735]: sanlock daemon started 2.6 host 7d9676dc-9af3-4d63-bc91-dc5ba9e50a7e.intel-8400
2013-01-24 18:43:50+0800 6094 [2739]: cmd_add_lockspace 2,9 TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0 flags 0 timeout 0
2013-01-24 18:43:50+0800 6094 [2739]: s1 lockspace TEST_LS:1:var/lib/libvirt/sanlock/TEST_LS:0
2013-01-24 18:43:50+0800 6094 [2849]: s1 delta_acquire begin TEST_LS:1
2013-01-24 18:43:51+0800 6094 [2849]: s1 delta_acquire write 1 1 6094 7d9676dc-9af3-4d63-bc91-dc5ba9e50a7e.intel-8400
2013-01-24 18:43:51+0800 6094 [2849]: s1 delta_acquire delta_short_delay 20
2013-01-24 18:44:11+0800 6114 [2849]: s1 delta_acquire done 1 1 6094
2013-01-24 18:44:11+0800 6115 [2739]: s1 add_lockspace done
2013-01-24 18:44:11+0800 6115 [2739]: cmd_add_lockspace 2,9 done 0
2013-01-24 18:46:30+0800 6253 [2735]: cmd_register ci 2 fd 9 pid 2913
2013-01-24 18:46:30+0800 6253 [2740]: cmd_killpath 2,9,2913 flags 0
2013-01-24 18:46:31+0800 6254 [2735]: cmd_restrict ci 2 fd 9 pid 2913 flags 1
2013-01-24 18:46:31+0800 6254 [2739]: cmd_acquire 2,9,2913 ci_in 3 fd 12 count 1
2013-01-24 18:46:31+0800 6254 [2739]: s1:r1 resource TEST_LS:sles11sp2-disk-resource-lock:/var/lib/libvirt/sanlock/sles11sp2-disk-resource-lock:0 for 2,9,2913
2013-01-24 18:46:31+0800 6254 [2739]: r1 paxos_acquire begin 0 0 0
2013-01-24 18:46:31+0800 6254 [2739]: r1 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2013-01-24 18:46:31+0800 6254 [2739]: r1 paxos_acquire leader 0 free
2013-01-24 18:46:31+0800 6254 [2739]: r1 ballot 1 phase1 mbal 1
2013-01-24 18:46:31+0800 6254 [2739]: r1 ballot 1 phase2 bal 1 inp 1 1 6254 q_max -1
2013-01-24 18:46:31+0800 6254 [2739]: r1 ballot 1 commit self owner 1 1 6254
2013-01-24 18:46:31+0800 6254 [2739]: r1 acquire_disk rv 1 lver 1 at 6254
2013-01-24 18:46:31+0800 6254 [2739]: cmd_acquire 2,9,2913 result 0 pid_dead 0
2013-01-25 02:04:10+0800 32513 [2740]: cmd_rem_lockspace 3,12 TEST_LS flags 0
2013-01-25 02:04:10+0800 32513 [2735]: s1 set killing_pids check 0 remove 1
2013-01-25 02:04:10+0800 32513 [2735]: s1:r1 client_using_space pid 2913
2013-01-25 02:04:10+0800 32513 [2735]: s1 kill 2913 sig 100 count 1
2013-01-25 02:05:49+0800 32612 [2735]: s1 killing pids stuck 1
<...nothing else, 5 minutes passed...>

Comment 42 Federico Simoncelli 2013-01-24 18:14:49 UTC

(In reply to comment #41)
> (In reply to comment #40)
> I think that we have a misconception here, the "ignore" policy was
> implemented (Jiri correct me if I'm wrong) "ignoring" the fact that sanlock
> is requesting to release the resource. In this situation sanlock should
> escalate anyway ("ignore" == forced reboot in the sanlock implementation).

Actually let me correct myself, "ignore" == vm is abruptly killed (and eventually we might escalate to the reboot).

Comment 43 David Teigland 2013-01-24 19:05:47 UTC

The first problem is as I mentioned above: rem_lockspace is not equivalent to a failed lock, and should not be used to test that. (This does reveal a possible problem with a forced rem_lockspace, though, which I will look into.)

There might also be a problem with the killpath program because the the lease is not removed or the pid does not exit.  We'd expect one of those results from running killpath.  (If the lockspace had actually failed, then sanlock would have escalated when the killpath did not do anything.)

Comment 44 Luwen Su 2013-01-29 02:49:00 UTC

After talked with Jiri in irc , i verify this bug as  the action "power off" and "pause" work well.

For other two actions:
Ignore : will lead to sanlock stuck
Restart : Can shutdown successfuly but fail to start again.
I will create two bugs respectively to track them on 6.5

Thanks your help David , Federico and Jiri.

Comment 45 Luwen Su 2013-01-29 03:43:06 UTC

I create two bugs and re-write the steps  , anything miss or any mistake i made please correct me

Bug 905280 Lockfailure action Ignore will lead to sanlock rem_lockspace stuck
Bug 905282 - Lockfailure action Restart can shutdown the guest but fail to start it

Comment 46 Federico Simoncelli 2013-01-29 11:42:00 UTC

(In reply to comment #45)
> I create two bugs and re-write the steps  , anything miss or any mistake i
> made please correct me
> 
> Bug 905280 Lockfailure action Ignore will lead to sanlock rem_lockspace stuck

I think David already has a fix for this.

> Bug 905282 - Lockfailure action Restart can shutdown the guest but fail to
> start it

I don't think there's a way to fix this. It's probably NOTABUG.

Comment 47 Jiri Denemark 2013-01-29 12:41:04 UTC

(In reply to comment #46)
> > Bug 905282 - Lockfailure action Restart can shutdown the guest but fail to
> > start it
> 
> I don't think there's a way to fix this. It's probably NOTABUG.

Not really, libvirt should at least refuse to create a domain with restart lockfailure action if sanlock is used as the lock manager. In case it can't be fixed of course. But anyway, let's move further discussion to the new bugs.

Comment 48 errata-xmlrpc 2013-02-21 07:17:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.