Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1140437

Summary: spmProtects fences SPM if fails to renew ISO domain leases file during Attach ISO Domain flow
Product: Red Hat Enterprise Virtualization Manager Reporter: Marina Kalinin <mkalinin>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Aharon Canan <acanan>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.3.0CC: bazulay, danken, ecohen, fsimonce, gklein, iheim, laravot, lpeer, scohen, tnisan, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-24 17:18:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
leases file in the correct format none

Description Marina Kalinin 2014-09-11 01:23:35 UTC
Description of problem:
spmProtects fences SPM if fails to renew leases file during Attach ISO Domain flow.
Customer environment: 
vdsm-4.14.7-3.el6ev.x86_64
rhevm-3.3.3-0.52.el6ev.noarch

I could not reproduce the problem locally with latest 3.4.2. But I don't know how to fail "leases renewal process".
In customer environment it cause fencing the SPM host and loosing all production VMs running on it.

From customer logs:
spm-lock.log [created once we are accessing nfs iso domain]:
~~~~~~~~~~~~~
[2014-08-27 13:55:13] Protecting spm lock for vdsm pid 71254
[2014-08-27 13:55:13] Trying to acquire lease - sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 lease_file=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases id=3 lease_time_ms=5000 io_op_to_ms=1000
[2014-08-27 13:55:15] Lease acquired sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases, TS=1409147713666790
[2014-08-27 13:55:15] Protecting spm lock for vdsm pid 71254
[2014-08-27 13:55:15] Started renewal process (pid=21899) for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
[2014-08-27 13:55:21] Fencing sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill -USR1 71254
[2014-08-27 13:55:23] Fencing sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill -USR1 71254
[2014-08-27 13:55:28] Trying to stop vdsm for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill 71254
/bin/kill -9 71254
[2014-08-27 13:55:30] Trying to stop vdsm for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill 71254
kill 71254: No such process
/bin/kill -9 71254
kill 71254: No such process
sudo /sbin/reboot -f
sudo /sbin/reboot -f


vdsm.log:
~~~~~~~~~
[mku: start adding the domain]
Thread-655::DEBUG::2014-08-27 13:55:13,532::fileSD::150::Storage.StorageDomain::(__init__) Reading domain in path /rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0
Thread-655::DEBUG::2014-08-27 13:55:13,540::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with FileMetadataRW backend
Thread-655::DEBUG::2014-08-27 13:55:13,552::persistentDict::234::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Iso', 'DESCRIPTION=ISO_DOMAIN', 'IOOPTIMEOUTSEC=1', 'LEASERETRIES=3', 'LEASETIMESEC=5', 'LOCKPOLICY=', 'LOCKRENEWALINTERVALSEC=5', 'MASTER_VERSION=0', 'POOL_UUID=', 'REMOTE_PATH=no.one.reads.this:/rhev', 'ROLE=Regular', 'SDUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0', 'TYPE=NFS', 'VERSION=0', '_SHA_CKSUM=41f363f51f2c3b6db27dfd989d16130cb301791c']

[mku: starting spmprotect.sh here]
Thread-655::DEBUG::2014-08-27 13:55:13,571::clusterlock::86::SafeLease::(acquireHostId) Host id for domain 50885a30-5bec-4b34-98ed-ed3c3306bad0 successfully acquired (id: 3)
Thread-655::DEBUG::2014-08-27 13:55:13,575::clusterlock::100::SafeLease::(acquire) Acquiring cluster lock for domain 50885a30-5bec-4b34-98ed-ed3c3306bad0
Thread-655::DEBUG::2014-08-27 13:55:13,582::clusterlock::114::Storage.Misc.excCmd::(acquire) '/usr/bin/sudo -n /usr/bin/setsid /usr/bin/ionice -c 1 -n 0 /bin/su vdsm -s /bin/sh -c "/usr/libexec/vdsm/spmprotect.s
h start 50885a30-5bec-4b34-98ed-ed3c3306bad0 3 5 /rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases 5000 1000 3"' (cwd /usr/libexec/vdsm)
Thread-655::DEBUG::2014-08-27 13:55:15,836::clusterlock::114::Storage.Misc.excCmd::(acquire) SUCCESS: <err> = ''; <rc> = 0
Thread-655::DEBUG::2014-08-27 13:55:15,845::clusterlock::117::SafeLease::(acquire) Clustered lock acquired successfully
Thread-655::DEBUG::2014-08-27 13:55:15,862::persistentDict::234::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Iso', 'DESCRIPTION=ISO_DOMAIN', 'IOOPTIMEOUTSEC=1', 'LEASERETRIES=3', 'LEASETIMESEC=5', 'LOCKPOLICY=', 'LOCKRENEWALINTERVALSEC=5', 'MASTER_VERSION=0', 'POOL_UUID=', 'REMOTE_PATH=no.one.reads.this:/rhev', 'ROLE=Regular', 'SDUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0', 'TYPE=NFS', 'VERSION=0', '_SHA_CKSUM=41f363f51f2c3b6db27dfd989d16130cb301791c']

Thread-655::DEBUG::2014-08-27 13:55:15,871::persistentDict::299::Storage.PersistentDict::(flush) about to write lines (FileMetadataRW)=['CLASS=Iso', 'DESCRIPTION=ISO_DOMAIN', 'IOOPTIMEOUTSEC=1', 'LEASERETRIES=3', 'LEASETIMESEC=5', 'LOCKPOLICY=', 'LOCKRENEWALINTERVALSEC=5', 'MASTER_VERSION=0', 'POOL_UUID=7f7534f7-e5d4-44b3-b34a-4d0743734e7d', 'REMOTE_PATH=no.one.reads.this:/rhev', 'ROLE=Regular', 'SDUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0', 'TYPE=NFS', 'VERSION=0', '_SHA_CKSUM=48d35ae1db7fa17f61c5b88d3e80ccbc10132aad']
Thread-655::DEBUG::2014-08-27 13:55:15,891::persistentDict::175::Storage.PersistentDict::(transaction) Finished transaction
Thread-655::DEBUG::2014-08-27 13:55:15,894::persistentDict::167::Storage.PersistentDict::(transaction) Starting transaction
Thread-655::DEBUG::2014-08-27 13:55:15,895::persistentDict::173::Storage.PersistentDict::(transaction) Flushing changes
Thread-655::DEBUG::2014-08-27 13:55:15,897::persistentDict::299::Storage.PersistentDict::(flush) about to write lines (VGTagMetadataRW)=['CLASS=Data', 'DESCRIPTION=SERV_R6_DS01_02e5', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIMESEC=60', 'LOCKPOLICY=', 'LOCKRENEWALINTERVALSEC=5', 'LOGBLKSIZE=512', 'MASTER_VERSION=1', 'PHYBLKSIZE=512', 'POOL_DESCRIPTION=VCP', u'POOL_DOMAINS=52f06006-7205-4618-ae83-aa1caa3f3d25:Active,f737ce8c-d2c1-4cdf-8340-63fd449d2542:Active,07c1f411-8c34-4c3e-a096-48a133216016:Active,6090b90e-482e-47fa-bf64-10c2139bf903:Active,3997fbce-3f77-41b9-8250-a9981319f72e:Active,7f591cbf-1787-435d-881a-d55fb77ee636:Active,8b7ef24a-57c0-43e6-9f9e-b71a8701eacc:Active,08089fe3-ba56-428a-aea9-2155927db99f:Active,e4ecec40-eb11-4469-b4fb-a7119b24774a:Active,50d5c200-87ef-4780-9466-e927dd49f333:Active,dbc82e39-ead3-4868-899f-6208b8a34fd8:Attached,58953ce4-c34d-49e1-85a8-1d0033978059:Active,145656b4-c764-4b4f-8932-f0cc790ac7f5:Active,1851daa3-660f-44b2-8949-3e1a821b9b87:Active,beb7a15c-2041-4298-95b9-7c73657a140e:Active,216ae8e3-888f-41f0-9770-82af114a845f:Active,19c222a0-2957-4bd1-b96a-2282d9c2e691:Active,d8341ef4-e7cf-4387-92ed-7feb3b603bc6:Active,fe1c8078-1bf1-4cab-bea5-e0d9f665d082:Active,e9f39c0b-da13-4034-b42d-525c142a3a09:Active,80767ba5-7c87-45dd-9a32-7330ed889af5:Active,8b292f23-83d7-4c59-bdbd-6e675b10313c:Active,9b66b8d0-1a39-444a-bef5-c3b9b432cb49:Active,9479d431-1a05-4137-819e-507a4dae1c06:Active,ea38354b-1638-427f-b07a-fb7d68965573:Active,21b318d7-6324-496a-b480-db0943cb5054:Active,99bd47ea-2c85-4988-ad28-ae925a0ea3f6:Active,47e996a5-7031-4a67-bbd3-a982ddbaa385:Active,1f8ab569-e10c-4db2-89ea-1299c33657dc:Active,fadcbebe-e7e9-41d7-acc2-c6f7c88ddc21:Active,16ff2d50-2d03-44e7-aee4-bbde70bbc866:Active,8fa43837-ac15-48f4-ad00-c928899709a1:Active,0488b020-3c2b-4a20-822c-b92b3e498b03:Active,582f520c-cf40-498a-b999-6c5e0056509c:Active,f60fde72-53e0-4ead-b998-a5d985112707:Active,e8867e76-8e99-4681-afed-20bb0df04ac0:Active,15552cb3-d62d-4868-bc1c-64dcf8542d12:Active,50885a30-5bec-4b34-98ed-ed3c3306bad0:Attached,fdd19121-7026-41d7-8255-1ba29acd324c:Active,8f881682-847b-422f-97aa-3474f738d5d8:Active,c14a4d42-7e18-493c-96ed-8549bdd4b0bb:Active,8ea8a6d5-b5e6-4120-bed0-539dd45a2457:Active,caf20a16-d9bc-42fa-80aa-d6def94166a1:Active,bcdb0886-bd89-47c9-8b36-4d426e169a75:Active,57e4de3a-a5dd-4b94-8062-d2665d294268:Active,a7f1dba6-492e-43c8-84d0-74cc02a50519:Active,c5084f93-aa00-4f65-8532-65b3ca9dea7e:Active', 'POOL_SPM_ID=3', 'POOL_SPM_LVER=35', 'POOL_UUID=7f7534f7-e5d4-44b3-b34a-4d0743734e7d', u'PV0=pv:360060e8005719a000000719a000002e5,uuid:2OM14M-yrHa-XDMj-UqXj-kFNE-4Z31-NpDJfx,pestart:0,pecount:9009,mapoffset:0', 'ROLE=Master', 'SDUUID=80767ba5-7c87-45dd-9a32-7330ed889af5', 'TYPE=FCP', 'VERSION=3', 'VGUUID=3yCHe0-XH4g-rPhW-DXhk-4eDL-Q5K9-q5uxfa', '_SHA_CKSUM=c32dee7509906e5654b01cb4457e523810922d97']
...
MainThread::INFO::2014-08-27 13:55:30,989::vdsm::119::vds::(run) (PID: 22452) I am the actual vdsm 4.14.7-3.el6ev vcpfscrx900h01.vas.vodacom.corp (2.6.32-431.17.1.el6.x86_64)

Comment 1 Marina Kalinin 2014-09-11 01:26:47 UTC
My suggestion is to remove the fence command from spmprotect logic, since there is no reason to fence vdsm or a host, if ISO domain encounters a failure. ISO domain is not critical and should endanger the whole environment.

From /usr/libexec/vdsm/spmprotect.sh 
~~~
function start_renewal_loop() {
    local renewed curr i tl TPID
    while true ; do
        curr=`date +%s`
        debug "last renewal = $LAST_RENEWAL, curr = $curr"
        tl=$((LEASE_TIME_MS/1000-(curr*1000000-LAST_RENEWAL)/1000000))
        if [ "$tl" -gt "0" ] ; then
            (sleep $tl && fence) 2>/dev/null &
            disown
            TPID=$!
        else
            fence
        fi
~~~

Comment 7 Marina Kalinin 2014-09-15 21:43:00 UTC
Copying here Federico's comment for another bug, which I believe was meant for this bug:
~~~
This happens only when "acquire" succeeds and "renew" fails. The positive flow is quite fast usually (few seconds) but if the iso domains become unreachable between acquire and renew then the host is fenced.

So if you try to attach an iso that is not reachable you usually don't get fenced (unless as I said the first write is successful).
~~~

Comment 8 Marina Kalinin 2014-09-15 21:52:16 UTC
Reading the comments above, does not seem right to me, what we see in the log and what the code wants us to do.

Let me copy the spm-lock.log here again:
~~~~~~~~~~~~~
[2014-08-27 13:55:13] Protecting spm lock for vdsm pid 71254
[2014-08-27 13:55:13] Trying to acquire lease - sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 lease_file=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases id=3 lease_time_ms=5000 io_op_to_ms=1000
[2014-08-27 13:55:15] Lease acquired sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases, TS=1409147713666790
[2014-08-27 13:55:15] Protecting spm lock for vdsm pid 71254
[2014-08-27 13:55:15] Started renewal process (pid=21899) for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases

---> trying to fence 6 seconds later
[2014-08-27 13:55:21] Fencing sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill -USR1 71254

---> 2 seconds later - another attempt to fence
[2014-08-27 13:55:23] Fencing sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill -USR1 71254

---> 13 seconds later from initial attempt to renew - not 20 seconds.
[2014-08-27 13:55:28] Trying to stop vdsm for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill 71254
/bin/kill -9 71254
[2014-08-27 13:55:30] Trying to stop vdsm for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases
/bin/kill 71254
kill 71254: No such process
/bin/kill -9 71254
kill 71254: No such process
sudo /sbin/reboot -f
sudo /sbin/reboot -f
~~~~

Comment 9 Marina Kalinin 2014-09-15 21:56:06 UTC
And yes, I understand, removing the fence from the logic completely is wrong for the case when we have Data storage domains of type V1, i.e. NFS storage DC.

However, the logic of spmProtect should be changed to distinguish between Data Domains and non-Data domains.

Comment 11 Marina Kalinin 2014-09-23 16:56:11 UTC
As discussed with Fede:

1. This is a very rare condition happening when we succeeded to acquire the lease, but fail to renew it right away. Cannot reproduce locally.

2. Ask the customer to reproduce, since not reproducable locally.
Test first, if we can remove the leases file and create new one from scratch. If works suggest customer to do this as well. Do it on customer site as well.

3. Should we change the code to differentiate between ISO domain and Data domain, and prevent vdsm fencing in the first case? Not really. Since if vdsm acquired the lease, it needs to release it, so that other host would be able acquire it. But if something wrong happened with vdsm and it cannot reach the leases file, it would not be able releasing it as well. However, we should consider not rebooting the host, if succeeded to reboot vdsmd.

Comment 12 Marina Kalinin 2014-09-23 19:47:59 UTC
Created attachment 940592 [details]
leases file in the correct format

Comment 13 Marina Kalinin 2014-09-23 19:58:40 UTC
Update: cannot remove leases file and use an empty one.
The file should be in a specific format.

Comment 14 Dan Kenigsberg 2014-09-29 13:20:10 UTC
clearing my needinfo, as Federico has explained the matter.

Comment 15 Tal Nisan 2014-10-07 12:49:45 UTC
Federico, given the input from Marina, do we have any news?

Comment 16 Marina Kalinin 2014-10-07 13:59:58 UTC
Just to let you know, customer didn't try attaching the ISO domain again yet, since he didn't want to endanger his production environment.
Next step we agreed would be trying to attach the same ISO domain to a different Data Center.

Comment 17 Federico Simoncelli 2014-10-07 15:43:06 UTC
(In reply to Tal Nisan from comment #15)
> Federico, given the input from Marina, do we have any news?

Status update is in comment 16, clearing needinfo.