Bug 699976

Summary: blocked on qemu-img: SPM fences itself during creation of several templates
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED ERRATA QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.2CC: abaron, bazulay, cww, dron, egerman, fsimonce, iheim, jlibosva, kwolf, mgoldboi, pstehlik, rvaknin, sgrinber, smehrotr, srevivo, tdosek, yeylon, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.9-87 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:14:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 713743    
Bug Blocks:    
Attachments:
Description Flags
vdsm log.
none
vdsm log - dafna none

Description Haim 2011-04-27 07:17:17 UTC
Created attachment 495122 [details]
vdsm log.

Description of problem:

SPM host fence itself during creation of several templates. 
flow goes as follows: 

1) SPM is running 
2) create 2 templates at once 
3) start creation of template, following error starts to appear:

[2011-04-27 09:31:45] failed renewing lease
[2011-04-27 09:31:58] failed renewing lease
[2011-04-27 09:32:24] failed renewing lease
[2011-04-27 09:32:41] failed renewing lease
[2011-04-27 09:32:52] failed renewing lease
[2011-04-27 09:33:02] failed renewing lease

4) then, after approx 2 minutes, vdsm fence itself: 

[2011-04-27 09:33:09] Fencing spUUID=679ff66e-73ee-4e1f-a647-553a3e656512 id=1 lease_path=/dev/679ff66e-73ee-4e1f-a647-553a3e656512/leases
/bin/kill -USR1 18675
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
[2011-04-27 09:33:12] failed renewing lease
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
[2011-04-27 09:33:16] Trying to stop vdsm for spUUID=679ff66e-73ee-4e1f-a647-553a3e656512 id=1 lease_path=/dev/679ff66e-73ee-4e1f-a647-553a3e656512/leases
sudo /sbin/service vdsmd restart
Shutting down vdsm daemon: 
vdsm watchdog stop[  OK  ]
ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory
/usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory
User defined signal 1
[2011-04-27 09:33:16] releasing lease spUUID=679ff66e-73ee-4e1f-a647-553a3e656512 id=1 lease_path=/dev/679ff66e-73ee-4e1f-a647-553a3e656512/leases
vdsm stop[  OK  ]
Starting iscsid: 
Starting up vdsm daemon: 
vdsm start[  OK  ]


repro steps: 

1) create 2 or more templates in parallel 
2) spm host will fence itself

Comment 2 Sanjay Mehrotra 2011-04-27 14:31:01 UTC
This bug appears to have the same footprints as BZ 690520. Just a heads up, more update with further investigation.

Comment 3 Sanjay Mehrotra 2011-05-03 04:59:02 UTC
Assigning it to myself for investigation.

Comment 5 Sanjay Mehrotra 2011-05-03 21:47:01 UTC
Based upon the vdsm log files, the configuration lost connection to the storage during the time stamp period - [2011-04-27 09:31:45] and beyond. Hence, the master fs and lease volume were not available to renew the lease.  vdsm continued to do the right thing of continuously checking the volume as shown below. ( output from vdsm log file ) 

Fencing is expected under this scenario.  Please reproduce this defect without losing the network connection and the fix for 690520.


Thread-19::DEBUG::2011-04-27 09:54:10,452::lvm::358::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended      _devices=1 write_cache_state=0 filter = [ \\"a%/dev/mapper/1IET_00130001|/dev/mapper/1IET_00130002|/dev/mapper/1IET_00140001|/dev/mapper/1IET_00140002|/dev/mapper/1IET_00150001|/dev/mapper/1      IET_00150002|/dev/mapper/1IET_00160001|/dev/mapper/1IET_00160002|/dev/mapper/1IET_00290001|/dev/mapper/1IET_00370001|/dev/mapper/1IET_00370002%\\", \\"r%.*%\\" ] }  global {  locking_type=1        prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " 1d9687bb-7b75-40ee-a1d6-afc280f0a008' (cwd None)
56020 Thread-19::DEBUG::2011-04-27 09:54:10,576::lvm::358::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  /dev/mapper/1IET_00130001: read failed after 0 of 4096 at 80530571264: Input/output error\      n  /dev/mapper/1IET_00130001: read failed after 0 of 4096 at 80530628608: Input/output error\n  /dev/mapper/1IET_00130001: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper      /1IET_00130001: read failed after 0 of 4096 at 4096: Input/output error\n  /dev/mapper/1IET_00130002: read failed after 0 of 4096 at 80530571264: Input/output error\n  /dev/mapper/1IET_00130      002: read failed after 0 of 4096 at 80530628608: Input/output error\n  /dev/mapper/1IET_00130002: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/1IET_00130002: read fail      ed after 0 of 4096 at 4096: Input/output error\n'; <rc> = 0

Comment 6 Dafna Ron 2011-05-16 12:36:35 UTC
Created attachment 499131 [details]
vdsm log - dafna

I had the same happen when importing a faulty template. 
vdsm gives error: 

Thread-974::ERROR::2011-05-16 15:13:15,329::task::855::TaskManager.Task::(_setError) Unexpected error
Thread-974::ERROR::2011-05-16 15:13:15,337::dispatcher::103::Storage.Dispatcher.Protect::(run) {'status': {'message': "Image does not exist in domain: 'image=0b121d4e-7350-40d9-b5dc-2825bb2e9587, domain=504932e1-3437-4d7d-a07c-b695afb59d88'", 'code': 258}, 'args': ['image=0b121d4e-7350-40d9-b5dc-2825bb2e9587, domain=504932e1-3437-4d7d-a07c-b695afb59d88']}


then the export domain becomes inactive for 2 seconds. 

vdsm logs shows that it is getting Spmstop:

Thread-978::INFO::2011-05-16 15:13:15,395::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: spmStop, args: ( spUUID=1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8)
Thread-978::DEBUG::2011-05-16 15:13:15,396::task::492::TaskManager.Task::(_debug) Task 47e6a719-8253-4dd5-afff-e5697add3af0: moving from state init -> state preparing
Thread-978::DEBUG::2011-05-16 15:13:15,396::spm::605::Storage.SPM::(public_spmStop) spUUID=1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8
Thread-978::DEBUG::2011-05-16 15:13:15,397::taskManager::105::TaskManager::(getAllTasks) Entry.
Thread-978::DEBUG::2011-05-16 15:13:15,397::taskManager::110::TaskManager::(getAllTasks) Return: {}
Thread-978::DEBUG::2011-05-16 15:13:15,398::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`32fbed79-464e-4108-860e-0e1544d5e30a`::Request was made in '/usr/share/vdsm/storage/spm.py' line '628' at 'public_spmStop'
Thread-978::DEBUG::2011-05-16 15:13:15,398::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' for lock type 'exclusive'
Thread-978::DEBUG::2011-05-16 15:13:15,399::resourceManager::508::ResourceManager::(registerResource) Resource 'Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' is free. Now locking as 'exclusive' (1 active user)
Thread-978::DEBUG::2011-05-16 15:13:15,400::resourceManager::191::ResourceManager.Request::(grant) ResName=`Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`32fbed79-464e-4108-860e-0e1544d5e30a`::Granted request
Thread-978::DEBUG::2011-05-16 15:13:15,400::task::492::TaskManager.Task::(_debug) Task 47e6a719-8253-4dd5-afff-e5697add3af0: _resourcesAcquired: Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8 (exclusive)
Thread-978::DEBUG::2011-05-16 15:13:15,401::task::492::TaskManager.Task::(_debug) Task 47e6a719-8253-4dd5-afff-e5697add3af0: ref 1 aborting False
Thread-978::DEBUG::2011-05-16 15:13:15,402::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`9c2c106c-ff5a-42fa-889c-64dc5e9c88c2`::Request was made in '/usr/share/vdsm/storage/spm.py' line '639' at '_stop'
Thread-978::DEBUG::2011-05-16 15:13:15,402::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' for lock type 'exclusive'
Thread-978::DEBUG::2011-05-16 15:13:15,403::resourceManager::508::ResourceManager::(registerResource) Resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' is free. Now locking as 'exclusive' (1 active user)
Thread-978::DEBUG::2011-05-16 15:13:15,403::resourceManager::191::ResourceManager.Request::(grant) ResName=`Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`9c2c106c-ff5a-42fa-889c-64dc5e9c88c2`::Granted request
Thread-978::DEBUG::2011-05-16 15:13:15,404::resourceManager::517::ResourceManager::(releaseResource) Trying to release resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8'
Thread-978::DEBUG::2011-05-16 15:13:15,405::resourceManager::532::ResourceManager::(releaseResource) Released resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' (0 active users)
Thread-978::DEBUG::2011-05-16 15:13:15,405::resourceManager::537::ResourceManager::(releaseResource) Resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' is free, finding out if anyone is waiting for it.
Thread-978::DEBUG::2011-05-16 15:13:15,406::resourceManager::544::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8', Clearing records.
Thread-978::DEBUG::2011-05-16 15:13:15,406::taskManager::118::TaskManager::(unloadTasks) Entry.
Thread-978::DEBUG::2011-05-16 15:13:15,407::taskManager::122::TaskManager::(unloadTasks) Return
Thread-978::DEBUG::2011-05-16 15:13:15,407::spm::201::Storage.SPM::(__cleanupMasterMount) unmounting /rhev/data-center/mnt/blockSD/5bd17ddd-1cdc-4b25-bbbf-330b7c150d01/master
Thread-978::DEBUG::2011-05-16 15:13:15,408::fileUtils::109::Storage.Misc.excCmd::(umount) '/usr/bin/sudo -n /bin/umount -f /rhev/data-center/mnt/blockSD/5bd17ddd-1cdc-4b25-bbbf-330b7c150d01/master' (cwd None)


I attached the full log.

Comment 7 yeylon@redhat.com 2011-05-23 14:08:20 UTC
i'm getting the same issue when trying to create a new single VM (clone) from the user portal!

adding a test blocker flag to this one and raising priority to urgent

Comment 10 Jakub Libosvar 2011-06-03 14:28:20 UTC
I couldn't reproduce in vdsm-72. Tried two scenarios, both were performed under load of machine (4 cores, 6 processes 'dd if=/dev/urandom > /dev/null')

1) Created 7 templates in parallel as written in description. All tasks ended successfully.

2) Created 3 clone VMs in User Portal. Successful.

Comment 11 Ayal Baron 2011-06-05 06:33:42 UTC
On needinfo for over a month and Jakub couldn't reproduce, closing as worksforme.

Comment 12 Evgeniy German 2011-06-05 09:48:56 UTC
Unfortunately, the issue is real and reproduced with vdsm-4.9-73. It might be specific to a certain storage workload here in tlv, but it is still alive.

Comment 21 Dafna Ron 2011-06-14 09:17:29 UTC
I encountered the same issue again when creating a VM from template with copy+preallocated.
the host fenced itself during creation of the VM 

Federico issued a fix which after testing with this scenario solved the issue.

Comment 22 Federico Simoncelli 2011-06-14 09:30:41 UTC
For future reference: the fix was a patched qemu-img that is using O_DIRECT for writing the images. This prevents from filling up the buffers which then are flushed preventing safelease to renew the lease. The patch will be submitted to the appropriate mailing list.

Comment 24 Federico Simoncelli 2011-06-14 12:52:46 UTC
Patch posted upstream:

http://marc.info/?l=kvm&m=130805442315023

Comment 25 Rami Vaknin 2011-06-16 10:31:47 UTC
In my scenario the host was rebooted every time (4 out of 4) I imported a template from Export NFS domain, my env contains one RHEL6.1 host in the 3.0 Data Center with one iSCSI Storage domain and one NFS Export domain, qemu-img-0.12.1.2-2.164.el6.x86_64 and vdsm-4.9-75.el6.x86_64.

I used fsimonce's patched qemu-img (it doesn't have version yet) which fixed the reboot issue, now the import template pass successfully.

Comment 26 Rami Vaknin 2011-06-16 10:34:10 UTC
My comment 25 apply for the needinfo.

Comment 28 Dan Kenigsberg 2011-06-18 18:47:13 UTC
Hopefully, qemu-im bug 713743 would be resolved in such a way that we would not need to do a thing here. In case qemu-img requires special command line to avoid trashing the host cache, we could use something like:

http://gerrit.usersys.redhat.com/589

Comment 29 Kevin Wolf 2011-06-21 07:58:31 UTC
With the RFE in bug 713743 we would have fixed the symptom, but not the cause. It would stay a time bomb, so I believe we still need a proper fix even if we add the stopgap in qemu-img.

Comment 31 Federico Simoncelli 2011-06-21 09:20:40 UTC
We are aware that the current fix is not optimal.
In fact we are already working on other solutions, but they are less likely to get into 6.2 (cgroup).

Comment 32 Tomas Dosek 2011-08-08 10:43:14 UTC
Verified - vdsm-4.9-91 - could not reproduce any of above mentioned issues, therefore moving to Verified.

Comment 33 errata-xmlrpc 2011-12-06 07:14:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html