Hide Forgot
Created attachment 495122 [details] vdsm log. Description of problem: SPM host fence itself during creation of several templates. flow goes as follows: 1) SPM is running 2) create 2 templates at once 3) start creation of template, following error starts to appear: [2011-04-27 09:31:45] failed renewing lease [2011-04-27 09:31:58] failed renewing lease [2011-04-27 09:32:24] failed renewing lease [2011-04-27 09:32:41] failed renewing lease [2011-04-27 09:32:52] failed renewing lease [2011-04-27 09:33:02] failed renewing lease 4) then, after approx 2 minutes, vdsm fence itself: [2011-04-27 09:33:09] Fencing spUUID=679ff66e-73ee-4e1f-a647-553a3e656512 id=1 lease_path=/dev/679ff66e-73ee-4e1f-a647-553a3e656512/leases /bin/kill -USR1 18675 ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory [2011-04-27 09:33:12] failed renewing lease ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory [2011-04-27 09:33:16] Trying to stop vdsm for spUUID=679ff66e-73ee-4e1f-a647-553a3e656512 id=1 lease_path=/dev/679ff66e-73ee-4e1f-a647-553a3e656512/leases sudo /sbin/service vdsmd restart Shutting down vdsm daemon: vdsm watchdog stop[ OK ] ls: cannot access /var/run/vdsm/spmprotect/20094: No such file or directory /usr/libexec/vdsm/spmprotect.sh: line 138: pushd: /var/run/vdsm/spmprotect/20094: No such file or directory User defined signal 1 [2011-04-27 09:33:16] releasing lease spUUID=679ff66e-73ee-4e1f-a647-553a3e656512 id=1 lease_path=/dev/679ff66e-73ee-4e1f-a647-553a3e656512/leases vdsm stop[ OK ] Starting iscsid: Starting up vdsm daemon: vdsm start[ OK ] repro steps: 1) create 2 or more templates in parallel 2) spm host will fence itself
This bug appears to have the same footprints as BZ 690520. Just a heads up, more update with further investigation.
Assigning it to myself for investigation.
Based upon the vdsm log files, the configuration lost connection to the storage during the time stamp period - [2011-04-27 09:31:45] and beyond. Hence, the master fs and lease volume were not available to renew the lease. vdsm continued to do the right thing of continuously checking the volume as shown below. ( output from vdsm log file ) Fencing is expected under this scenario. Please reproduce this defect without losing the network connection and the fix for 690520. Thread-19::DEBUG::2011-04-27 09:54:10,452::lvm::358::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended _devices=1 write_cache_state=0 filter = [ \\"a%/dev/mapper/1IET_00130001|/dev/mapper/1IET_00130002|/dev/mapper/1IET_00140001|/dev/mapper/1IET_00140002|/dev/mapper/1IET_00150001|/dev/mapper/1 IET_00150002|/dev/mapper/1IET_00160001|/dev/mapper/1IET_00160002|/dev/mapper/1IET_00290001|/dev/mapper/1IET_00370001|/dev/mapper/1IET_00370002%\\", \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " 1d9687bb-7b75-40ee-a1d6-afc280f0a008' (cwd None) 56020 Thread-19::DEBUG::2011-04-27 09:54:10,576::lvm::358::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' /dev/mapper/1IET_00130001: read failed after 0 of 4096 at 80530571264: Input/output error\ n /dev/mapper/1IET_00130001: read failed after 0 of 4096 at 80530628608: Input/output error\n /dev/mapper/1IET_00130001: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper /1IET_00130001: read failed after 0 of 4096 at 4096: Input/output error\n /dev/mapper/1IET_00130002: read failed after 0 of 4096 at 80530571264: Input/output error\n /dev/mapper/1IET_00130 002: read failed after 0 of 4096 at 80530628608: Input/output error\n /dev/mapper/1IET_00130002: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/1IET_00130002: read fail ed after 0 of 4096 at 4096: Input/output error\n'; <rc> = 0
Created attachment 499131 [details] vdsm log - dafna I had the same happen when importing a faulty template. vdsm gives error: Thread-974::ERROR::2011-05-16 15:13:15,329::task::855::TaskManager.Task::(_setError) Unexpected error Thread-974::ERROR::2011-05-16 15:13:15,337::dispatcher::103::Storage.Dispatcher.Protect::(run) {'status': {'message': "Image does not exist in domain: 'image=0b121d4e-7350-40d9-b5dc-2825bb2e9587, domain=504932e1-3437-4d7d-a07c-b695afb59d88'", 'code': 258}, 'args': ['image=0b121d4e-7350-40d9-b5dc-2825bb2e9587, domain=504932e1-3437-4d7d-a07c-b695afb59d88']} then the export domain becomes inactive for 2 seconds. vdsm logs shows that it is getting Spmstop: Thread-978::INFO::2011-05-16 15:13:15,395::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: spmStop, args: ( spUUID=1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8) Thread-978::DEBUG::2011-05-16 15:13:15,396::task::492::TaskManager.Task::(_debug) Task 47e6a719-8253-4dd5-afff-e5697add3af0: moving from state init -> state preparing Thread-978::DEBUG::2011-05-16 15:13:15,396::spm::605::Storage.SPM::(public_spmStop) spUUID=1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8 Thread-978::DEBUG::2011-05-16 15:13:15,397::taskManager::105::TaskManager::(getAllTasks) Entry. Thread-978::DEBUG::2011-05-16 15:13:15,397::taskManager::110::TaskManager::(getAllTasks) Return: {} Thread-978::DEBUG::2011-05-16 15:13:15,398::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`32fbed79-464e-4108-860e-0e1544d5e30a`::Request was made in '/usr/share/vdsm/storage/spm.py' line '628' at 'public_spmStop' Thread-978::DEBUG::2011-05-16 15:13:15,398::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' for lock type 'exclusive' Thread-978::DEBUG::2011-05-16 15:13:15,399::resourceManager::508::ResourceManager::(registerResource) Resource 'Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' is free. Now locking as 'exclusive' (1 active user) Thread-978::DEBUG::2011-05-16 15:13:15,400::resourceManager::191::ResourceManager.Request::(grant) ResName=`Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`32fbed79-464e-4108-860e-0e1544d5e30a`::Granted request Thread-978::DEBUG::2011-05-16 15:13:15,400::task::492::TaskManager.Task::(_debug) Task 47e6a719-8253-4dd5-afff-e5697add3af0: _resourcesAcquired: Storage.1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8 (exclusive) Thread-978::DEBUG::2011-05-16 15:13:15,401::task::492::TaskManager.Task::(_debug) Task 47e6a719-8253-4dd5-afff-e5697add3af0: ref 1 aborting False Thread-978::DEBUG::2011-05-16 15:13:15,402::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`9c2c106c-ff5a-42fa-889c-64dc5e9c88c2`::Request was made in '/usr/share/vdsm/storage/spm.py' line '639' at '_stop' Thread-978::DEBUG::2011-05-16 15:13:15,402::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' for lock type 'exclusive' Thread-978::DEBUG::2011-05-16 15:13:15,403::resourceManager::508::ResourceManager::(registerResource) Resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' is free. Now locking as 'exclusive' (1 active user) Thread-978::DEBUG::2011-05-16 15:13:15,403::resourceManager::191::ResourceManager.Request::(grant) ResName=`Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8`ReqID=`9c2c106c-ff5a-42fa-889c-64dc5e9c88c2`::Granted request Thread-978::DEBUG::2011-05-16 15:13:15,404::resourceManager::517::ResourceManager::(releaseResource) Trying to release resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' Thread-978::DEBUG::2011-05-16 15:13:15,405::resourceManager::532::ResourceManager::(releaseResource) Released resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' (0 active users) Thread-978::DEBUG::2011-05-16 15:13:15,405::resourceManager::537::ResourceManager::(releaseResource) Resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8' is free, finding out if anyone is waiting for it. Thread-978::DEBUG::2011-05-16 15:13:15,406::resourceManager::544::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.upgrade_1d2f1160-1021-4f1e-ba3b-bd1d0c2afca8', Clearing records. Thread-978::DEBUG::2011-05-16 15:13:15,406::taskManager::118::TaskManager::(unloadTasks) Entry. Thread-978::DEBUG::2011-05-16 15:13:15,407::taskManager::122::TaskManager::(unloadTasks) Return Thread-978::DEBUG::2011-05-16 15:13:15,407::spm::201::Storage.SPM::(__cleanupMasterMount) unmounting /rhev/data-center/mnt/blockSD/5bd17ddd-1cdc-4b25-bbbf-330b7c150d01/master Thread-978::DEBUG::2011-05-16 15:13:15,408::fileUtils::109::Storage.Misc.excCmd::(umount) '/usr/bin/sudo -n /bin/umount -f /rhev/data-center/mnt/blockSD/5bd17ddd-1cdc-4b25-bbbf-330b7c150d01/master' (cwd None) I attached the full log.
i'm getting the same issue when trying to create a new single VM (clone) from the user portal! adding a test blocker flag to this one and raising priority to urgent
I couldn't reproduce in vdsm-72. Tried two scenarios, both were performed under load of machine (4 cores, 6 processes 'dd if=/dev/urandom > /dev/null') 1) Created 7 templates in parallel as written in description. All tasks ended successfully. 2) Created 3 clone VMs in User Portal. Successful.
On needinfo for over a month and Jakub couldn't reproduce, closing as worksforme.
Unfortunately, the issue is real and reproduced with vdsm-4.9-73. It might be specific to a certain storage workload here in tlv, but it is still alive.
I encountered the same issue again when creating a VM from template with copy+preallocated. the host fenced itself during creation of the VM Federico issued a fix which after testing with this scenario solved the issue.
For future reference: the fix was a patched qemu-img that is using O_DIRECT for writing the images. This prevents from filling up the buffers which then are flushed preventing safelease to renew the lease. The patch will be submitted to the appropriate mailing list.
Patch posted upstream: http://marc.info/?l=kvm&m=130805442315023
In my scenario the host was rebooted every time (4 out of 4) I imported a template from Export NFS domain, my env contains one RHEL6.1 host in the 3.0 Data Center with one iSCSI Storage domain and one NFS Export domain, qemu-img-0.12.1.2-2.164.el6.x86_64 and vdsm-4.9-75.el6.x86_64. I used fsimonce's patched qemu-img (it doesn't have version yet) which fixed the reboot issue, now the import template pass successfully.
My comment 25 apply for the needinfo.
Hopefully, qemu-im bug 713743 would be resolved in such a way that we would not need to do a thing here. In case qemu-img requires special command line to avoid trashing the host cache, we could use something like: http://gerrit.usersys.redhat.com/589
With the RFE in bug 713743 we would have fixed the symptom, but not the cause. It would stay a time bomb, so I believe we still need a proper fix even if we add the stopgap in qemu-img.
We are aware that the current fix is not optimal. In fact we are already working on other solutions, but they are less likely to get into 6.2 (cgroup).
Verified - vdsm-4.9-91 - could not reproduce any of above mentioned issues, therefore moving to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2011-1782.html