Description of problem: During I/O to disk on guest start LSM. After a minute the operation hangs and an error is reported that connection has been lost with the qemu process Version-Release number of selected component (if applicable): rhevm-3.6.0-0.12.master.el6.noarch vdsm-4.17.3-1.el7ev.noarch How reproducible: Ran this once Steps to Reproduce: 1. Create a Vm with a 14g nfs thin disk, run the VM and install OS 2. Start writing to the disk with DD 3. During the I/O on the guest start the LSM >>>>> After a minute the operation hangs and an error is reported indicating that connection to the qemu process has been lost Actual results: Connection to the qemu process is lost during I/O to the disk and LSM of the disk Expected results: The LSM and should suceed and the LSM should complete Additional info: Engine Log: -------------------------- 2015-09-10 13:54:48,905 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-3) [] FINISH, DestroyVDSCommand, log id: 4174c7cc 2015-09-10 13:54:49,004 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-3) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM VM18 is down with error. Exit mess age: Lost connection with qemu process. 2015-09-10 13:54:49,006 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '082836ca-4a6d-418e-b46f-82e36e6906af(VM18) is running in db and not running in VDS 'level1_vm01' 2015-09-10 13:54:49,034 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-7-thread-5) [55d6e382] Running command: ProcessDownVmCommand internal: true. 2015-09-10 14:11:46,582 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] SPMAsyncTask::PollTask: Polling task 'f642551d-3ab9-4108-ad15-7468cfe9179b' (Parent Command 'LiveMigrateDisk', Para meters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'. 2015-09-10 14:11:46,595 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] BaseAsyncTask::onTaskEndSuccess: Task 'f642551d-3ab9-4108-ad15-7468cfe9179b' (Parent Command 'LiveMigrateDisk', Par ameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully. 2015-09-10 14:11:46,599 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] CommandAsyncTask::endActionIfNecessary: All tasks of command 'ed2ec1ed-eafa-485e-934d-711fdffe17fd' has ended - > executing 'endAction' 2015-09-10 14:11:46,599 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: 'ed2ec1ed-eafa-485e-934d-711fdffe17fd'): calling endAction '. 2015-09-10 14:11:46,599 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-30) [6f96321] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'LiveMigrateDisk', executi onIndex: '1' 2015-09-10 14:11:46,863 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Ending command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' successfully. 2015-09-10 14:11:46,864 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Running command: LiveMigrateDiskCommand Task handler: VmReplicateDiskFinishTaskHandler internal: false. E ntities affected : ID: 3242921b-9642-4137-9e36-df7c2802e887 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER, ID: ca046d48-3cf1-4d80-a6e0-2bf076224033 Type: StorageAction group CREATE_DISK with role type USER 2015-09-10 14:11:46,890 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Cannot get vdsManager for vdsid='00000000-0000-0000-0000-000000000000'. 2015-09-10 14:11:46,891 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (org.ovirt.thread.pool-7-thread-30) [3b06388e] CreateCommand failed: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) 2015-09-10 14:11:46,892 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Exception: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [rt.jar:1.8.0_51] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [rt.jar:1.8.0_51] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.8.0_51] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) [rt.jar:1.8.0_51] at org.ovirt.engine.core.vdsbroker.ResourceManager.instantiateInjectedCommand(ResourceManager.java:432) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:416) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:464) [vdsbroker.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler.runVdsCommand(AbstractSPMAsyncTaskHandler.java:48) [bll.jar:] at org.ovirt.engine.core.bll.lsm.VmReplicateDiskFinishTaskHandler.beforeTask(VmReplicateDiskFinishTaskHandler.java:60) [bll.jar:] at org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler.execute(AbstractSPMAsyncTaskHandler.java:38) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1209) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1355) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1979) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1392) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:644) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:591) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1981) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] : VDSM log ------------------ Thread-30::DEBUG::2015-09-10 13:54:50,110::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with FileMetadataRW backend Thread-28214::DEBUG::2015-09-10 13:54:49,972::fileSD::169::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/10.35.160.108:_Storage_kev2/b0ae1c01-b71d-4ad4-a9a8-84ce1dc536f8/dom_md/metadata iflag=direct of=/dev/nul l bs=4096 count=1 (cwd None) Thread-28214::DEBUG::2015-09-10 13:54:50,160::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n329 bytes (329 B) copied, 0.015572 s, 21.1 kB/s\n'; <rc> = 0 Thread-30::DEBUG::2015-09-10 13:54:50,198::persistentDict::234::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Data', 'DESCRIPTION=nfs1', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIMESEC=60', 'LOCKPOLICY=', 'LO CKRENEWALINTERVALSEC=5', 'MASTER_VERSION=1', 'POOL_DESCRIPTION=dc1', 'POOL_DOMAINS=7e76fac6-3777-48e1-8533-1dce9ddff4e1:Active', 'POOL_SPM_ID=-1', 'POOL_SPM_LVER=-1', 'POOL_UUID=8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2', 'REMOTE_PATH=10.35.1 60.108:/Storage/kev1', 'ROLE=Master', 'SDUUID=7e76fac6-3777-48e1-8533-1dce9ddff4e1', 'TYPE=NFS', 'VERSION=3', '_SHA_CKSUM=23f0390c1e39e94d4dd39b5d9ba4dc953dae3a11'] Thread-30::DEBUG::2015-09-10 13:54:50,210::fileSD::643::Storage.StorageDomain::(imageGarbageCollector) Removing remnants of deleted images [] Thread-30::INFO::2015-09-10 13:54:50,211::sd::442::Storage.StorageDomain::(_registerResourceNamespaces) Resource namespace 7e76fac6-3777-48e1-8533-1dce9ddff4e1_imageNS already registered Thread-30::INFO::2015-09-10 13:54:50,211::sd::450::Storage.StorageDomain::(_registerResourceNamespaces) Resource namespace 7e76fac6-3777-48e1-8533-1dce9ddff4e1_volumeNS already registered Thread-30::DEBUG::2015-09-10 13:54:50,225::fileSD::169::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/10.35.160.108:_Storage_kev1/7e76fac6-3777-48e1-8533-1dce9ddff4e1/dom_md/metadata iflag=direct of=/dev/null b s=4096 count=1 (cwd None) Thread-30::DEBUG::2015-09-10 13:54:50,248::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n455 bytes (455 B) copied, 0.00454271 s, 100 kB/s\n'; <rc> = 0 periodic/141::WARNING::2015-09-10 13:54:50,346::periodic::253::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'082836ca-4a6d-418e-b46f-82e36e6906af'] Thread-28228::DEBUG::2015-09-10 13:54:50,479::monitor::288::Storage.Monitor::(_refreshDomain) Refreshing domain eba7102c-4651-44d4-9285-2b45056710e9 Thread-28228::DEBUG::2015-09-10 13:54:50,480::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with LvMetadataRW backend Thread-28228::DEBUG::2015-09-10 13:54:50,480::blockSD::337::Storage.Misc.excCmd::(readlines) /usr/bin/dd iflag=direct skip=0 bs=2048 if=/dev/eba7102c-4651-44d4-9285-2b45056710e9/metadata count=1 (cwd None) VM Channels Listener::DEBUG::2015-09-10 13:54:50,661::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 15 VM Channels Listener::DEBUG::2015-09-10 13:54:50,662::guestagent::218::virt.vm::(_connect) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Attempting connection to /var/lib/libvirt/qemu/channels/082836ca-4a6d-418e-b46f-82e36e6906af.com.redh at.rhevm.vdsm VM Channels Listener::DEBUG::2015-09-10 13:54:50,662::guestagent::232::virt.vm::(_connect) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Failed to connect to /var/lib/libvirt/qemu/channels/082836ca-4a6d-418e-b46f-82e36e6906af.com.redhat.r hevm.vdsm with 2 periodic/137::ERROR::2015-09-10 13:54:50,663::vm::960::virt.vm::(_getExtendCandidates) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Unable to get watermarks for drive vda: Unable to read from monitor: Connection reset by peer periodic/137::DEBUG::2015-09-10 13:54:50,663::executor::178::Executor::(_run) Worker was discarded Thread-28228::DEBUG::2015-09-10 13:54:50,684::blockSD::337::Storage.Misc.excCmd::(readlines) SUCCESS: <err> = '1+0 records in\n1+0 records out\n2048 bytes (2.0 kB) copied, 0.0147833 s, 139 kB/s\n'; <rc> = 0 Thread-28228::DEBUG::2015-09-10 13:54:50,685::misc::260::Storage.Misc::(validateDDBytes) err: ['1+0 records in', '1+0 records out', '2048 bytes (2.0 kB) copied, 0.0147833 s, 139 kB/s'], size: 2048 Thread-28228::DEBUG::2015-09-10 13:54:50,685::persistentDict::234::Storage.PersistentDict::(refresh) read lines (LvMetadataRW)=[] Thread-28228::DEBUG::2015-09-10 13:54:50,685::persistentDict::252::Storage.PersistentDict::(refresh) Empty metadata Thread-28228::DEBUG::2015-09-10 13:54:50,685::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with VGTagMetadataRW backend Thread-28228::DEBUG::2015-09-10 13:54:50,686::lvm::504::Storage.OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex Thread-28228::DEBUG::2015-09-10 13:54:50,686::lvm::506::Storage.OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex Thread-28228::DEBUG::2015-09-10 13:54:50,686::lvm::514::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex Thread-28228::DEBUG::2015-09-10 13:54:50,687::lvm::526::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex Thread-28228::DEBUG::2015-09-10 13:54:50,687::lvm::371::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-28228::DEBUG::2015-09-10 13:54:50,688::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_e rror_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a03000798dd6205b34e511|/dev/mapper/360060160f4a030007b8dd6205b34e511|/dev/mapper/360060160f4a030007d8dd6205b34e511|/dev/mapper/360060160f4a03000979e521de bece411|/dev/mapper/360060160f4a03000999e521debece411|/dev/mapper/360060160f4a030009b9e521debece411|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name eba7102c-4651-44d4-9285 -2b45056710e9 (cwd None)
Created attachment 1072250 [details] engine, vdsm, server logs Added logs
I REFINED THE TEST AND HERE ARE THE RESULTS: 1. Ran the scenario again and the error occured again - so this SPECIFICALLY happens when running dd BEFORE starting the LSM and AGAIN a second session during the SLM 2. Ran the same scenario as above running the dd session BEFORE the lsm BUT did NOT start a second dd write during the LSM and it worked FINE 3. Ran the same scenario with only installing the OS and no dd and it worked FINE! ERROR in engine: 2015-09-10 19:36:04,935 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-9) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm20 is down with error. Exit message: Lost connection with qemu process. Error in vdsm: Thread-69696::DEBUG::2015-09-10 19:36:04,422::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033' for lock type 'shared' Thread-69696::DEBUG::2015-09-10 19:36:04,422::resourceManager::566::Storage.ResourceManager::(registerResource) Resource 'Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033' found in shared state and queue is empty, Joining current shared lock (2 active users) Thread-69696::DEBUG::2015-09-10 19:36:04,423::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033`ReqID=`ce5f0cdf-a7e1-4a68-9286-6611ec296cc9`::Granted request Thread-69696::DEBUG::2015-09-10 19:36:04,423::task::827::Storage.TaskManager.Task::(resourceAcquired) Task=`004a38fa-7909-4371-872b-115ea7d845a4`::_resourcesAcquired: Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033 (shared) Thread-69696::DEBUG::2015-09-10 19:36:04,423::task::993::Storage.TaskManager.Task::(_decref) Task=`004a38fa-7909-4371-872b-115ea7d845a4`::ref 1 aborting False Thread-69696::DEBUG::2015-09-10 19:36:04,424::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`ca046d48-3cf1-4d80-a6e0-2bf076224033_imageNS.ec548166-bfec-4fd1-9eeb-d81e3fae2348`ReqID=`27d27c49-5ac6-43cc-aa04-3bd 1ea5cd526`::Request was made in '/usr/share/vdsm/storage/sp.py' line '1915' at 'setVolumeDescription' Thread-69696::DEBUG::2015-09-10 19:36:04,424::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'ca046d48-3cf1-4d80-a6e0-2bf076224033_imageNS.ec548166-bfec-4fd1-9eeb-d81e3fae2348' for lock type 'exclusive' Thread-69696::DEBUG::2015-09-10 19:36:04,425::lvm::428::Storage.OperationMutex::(_reloadlvs) Operation 'lvm reload operation' got the operation mutex Thread-69696::DEBUG::2015-09-10 19:36:04,425::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_e rror_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a03000798dd6205b34e511|/dev/mapper/360060160f4a030007b8dd6205b34e511|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_fo r_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags ca046d48-3cf1-4d80-a6e0-2bf07622 4033 (cwd None) periodic/255::WARNING::2015-09-10 19:36:04,905::periodic::253::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'169380d1-5def-4942-8a31-f42ffa9c9ba3'] mailbox.SPMMonitor::DEBUG::2015-09-10 19:36:06,192::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 ( cwd None) mailbox.SPMMonitor::DEBUG::2015-09-10 19:36:06,244::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0368906 s, 27.8 MB/s\n'; <rc> = 0 VM Channels Listener::DEBUG::2015-09-10 19:36:06,677::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 15 VM Channels Listener::DEBUG::2015-09-10 19:36:06,677::guestagent::218::virt.vm::(_connect) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Attempting connection to /var/lib/libvirt/qemu/channels/169380d1-5def-4942-8a31-f42ffa9c9ba3.com.redh at.rhevm.vdsm VM Channels Listener::DEBUG::2015-09-10 19:36:06,678::guestagent::232::virt.vm::(_connect) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Failed to connect to /var/lib/libvirt/qemu/channels/169380d1-5def-4942-8a31-f42ffa9c9ba3.com.redhat.r hevm.vdsm with 111 periodic/251::ERROR::2015-09-10 19:36:06,892::vm::960::virt.vm::(_getExtendCandidates) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Unable to get watermarks for drive vda: Unable to read from monitor: Connection reset by peer periodic/251::DEBUG::2015-09-10 19:36:06,892::executor::178::Executor::(_run) Worker was discarded libvirtEventLoop::DEBUG::2015-09-10 19:36:06,894::vm::4415::virt.vm::(onLibvirtLifecycleEvent) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::event Stopped detail 5 opaque None libvirtEventLoop::INFO::2015-09-10 19:36:06,894::vm::807::virt.vm::(_onQemuDeath) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::underlying process disconnected libvirtEventLoop::INFO::2015-09-10 19:36:06,895::vm::3687::virt.vm::(releaseVm) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Release VM resources libvirtEventLoop::DEBUG::2015-09-10 19:36:06,897::vmchannels::214::vds::(unregister) Delete fileno 15 from listener. libvirtEventLoop::DEBUG::2015-09-10 19:36:06,897::task::595::Storage.TaskManager.Task::(_updateState) Task=`cedaa805-48d0-42e3-b98d-b36d3ce66c11`::moving from state init -> state preparing libvirtEventLoop::INFO::2015-09-10 19:36:06,898::logUtils::48::dispatcher::(wrapper) Run and protect: teardownImage(sdUUID=u'7e76fac6-3777-48e1-8533-1dce9ddff4e1', spUUID=u'8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2', imgUUID=u'1f7c93ee-a26a-4a24-8607-d9c36b58ecb9', volUUID=None)
Daniel - please take a look. Kevin - what versions of qmeu and libvirt are you using?
Maybe related: We saw similar behaviour during Live Merge. QEMU got unresponsive round about 30 seconds after start of action because of uninterruptible bitmap task. More details in BZ1262251
Thanks Markus! Kevin, can this be another issue related to the qemu patch Markus mention in https://bugzilla.redhat.com/show_bug.cgi?id=1262251#c13 ?
I think that's unlikely. What Markus reported stalls guest I/O while starting the mirror block job (when it checks which parts of the image are allocated and must be copied), but not when it's active and mirroring new data. In contrast, here the problem seems to only occur when you write new data while it's active (as Kevin described in comment 2). This is after the point where that patch could have an effect.
(In reply to Allon Mureinik from comment #3) > Daniel - please take a look. > > Kevin - what versions of qmeu and libvirt are you using? qemu ---------------------- qemu-kvm-rhev-2.3.0-19.el7.x86_64 libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64 qemu-kvm-tools-rhev-2.3.0-19.el7.x86_64 qemu-img-rhev-2.3.0-19.el7.x86_64 qemu-kvm-common-rhev-2.3.0-19.el7.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch libvirt --------------------- libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64 libvirt-daemon-1.2.17-5.el7.x86_64 libvirt-daemon-driver-secret-1.2.17-5.el7.x86_64 libvirt-daemon-driver-network-1.2.17-5.el7.x86_64 libvirt-daemon-kvm-1.2.17-5.el7.x86_64 libvirt-client-1.2.17-5.el7.x86_64 libvirt-daemon-driver-nwfilter-1.2.17-5.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.17-5.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.17-5.el7.x86_64 libvirt-daemon-driver-interface-1.2.17-5.el7.x86_64 libvirt-daemon-driver-storage-1.2.17-5.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-lock-sanlock-1.2.17-5.el7.x86_64
Aharon - what is the frequency of this issue? Additional repro (comment 2) seems to suggest it is happening in more specific scenarios - so why is it blocking testing the above BZs? Let's try to find more issues in the other features and not hold up because of this one, if possible.
Hi Kevin, According to the described scenario, connection to qemu process is lost while dd to the guest during LSM [1]. What additional data is needed to determine whether there's an issue in qemu side (rather than vdsm)? Thanks!, Daniel [1] https://bugzilla.redhat.com/show_bug.cgi?id=1261980#c2 ("Message: VM vm20 is down with error. Exit message: Lost connection with qemu process.")
If it's an easily reproducible scenario: 1. What about the qemu log? 2. Please have Virt-QE reproduce it. 3. Attach GDB to the QEMU process (need to be quick - or VDSM will think the VM is not responsive!) and get a nice stack when it blows up.
(In reply to Yaniv Kaul from comment #8) > Aharon - what is the frequency of this issue? Additional repro (comment 2) > seems to suggest it is happening in more specific scenarios - so why is it > blocking testing the above BZs? > > Let's try to find more issues in the other features and not hold up because > of this one, if possible. It happens every time when running steps from comment #2. I blocked the tracker bugs because it looks very basic to me - we can discuss it and decide if it is not. Kevin already completed running all other cases so other than that we are good. Kevin will answer comment #10 asap.
Yaniv already answered for me in comment 10, and I have nothing to add at this point, so I'm just unsetting my needinfo.
(In reply to Yaniv Kaul from comment #10) > If it's an easily reproducible scenario: > 1. What about the qemu log? > 2. Please have Virt-QE reproduce it. > 3. Attach GDB to the QEMU process (need to be quick - or VDSM will think the > VM is not responsive!) and get a nice stack when it blows up. Ran this scenario again. This time connection to the VM was lost during the first write and LSM as follows: Steps to Reproduce: 1. Create a Vm with a 14g nfs thin disk, run the VM and install OS 2. Start writing to the disk with DD 3. During the I/O on the guest start the LSM >>>>> After a minute the operation hangs and an error is reported indicating that connection to the qemu process has been lost Actual results: Connection to the qemu process is lost during I/O to the disk and LSM of the disk Expected results: The LSM should suceed and the LSM should complete
Engine log from previous run: 2015-10-05 22:09:41,935 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-60) [4724dd39] VM 'f21b318b-2118-4980-95eb-13f68e689eb2'(vm24) moved from 'Up' --> 'NotResponding' 2015-10-05 22:09:41,981 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) [4724dd39] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm24 is not responding. vdsm log: ll bs=4096 count=1 (cwd None) Thread-34::DEBUG::2015-10-05 19:10:01,980::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n455 bytes (455 B) copied, 0.000604022 s, 753 kB/s\n'; <rc> = 0 Thread-344277::ERROR::2015-10-05 19:10:02,171::vm::3127::virt.vm::(snapshot) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to take snapshot Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 3125, in snapshot self._dom.snapshotCreateXML(snapxml, snapFlags)
Created attachment 1080078 [details] new vdsm, engine. gdb and libvirt logs added Added logs
Kevin, any insights given the new information?
Following my discussion with Allon and Yaniv, As this doesn't happen only on mixed LSM and as it is blocker bug to the version, Removing it from blocking the mixed LSM features as those were verified and working.
(In reply to Tal Nisan from comment #16) > Kevin, any insights given the new information? The uploadde archive contains a file qemu_trace_05_10_2015 that has the output of a gdb session, however the only command used there was 'quit', so it's not very helpful. 'thread apply all bt full' is what I would need at the very least to get any clue about what qemu is doing. Also, I don't really know yet what the "connection lost" message in VDSM actually means in detail. Can someone explain? Is this about a monitor socket being closed, or a command timing out or anything like that? And is libvirt involved or is VDSM talking directly to qemu there?
(In reply to Kevin Wolf from comment #18) > Also, I don't really know yet what the "connection lost" message in VDSM > actually means in detail. Can someone explain? Is this about a monitor socket > being closed, or a command timing out or anything like that? And is libvirt > involved or is VDSM talking directly to qemu there? We use libvirt to preform virDomainBlockCopy (or virDomainBlockRebase if block copy is not supported, but it should be in libvirt 1.2.17). The "connection lost" message is this: 1. We fail to get watermarks for drive vda (using virDomainBlockInfo, need to improve this log message): periodic/137::ERROR::2015-09-10 13:54:50,663::vm::960::virt.vm::(_getExtendCandidates) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Unable to get watermarks for drive vda: Unable to read from monitor: Connection reset by peer 2. Immediately after that: libvirtEventLoop::DEBUG::2015-09-10 13:54:50,859::vm::4415::virt.vm::(onLibvirtLifecycleEvent) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::event Stopped detail 5 opaque None libvirtEventLoop::INFO::2015-09-10 13:54:50,859::vm::807::virt.vm::(_onQemuDeath) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::underlying process disconnected This can be seen in the first vdsm log (attachment 1072250 [details]) I suggest to open a separate qemu bug for this issue and continue there. Hopefully we can create a simple reproducer using virsh. In second log the operation fails when trying to create a snapshot, and it does not seems related to qemu, looks like libvirt issue: 1. We try to take a snapshot: Thread-344277::INFO::2015-10-05 19:09:02,161::vm::3112::virt.vm::(snapshot) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::<domainsnapshot> <disks> <disk name="vda" snapshot="external" type="file"> <source file="/rhev/data-center/8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2/7e76fac6-3777-48e1-8533-1dce9ddff4e1/images/e0f320d0-ae7a-47cd-a027-c6b297f78535/07 f25c30-4a01-49ae-9e4a-6daa79c48171" type="file"/> </disk> </disks> </domainsnapshot> 2. We freeze the guest file systems Thread-344277::INFO::2015-10-05 19:09:02,161::vm::2889::virt.vm::(freeze) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Freezing guest filesystems 3. We failed to freeze (timeout) Thread-344277::WARNING::2015-10-05 19:09:32,167::vm::2894::virt.vm::(freeze) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to freeze guest filesystems: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) Since generally libvirt error reporting is not specific enough, we assume that this failure means there is no guest agent on the host, so we continue without freezing. 4. We try to take a snapshot (again, fails with timeout) Thread-344277::ERROR::2015-10-05 19:10:02,171::vm::3127::virt.vm::(snapshot) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to take snapshot Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 3125, in snapshot self._dom.snapshotCreateXML(snapxml, snapFlags) File "/usr/share/vdsm/virt/vm.py", line 214, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) 5. We thaw guest file systems (always done if we tried to freeze, so we never leave guest in freezed state) Thread-344277::INFO::2015-10-05 19:10:02,277::vm::2912::virt.vm::(thaw) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Thawing guest filesystems 6. thaw fail also with timeout Thread-344277::WARNING::2015-10-05 19:10:32,278::vm::2917::virt.vm::(thaw) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to thaw guest filesystems: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) This may be an issue with libvirt but it seems like regression in our sampling code introduced during 3.6 development. Francesco, would you open a virt bug for this and work with kgoldbla on this issue?
Created attachment 1081998 [details] vdsm.log, engine.log, gdb I believe I have the same problem when running LSM. Attached: vdsm.log engine.log and gdb with "thread apply all bt full" - which I'm not sure if it's any good. Versions: vdsm-4.17.8-1.el7ev.noarch rhevm-3.6.0-0.18.el6.noarch qemu: qemu-kvm-common-rhev-2.3.0-29.el7.x86_64 qemu-kvm-rhev-2.3.0-29.el7.x86_64 libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64 qemu-img-rhev-2.3.0-29.el7.x86_64 qemu-kvm-tools-rhev-2.3.0-29.el7.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch libvirt: libvirt-daemon-kvm-1.2.17-5.el7.x86_64 libvirt-daemon-driver-nwfilter-1.2.17-5.el7.x86_64 libvirt-daemon-driver-lxc-1.2.17-5.el7.x86_64 libvirt-daemon-driver-storage-1.2.17-5.el7.x86_64 libvirt-lock-sanlock-1.2.17-5.el7.x86_64 libvirt-daemon-driver-network-1.2.17-5.el7.x86_64 libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64 libvirt-daemon-driver-interface-1.2.17-5.el7.x86_64 libvirt-daemon-1.2.17-5.el7.x86_64 libvirt-daemon-config-network-1.2.17-5.el7.x86_64 libvirt-daemon-driver-secret-1.2.17-5.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-client-1.2.17-5.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.17-5.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.17-5.el7.x86_64 libvirt-1.2.17-5.el7.x86_64 Relevant time-line in logs, starting @ 13:24
(In reply to Kevin Wolf from comment #18) > (In reply to Tal Nisan from comment #16) > > Kevin, any insights given the new information? > > The uploadde archive contains a file qemu_trace_05_10_2015 that has the > output > of a gdb session, however the only command used there was 'quit', so it's not > very helpful. 'thread apply all bt full' is what I would need at the very > least > to get any clue about what qemu is doing. > > Also, I don't really know yet what the "connection lost" message in VDSM > actually means in detail. Can someone explain? Is this about a monitor socket > being closed, or a command timing out or anything like that? And is libvirt > involved or is VDSM talking directly to qemu there? See comment 20 for the gdb trace.
This qemu process looks alive. Can you try whether it's possible to attach a new monitor connection to the running process? I guess libvirt can't do that, but you should be able to manually connect to the socket.
(In reply to Nir Soffer from comment #19) > (In reply to Kevin Wolf from comment #18) > > Also, I don't really know yet what the "connection lost" message in VDSM > > actually means in detail. Can someone explain? Is this about a monitor socket > > being closed, or a command timing out or anything like that? And is libvirt > > involved or is VDSM talking directly to qemu there? > Francesco, would you open a virt bug for this and work with kgoldbla on > this issue? I'll be more than happy to work on this (and to find a simple reproducer) So far, it seems to share the same root cause of https://bugzilla.redhat.com/show_bug.cgi?id=1250839 Will now properly review the logs, and file a new bug if this is not the case.
(In reply to Francesco Romani from comment #23) > (In reply to Nir Soffer from comment #19) > > (In reply to Kevin Wolf from comment #18) > > > Also, I don't really know yet what the "connection lost" message in VDSM > > > actually means in detail. Can someone explain? Is this about a monitor socket > > > being closed, or a command timing out or anything like that? And is libvirt > > > involved or is VDSM talking directly to qemu there? > > > Francesco, would you open a virt bug for this and work with kgoldbla on > > this issue? > > I'll be more than happy to work on this (and to find a simple reproducer) > > So far, it seems to share the same root cause of > https://bugzilla.redhat.com/show_bug.cgi?id=1250839 > > Will now properly review the logs, and file a new bug if this is not the > case. Thanks Francesco! Please let us know if there's any new outcome. Adding needinfo for visibility..
Francesco - thanks! Reassigning so it's easier to track this. Please send it back Daniel's way if it's more of a storage issue than a virt one. And once again - thanks!
I finally managed to find some time to play with this. TL;DR: Kevin (Goldblatt), can you please check if this still fails with qemu-kvm-rhev >= 2.3.0-28? +++ First, I wanted to reproduce locally the qemu-kvm-rhev being stuck. I had a RHEL7.2 box with quite recent packages, and run the reproducer much like description: - 3.6.0-ish (but Engine is irrelevant here, VDSM not yet into play) environment with NFS and ISCSI storage domains - created VM on NFS storage, thin provisioned, 14 GB disk - booted VM using live image, started brutal dd if=/dev/urandom of=/dev/vda oflag=direct. Please note that this *is* different from description, but at qemu side should not make any difference, it is just I/O on disk - after something like 20-30s, started Live Storage Migration - qemu was responsive all along (inspected using ps, virsh, vdsClient, through spice console) after few minutes, stopped dd inside guest, stopped vm, removed snapshot and disk, all succeeded. But I noticed that I was running a much more recent qemu-kvm-*EV* build. Checking the differences, I found some interesting patches to fix bugs in the drive-mirror area, like https://bugzilla.redhat.com/show_bug.cgi?id=1251487 https://bugzilla.redhat.com/show_bug.cgi?id=1259229 Some reported behaviour matches our findings here. Hence, let's see if the original issue is still present
(In reply to Francesco Romani from comment #26) > I finally managed to find some time to play with this. > > TL;DR: Kevin (Goldblatt), can you please check if this still fails with > qemu-kvm-rhev >= 2.3.0-28? > > +++ > > First, I wanted to reproduce locally the qemu-kvm-rhev being stuck. [...] > But I noticed that I was running a much more recent qemu-kvm-*EV* build. Which is identical to the -rhev builds minus the branding, so the above is still significant, even more the question: > Hence, let's see if the original issue is still present
Francesco, what about comment #20, that has 2.3.0-29 (which is not latest, but more recent than you pointed out in comment #27)? Goldblatt, does it reproduce on latest build?
(In reply to Michal Skrivanek from comment #28) > Francesco, what about comment #20, that has 2.3.0-29 (which is not latest, > but more recent than you pointed out in comment #27)? Right, I missed that. So it seems it is still happening with recent qemu-kvm-rhev, will try again to reproduce locally; if reproduces, will attach new monitor and inspect what's going on. Natalie, are you doing LSM from NFS to block storage (ISCSI maybe), right? Are the images on NFS thin provisioned? Please share all the details about the storage migration being performed, to make sure I'm reproducing the same environment.
After additional investigation (and some failed attempts to reproduce), we should move this bug down to QEMU(-kvm). On VDSM side the issue as per https://bugzilla.redhat.com/show_bug.cgi?id=1250839 is confirmed, but we con continue here.
cloned this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1279777 and moved the clone to qemu-kvm-rhev. Keeping this bug open to consume the fix from 1279777. As far as we know right now, no further action is expected on RHEV side.
(In reply to Francesco Romani from comment #26) > I finally managed to find some time to play with this. > > TL;DR: Kevin (Goldblatt), can you please check if this still fails with > qemu-kvm-rhev >= 2.3.0-28? > > +++ > > First, I wanted to reproduce locally the qemu-kvm-rhev being stuck. > > I had a RHEL7.2 box with quite recent packages, and run the reproducer much > like description: > - 3.6.0-ish (but Engine is irrelevant here, VDSM not yet into play) > environment with NFS and ISCSI storage domains > - created VM on NFS storage, thin provisioned, 14 GB disk > - booted VM using live image, started brutal dd if=/dev/urandom of=/dev/vda > oflag=direct. Please note that this *is* different from description, but at > qemu side should not make any difference, it is just I/O on disk > - after something like 20-30s, started Live Storage Migration > - qemu was responsive all along (inspected using ps, virsh, vdsClient, > through spice console) > > after few minutes, stopped dd inside guest, stopped vm, removed snapshot and > disk, all succeeded. > > But I noticed that I was running a much more recent qemu-kvm-*EV* build. > Checking the differences, I found some interesting patches to fix bugs in > the drive-mirror area, like > https://bugzilla.redhat.com/show_bug.cgi?id=1251487 > https://bugzilla.redhat.com/show_bug.cgi?id=1259229 > > Some reported behaviour matches our findings here. > > Hence, let's see if the original issue is still present I ran the scenario again with the following code: -------------------------------------------------------- rhevm-3.6.0.3-0.1.el6.noarch - engine vdsm-4.17.10.1-0.el7ev.noarch - vdsm libvirt-------------------------------- libvirt-client-1.2.17-13.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-daemon-driver-interface-1.2.17-13.el7.x86_64 libvirt-daemon-driver-storage-1.2.17-13.el7.x86_64 libvirt-daemon-1.2.17-13.el7.x86_64 libvirt-daemon-driver-nwfilter-1.2.17-13.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.17-13.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.17-13.el7.x86_64 libvirt-daemon-driver-secret-1.2.17-13.el7.x86_64 libvirt-daemon-driver-qemu-1.2.17-13.el7.x86_64 libvirt-daemon-kvm-1.2.17-13.el7.x86_64 libvirt-daemon-driver-network-1.2.17-13.el7.x86_64 libvirt-lock-sanlock-1.2.17-13.el7.x86_64 qemu--------------------------------- ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch qemu-kvm-tools-rhev-2.3.0-31.el7.x86_64 qemu-kvm-common-rhev-2.3.0-31.el7.x86_64 libvirt-daemon-driver-qemu-1.2.17-13.el7.x86_64 qemu-img-rhev-2.3.0-31.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7.x86_64 I ran the following scenario: -------------------------------------------------------- 1. Created VM with 1 nfs thin disk of 14g 2. Installed Rhel7.1 on the VM 3. Started a dd (dd if=/dev/zero of/aaa/aa bs=512k count=5000) on the OS 4. After about 30 seconds started LSM to a block SD 5. Ran a 2nd dd on the vm OS when the first finished during the LSM and a 3 dd after that. 6. 5 minutes later the Mixed LSM failed 7. The qemu process was still alive Engine log -------------------------------------------------------- 2015-11-10 16:34:30,803 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Failed in 'VmReplicateDiskFinishVDS' method 2015-11-10 16:34:30,809 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-35) [3e229702] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM blond-vdsf com mand failed: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo) 2015-11-10 16:34:30,809 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=-32603, message=Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo)]]' 2015-11-10 16:34:30,809 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] HostName = blond-vdsf 2015-11-10 16:34:30,809 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Command 'VmReplicateDiskFinishVDSCommand(HostName = blond-vdsf, VmReplicateDiskParameters:{runAsync='true', hostId='e338a2b6-7a21-4a1a-b305-895682b0b7d2', vmId='827992ea-9404-4ce1-b438-679e3c83ed6a'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo), code = -32603 2015-11-10 16:34:30,809 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] FINISH, VmReplicateDiskFinishVDSCommand, log id: 1824df3e 2015-11-10 16:34:30,841 ERROR [org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler] (org.ovirt.thread.pool-7-thread-35) [3e229702] Failed VmReplicateDiskFinish (Disk 'c2b79f7a-a587-443f-885f-93bf1edfbe2a', VM '827992ea-9404-4ce1-b438-679e3c83ed6a') 2015-11-10 16:34:30,841 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo), code = -32603 (Failed with error unexpected and code 16) 2015-11-10 16:34:30,845 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Reverting task 'unknown', handler 'org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler' 2015-11-10 16:34:30,846 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] START, VmReplicateDiskFinishVDSCommand(HostName = blond-vdsf, VmReplicateDiskParameters:{runAsync='true', hostId='e338a2b6-7a21-4a1a-b305-895682b0b7d2', vmId='827992ea-9404-4ce1-b438-679e3c83ed6a'}), log id: 7d83513c 2015-11-10 16:35:00,866 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Unexpected return value: StatusForXmlRpc [code=-32603, message=Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo)] 2015-11-10 16:35:00,866 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Failed in 'VmReplicateDiskFinishVDS' method 2015-11-10 16:35:00,873 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-35) [3e229702] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM blond-vdsf command failed: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo) 2015-11-10 16:35:00,873 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=-32603, message=Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo)]]' 2015-11-10 16:35:00,873 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] HostName = blond-vdsf 2015-11-10 16:35:00,873 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] Command 'VmReplicateDiskFinishVDSCommand(HostName = blond-vdsf, VmReplicateDiskParameters:{runAsync='true', hostId='e338a2b6-7a21-4a1a-b305-895682b0b7d2', vmId='827992ea-9404-4ce1-b438-679e3c83ed6a'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo), code = -32603 2015-11-10 16:35:00,874 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-7-thread-35) [3e229702] FINISH, VmReplicateDiskFinishVDSCommand, log id: 7d83513c 2015-11-10 16:35:00,878 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-35) [3e229702] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'LiveMigrateDisk' completed, handling the result. 2015-11-10 16:35:00,879 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-35) [3e229702] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'LiveMigrateDisk' hasn't succeeded, not clearing tasks, will attempt again next polling. 2015-11-10 16:35:10,587 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-93) [6d91d3ce] BaseAsyncTask::onTaskEndSuccess: Task 'd61d55c7-f1da-45b9-ba4f-c665b66d1d46' (Parent Command 'LiveMigrateDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully. 2015-11-10 16:35:10,591 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-93) [6d91d3ce] CommandAsyncTask::endActionIfNecessary: All tasks of command '7fa5af6d-de94-4ed6-b995-2cb0cc7ea2ce' has ended -> executing 'endAction' 2015-11-10 16:35:10,591 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-93) [6d91d3ce] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: '7fa5af6d-de94-4ed6-b995-2cb0cc7ea2ce'): calling endAction '. 2015-11-10 16:35:10,591 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-17) [6d91d3ce] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'LiveMigrateDisk', executionIndex: '0' 2015-11-10 16:35:10,637 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-17) [3e229702] Ending command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' with failure. 2015-11-10 16:35:10,641 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-17) [3e229702] Lock freed to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[827992ea-9404-4ce1-b438-679e3c83ed6a=<VM, ACTION_TYPE_FAILED_DISK_IS_BEING_MIGRATED$DiskName vmtest2_Disk1>]'}' 2015-11-10 16:35:10,671 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-17) [3e229702] Correlation ID: 3e229702, Job ID: 0cfc1adc-c038-40d4-9602-f4b4c6497ede, Call Stack: null, Custom Event ID: -1, Message: User admin@internal have failed to move disk vmtest2_Disk1 to domain block1. vdsm logs: ---------------------------------------------------------------------------- Thread-33734::DEBUG::2015-11-10 16:34:30,362::fileSD::173::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n461 bytes (461 B) copied, 0.000403639 s, 1.1 MB/s\n'; <rc> = 0 periodic/15::WARNING::2015-11-10 16:34:30,834::periodic::253::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.NumaInfoMonitor'> on [u'827992ea-9404-4ce1-b438-679e3c83ed6a'] Thread-69668::DEBUG::2015-11-10 16:34:31,187::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID': u'87566c2f-6c56-4b30-9669-0231fa11e957'} Thread-69668::DEBUG::2015-11-10 16:34:31,189::task::595::Storage.TaskManager.Task::(_updateState) Task=`69ec61ea-a2df-4adb-a2d2-4916362a72b5`::moving from state init -> state preparing Thread-69668::INFO::2015-11-10 16:34:31,189::logUtils::48::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID=u'87566c2f-6c56-4b30-9669-0231fa11e957', options=None) Thread-69668::INFO::2015-11-10 16:34:31,202::logUtils::51::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 5L}} Thread-69668::DEBUG::2015-11-10 16:34:31,202::task::1191::Storage.TaskManager.Task::(prepare) Task=`69ec61ea-a2df-4adb-a2d2-4916362a72b5`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 5L}} Thread-69668::DEBUG::2015-11-10 16:34:31,203::task::595::Storage.TaskManager.Task::(_updateState) Task=`69ec61ea-a2df-4adb-a2d2-4916362a72b5`::moving from state preparing -> state finished Thread-69668::DEBUG::2015-11-10 16:34:31,203::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-69668::DEBUG::2015-11-10 16:34:31,204::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-69668::DEBUG::2015-11-10 16:34:31,205::task::993::Storage.TaskManager.Task::(_decref) Task=`69ec61ea-a2df-4adb-a2d2-4916362a72b5`::ref 0 aborting False Thread-69668::DEBUG::2015-11-10 16:34:31,205::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.getSpmStatus' in bridge with {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 5L} Thread-69657::ERROR::2015-11-10 16:34:31,097::__init__::526::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 521, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 277, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 615, in diskReplicateFinish return v.diskReplicateFinish(srcDisk, dstDisk) File "/usr/share/vdsm/virt/vm.py", line 3271, in diskReplicateFinish blkJobInfo = self._dom.blockJobInfo(drive.name, 0) File "/usr/share/vdsm/virt/virdomain.py", line 76, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockInfo) Thread-69669::DEBUG::2015-11-10 16:34:31,246::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.getInfo' in bridge with {u'storagepoolID': u'87566c2f-6c56-4b30-9669-0231fa11e957'} Thread-69669::DEBUG::2015-11-10 16:34:31,247::task::595::Storage.TaskManager.Task::(_updateState) Task=`4729dcdf-b4a1-4bf5-9ec9-fa28ade0365a`::moving from state init -> state preparing Thread-69669::INFO::2015-11-10 16:34:31,247::logUtils::48::dispatcher::(wrapper) Run and protect: getStoragePoolInfo(spUUID=u'87566c2f-6c56-4b30-9669-0231fa11e957', options=None) Thread-69669::DEBUG::2015-11-10 16:34:31,248::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.87566c2f-6c56-4b30-9669-0231fa11e957`ReqID=`60b459cb-b406-458c-927d-87da642bfd22`::Request was made in '/usr /share/vdsm/storage/hsm.py' line '2634' at 'getStoragePoolInfo' Thread-69669::DEBUG::2015-11-10 16:34:31,248::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.87566c2f-6c56-4b30-9669-0231fa11e957' for lock type 'shared' Thread-69669::DEBUG::2015-11-10 16:34:31,249::resourceManager::601::Storage.ResourceManager::(registerResource) Resource 'Storage.87566c2f-6c56-4b30-9669-0231fa11e957' is free. Now locking as 'shared' (1 active user) Thread-69669::DEBUG::2015-11-10 16:34:31,249::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.87566c2f-6c56-4b30-9669-0231fa11e957`ReqID=`60b459cb-b406-458c-927d-87da642bfd22`::Granted request Thread-69669::DEBUG::2015-11-10 16:34:31,250::task::827::Storage.TaskManager.Task::(resourceAcquired) Task=`4729dcdf-b4a1-4bf5-9ec9-fa28ade0365a`::_resourcesAcquired: Storage.87566c2f-6c56-4b30-9669-0231fa11e957 (shared) Thread-69669::DEBUG::2015-11-10 16:34:31,250::task::993::Storage.TaskManager.Task::(_decref) Task=`4729dcdf-b4a1-4bf5-9ec9-fa28ade0365a`::ref 1 aborting False gdb of the qemu process: ---------------------------------------------------------------------------- GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Attaching to process 30618 Reading symbols from /usr/libexec/qemu-kvm...Reading symbols from /usr/libexec/qemu-kvm...(no debugging symbols found)...done. (no debugging symbols found)...done. Reading symbols from /lib64/libssl3.so...Reading symbols from /lib64/libssl3.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libssl3.so Reading symbols from /lib64/libsmime3.so...Reading symbols from /lib64/libsmime3.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libsmime3.so Reading symbols from /lib64/libnss3.so...Reading symbols from /lib64/libnss3.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnss3.so Reading symbols from /lib64/libnssutil3.so...Reading symbols from /lib64/libnssutil3.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnssutil3.so Reading symbols from /lib64/libplds4.so...Reading symbols from /lib64/libplds4.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libplds4.so Reading symbols from /lib64/libplc4.so...Reading symbols from /lib64/libplc4.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libplc4.so Reading symbols from /lib64/libnspr4.so...Reading symbols from /lib64/libnspr4.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnspr4.so Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 4560] [New LWP 4559] [New LWP 4558] [New LWP 4557] [New LWP 4432] [New LWP 4149] [New LWP 4148] [New LWP 4147] [New LWP 4146] [New LWP 4051] [New LWP 4047] [New LWP 4045] [New LWP 4024] [New LWP 4023] [New LWP 3818] [New LWP 3817] [New LWP 3816] [New LWP 3810] [New LWP 30625] [New LWP 30622] [New LWP 30619] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libgthread-2.0.so.0...Reading symbols from /lib64/libgthread-2.0.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgthread-2.0.so.0 Reading symbols from /lib64/libglib-2.0.so.0...Reading symbols from /lib64/libglib-2.0.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libglib-2.0.so.0 Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /lib64/libaio.so.1...Reading symbols from /lib64/libaio.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libaio.so.1 Reading symbols from /usr/lib64/iscsi/libiscsi.so.2...Reading symbols from /usr/lib64/iscsi/libiscsi.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/iscsi/libiscsi.so.2 Reading symbols from /lib64/libcurl.so.4...Reading symbols from /lib64/libcurl.so.4...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libcurl.so.4 Reading symbols from /lib64/librbd.so.1...Reading symbols from /lib64/librbd.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/librbd.so.1 Reading symbols from /lib64/librados.so.2...Reading symbols from /lib64/librados.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/librados.so.2 Reading symbols from /lib64/libgfapi.so.0...Reading symbols from /lib64/libgfapi.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgfapi.so.0 Reading symbols from /lib64/libglusterfs.so.0...Reading symbols from /lib64/libglusterfs.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libglusterfs.so.0 Reading symbols from /lib64/libgfrpc.so.0...Reading symbols from /lib64/libgfrpc.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgfrpc.so.0 Reading symbols from /lib64/libgfxdr.so.0...Reading symbols from /lib64/libgfxdr.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgfxdr.so.0 Reading symbols from /lib64/libssh2.so.1...Reading symbols from /lib64/libssh2.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libssh2.so.1 Reading symbols from /lib64/libtcmalloc.so.4...Reading symbols from /lib64/libtcmalloc.so.4...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libtcmalloc.so.4 Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/librt.so.1 Reading symbols from /lib64/libpixman-1.so.0...Reading symbols from /lib64/libpixman-1.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libpixman-1.so.0 Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libutil.so.1 Reading symbols from /lib64/libnuma.so.1...Reading symbols from /lib64/libnuma.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnuma.so.1 Reading symbols from /lib64/libasound.so.2...Reading symbols from /lib64/libasound.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libasound.so.2 Reading symbols from /lib64/libpulse.so.0...Reading symbols from /lib64/libpulse.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libpulse.so.0 Reading symbols from /lib64/libuuid.so.1...Reading symbols from /lib64/libuuid.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libuuid.so.1 Reading symbols from /lib64/libpng15.so.15...Reading symbols from /lib64/libpng15.so.15...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libpng15.so.15 Reading symbols from /lib64/libsasl2.so.3...Reading symbols from /lib64/libsasl2.so.3...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libsasl2.so.3 Reading symbols from /lib64/libgnutls.so.28...Reading symbols from /lib64/libgnutls.so.28...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgnutls.so.28 Reading symbols from /lib64/liblzo2.so.2...Reading symbols from /lib64/liblzo2.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/liblzo2.so.2 Reading symbols from /lib64/libsnappy.so.1...Reading symbols from /lib64/libsnappy.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libsnappy.so.1 Reading symbols from /lib64/libseccomp.so.2...Reading symbols from /lib64/libseccomp.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libseccomp.so.2 Reading symbols from /lib64/librdmacm.so.1...Reading symbols from /lib64/librdmacm.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/librdmacm.so.1 Reading symbols from /lib64/libibverbs.so.1...Reading symbols from /lib64/libibverbs.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libibverbs.so.1 Reading symbols from /lib64/libspice-server.so.1...Reading symbols from /lib64/libspice-server.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libspice-server.so.1 Reading symbols from /lib64/libusb-1.0.so.0...Reading symbols from /lib64/libusb-1.0.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libusb-1.0.so.0 Reading symbols from /lib64/libusbredirparser.so.1...Reading symbols from /lib64/libusbredirparser.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libusbredirparser.so.1 Reading symbols from /lib64/libstdc++.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libstdc++.so.6 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libgcc_s.so.1 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libgcrypt.so.11...Reading symbols from /lib64/libgcrypt.so.11...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgcrypt.so.11 Reading symbols from /lib64/libidn.so.11...Reading symbols from /lib64/libidn.so.11...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libidn.so.11 Reading symbols from /lib64/libgssapi_krb5.so.2...Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgssapi_krb5.so.2 Reading symbols from /lib64/libkrb5.so.3...Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5.so.3 Reading symbols from /lib64/libk5crypto.so.3...Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libk5crypto.so.3 Reading symbols from /lib64/libcom_err.so.2...Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libcom_err.so.2 Reading symbols from /lib64/liblber-2.4.so.2...Reading symbols from /lib64/liblber-2.4.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/liblber-2.4.so.2 Reading symbols from /lib64/libldap-2.4.so.2...Reading symbols from /lib64/libldap-2.4.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libldap-2.4.so.2 Reading symbols from /lib64/libboost_thread-mt.so.1.53.0...Reading symbols from /lib64/libboost_thread-mt.so.1.53.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libboost_thread-mt.so.1.53.0 Reading symbols from /lib64/libboost_system-mt.so.1.53.0...Reading symbols from /lib64/libboost_system-mt.so.1.53.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libboost_system-mt.so.1.53.0 Reading symbols from /lib64/libacl.so.1...Reading symbols from /lib64/libacl.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libacl.so.1 Reading symbols from /lib64/libcrypto.so.10...Reading symbols from /lib64/libcrypto.so.10...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libcrypto.so.10 Reading symbols from /lib64/libssl.so.10...Reading symbols from /lib64/libssl.so.10...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libssl.so.10 Reading symbols from /lib64/libunwind.so.8...Reading symbols from /lib64/libunwind.so.8...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libunwind.so.8 Reading symbols from /lib64/libjson-c.so.2...Reading symbols from /lib64/libjson-c.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libjson-c.so.2 Reading symbols from /usr/lib64/pulseaudio/libpulsecommon-6.0.so...Reading symbols from /usr/lib64/pulseaudio/libpulsecommon-6.0.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/pulseaudio/libpulsecommon-6.0.so Reading symbols from /lib64/libdbus-1.so.3...Reading symbols from /lib64/libdbus-1.so.3...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libdbus-1.so.3 Reading symbols from /lib64/libcap.so.2...Reading symbols from /lib64/libcap.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libcap.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libcrypt.so.1 Reading symbols from /lib64/libp11-kit.so.0...Reading symbols from /lib64/libp11-kit.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libp11-kit.so.0 Reading symbols from /lib64/libtspi.so.1...Reading symbols from /lib64/libtspi.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libtspi.so.1 Reading symbols from /lib64/libtasn1.so.6...Reading symbols from /lib64/libtasn1.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libtasn1.so.6 Reading symbols from /lib64/libnettle.so.4...Reading symbols from /lib64/libnettle.so.4...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnettle.so.4 Reading symbols from /lib64/libhogweed.so.2...Reading symbols from /lib64/libhogweed.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libhogweed.so.2 Reading symbols from /lib64/libgmp.so.10...Reading symbols from /lib64/libgmp.so.10...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgmp.so.10 Reading symbols from /lib64/libnl-route-3.so.200...Reading symbols from /lib64/libnl-route-3.so.200...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnl-route-3.so.200 Reading symbols from /lib64/libnl-3.so.200...Reading symbols from /lib64/libnl-3.so.200...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnl-3.so.200 Reading symbols from /lib64/libcelt051.so.0...Reading symbols from /lib64/libcelt051.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libcelt051.so.0 Reading symbols from /lib64/libjpeg.so.62...Reading symbols from /lib64/libjpeg.so.62...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libjpeg.so.62 Reading symbols from /lib64/libgpg-error.so.0...Reading symbols from /lib64/libgpg-error.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgpg-error.so.0 Reading symbols from /lib64/libkrb5support.so.0...Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5support.so.0 Reading symbols from /lib64/libkeyutils.so.1...Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libkeyutils.so.1 Reading symbols from /lib64/libattr.so.1...Reading symbols from /lib64/libattr.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libattr.so.1 Reading symbols from /lib64/libX11-xcb.so.1...Reading symbols from /lib64/libX11-xcb.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libX11-xcb.so.1 Reading symbols from /lib64/libX11.so.6...Reading symbols from /lib64/libX11.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libX11.so.6 Reading symbols from /lib64/libxcb.so.1...Reading symbols from /lib64/libxcb.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libxcb.so.1 Reading symbols from /lib64/libICE.so.6...Reading symbols from /lib64/libICE.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libICE.so.6 Reading symbols from /lib64/libSM.so.6...Reading symbols from /lib64/libSM.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libSM.so.6 Reading symbols from /lib64/libXtst.so.6...Reading symbols from /lib64/libXtst.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libXtst.so.6 Reading symbols from /lib64/libsystemd.so.0...Reading symbols from /lib64/libsystemd.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libsystemd.so.0 Reading symbols from /lib64/libwrap.so.0...Reading symbols from /lib64/libwrap.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libwrap.so.0 Reading symbols from /lib64/libsndfile.so.1...Reading symbols from /lib64/libsndfile.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libsndfile.so.1 Reading symbols from /lib64/libasyncns.so.0...Reading symbols from /lib64/libasyncns.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libasyncns.so.0 Reading symbols from /lib64/libfreebl3.so...Reading symbols from /lib64/libfreebl3.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libfreebl3.so Reading symbols from /lib64/libffi.so.6...Reading symbols from /lib64/libffi.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libffi.so.6 Reading symbols from /lib64/libselinux.so.1...Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libselinux.so.1 Reading symbols from /lib64/libXau.so.6...Reading symbols from /lib64/libXau.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libXau.so.6 Reading symbols from /lib64/libXext.so.6...Reading symbols from /lib64/libXext.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libXext.so.6 Reading symbols from /lib64/libXi.so.6...Reading symbols from /lib64/libXi.so.6...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libXi.so.6 Reading symbols from /lib64/liblzma.so.5...Reading symbols from /lib64/liblzma.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/liblzma.so.5 Reading symbols from /lib64/libdw.so.1...Reading symbols from /lib64/libdw.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libdw.so.1 Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libnsl.so.1 Reading symbols from /lib64/libgsm.so.1...Reading symbols from /lib64/libgsm.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libgsm.so.1 Reading symbols from /lib64/libFLAC.so.8...Reading symbols from /lib64/libFLAC.so.8...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libFLAC.so.8 Reading symbols from /lib64/libvorbisenc.so.2...Reading symbols from /lib64/libvorbisenc.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libvorbisenc.so.2 Reading symbols from /lib64/libvorbis.so.0...Reading symbols from /lib64/libvorbis.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libvorbis.so.0 Reading symbols from /lib64/libogg.so.0...Reading symbols from /lib64/libogg.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libogg.so.0 Reading symbols from /lib64/libpcre.so.1...Reading symbols from /lib64/libpcre.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libpcre.so.1 Reading symbols from /lib64/libelf.so.1...Reading symbols from /lib64/libelf.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libelf.so.1 Reading symbols from /lib64/libbz2.so.1...Reading symbols from /lib64/libbz2.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libbz2.so.1 Reading symbols from /usr/lib64/sasl2/libplain.so...Reading symbols from /usr/lib64/sasl2/libplain.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/sasl2/libplain.so Reading symbols from /usr/lib64/sasl2/libsasldb.so...Reading symbols from /usr/lib64/sasl2/libsasldb.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/sasl2/libsasldb.so Reading symbols from /lib64/libdb-5.3.so...Reading symbols from /lib64/libdb-5.3.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libdb-5.3.so Reading symbols from /usr/lib64/sasl2/liblogin.so...Reading symbols from /usr/lib64/sasl2/liblogin.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/sasl2/liblogin.so Reading symbols from /usr/lib64/sasl2/libdigestmd5.so...Reading symbols from /usr/lib64/sasl2/libdigestmd5.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/sasl2/libdigestmd5.so Reading symbols from /usr/lib64/sasl2/libanonymous.so...Reading symbols from /usr/lib64/sasl2/libanonymous.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/sasl2/libanonymous.so Reading symbols from /usr/lib64/sasl2/libcrammd5.so...Reading symbols from /usr/lib64/sasl2/libcrammd5.so...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/sasl2/libcrammd5.so 0x00007f1319ba2c3f in ppoll () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install qemu-kvm-rhev-2.3.0-31.el7.x86_64 (gdb) thread apply all bt full Thread 22 (Thread 0x7f131134c700 (LWP 30619)): #0 0x00007f1319ba7949 in syscall () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f13216ff432 in qemu_event_wait () No symbol table info available. #2 0x00007f132170d6e6 in call_rcu_thread () No symbol table info available. #3 0x00007f131ffc3dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007f1319bad1cd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 21 (Thread 0x7f130fb63700 (LWP 30622)): #0 0x00007f1319ba4257 in ioctl () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f13214ca075 in kvm_vcpu_ioctl () No symbol table info available. #2 0x00007f13214ca12e in kvm_cpu_exec () No symbol table info available. #3 0x00007f13214b7dca in qemu_kvm_cpu_thread_fn () No symbol table info available. #4 0x00007f131ffc3dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00007f1319bad1cd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 20 (Thread 0x7f12c81ff700 (LWP 30625)): #0 0x00007f1319ba2b7d in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f131aafa1e7 in red_worker_main () from /lib64/libspice-server.so.1 No symbol table info available. #2 0x00007f131ffc3dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007f1319bad1cd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 19 (Thread 0x7f12c71ff700 (LWP 3810)): #0 0x00007f131ffc9870 in sem_timedwait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f13216ff287 in qemu_sem_timedwait () No symbol table info available. #2 0x00007f1321682a6c in worker_thread () No symbol table info available. #3 0x00007f131ffc3dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007f1319bad1cd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 18 (Thread 0x7f12c31f7700 (LWP 3816)): #0 0x00007f131ffc9870 in sem_timedwait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f13216ff287 in qemu_sem_timedwait () No symbol table info available. #2 0x00007f1321682a6c in worker_thread () No symbol table info available. #3 0x00007f131ffc3dc5 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit---
Created attachment 1092296 [details] engine, vdsm, server logs Added logs
Thanks for the updates. I see some thread leaking from the VDSM logs. This is a proof that you are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1250839 and then you need patch https://gerrit.ovirt.org/#/c/48348/ - this can makes things worse if storage is already blocked. But let me stress that bz1250839 is triggered by *blocked* storage, so I don't think patch 48348 fixes this bug. I think there is a common yet unknown cause that triggers both this bug *and* the leak. The evidence points in the lower levels of the stack: maybe libvirt, maybe qemu itself. The error message in turn points toward QEMU, so I cloned this issue to QEMU on https://bugzilla.redhat.com/show_bug.cgi?id=1279777 As per https://bugzilla.redhat.com/show_bug.cgi?id=1261980#c31 , until (and if) new evidence comes out, I keep this bug open only to consume the fix -if any, on the lower level of the stack. Please adopt bz1279777 (or move it to the appropriate persone) to continue the investigation on the low level and to learn why QEMU is unresponsive.