1279777 – qemu becomes unresponsive when mirroring a drive

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1279777 - qemu becomes unresponsive when mirroring a drive

Summary: qemu becomes unresponsive when mirroring a drive

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Kevin Wolf
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:	virt
Duplicates (1):	1261980 (view as bug list)
Depends On:
Blocks:	1154205 1170712 1261980
TreeView+	depends on / blocked

Reported:	2015-11-10 09:42 UTC by Francesco Romani
Modified:	2016-01-15 12:18 UTC (History)
CC List:	32 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1261980
Environment:
Last Closed:	2016-01-15 12:18:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Francesco Romani 2015-11-10 09:42:45 UTC

+++ This bug was initially created as a clone of Bug #1261980 +++

Description of problem:
During I/O to disk on guest start LSM. After a minute the operation hangs and an error is reported that connection has been lost with the qemu process

Version-Release number of selected component (if applicable):
rhevm-3.6.0-0.12.master.el6.noarch
vdsm-4.17.3-1.el7ev.noarch

How reproducible:
Ran this once

Steps to Reproduce:
1. Create a Vm with a 14g nfs thin disk, run the VM and install OS
2. Start writing to the disk with DD 
3. During the I/O on the guest start the LSM >>>>> After a minute the operation hangs and an error is reported indicating that connection to the qemu process has been lost

Actual results:
Connection to the qemu process is lost during I/O to the disk and LSM of the disk


Expected results:
The LSM and should suceed and the LSM should complete

Additional info:
Engine Log:
--------------------------
2015-09-10 13:54:48,905 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-3) [] FINISH, DestroyVDSCommand, log id: 4174c7cc
2015-09-10 13:54:49,004 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-3) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM VM18 is down with error. Exit mess
age: Lost connection with qemu process.
2015-09-10 13:54:49,006 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '082836ca-4a6d-418e-b46f-82e36e6906af(VM18) is running in db and not running in VDS 'level1_vm01'
2015-09-10 13:54:49,034 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-7-thread-5) [55d6e382] Running command: ProcessDownVmCommand internal: true.
2015-09-10 14:11:46,582 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] SPMAsyncTask::PollTask: Polling task 'f642551d-3ab9-4108-ad15-7468cfe9179b' (Parent Command 'LiveMigrateDisk', Para
meters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'.
2015-09-10 14:11:46,595 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] BaseAsyncTask::onTaskEndSuccess: Task 'f642551d-3ab9-4108-ad15-7468cfe9179b' (Parent Command 'LiveMigrateDisk', Par
ameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully.
2015-09-10 14:11:46,599 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] CommandAsyncTask::endActionIfNecessary: All tasks of command 'ed2ec1ed-eafa-485e-934d-711fdffe17fd' has ended -
> executing 'endAction'
2015-09-10 14:11:46,599 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler_Worker-29) [6f96321] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: 'ed2ec1ed-eafa-485e-934d-711fdffe17fd'): 
calling endAction '.
2015-09-10 14:11:46,599 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-30) [6f96321] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'LiveMigrateDisk', executi
onIndex: '1'
2015-09-10 14:11:46,863 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Ending command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' successfully.
2015-09-10 14:11:46,864 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Running command: LiveMigrateDiskCommand Task handler: VmReplicateDiskFinishTaskHandler internal: false. E
ntities affected :  ID: 3242921b-9642-4137-9e36-df7c2802e887 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: ca046d48-3cf1-4d80-a6e0-2bf076224033 Type: StorageAction group CREATE_DISK with role type USER
2015-09-10 14:11:46,890 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Cannot get vdsManager for vdsid='00000000-0000-0000-0000-000000000000'.
2015-09-10 14:11:46,891 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (org.ovirt.thread.pool-7-thread-30) [3b06388e] CreateCommand failed: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed
 with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)
2015-09-10 14:11:46,892 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (org.ovirt.thread.pool-7-thread-30) [3b06388e] Exception: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [rt.jar:1.8.0_51]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [rt.jar:1.8.0_51]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.8.0_51]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422) [rt.jar:1.8.0_51]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.instantiateInjectedCommand(ResourceManager.java:432) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:416) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:464) [vdsbroker.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler.runVdsCommand(AbstractSPMAsyncTaskHandler.java:48) [bll.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskFinishTaskHandler.beforeTask(VmReplicateDiskFinishTaskHandler.java:60) [bll.jar:]
        at org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler.execute(AbstractSPMAsyncTaskHandler.java:38) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1209) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1355) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1979) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1392) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:644) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:591) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1981) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
:



VDSM log
------------------
Thread-30::DEBUG::2015-09-10 13:54:50,110::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with FileMetadataRW backend
Thread-28214::DEBUG::2015-09-10 13:54:49,972::fileSD::169::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/10.35.160.108:_Storage_kev2/b0ae1c01-b71d-4ad4-a9a8-84ce1dc536f8/dom_md/metadata iflag=direct of=/dev/nul
l bs=4096 count=1 (cwd None)
Thread-28214::DEBUG::2015-09-10 13:54:50,160::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n329 bytes (329 B) copied, 0.015572 s, 21.1 kB/s\n'; <rc> = 0
Thread-30::DEBUG::2015-09-10 13:54:50,198::persistentDict::234::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Data', 'DESCRIPTION=nfs1', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIMESEC=60', 'LOCKPOLICY=', 'LO
CKRENEWALINTERVALSEC=5', 'MASTER_VERSION=1', 'POOL_DESCRIPTION=dc1', 'POOL_DOMAINS=7e76fac6-3777-48e1-8533-1dce9ddff4e1:Active', 'POOL_SPM_ID=-1', 'POOL_SPM_LVER=-1', 'POOL_UUID=8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2', 'REMOTE_PATH=10.35.1
60.108:/Storage/kev1', 'ROLE=Master', 'SDUUID=7e76fac6-3777-48e1-8533-1dce9ddff4e1', 'TYPE=NFS', 'VERSION=3', '_SHA_CKSUM=23f0390c1e39e94d4dd39b5d9ba4dc953dae3a11']
Thread-30::DEBUG::2015-09-10 13:54:50,210::fileSD::643::Storage.StorageDomain::(imageGarbageCollector) Removing remnants of deleted images []
Thread-30::INFO::2015-09-10 13:54:50,211::sd::442::Storage.StorageDomain::(_registerResourceNamespaces) Resource namespace 7e76fac6-3777-48e1-8533-1dce9ddff4e1_imageNS already registered
Thread-30::INFO::2015-09-10 13:54:50,211::sd::450::Storage.StorageDomain::(_registerResourceNamespaces) Resource namespace 7e76fac6-3777-48e1-8533-1dce9ddff4e1_volumeNS already registered
Thread-30::DEBUG::2015-09-10 13:54:50,225::fileSD::169::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/10.35.160.108:_Storage_kev1/7e76fac6-3777-48e1-8533-1dce9ddff4e1/dom_md/metadata iflag=direct of=/dev/null b
s=4096 count=1 (cwd None)
Thread-30::DEBUG::2015-09-10 13:54:50,248::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n455 bytes (455 B) copied, 0.00454271 s, 100 kB/s\n'; <rc> = 0
periodic/141::WARNING::2015-09-10 13:54:50,346::periodic::253::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'082836ca-4a6d-418e-b46f-82e36e6906af']
Thread-28228::DEBUG::2015-09-10 13:54:50,479::monitor::288::Storage.Monitor::(_refreshDomain) Refreshing domain eba7102c-4651-44d4-9285-2b45056710e9
Thread-28228::DEBUG::2015-09-10 13:54:50,480::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with LvMetadataRW backend
Thread-28228::DEBUG::2015-09-10 13:54:50,480::blockSD::337::Storage.Misc.excCmd::(readlines) /usr/bin/dd iflag=direct skip=0 bs=2048 if=/dev/eba7102c-4651-44d4-9285-2b45056710e9/metadata count=1 (cwd None)
VM Channels Listener::DEBUG::2015-09-10 13:54:50,661::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 15
VM Channels Listener::DEBUG::2015-09-10 13:54:50,662::guestagent::218::virt.vm::(_connect) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Attempting connection to /var/lib/libvirt/qemu/channels/082836ca-4a6d-418e-b46f-82e36e6906af.com.redh
at.rhevm.vdsm
VM Channels Listener::DEBUG::2015-09-10 13:54:50,662::guestagent::232::virt.vm::(_connect) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Failed to connect to /var/lib/libvirt/qemu/channels/082836ca-4a6d-418e-b46f-82e36e6906af.com.redhat.r
hevm.vdsm with 2
periodic/137::ERROR::2015-09-10 13:54:50,663::vm::960::virt.vm::(_getExtendCandidates) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Unable to get watermarks for drive vda: Unable to read from monitor: Connection reset by peer
periodic/137::DEBUG::2015-09-10 13:54:50,663::executor::178::Executor::(_run) Worker was discarded
Thread-28228::DEBUG::2015-09-10 13:54:50,684::blockSD::337::Storage.Misc.excCmd::(readlines) SUCCESS: <err> = '1+0 records in\n1+0 records out\n2048 bytes (2.0 kB) copied, 0.0147833 s, 139 kB/s\n'; <rc> = 0
Thread-28228::DEBUG::2015-09-10 13:54:50,685::misc::260::Storage.Misc::(validateDDBytes) err: ['1+0 records in', '1+0 records out', '2048 bytes (2.0 kB) copied, 0.0147833 s, 139 kB/s'], size: 2048
Thread-28228::DEBUG::2015-09-10 13:54:50,685::persistentDict::234::Storage.PersistentDict::(refresh) read lines (LvMetadataRW)=[]
Thread-28228::DEBUG::2015-09-10 13:54:50,685::persistentDict::252::Storage.PersistentDict::(refresh) Empty metadata
Thread-28228::DEBUG::2015-09-10 13:54:50,685::persistentDict::192::Storage.PersistentDict::(__init__) Created a persistent dict with VGTagMetadataRW backend
Thread-28228::DEBUG::2015-09-10 13:54:50,686::lvm::504::Storage.OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-28228::DEBUG::2015-09-10 13:54:50,686::lvm::506::Storage.OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-28228::DEBUG::2015-09-10 13:54:50,686::lvm::514::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-28228::DEBUG::2015-09-10 13:54:50,687::lvm::526::Storage.OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-28228::DEBUG::2015-09-10 13:54:50,687::lvm::371::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-28228::DEBUG::2015-09-10 13:54:50,688::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_e
rror_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a03000798dd6205b34e511|/dev/mapper/360060160f4a030007b8dd6205b34e511|/dev/mapper/360060160f4a030007d8dd6205b34e511|/dev/mapper/360060160f4a03000979e521de
bece411|/dev/mapper/360060160f4a03000999e521debece411|/dev/mapper/360060160f4a030009b9e521debece411|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name eba7102c-4651-44d4-9285
-2b45056710e9 (cwd None)

--- Additional comment from Kevin Alon Goldblatt on 2015-09-10 11:12:31 EDT ---

Added logs

--- Additional comment from Kevin Alon Goldblatt on 2015-09-10 17:55:40 EDT ---

I REFINED THE TEST AND HERE ARE THE RESULTS:

1. Ran the scenario again and the error occured again - so this SPECIFICALLY happens when running dd BEFORE starting the LSM and AGAIN a second session during the SLM

2. Ran the same scenario as above running the dd session BEFORE the lsm BUT did NOT start a second dd write during the LSM and it worked FINE

3. Ran the same scenario with only installing the OS and no dd and it worked FINE!

ERROR in engine:
2015-09-10 19:36:04,935 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-9) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm20 is down with error. Exit message: Lost connection with qemu process.


Error in vdsm:
Thread-69696::DEBUG::2015-09-10 19:36:04,422::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033' for lock type 'shared'
Thread-69696::DEBUG::2015-09-10 19:36:04,422::resourceManager::566::Storage.ResourceManager::(registerResource) Resource 'Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033' found in shared state and queue is empty, Joining current shared lock
 (2 active users)
Thread-69696::DEBUG::2015-09-10 19:36:04,423::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033`ReqID=`ce5f0cdf-a7e1-4a68-9286-6611ec296cc9`::Granted request
Thread-69696::DEBUG::2015-09-10 19:36:04,423::task::827::Storage.TaskManager.Task::(resourceAcquired) Task=`004a38fa-7909-4371-872b-115ea7d845a4`::_resourcesAcquired: Storage.ca046d48-3cf1-4d80-a6e0-2bf076224033 (shared)
Thread-69696::DEBUG::2015-09-10 19:36:04,423::task::993::Storage.TaskManager.Task::(_decref) Task=`004a38fa-7909-4371-872b-115ea7d845a4`::ref 1 aborting False
Thread-69696::DEBUG::2015-09-10 19:36:04,424::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`ca046d48-3cf1-4d80-a6e0-2bf076224033_imageNS.ec548166-bfec-4fd1-9eeb-d81e3fae2348`ReqID=`27d27c49-5ac6-43cc-aa04-3bd
1ea5cd526`::Request was made in '/usr/share/vdsm/storage/sp.py' line '1915' at 'setVolumeDescription'
Thread-69696::DEBUG::2015-09-10 19:36:04,424::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'ca046d48-3cf1-4d80-a6e0-2bf076224033_imageNS.ec548166-bfec-4fd1-9eeb-d81e3fae2348' for lock type
 'exclusive'
Thread-69696::DEBUG::2015-09-10 19:36:04,425::lvm::428::Storage.OperationMutex::(_reloadlvs) Operation 'lvm reload operation' got the operation mutex
Thread-69696::DEBUG::2015-09-10 19:36:04,425::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_e
rror_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a03000798dd6205b34e511|/dev/mapper/360060160f4a030007b8dd6205b34e511|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_fo
r_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags ca046d48-3cf1-4d80-a6e0-2bf07622
4033 (cwd None)
periodic/255::WARNING::2015-09-10 19:36:04,905::periodic::253::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'169380d1-5def-4942-8a31-f42ffa9c9ba3']
mailbox.SPMMonitor::DEBUG::2015-09-10 19:36:06,192::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (
cwd None)
mailbox.SPMMonitor::DEBUG::2015-09-10 19:36:06,244::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0368906 s, 27.8 MB/s\n'; <rc> = 0
VM Channels Listener::DEBUG::2015-09-10 19:36:06,677::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 15
VM Channels Listener::DEBUG::2015-09-10 19:36:06,677::guestagent::218::virt.vm::(_connect) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Attempting connection to /var/lib/libvirt/qemu/channels/169380d1-5def-4942-8a31-f42ffa9c9ba3.com.redh
at.rhevm.vdsm
VM Channels Listener::DEBUG::2015-09-10 19:36:06,678::guestagent::232::virt.vm::(_connect) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Failed to connect to /var/lib/libvirt/qemu/channels/169380d1-5def-4942-8a31-f42ffa9c9ba3.com.redhat.r
hevm.vdsm with 111
periodic/251::ERROR::2015-09-10 19:36:06,892::vm::960::virt.vm::(_getExtendCandidates) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Unable to get watermarks for drive vda: Unable to read from monitor: Connection reset by peer
periodic/251::DEBUG::2015-09-10 19:36:06,892::executor::178::Executor::(_run) Worker was discarded
libvirtEventLoop::DEBUG::2015-09-10 19:36:06,894::vm::4415::virt.vm::(onLibvirtLifecycleEvent) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::event Stopped detail 5 opaque None
libvirtEventLoop::INFO::2015-09-10 19:36:06,894::vm::807::virt.vm::(_onQemuDeath) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::underlying process disconnected
libvirtEventLoop::INFO::2015-09-10 19:36:06,895::vm::3687::virt.vm::(releaseVm) vmId=`169380d1-5def-4942-8a31-f42ffa9c9ba3`::Release VM resources
libvirtEventLoop::DEBUG::2015-09-10 19:36:06,897::vmchannels::214::vds::(unregister) Delete fileno 15 from listener.
libvirtEventLoop::DEBUG::2015-09-10 19:36:06,897::task::595::Storage.TaskManager.Task::(_updateState) Task=`cedaa805-48d0-42e3-b98d-b36d3ce66c11`::moving from state init -> state preparing
libvirtEventLoop::INFO::2015-09-10 19:36:06,898::logUtils::48::dispatcher::(wrapper) Run and protect: teardownImage(sdUUID=u'7e76fac6-3777-48e1-8533-1dce9ddff4e1', spUUID=u'8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2', imgUUID=u'1f7c93ee-a26a-4a24-8607-d9c36b58ecb9', volUUID=None)

--- Additional comment from Allon Mureinik on 2015-09-13 04:05:51 EDT ---

Daniel - please take a look.

Kevin - what versions of qmeu and libvirt are you using?

--- Additional comment from Markus Stockhausen on 2015-09-13 13:48:32 EDT ---

Maybe related: We saw similar behaviour during Live Merge. QEMU got unresponsive round about 30 seconds after start of action because of uninterruptible bitmap task. More details in BZ1262251

--- Additional comment from Nir Soffer on 2015-09-15 17:19:54 EDT ---

Thanks Markus!

Kevin, can this be another issue related to the qemu patch Markus mention 
in https://bugzilla.redhat.com/show_bug.cgi?id=1262251#c13 ?

--- Additional comment from Kevin Wolf on 2015-09-16 03:57:07 EDT ---

I think that's unlikely. What Markus reported stalls guest I/O while starting the
mirror block job (when it checks which parts of the image are allocated and must
be copied), but not when it's active and mirroring new data. In contrast, here the
problem seems to only occur when you write new data while it's active (as Kevin
described in comment 2). This is after the point where that patch could have an
effect.

--- Additional comment from Kevin Alon Goldblatt on 2015-09-16 04:59:39 EDT ---

(In reply to Allon Mureinik from comment #3)
> Daniel - please take a look.
> 
> Kevin - what versions of qmeu and libvirt are you using?

qemu 
----------------------
qemu-kvm-rhev-2.3.0-19.el7.x86_64
libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64
qemu-kvm-tools-rhev-2.3.0-19.el7.x86_64
qemu-img-rhev-2.3.0-19.el7.x86_64
qemu-kvm-common-rhev-2.3.0-19.el7.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

libvirt
---------------------
libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64
libvirt-daemon-1.2.17-5.el7.x86_64
libvirt-daemon-driver-secret-1.2.17-5.el7.x86_64
libvirt-daemon-driver-network-1.2.17-5.el7.x86_64
libvirt-daemon-kvm-1.2.17-5.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-5.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.17-5.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.17-5.el7.x86_64
libvirt-daemon-driver-interface-1.2.17-5.el7.x86_64
libvirt-daemon-driver-storage-1.2.17-5.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-lock-sanlock-1.2.17-5.el7.x86_64

--- Additional comment from Yaniv Kaul on 2015-09-20 09:27:19 EDT ---

Aharon - what is the frequency of this issue? Additional repro (comment 2) seems to suggest it is happening in more specific scenarios - so why is it blocking testing the above BZs?

Let's try to find more issues in the other features and not hold up because of this one, if possible.

--- Additional comment from Daniel Erez on 2015-09-20 10:07:05 EDT ---

Hi Kevin,

According to the described scenario, connection to qemu process is lost while dd to the guest during LSM [1]. What additional data is needed to determine whether there's an issue in qemu side (rather than vdsm)?

Thanks!,
Daniel

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1261980#c2
("Message: VM vm20 is down with error. Exit message: Lost connection with qemu process.")

--- Additional comment from Yaniv Kaul on 2015-09-20 10:09:56 EDT ---

If it's an easily reproducible scenario:
1. What about the qemu log?
2. Please have Virt-QE reproduce it.
3. Attach GDB to the QEMU process (need to be quick - or VDSM will think the VM is not responsive!) and get a nice stack when it blows up.

--- Additional comment from Aharon Canan on 2015-09-21 08:45:31 EDT ---

(In reply to Yaniv Kaul from comment #8)
> Aharon - what is the frequency of this issue? Additional repro (comment 2)
> seems to suggest it is happening in more specific scenarios - so why is it
> blocking testing the above BZs?
> 
> Let's try to find more issues in the other features and not hold up because
> of this one, if possible.

It happens every time when running steps from comment #2.
I blocked the tracker bugs because it looks very basic to me - we can discuss it and decide if it is not.

Kevin already completed running all other cases so other than that we are good.

Kevin will answer comment #10 asap.

--- Additional comment from Kevin Wolf on 2015-09-28 07:16:18 EDT ---

Yaniv already answered for me in comment 10, and I have nothing to add at this
point, so I'm just unsetting my needinfo.

--- Additional comment from Kevin Alon Goldblatt on 2015-10-05 16:33:35 EDT ---

(In reply to Yaniv Kaul from comment #10)
> If it's an easily reproducible scenario:
> 1. What about the qemu log?
> 2. Please have Virt-QE reproduce it.
> 3. Attach GDB to the QEMU process (need to be quick - or VDSM will think the
> VM is not responsive!) and get a nice stack when it blows up.

Ran this scenario again. This time connection to the VM was lost during the first write and LSM as follows:


Steps to Reproduce:
1. Create a Vm with a 14g nfs thin disk, run the VM and install OS
2. Start writing to the disk with DD 
3. During the I/O on the guest start the LSM >>>>> After a minute the operation hangs and an error is reported indicating that connection to the qemu process has been lost

Actual results:
Connection to the qemu process is lost during I/O to the disk and LSM of the disk


Expected results:
The LSM should suceed and the LSM should complete

--- Additional comment from Kevin Alon Goldblatt on 2015-10-05 16:54:22 EDT ---

Engine log from previous run:

2015-10-05 22:09:41,935 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-60) [4724dd39] VM 'f21b318b-2118-4980-95eb-13f68e689eb2'(vm24) moved from 'Up' --> 'NotResponding'
2015-10-05 22:09:41,981 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) [4724dd39] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm24 is not responding.


vdsm log:

ll bs=4096 count=1 (cwd None)
Thread-34::DEBUG::2015-10-05 19:10:01,980::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n455 bytes (455 B) copied, 0.000604022 s, 753 kB/s\n'; <rc> = 0
Thread-344277::ERROR::2015-10-05 19:10:02,171::vm::3127::virt.vm::(snapshot) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to take snapshot
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3125, in snapshot
    self._dom.snapshotCreateXML(snapxml, snapFlags)

--- Additional comment from Kevin Alon Goldblatt on 2015-10-05 16:56 EDT ---

Added logs

--- Additional comment from Tal Nisan on 2015-10-07 09:42:14 EDT ---

Kevin, any insights given the new information?

--- Additional comment from Aharon Canan on 2015-10-07 09:48:25 EDT ---

Following my discussion with Allon and Yaniv, 
As this doesn't happen only on mixed LSM and as it is blocker bug to the version, 

Removing it from blocking the mixed LSM features as those were verified and working.

--- Additional comment from Kevin Wolf on 2015-10-08 05:25:00 EDT ---

(In reply to Tal Nisan from comment #16)
> Kevin, any insights given the new information?

The uploadde archive contains a file qemu_trace_05_10_2015 that has the output
of a gdb session, however the only command used there was 'quit', so it's not
very helpful. 'thread apply all bt full' is what I would need at the very least
to get any clue about what qemu is doing.

Also, I don't really know yet what the "connection lost" message in VDSM
actually means in detail. Can someone explain? Is this about a monitor socket
being closed, or a command timing out or anything like that? And is libvirt
involved or is VDSM talking directly to qemu there?

--- Additional comment from Nir Soffer on 2015-10-08 07:05:34 EDT ---

(In reply to Kevin Wolf from comment #18)
> Also, I don't really know yet what the "connection lost" message in VDSM
> actually means in detail. Can someone explain? Is this about a monitor socket
> being closed, or a command timing out or anything like that? And is libvirt
> involved or is VDSM talking directly to qemu there?

We use libvirt to preform virDomainBlockCopy (or virDomainBlockRebase if block copy is not supported, but it should be in libvirt 1.2.17).

The "connection lost" message is this:

1. We fail to get watermarks for drive vda (using virDomainBlockInfo, need to improve this log message):

periodic/137::ERROR::2015-09-10 13:54:50,663::vm::960::virt.vm::(_getExtendCandidates) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::Unable to get watermarks for drive vda: Unable to read from monitor: Connection reset by peer

2. Immediately after that:

libvirtEventLoop::DEBUG::2015-09-10 13:54:50,859::vm::4415::virt.vm::(onLibvirtLifecycleEvent) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::event Stopped detail 5 opaque None
libvirtEventLoop::INFO::2015-09-10 13:54:50,859::vm::807::virt.vm::(_onQemuDeath) vmId=`082836ca-4a6d-418e-b46f-82e36e6906af`::underlying process disconnected
 
This can be seen in the first vdsm log (attachment 1072250 [details])

I suggest to open a separate qemu bug for this issue and continue there.
Hopefully we can create a simple reproducer using virsh.

In second log the operation fails when trying to create a snapshot, and it
does not seems related to qemu, looks like libvirt issue:

1. We try to take a snapshot:

Thread-344277::INFO::2015-10-05 19:09:02,161::vm::3112::virt.vm::(snapshot) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::<domainsnapshot>
        <disks>
                <disk name="vda" snapshot="external" type="file">
                        <source file="/rhev/data-center/8cf6b080-2ca3-44ce-a0cb-f57cb0f715d2/7e76fac6-3777-48e1-8533-1dce9ddff4e1/images/e0f320d0-ae7a-47cd-a027-c6b297f78535/07
f25c30-4a01-49ae-9e4a-6daa79c48171" type="file"/>
                </disk>
        </disks>
</domainsnapshot>

2. We freeze the guest file systems

Thread-344277::INFO::2015-10-05 19:09:02,161::vm::2889::virt.vm::(freeze) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Freezing guest filesystems

3. We failed to freeze (timeout)

Thread-344277::WARNING::2015-10-05 19:09:32,167::vm::2894::virt.vm::(freeze) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to freeze guest filesystems: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

Since generally libvirt error reporting is not specific enough, we assume
that this failure means there is no guest agent on the host, so we continue
without freezing.

4. We try to take a snapshot (again, fails with timeout)

Thread-344277::ERROR::2015-10-05 19:10:02,171::vm::3127::virt.vm::(snapshot) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to take snapshot
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3125, in snapshot
    self._dom.snapshotCreateXML(snapxml, snapFlags)
  File "/usr/share/vdsm/virt/vm.py", line 214, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

5. We thaw guest file systems (always done if we tried to freeze, so we never leave guest in freezed state)

Thread-344277::INFO::2015-10-05 19:10:02,277::vm::2912::virt.vm::(thaw) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Thawing guest filesystems

6. thaw fail also with timeout

Thread-344277::WARNING::2015-10-05 19:10:32,278::vm::2917::virt.vm::(thaw) vmId=`f21b318b-2118-4980-95eb-13f68e689eb2`::Unable to thaw guest filesystems: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

This may be an issue with libvirt but it seems like regression in our
sampling code introduced during 3.6 development.

Francesco, would you open a virt bug for this and work with kgoldbla on
this issue?

--- Additional comment from Natalie Gavrielov on 2015-10-12 08:41 EDT ---

I believe I have the same problem when running LSM.

Attached:
vdsm.log
engine.log
and gdb with "thread apply all bt full" - which I'm not sure if it's any good.

Versions:
vdsm-4.17.8-1.el7ev.noarch
rhevm-3.6.0-0.18.el6.noarch

qemu:

qemu-kvm-common-rhev-2.3.0-29.el7.x86_64
qemu-kvm-rhev-2.3.0-29.el7.x86_64
libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64
qemu-img-rhev-2.3.0-29.el7.x86_64
qemu-kvm-tools-rhev-2.3.0-29.el7.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch

libvirt:

libvirt-daemon-kvm-1.2.17-5.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-5.el7.x86_64
libvirt-daemon-driver-lxc-1.2.17-5.el7.x86_64
libvirt-daemon-driver-storage-1.2.17-5.el7.x86_64
libvirt-lock-sanlock-1.2.17-5.el7.x86_64
libvirt-daemon-driver-network-1.2.17-5.el7.x86_64
libvirt-daemon-driver-qemu-1.2.17-5.el7.x86_64
libvirt-daemon-driver-interface-1.2.17-5.el7.x86_64
libvirt-daemon-1.2.17-5.el7.x86_64
libvirt-daemon-config-network-1.2.17-5.el7.x86_64
libvirt-daemon-driver-secret-1.2.17-5.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.17-5.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.17-5.el7.x86_64
libvirt-1.2.17-5.el7.x86_64

Relevant time-line in logs, starting @ 13:24

--- Additional comment from Kevin Alon Goldblatt on 2015-10-13 10:37:42 EDT ---

(In reply to Kevin Wolf from comment #18)
> (In reply to Tal Nisan from comment #16)
> > Kevin, any insights given the new information?
> 
> The uploadde archive contains a file qemu_trace_05_10_2015 that has the
> output
> of a gdb session, however the only command used there was 'quit', so it's not
> very helpful. 'thread apply all bt full' is what I would need at the very
> least
> to get any clue about what qemu is doing.
> 
> Also, I don't really know yet what the "connection lost" message in VDSM
> actually means in detail. Can someone explain? Is this about a monitor socket
> being closed, or a command timing out or anything like that? And is libvirt
> involved or is VDSM talking directly to qemu there?

See comment 20 for the gdb trace.

--- Additional comment from Kevin Wolf on 2015-10-14 05:38:28 EDT ---

This qemu process looks alive. Can you try whether it's possible to attach a
new monitor connection to the running process? I guess libvirt can't do that,
but you should be able to manually connect to the socket.

--- Additional comment from Francesco Romani on 2015-10-16 11:39:20 EDT ---

(In reply to Nir Soffer from comment #19)
> (In reply to Kevin Wolf from comment #18)
> > Also, I don't really know yet what the "connection lost" message in VDSM
> > actually means in detail. Can someone explain? Is this about a monitor socket
> > being closed, or a command timing out or anything like that? And is libvirt
> > involved or is VDSM talking directly to qemu there?

> Francesco, would you open a virt bug for this and work with kgoldbla on
> this issue?

I'll be more than happy to work on this (and to find a simple reproducer)

So far, it seems to share the same root cause of https://bugzilla.redhat.com/show_bug.cgi?id=1250839

Will now properly review the logs, and file a new bug if this is not the case.

--- Additional comment from Daniel Erez on 2015-10-26 15:18:00 EDT ---

(In reply to Francesco Romani from comment #23)
> (In reply to Nir Soffer from comment #19)
> > (In reply to Kevin Wolf from comment #18)
> > > Also, I don't really know yet what the "connection lost" message in VDSM
> > > actually means in detail. Can someone explain? Is this about a monitor socket
> > > being closed, or a command timing out or anything like that? And is libvirt
> > > involved or is VDSM talking directly to qemu there?
> 
> > Francesco, would you open a virt bug for this and work with kgoldbla on
> > this issue?
> 
> I'll be more than happy to work on this (and to find a simple reproducer)
> 
> So far, it seems to share the same root cause of
> https://bugzilla.redhat.com/show_bug.cgi?id=1250839
> 
> Will now properly review the logs, and file a new bug if this is not the
> case.

Thanks Francesco! Please let us know if there's any new outcome.
Adding needinfo for visibility..

--- Additional comment from Allon Mureinik on 2015-10-28 10:08:30 EDT ---

Francesco - thanks!
Reassigning so it's easier to track this. Please send it back Daniel's way if it's more of a storage issue than a virt one.

And once again - thanks!

--- Additional comment from Francesco Romani on 2015-11-02 13:49:00 EST ---

I finally managed to find some time to play with this.

TL;DR: Kevin (Goldblatt), can you please check if this still fails with qemu-kvm-rhev >= 2.3.0-28?

+++

First, I wanted to reproduce locally the qemu-kvm-rhev being stuck.

I had a RHEL7.2 box with quite recent packages, and run the reproducer much like description:
- 3.6.0-ish (but Engine is irrelevant here, VDSM not yet into play) environment with NFS and ISCSI storage domains
- created VM on NFS storage, thin provisioned, 14 GB disk
- booted VM using live image, started brutal dd if=/dev/urandom of=/dev/vda oflag=direct. Please note that this *is* different from description, but at qemu side should not make any difference, it is just I/O on disk
- after something like 20-30s, started Live Storage Migration
- qemu was responsive all along (inspected using ps, virsh, vdsClient, through spice console)

after few minutes, stopped dd inside guest, stopped vm, removed snapshot and disk, all succeeded.

But I noticed that I was running a much more recent qemu-kvm-*EV* build. Checking the differences, I found some interesting patches to fix bugs in the drive-mirror area, like
https://bugzilla.redhat.com/show_bug.cgi?id=1251487
https://bugzilla.redhat.com/show_bug.cgi?id=1259229

Some reported behaviour matches our findings here.

Hence, let's see if the original issue is still present

--- Additional comment from Francesco Romani on 2015-11-04 09:51:03 EST ---

(In reply to Francesco Romani from comment #26)
> I finally managed to find some time to play with this.
> 
> TL;DR: Kevin (Goldblatt), can you please check if this still fails with
> qemu-kvm-rhev >= 2.3.0-28?
> 
> +++
> 
> First, I wanted to reproduce locally the qemu-kvm-rhev being stuck.
[...]
> But I noticed that I was running a much more recent qemu-kvm-*EV* build.

Which is identical to the -rhev builds minus the branding, so the above is still significant, even more the question:

> Hence, let's see if the original issue is still present

--- Additional comment from Michal Skrivanek on 2015-11-05 13:31:14 EST ---

Francesco, what about comment #20, that has 2.3.0-29 (which is not latest, but more recent than you pointed out in comment #27)?

Goldblatt, does it reproduce on latest build?

--- Additional comment from Francesco Romani on 2015-11-06 03:08:28 EST ---

(In reply to Michal Skrivanek from comment #28)
> Francesco, what about comment #20, that has 2.3.0-29 (which is not latest,
> but more recent than you pointed out in comment #27)?

Right, I missed that. So it seems it is still happening with recent qemu-kvm-rhev, will try again to reproduce locally; if reproduces, will attach new monitor and inspect what's going on.

Natalie, are you doing LSM from NFS to block storage (ISCSI maybe), right?
Are the images on NFS thin provisioned?

Please share all the details about the storage migration being performed, to make sure I'm reproducing the same environment.

--- Additional comment from Francesco Romani on 2015-11-09 07:08:45 EST ---

After additional investigation (and some failed attempts to reproduce), we should move this bug down to QEMU(-kvm).

On VDSM side the issue as per https://bugzilla.redhat.com/show_bug.cgi?id=1250839 is confirmed, but we con continue here.

Comment 3 Kevin Wolf 2015-11-10 13:56:23 UTC

Still waiting for a description of what you think that qemu is doing wrong here.
As I said in one of the comments of the original bugs "This qemu process looks
alive", and I can't see any newer information that says otherwise.

Also my request that you try and attach a new monitor to the process is still
open.

Comment 4 Kevin Wolf 2015-11-10 13:58:32 UTC

Also, why is this marked as a regression? In which qemu-kvm-rhev version did it
work correctly according to your definition, and what is the exact difference
between the behaviour then and now?

Comment 5 Francesco Romani 2015-11-11 13:25:01 UTC

(In reply to Kevin Wolf from comment #4)
> Also, why is this marked as a regression? In which qemu-kvm-rhev version did
> it
> work correctly according to your definition, and what is the exact difference
> between the behaviour then and now?

It is marked as regression because the original bug I cloned to make this one was marked as such. I don't claim there is a regression on QEMU given the information we have so far.

Comment 6 Yaniv Kaul 2015-11-12 10:00:16 UTC

*** Bug 1261980 has been marked as a duplicate of this bug. ***

Comment 7 Yaniv Kaul 2015-11-15 15:29:11 UTC

*** Bug 1261980 has been marked as a duplicate of this bug. ***

Comment 8 Kevin Alon Goldblatt 2015-11-22 09:29:37 UTC

Tested with the following code:
-----------------------------------
rhevm-3.6.0.3-0.1.el6.noarch
vdsm-4.17.10.1-0.el7ev.noarch

Verified with the following steps:
----------------------------------
Steps to Reproduce:
1. Create a VM with one 5G thin provisioned disk on an iSCSI SD
2. Boot the VM from a live CD such as TinyCorePlus
3. Write data to the first part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048  >>>>> During the dd to the disks the connection is lost with the QEMU process. This happens every time. We have a bz https://bugzilla.redhat.com/show_bug.cgi?id=1279777 open for the qemu issue. THIS IS A BLOCKER!

If this connection was not lost at this point I did steps 4 and connection to the qemu process was lost after step 5

4. When above command is finished create a VM snapshot
5. Write data to the second part of the disk:
    Open a terminal inside the VM and run the following command
    dd if=/dev/zero of=/dev/vda bs=1M count=2048 seek=2048

Comment 9 Kevin Wolf 2015-11-23 09:35:50 UTC

I can only repeat once more my repeated request for more information as
explained in comment 3. Unless you can break this down to misbehaviour on the
qemu level, we can't debug this. "RHEV-M lost the connection" is a bug report
as precise as "something on my system is broken".

If you can't provide more detailed information on the qemu level, can you leave
a failed qemu (does it exist any more? With the information provided here I
don't even know yet whether it's hanging, has exited or what else) around and
provide me an SSH connection so I can have at least a look at the state of the
qemu process after the problem occurred?

Comment 11 Fam Zheng 2015-11-26 01:12:35 UTC

The comment 8 scenario doesn't seem to relate with mirror. But FWIW this may relate to https://bugzilla.redhat.com/show_bug.cgi?id=1277922

Comment 12 Michal Skrivanek 2015-12-07 15:10:47 UTC

(In reply to Kevin Alon Goldblatt from comment #8)

Did you retest with the fix of bug 1283987? Especially based on last comment it may very well resolve your problem

Comment 13 Kevin Alon Goldblatt 2015-12-14 21:03:30 UTC

Retested witht the following code:
vdsm-4.17.12-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch


Retested with the scenario in comment 8 >>>>> the same result.
When using gdb on the process and running I/O the VM enters into a Paused state and the console to the vm freezez
Without gdb on the qemu process the qemu proces dies/exits when running I/O

Reproduced the scenario with kwolf.

Comment 14 Michal Skrivanek 2015-12-15 05:05:52 UTC

(In reply to Kevin Alon Goldblatt from comment #13)
Did you retest with with the fix of bug 1283987?

Comment 15 Kevin Alon Goldblatt 2015-12-15 16:31:28 UTC

(In reply to Michal Skrivanek from comment #14)
> (In reply to Kevin Alon Goldblatt from comment #13)
> Did you retest with with the fix of bug 1283987?

Yes, I used the same code as they indicated in bug 1283987 (Comment 16) which fixed the problem, namely:

From Comment 16:

===================================================
verify the issue on x86_64:
host info:
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
3.10.0-327.4.1.el7.x86_64

steps:

steps are the same as above

results:
no core dumped, when extend lv to 4GB, the guest can be installed successfully.

------------------------------------------------------

Comment 16 Kevin Alon Goldblatt 2015-12-22 12:32:22 UTC

(In reply to Kevin Alon Goldblatt from comment #15)
> (In reply to Michal Skrivanek from comment #14)
> > (In reply to Kevin Alon Goldblatt from comment #13)
> > Did you retest with with the fix of bug 1283987?
> 
> Yes, I used the same code as they indicated in bug 1283987 (Comment 16)
> which fixed the problem, namely:
> 
> From Comment 16:
> 
> ===================================================
> verify the issue on x86_64:
> host info:
> qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
> 3.10.0-327.4.1.el7.x86_64
> 
> steps:
> 
> steps are the same as above
> 
> results:
> no core dumped, when extend lv to 4GB, the guest can be installed
> successfully.
> 
> ------------------------------------------------------

To clarify my comment. 

I used the same code that was used in bug 1283987. In that bug, the problem associated with it was remedied.

In this bug the problem of either a frozen or exited qemu process remains.

Comment 17 Kevin Wolf 2015-12-22 12:46:37 UTC

In the setup that Kevin provided me and where he said the problem would
reproduce (which didn't involve any mirroring), all I could see so far is that
VMs get stopped because their LV is full. They didn't automatically extend the
VM and then resume as you would expect, but they would resume successfully if
you requested it manually. So I think this is a VDSM bug.

I haven't seen qemu processes disappearing, and considering that in contrast to
the original report no mirroring was involved, I wonder whether that VDSM bug
that Kevin was apparently reproducing is different from what was originally
reported.

Comment 18 Han Han 2015-12-28 09:03:12 UTC

I tied to reproduce it in libvirt. I started a guest without block iotune and started iozone to provide high IO rate in guest. Then do common blockcopy:
# virsh blockcopy guest vda /var/lib/libvirt/images/new.qcow2  --verbose  --wait
And blockcopy could reach 100%.
Then do pivotv job:
# virsh blockjob guest vda --pivot
It continued running and not finished while the guest was accessible. When I stopped the iozone process in guest, the pivot job would finish soon and guest was OK to use.

In RHEVM, I also prepared a guest running iozone and did storage migration. The migration run for long time and report failed at last. But the guest was accessible all the time.

When I setup block iotune in libvirt and RHEVM, such as 20MB/s for total-bytes-sec, it was ok to finish the pivot job and storage migration.

So, I tend to consider that it is NotABug because it's hard to make images sync when guest is in high IO rate. In production environment, we could use the limits of blkdeviotune to make the storage migration success.

Comment 19 Ademar Reis 2016-01-15 12:18:50 UTC

Based on comment #17 and comment #18, closing it as NOTABUG. Please reopen if you have more details. Thanks.

Note You need to log in before you can comment on or make changes to this bug.

acanan
amureini
areis
christian.grundmann
derez
dyuan
ecohen
famz
fromani
gklein
hhan
huding
jcody
juzhang
kgoldbla
knoel
kwolf
lsurette
michal.skrivanek
mst
nsoffer
pbonzini
pezhang
rbalakri
Rhev-m-bugs
tnisan
virt-maint
xfu
xuzhang
yanyang
yeylon
ylavi