Bug 1295208 - Writing to thin iscsi disk with dd pauses VM (if the storage and writing is faster than it takes VDSM to extend the LV)
Summary: Writing to thin iscsi disk with dd pauses VM (if the storage and writing is f...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 3.6.0.3
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Maor
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-03 14:52 UTC by Kevin Alon Goldblatt
Modified: 2018-07-10 08:54 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-19 11:03:26 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.3?
ylavi: exception?
ylavi: planning_ack+
kgoldbla: devel_ack?
kgoldbla: testing_ack?


Attachments (Terms of Use)
engine, vdsm, server logs (1.30 MB, application/x-gzip)
2016-01-03 14:59 UTC, Kevin Alon Goldblatt
no flags Details
logs-7.1.15 (4.86 MB, application/x-gzip)
2016-01-07 08:40 UTC, Elad
no flags Details
engine, vdsm, server logs (960.15 KB, application/x-gzip)
2016-02-03 13:57 UTC, Kevin Alon Goldblatt
no flags Details
Repoplot of vdsm.log (39.76 KB, application/pdf)
2016-02-26 16:29 UTC, Nir Soffer
no flags Details
Repoplot of vdsm.log (fixed) (134.78 KB, application/pdf)
2016-02-26 17:15 UTC, Nir Soffer
no flags Details
Repoplot of vdsm.log (improved) (202.14 KB, application/pdf)
2016-03-06 19:44 UTC, Nir Soffer
no flags Details
vdsm server and engine logs (1.59 MB, application/x-gzip)
2016-05-23 13:24 UTC, Kevin Alon Goldblatt
no flags Details

Description Kevin Alon Goldblatt 2016-01-03 14:52:01 UTC
Description of problem:
Writing to thin isci disk with dd pauses VM for 20-40 seconds during lvextend

Version-Release number of selected component (if applicable):
rhevm-3.6.0.3-0.1.el6.noarch
vdsm-4.17.12-0.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create a VM with thin iscsi disk and install Rhel OS
2. Add a second thin iscsi disk of 10g
3. Install file system on the second disk
4. Using dd write 5G of zero's to the disk >>>>> During the write the VM is paused up to 40 minutes during the lvextend operation

Actual results:
The VM is paused during the Write operation

Expected results:
The VM should not be paused

Additional info:
 From engine log:
---------------------------------------------------------

.tlv.redhat.com' 
2016-01-03 16:12:30,057 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-97) [671c66c1] VM '4c041a95-81e2-4114-bbb3-dae0f9188f49(kwolf_vm) is running in db and not r
unning in VDS 'leopard01.qa.lab.tlv.redhat.com'
2016-01-03 16:12:30,310 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-7-thread-32) [725625d5] Running command: ProcessDownVmCommand internal: true.
2016-01-03 16:17:39,007 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-62) [49f4e0c4] VM '5a795aea-0ea4-48b4-8e9b-bee0be82e399'(vm1) moved from 'Up' --> 'Paused'
2016-01-03 16:17:39,500 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) [49f4e0c4] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has been paused.
2016-01-03 16:17:39,575 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) [49f4e0c4] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has been paused due to no Storage space error.
2016-01-03 16:17:44,792 ERROR [org.ovirt.engine.core.bll.host.provider.foreman.SystemProviderFinder] (ajp-/127.0.0.1:8702-10) [] Failed to find host on any provider by host name 'kevin-rhevm.scl.lab.tlv.redhat.com' 
2016-01-03 16:18:18,391 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-97) [671c66c1] VM '5a795aea-0ea4-48b4-8e9b-bee0be82e399'(vm1) moved from 'Paused' --> 'Up'
2016-01-03 16:18:18,571 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-97) [671c66c1] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has recovered from paused back to up.
2016-01-03 16:18:19,617 ERROR [org.ovirt.engine.core.bll.host.provider.foreman.SystemProviderFinder] (ajp-/127.0.0.1:8702-19) [] Failed to find host on any provider by host name 'kevin-rhevm.scl.lab.tlv.redhat.com' 


From vdsm log:
-----------------------------
1b01b075-637e-43de-92d5-844db6e371ec::DEBUG::2016-01-03 16:17:46,531::storage_mailbox::146::Storage.SPM.Messages.Extend::(processRequest) processRequest, payload:'1xtnd2\x92\xd91\xfb,\xf1\xb7\x8dJ%\
x0e\xb3\x8ap\x7f\xaf\xef\xb2\xa3\xf9T\x7f\x91\x02J\xec\xff\xe4\x97\xd8j000000000000140000000000000'
1b01b075-637e-43de-92d5-844db6e371ec::INFO::2016-01-03 16:17:46,531::storage_mailbox::161::Storage.SPM.Messages.Extend::(processRequest) processRequest: extending volume 6ad897e4-ffec-4a02-917f-54f9
a3b2efaf in domain 7f708ab3-0e25-4a8d-b7f1-2cfb31d99232 (pool cce31f29-940a-487b-8e15-bffbf7870bc0) to size 5120
1b01b075-637e-43de-92d5-844db6e371ec::DEBUG::2016-01-03 16:17:46,533::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvextend --config ' devices { preferred_names = ["^/dev/mapp
er/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a516000c9|/dev/mapper/3514f0c5a516000ca|'\'', '\''r|.*|'\'' ] }  global {  locki
ng_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --size 5120m 7f708ab3-0e25-4a8d-b7f1-2cfb31d99232/6ad897e4-ffec-
4a02-917f-54f9a3b2efaf (cwd None)
1b01b075-637e-43de-92d5-844db6e371ec::DEBUG::2016-01-03 16:17:46,638::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabli
ng it!\n  WARNING: This metadata update is NOT backed up\n'; <rc> = 0
.
.
.
9f11fb41-1103-43ce-8312-9848fbcfc776::DEBUG::2016-01-03 16:17:50,660::storage_mailbox::146::Storage.SPM.Messages.Extend::(processRequest) processRequest, payload:'1xtnd2\x92\xd91\xfb,\xf1\xb7\x8dJ%\
x0e\xb3\x8ap\x7f\xaf\xef\xb2\xa3\xf9T\x7f\x91\x02J\xec\xff\xe4\x97\xd8j000000000000180000000000000'
9f11fb41-1103-43ce-8312-9848fbcfc776::INFO::2016-01-03 16:17:50,660::storage_mailbox::161::Storage.SPM.Messages.Extend::(processRequest) processRequest: extending volume 6ad897e4-ffec-4a02-917f-54f9
a3b2efaf in domain 7f708ab3-0e25-4a8d-b7f1-2cfb31d99232 (pool cce31f29-940a-487b-8e15-bffbf7870bc0) to size 6144
9f11fb41-1103-43ce-8312-9848fbcfc776::DEBUG::2016-01-03 16:17:50,661::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvextend --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a516000c8|/dev/mapper/3514f0c5a516000c9|/dev/mapper/3514f0c5a516000ca|/dev/mapper/3514f0c5a516000cb|/dev/mapper/3514f0c5a516000cc|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --size 6144m 7f708ab3-0e25-4a8d-b7f1-2cfb31d99232/6ad897e4-ffec-4a02-917f-54f9a3b2efaf (cwd None)
9f11fb41-1103-43ce-8312-9848fbcfc776::DEBUG::2016-01-03 16:17:50,900::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!\n  WARNING: This metadata update is NOT backed up\n'; <rc> = 0

Comment 1 Kevin Alon Goldblatt 2016-01-03 14:59:07 UTC
Created attachment 1111182 [details]
engine, vdsm, server logs

Adding logs

Comment 2 Yaniv Kaul 2016-01-03 17:24:25 UTC
Kevin, is the VM paused for several seconds or minutes? The description is not clear:

"Writing to thin isci disk with dd pauses VM for 20-40 seconds during lvextend"

vs.

"4. Using dd write 5G of zero's to the disk >>>>> During the write the VM is paused up to 40 minutes during the lvextend operation"


Also, you have not provided information on the storage specs, etc.

Comment 3 Elad 2016-01-07 08:40:54 UTC
Created attachment 1112396 [details]
logs-7.1.15

Checked on my setup with:

rhevm-3.6.2-0.1.el6.noarch

vdsm-4.17.15-0.el7ev.noarch
libvirt-daemon-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
kernel-3.10.0-327.4.4.el7.x86_64
lvm2-2.02.130-5.el7_2.1.x86_64
sanlock-3.2.4-2.el7_2.x86_64
device-mapper-multipath-0.4.9-85.el7.x86_64
iscsi-initiator-utils-6.2.0.873-32.el7.x86_64


Tested over iSCSI using XtremIO as storage backend.

Writing 5G of zeros to the second sparse disk causes the VM to get paused for short period of ~1 minute.

Attached full logs
=============================================================================


2016-01-07 08:07:07,949 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM '57df2c81-e457-49ce-9e6c-d43fa9f30aa5'(1) moved from 'Up' --> 'Paused'
2016-01-07 08:07:08,285 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM 1 has been paused.
2016-01-07 08:07:08,551 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM 1 has been paused due to no Storage space error.
2016-01-07 08:08:00,099 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-81) [7e609e66] VM '57df2c81-e457-49ce-9e6c-d43fa9f30aa5'(1) moved from 'Paused' --> 'Up'
2016-01-07 08:08:00,201 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-81) [7e609e66] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM 1 has recovered from paused back to up.



=============================================================================

lvextend:


c7ed6b56-b2e3-4f2b-b875-64afac789dba::DEBUG::2016-01-07 10:07:07,522::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-7 /usr/bin/sudo -n /usr/sbin/lvm lvextend --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a516008b0|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --size 2048m 7b4e9053-46a3-4bcf-aa6f-93913c573341/00000104-86f0-48b4-834f-ca79cb0131d5 (cwd None)

61c5086c-26ac-4224-9d3a-d04be27b7680::DEBUG::2016-01-07 10:08:44,927::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-7 /usr/bin/sudo -n /usr/sbin/lvm lvextend --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a51600096|/dev/mapper/3514f0c5a51600097|/dev/mapper/3514f0c5a51600098|/dev/mapper/3514f0c5a51600099|/dev/mapper/3514f0c5a5160009a|/dev/mapper/3514f0c5a516008b0|/dev/mapper/3514f0c5a516008b1|/dev/mapper/3514f0c5a516008b2|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --size 9216m 7b4e9053-46a3-4bcf-aa6f-93913c573341/00000104-86f0-48b4-834f-ca79cb0131d5 (cwd None)

Comment 4 Maor 2016-01-11 13:15:14 UTC
Nir,

It looks like the lvextend operation took merely 5 seconds but VDSM reported that the VM is stopped for much more time (probably duo to other related operations)
Is there anything we can do here?

Comment 5 Nir Soffer 2016-01-11 15:42:33 UTC
Vdsm reports about vm being paused or resume are in Francesco territory, and 
not related to storage.

We need to analyze the extend events, to check that we did receive extend events
on time, and that we did handle them on as expected.

Generally, I think this test is not valid use case; when writing zeros, storage
server can optimize the write, making the write operation much faster then we
can extend. We should test extend flow with a real world use case, such as
provisioning a vm, or copying data from one storage to another storage.

Setting the severity to medium since this is not a valid way to test extend.
Please update the severity if you can reproduce this issue during real flows.

Comment 6 Maor 2016-01-11 15:49:14 UTC
Elad & Kevin,
Can you please try to test this with a real file (maybe dd of ISO file), does the VM also gets paused?

Comment 7 Aharon Canan 2016-01-11 16:56:16 UTC
Writing zeros using dd doesn't create a file? 
Do you ask to copy using dd instead of creating new file?

I am not sure what is the difference for the system.

Comment 8 Maor 2016-01-11 19:21:28 UTC
yes, using dd operation with a real source file. Something that the user might use.
XtremIO might have special optimizations for zero blocks, which makes the operation finishe really fast and that could be the reason why the lvextend operation completed only after we got no space error from qemu.

Comment 9 Aharon Canan 2016-01-14 08:52:24 UTC
 (In reply to Maor from comment #6)
> Elad & Kevin,
> Can you please try to test this with a real file (maybe dd of ISO file),
> does the VM also gets paused?

Kevin, 

can you please try?

Comment 10 Kevin Alon Goldblatt 2016-02-02 14:52:17 UTC
(In reply to Aharon Canan from comment #9)
>  (In reply to Maor from comment #6)
> > Elad & Kevin,
> > Can you please try to test this with a real file (maybe dd of ISO file),
> > does the VM also gets paused?
> 
> Kevin, 
> 
> can you please try?

I copied an iso file to one of the disks using 'wget':

wget https://download.fedoraproject.org/pub/fedora/linux/releases/23/Workstation/x86_64/iso/Fedora-Live-Workstation-x86_64-23-10.iso

Then I copied this file to the 2nd disk using 'cp' as follows:
cp data/Fedora-Live-Workstation-x86_64-23-10.iso data/1
cp data/Fedora-Live-Workstation-x86_64-23-10.iso data/2
cp data/Fedora-Live-Workstation-x86_64-23-10.iso data/3

The vm was paused again for 50 seconds

Comment 11 Maor 2016-02-02 19:21:01 UTC
Kevin, can you please attach the VDSM and engine logs

Comment 12 Maor 2016-02-02 19:25:21 UTC
Also, just from curiosity, didn't you want to do that with 'dd' instead of cp?

Comment 13 Kevin Alon Goldblatt 2016-02-03 13:47:51 UTC
(In reply to Maor from comment #11)
> Kevin, can you please attach the VDSM and engine logs

vdsm.log:
------------
rts_iso/4b7cf7b1-98c0-4e60-b515-409a7daaf072/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-5188::DEBUG::2016-02-02 16:38:27,728::fileSD::173::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n377 bytes (377 B) copied, 0.000850107 s, 443 kB/s\n'; <rc> = 0
libvirtEventLoop::INFO::2016-02-02 16:38:27,860::vm::3728::virt.vm::(onIOError) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::abnormal vm stop device virtio-disk2 error enospc
libvirtEventLoop::INFO::2016-02-02 16:38:27,860::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::CPU stopped: onIOError
libvirtEventLoop::INFO::2016-02-02 16:38:27,895::vm::1026::virt.vm::(extendDrivesIfNeeded) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::Requesting extension for volume d9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6 on domain 2deae191-8162-4b46-8135-0f390c159ded (apparent: 4294967296, capacity: 11811160064, allocated: 4302503424, physical: 4294967296)
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,895::vm::1091::virt.vm::(__extendDriveVolume) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::Requesting an extension for the volume: {'newSize': 5368709120, 'domainID': u'2deae191-8162-4b46-8135-0f390c159ded', 'name': u'vdc', 'volumeID': u'd9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6', 'imageID': u'9ffc9eac-56d0-4583-9d73-5b67a75b0f6f', 'internal': False, 'poolID': u'c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2'}
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,896::task::595::Storage.TaskManager.Task::(_updateState) Task=`57fa165a-27dd-4ed0-8807-48598958d36e`::moving from state init -> state preparing
libvirtEventLoop::INFO::2016-02-02 16:38:27,896::logUtils::48::dispatcher::(wrapper) Run and protect: sendExtendMsg(spUUID=u'c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2', volDict={'newSize': 5368709120, 'domainID': u'2deae191-8162-4b46-8135-0f390c159ded', 'name': u'vdc', 'volumeID': u'd9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6', 'imageID': u'9ffc9eac-56d0-4583-9d73-5b67a75b0f6f', 'internal': False, 'poolID': u'c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2'}, newSize=5368709120, callbackFunc=<bound method Vm.__afterVolumeExtension of <virt.vm.Vm object at 0x7f96ec029590>>)
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,896::storage_mailbox::123::Storage.SPM.Messages.Extend::(__init__) new extend msg created: domain: 2deae191-8162-4b46-8135-0f390c159ded, volume: d9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6
libvirtEventLoop::INFO::2016-02-02 16:38:27,897::logUtils::51::dispatcher::(wrapper) Run and protect: sendExtendMsg, Return response: None
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,897::task::1191::Storage.TaskManager.Task::(prepare) Task=`57fa165a-27dd-4ed0-8807-48598958d36e`::finished: None
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,897::task::595::Storage.TaskManager.Task::(_updateState) Task=`57fa165a-27dd-4ed0-8807-48598958d36e`::moving from state preparing -> state finished
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,897::resourceManager::943::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,897::resourceManager::980::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,897::task::993::Storage.TaskManager.Task::(_decref) Task=`57fa165a-27dd-4ed0-8807-48598958d36e`::ref 0 aborting False
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,898::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"0ee7709a-c4db-443d-ad5e-8c4cb941c62d": {"status": "Paused", "ioerror": {"alias": "virtio-disk2", "name": "vdc", "path": "/rhev/data-center/c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2/2deae191-8162-4b46-8135-0f390c159ded/images/9ffc9eac-56d0-4583-9d73-5b67a75b0f6f/d9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6"}, "pauseCode": "ENOSPC"}, "notify_time": 4464231980}, "jsonrpc": "2.0", "method": "|virt|VM_status|0ee7709a-c4db-443d-ad5e-8c4cb941c62d"}
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,899::vm::4563::virt.vm::(onLibvirtLifecycleEvent) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::event Suspended detail 2 opaque None
libvirtEventLoop::INFO::2016-02-02 16:38:27,899::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::CPU stopped: onSuspend
libvirtEventLoop::INFO::2016-02-02 16:38:27,899::vm::3728::virt.vm::(onIOError) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::abnormal vm stop device virtio-disk2 error enospc
libvirtEventLoop::INFO::2016-02-02 16:38:27,899::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::CPU stopped: onIOError
libvirtEventLoop::INFO::2016-02-02 16:38:27,910::vm::1026::virt.vm::(extendDrivesIfNeeded) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::Requesting extension for volume d9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6 on domain 2deae191-8162-4b46-8135-0f390c159ded (apparent: 4294967296, capacity: 11811160064, allocated: 4302503424, physical: 4294967296)
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,911::vm::1091::virt.vm::(__extendDriveVolume) vmId=`0ee7709a-c4db-443d-ad5e-8c4cb941c62d`::Requesting an extension for the volume: {'newSize': 5368709120, 'domainID': u'2deae191-8162-4b46-8135-0f390c159ded', 'name': u'vdc', 'volumeID': u'd9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6', 'imageID': u'9ffc9eac-56d0-4583-9d73-5b67a75b0f6f', 'internal': False, 'poolID': u'c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2'}
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,911::task::595::Storage.TaskManager.Task::(_updateState) Task=`dacd6406-591b-46fd-9255-36bac2859335`::moving from state init -> state preparing
libvirtEventLoop::INFO::2016-02-02 16:38:27,911::logUtils::48::dispatcher::(wrapper) Run and protect: sendExtendMsg(spUUID=u'c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2', volDict={'newSize': 5368709120, 'domainID': u'2deae191-8162-4b46-8135-0f390c159ded', 'name': u'vdc', 'volumeID': u'd9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6', 'imageID': u'9ffc9eac-56d0-4583-9d73-5b67a75b0f6f', 'internal': False, 'poolID': u'c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2'}, newSize=5368709120, callbackFunc=<bound method Vm.__afterVolumeExtension of <virt.vm.Vm object at 0x7f96ec029590>>)
libvirtEventLoop::DEBUG::2016-02-02 16:38:27,912::storage_mailbox::123::Storage.SPM.Messages.Extend::(__init__) new extend msg created: domain: 2deae191-8162-4b46-8135-0f390c159ded, volume: d9e9897b-fdb1-4c73-9ff9-a5d9b0ec2be6



engine.log:
---------------
2016-02-02 16:31:43,666 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [] Stage: Pre-termination
2016-02-02 16:31:43,692 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [] Retrieving installation logs to: '/var/log/ovirt-engine/host-deploy/ovirt-host-mgmt-20160202163143-10.35.102.15-n
ull.log'
2016-02-02 16:31:43,863 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [] Stage: Termination
2016-02-02 16:38:26,658 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '0ee7709a-c4db-443d-ad5e-8c4cb941c62d'(vm1) moved from 'Up' --> 'Paused'
2016-02-02 16:38:26,735 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-4) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has 
been paused.
2016-02-02 16:38:26,793 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-4) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has been paused due to no Storage space error.
2016-02-02 16:39:11,971 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-22) [] VM '0ee7709a-c4db-443d-ad5e-8c4cb941c62d'(vm1) moved from 'Paused' --> 'Up'
2016-02-02 16:39:12,060 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-22) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has recovered from paused back to up.
2016-02-02 17:01:09,052 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (ajp-/127.0.0.1:8702-4) [2ae17833] Lock Acquired to object 'EngineLock:{exclusiveLocks='[vm3=<VM_NAME, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='[00000000-0000-0000-0000-000000000000=<TEMPLATE, ACTION_TYPE_FAILED_TEMPLATE_IS_USED_FOR_CREATE_VM$VmName vm3>]'}'
2016-02-02 17:01:09,299 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (ajp-/127.0.0.1:8702-4) [] Running command: AddVmCommand internal: false. Entities affected :  ID: dace548b-40bf-4bfa-9a76-a5fcaa058d77 Type: VdsGroupsAction group CREATE_VM with role type USER,  ID: 00000000-0000-0000-0000-000000000000 Type: VmTemplateAction group CREATE_VM with role type USER
2016-02-02 17:01:09,827 INFO  [org.ovirt.engine.core.bll.AddGraphicsDeviceCommand] (ajp-/127.0.0.1:8702-4) [55c5cd43] Running command: AddGraphicsDeviceCommand internal: true. Entities affected :  ID: 0e283ef3-1fd6-43ac-84da-bbc1815baff2 Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2016-02-02 17:01:09,964 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-4) [55c5cd43] Correlation ID: 2ae17833, Job ID: baab7406-d79d-4166-b2ea-702d37771164, Call Stack: null, Custom Event ID: -1, Message: VM vm3 was created by admin@internal.
2016-02-02 17:01:10,025 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (ajp-/127.0.0.1:8702-4) [55c5cd43] Lock freed to object 'EngineLock:{exclusiveLocks='[vm3=<VM_NAME, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='[00000000-0000-0000-0000-000000000000=<TEMPLATE, ACTION_TYPE_FAILED_TEMPLATE_IS_USED_FOR_CREATE_VM$VmName vm3>]'}'
2016-02-02 17:01:11,130 INFO  [org.ovirt.engine.core.bll.network.vm.AddVmInterfaceCommand] (ajp-/127.0.0.1:8702-7) [20721d59] Running command: AddVmInterfaceCommand internal: false. Entities affected :  ID: 0e283ef3-1fd6-43ac-84da-bbc1815baff2 Type: VMAction group CONFIGURE_VM_NETWORK with role type USER,  ID: 14e712d8-f4e4-4812-8094-bd1ec406e214 Type: VnicProfileAction group CONFIGURE_VM_NETWORK with role type USER
2016-02-02 17:01:11,517 INFO  [org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand] (ajp-/127.0.0.1:8702-7) [6c5fbdab] Running command: ActivateDeactivateVmNicCommand internal: true. Entities affected :  ID: 0e283ef3-1fd6-43ac-84da-bbc1815baff2 Type: VM
2016-02-02 17:01:11,701 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-7) [6c5fbdab] Correlation ID: 6c5fbdab, Call Stack: null, Custom Event ID: -1, Message: Network Interface nic1 (VirtIO) was plugged to VM vm3. (User: admin@internal)
2016-02-02 17:01:11,782 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-7) [6c5fbdab] Correlation ID: 20721d59, Call Stack: null, Custom Event ID: -1, Message: Interface nic1 (VirtIO) was added to VM vm3. (User: admin@internal)
2016-02-02 17:01:11,966 INFO  [org.ovirt.engine.core.bll.network.vm.ReorderVmNicsCommand] (ajp-/127.0.0.1:8702-15) [16dafa38] Running command: ReorderVmNicsCommand internal: false. Entities affected :  ID: 0e283ef3-1fd6-43ac-84da-bbc1815baff2 Type: VMAction group CREATE_VM with role type USER
2016-02-02 17:01:16,237 INFO  [org.ovirt.engine.core.bll.AddDiskCommand] (ajp-/127.0.0.1:8702-7) [1c83a37a] Lock Acquired to object 'EngineLock:{exclusiveLocks='[0e283ef3-1fd6-43ac-84da-bbc1815baff2=<VM_DISK_BOOT, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='[0e283ef3-1fd6-43ac-84da-bbc1815baff2=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]'}'
2016-02-02 17:01:16,364 INFO  [org.ovirt.engine.core.bll.AddDiskCommand] (ajp-/127.0.0.1:8702-7) [1c83a37a] Running command: AddDiskCommand internal: false. Entities affected :  ID: 0e283ef3-1fd6-43ac-84da-bbc1815baff2 Type: VMAction group CONFIGURE_VM_STORAGE with role type USER,  ID: 2deae191-8162-4b46-8135-0f390c159ded Type: StorageAction group CREATE_DISK with role type USER
2016-02-02 17:01:16,399 INFO  [org.ovirt.engine.core.bll.AddImageFromScratchCommand] (ajp-/127.0.0.1:8702-7) [47b383] Running command: AddImageFromScratchCommand internal: true. Entities affected :  ID: 2deae191-8162-4b46-8135-0f390c159ded Type: Storage
2016-02-02 17:01:16,550 INFO  [org.ovirt.engine.core.bll.AddImageFromScratchCommand] (ajp-/127.0.0.1:8702-7) [47b383] Lock freed to object 'EngineLock:{exclusiveLocks='[0e283ef3-1fd6-43ac-84da-bbc1815baff2=<VM_DISK_BOOT, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='[0e283ef3-1fd6-43ac-84da-bbc1815baff2=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]'}'
2016-02-02 17:01:17,602 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (ajp-/127.0.0.1:8702-7) [47b383] START, CreateImageVDSCommand( CreateImageVDSCommandParameters:{runAsync='true', storagePoolId='c96b5ce2-9b6b-467f-85f2-ef212eb3ebc2', ignoreFailoverLimit='false', storageDomainId='2deae191-8162-4b46-8135-0f390c159ded', imageGroupId='bcab9f13-d6e0-490a-99d5-4142a177287a', imageSizeInBytes='5368709120', volumeFormat='COW', newImageId='1d14a138-21eb-4376-b12b-f6d8b6eb2c54', newImageDescription='{"DiskAlias":"vm3_Disk1","DiskDescription":""}', imageInitialSizeInBytes='0'}), log id: 2e2be3b3
2016-02-02 17:01:17,690 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (ajp-/127.0.0.1:8702-7) [47b383] -- executeIrsBrokerCommand: calling 'createVolume' with two new parameters: description and UUID
2016-02-02 17:01:18,790 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (ajp-/127.0.0.1:8702-7) [47b383] FINISH, CreateImageVDSCommand, return: 1d14a138-21eb-4376-b12b-f6d8b6eb2c54, log id: 2e2be3b3

Comment 14 Kevin Alon Goldblatt 2016-02-03 13:57:15 UTC
Created attachment 1120791 [details]
engine, vdsm, server logs

Added requested logs

Comment 15 Yaniv Kaul 2016-02-24 09:37:49 UTC
(In reply to Nir Soffer from comment #5)
> Vdsm reports about vm being paused or resume are in Francesco territory, and 
> not related to storage.
> 
> We need to analyze the extend events, to check that we did receive extend
> events
> on time, and that we did handle them on as expected.
> 
> Generally, I think this test is not valid use case; when writing zeros,
> storage
> server can optimize the write, making the write operation much faster then we
> can extend. We should test extend flow with a real world use case, such as
> provisioning a vm, or copying data from one storage to another storage.

Specifically, XtremIO indeed optimizes that and does not write anything. In fact, even metadata is not allocated (reads from unallocated blocks return zero, so it's the same thing).

> 
> Setting the severity to medium since this is not a valid way to test extend.
> Please update the severity if you can reproduce this issue during real flows.

Agreed. Setting NEEDINFO on reporter.

Comment 16 Nir Soffer 2016-02-26 16:29:54 UTC
Created attachment 1130862 [details]
Repoplot of vdsm.log

Comment 17 Nir Soffer 2016-02-26 16:53:23 UTC
Looking in repoplot of vdsm log (attachment 1130862 [details]) we can see slow
mailbox check for mail commands (1 second) during the entire log
(16:00 - 19:00).

At 18:36, we see very slow mailbox check for mail command (5 seconds). At the 
same time, there is also high read delay (6 seconds). Slow spm mailbox check
leads to slow extend, which may lead to pausing of a vm.

So a possible theory for this failure is:

1. Doing big copy, saturating the storage bandwidth
2. Mailbox check for mail commands become slow
3. Extend operation delayed
4. Vm get paused

Todo:

1. Understand why we see slow check for mail (1 second) - this should be
   much faster. On my poor storage (targetcli, 1Gib nic), this takes about
   50 milliseconds.

2. Check the logs around 18:36, to understand if the vm paused at this point.

Aharon, can you describe the environment where this was tested?
- iscsi/fc
- nics used for storage
- storage server

Comment 18 Nir Soffer 2016-02-26 17:15:44 UTC
Created attachment 1130868 [details]
Repoplot of vdsm.log (fixed)

This version fix display of values around 1 second.

Comment 19 Nir Soffer 2016-02-26 17:20:48 UTC
Regarding the high mailbox check for mail dd commands, this was the result
of truncating timestamps to seconds resolution. Attachment 1130862 [details] show that
the values are pretty low and should not introduce a significant delay in the
mailbox mechanism.

Comment 20 Nir Soffer 2016-02-26 17:22:00 UTC
(In reply to Nir Soffer from comment #19)
> of truncating timestamps to seconds resolution. Attachment 1130862 [details]
Correction: attachment 1130868 [details]

Comment 21 Aharon Canan 2016-03-02 08:16:22 UTC
Info needed from Kevin

Comment 22 Allon Mureinik 2016-03-03 14:44:14 UTC
Based on the discussion above there doesn't seem to be anything actionable here - this is the "by design" limitation.

Pushing out to 4.0 to reconsider.

Comment 23 Kevin Alon Goldblatt 2016-03-06 16:24:15 UTC
(In reply to Yaniv Kaul from comment #2)
> Kevin, is the VM paused for several seconds or minutes? The description is
> not clear:
> 
> "Writing to thin isci disk with dd pauses VM for 20-40 seconds during
> lvextend"
> 
> vs.
> 
> "4. Using dd write 5G of zero's to the disk >>>>> During the write the VM is
> paused up to 40 minutes during the lvextend operation"
> 
> 
> Also, you have not provided information on the storage specs, etc.

The VM is paused for a for 20-40 Seconds, Not minutes. Typo. Sorry

Comment 24 Kevin Alon Goldblatt 2016-03-06 16:27:09 UTC
(In reply to Nir Soffer from comment #17)
> Looking in repoplot of vdsm log (attachment 1130862 [details]) we can see
> slow
> mailbox check for mail commands (1 second) during the entire log
> (16:00 - 19:00).
> 
> At 18:36, we see very slow mailbox check for mail command (5 seconds). At
> the 
> same time, there is also high read delay (6 seconds). Slow spm mailbox check
> leads to slow extend, which may lead to pausing of a vm.
> 
> So a possible theory for this failure is:
> 
> 1. Doing big copy, saturating the storage bandwidth
> 2. Mailbox check for mail commands become slow
> 3. Extend operation delayed
> 4. Vm get paused
> 
> Todo:
> 
> 1. Understand why we see slow check for mail (1 second) - this should be
>    much faster. On my poor storage (targetcli, 1Gib nic), this takes about
>    50 milliseconds.
> 
> 2. Check the logs around 18:36, to understand if the vm paused at this point.
> 
> Aharon, can you describe the environment where this was tested?
> - iscsi/fc
> - nics used for storage
> - storage server

Environment is:
iscsi
1 Nic
ExtremeIO

Comment 25 Kevin Alon Goldblatt 2016-03-06 16:29:25 UTC
(In reply to Aharon Canan from comment #21)
> Info needed from Kevin


In response to Comment 17,
The environment is as follows:
iscsi
1 Nic
ExtremeIO

Comment 26 Yaniv Kaul 2016-03-06 18:18:17 UTC
(In reply to Kevin Alon Goldblatt from comment #25)
> (In reply to Aharon Canan from comment #21)
> > Info needed from Kevin
> 
> 
> In response to Comment 17,
> The environment is as follows:
> iscsi
> 1 Nic

10g or 1g? In any case, quite an unusual configuration for iSCSI - usually, multipathing is employed.

> ExtremeIO

XtremIO.

Did you check the latency on the XtremIO of that initiator? Volume?

Regardless, it should not take so long.
However, you are writing zero's. XtremIO does not write zeros, it's just metadata operation. Therefore, it is very fast (and unrealistic) scenario. Did it ever reproduce with any other IO pattern? I assume 1G or 10G link, you can probably close to saturating the link?

Comment 27 Nir Soffer 2016-03-06 19:44:28 UTC
Created attachment 1133561 [details]
Repoplot of vdsm.log (improved)

Created with latest version of repoplot, providing more information.

Comment 28 Kevin Alon Goldblatt 2016-03-13 14:22:27 UTC
(In reply to Yaniv Kaul from comment #26)
> (In reply to Kevin Alon Goldblatt from comment #25)
> > (In reply to Aharon Canan from comment #21)
> > > Info needed from Kevin
> > 
> > 
> > In response to Comment 17,
> > The environment is as follows:
> > iscsi
> > 1 Nic
> 
> 10g or 1g? In any case, quite an unusual configuration for iSCSI - usually,
> multipathing is employed.

We are using multipathing of 4 paths
We are using a 1G ethernet connection from the hosts

> 
> > ExtremeIO
> 
> XtremIO.
> 
> Did you check the latency on the XtremIO of that initiator? Volume?
> 
> Regardless, it should not take so long.
> However, you are writing zero's. XtremIO does not write zeros, it's just
> metadata operation. Therefore, it is very fast (and unrealistic) scenario.
> Did it ever reproduce with any other IO pattern? I assume 1G or 10G link,
> you can probably close to saturating the link?

I encountered the same behaviour when using 'dd' from /dev/zero AND as mentioned in comment 10, when copying a regular file with 'cp'. So that removes the 'dd' from /dev/zero possibility from the equation

Comment 29 Yaniv Kaul 2016-03-13 14:40:01 UTC
(In reply to Kevin Alon Goldblatt from comment #28)
> (In reply to Yaniv Kaul from comment #26)
> > (In reply to Kevin Alon Goldblatt from comment #25)
> > > (In reply to Aharon Canan from comment #21)
> > > > Info needed from Kevin
> > > 
> > > 
> > > In response to Comment 17,
> > > The environment is as follows:
> > > iscsi
> > > 1 Nic
> > 
> > 10g or 1g? In any case, quite an unusual configuration for iSCSI - usually,
> > multipathing is employed.
> 
> We are using multipathing of 4 paths
> We are using a 1G ethernet connection from the hosts

But essentially a single NIC? That's very low. It's good that you have multi-pathing, but meaningless on a 1GB link (it's easy to saturate a single 1GB with a single TCP connection - on 10GB it's better to have multiple TCP connections, which is exactly what you have with multipathing).

> 
> > 
> > > ExtremeIO
> > 
> > XtremIO.
> > 
> > Did you check the latency on the XtremIO of that initiator? Volume?
> > 
> > Regardless, it should not take so long.
> > However, you are writing zero's. XtremIO does not write zeros, it's just
> > metadata operation. Therefore, it is very fast (and unrealistic) scenario.
> > Did it ever reproduce with any other IO pattern? I assume 1G or 10G link,
> > you can probably close to saturating the link?
> 
> I encountered the same behaviour when using 'dd' from /dev/zero AND as
> mentioned in comment 10, when copying a regular file with 'cp'. So that
> removes the 'dd' from /dev/zero possibility from the equation

Where are you copying it from? From XtremIO to XtremIO? It's dedup'ing - again, no real write.
What I keep trying to understand here - could it be that you are expanding faster than we are extending? For this, I need to know the effective speed at which you are writing. It may well be that you are doing it faster.

Comment 30 Kevin Alon Goldblatt 2016-05-23 13:22:47 UTC
I ran the same operation again as follows:

Using the following code:
----------------------------------------
rhevm-4.0.0-0.7.master.el7ev.noarch
vdsm-4.18.0-16.git51df339.el7.centos.x86_64


Ran the following scenario:
---------------------------------------
1. Created a VM and installed Rhel7.2
2. Added 1 iscsi thin disk and 1 nfs thin disk
3. Installed file system on both disks
4. Mounted both disks
5. Copied an iso file of 1.4GB from the fedora site to the nfs disk
6. Using 'cp' copied the file from the nfs disk to the iscsi disk >>>>> After about a minute or so of copying the VM was paused as before in all my other attemps.
So I ruled out the 'dd from /dev/zero' issue by using cp of a file
And I ruled out the de-duplication possibility on the xtremeIO
I also ruled out the 1G NIC issue by using a host with a 10G NIC

I am attaching the new logs

Comment 31 Kevin Alon Goldblatt 2016-05-23 13:24:35 UTC
Created attachment 1160629 [details]
vdsm server and engine logs

Added logs

Comment 32 Kevin Alon Goldblatt 2016-05-23 14:49:34 UTC

Engine.log:
---------------------
2016-05-22 18:40:00,022 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-60) [] Autorecovering 1 storage domains
2016-05-22 18:40:00,022 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-60) [] Autorecovering storage domains id: e3cf4b5c-ba13-40aa-83e5-5496f36fb036 
2016-05-22 18:40:00,026 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-60) [39c2c7fa] Running command: ConnectDomainToSto
rageCommand internal: true. Entities affected :  ID: e3cf4b5c-ba13-40aa-83e5-5496f36fb036 Type: Storage
2016-05-22 18:40:00,026 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-60) [39c2c7fa] ConnectDomainToStorage. Before Conn
ect all hosts to pool. Time: Sun May 22 18:40:00 IDT 2016
2016-05-22 18:40:00,056 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-23) [39c2c7fa] START, ConnectStorageServerVDSComman
d(HostName = puma22.scl.lab.tlv.redhat.com, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='9a9847eb-8132-4215-8b93-90cb8bfb6440', storagePoolId='a7484455-20aa-4dd
e-87b3-f60956fc7e9d', storageType='ISCSI', connectionList='[StorageServerConnections:{id='b449f09b-0613-4af4-8f9d-80951f4741bd', connection='10.35.146.225', iqn='iqn.2008-05.com.xtremio:xi
o00153500071-514f0c50023f6c05', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id
='2c148526-496c-4a9c-86cd-a068aa0b2846', connection='10.35.146.193', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', vfsType='null', mountOptions='null', nfsVersion='null', 
nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='9eecb527-422e-4360-af37-190800c6b502', connection='10.35.146.129', iqn='iqn.2008-05.co
m.xtremio:xio00153500071-514f0c50023f6c00', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerCon
nections:{id='e1d0e6c0-5088-46ad-872e-acda578f7568', connection='10.35.146.161', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', vfsType='null', mountOptions='null', nfsVers
ion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 3c00a0d1
2016-05-22 18:40:03,197 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-23) [39c2c7fa] FINISH, ConnectStorageServerVDSComma
nd, return: {b449f09b-0613-4af4-8f9d-80951f4741bd=0, 2c148526-496c-4a9c-86cd-a068aa0b2846=0, e1d0e6c0-5088-46ad-872e-acda578f7568=0, 9eecb527-422e-4360-af37-190800c6b502=0}, log id: 3c00a0
d1
2016-05-22 18:40:03,198 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-60) [39c2c7fa] ConnectDomainToStorage. After Conne
ct all hosts to pool. Time: Sun May 22 18:40:03 IDT 2016
2016-05-22 18:41:28,090 WARN  [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-6) [] skipping VM '521e0a5d-737d-4382-ad47-e2d884b24786' from this monitorin
g cycle - the VM data has changed since fetching the data
2016-05-22 18:42:11,141 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler_Worker-23) [] VM '521e0a5d-737d-4382-ad47-e2d884b24786'(vm1) moved from 'Up' -
-> 'Paused'
2016-05-22 18:42:11,266 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-23) [] Correlation ID: null, Call Stack: null, Custom Ev
ent ID: -1, Message: VM vm1 has been paused.
2016-05-22 18:42:11,394 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-23) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has been paused due to no Storage space error.


VDSM.log:
--------------------
2016-05-22 18:35:00,041 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-28) [2d5fdbbe] START, ConnectStorageServerVDSComman
d(HostName = puma22.scl.lab.tlv.redhat.com, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='9a9847eb-8132-4215-8b93-90cb8bfb6440', storagePoolId='a7484455-20aa-4dd
e-87b3-f60956fc7e9d', storageType='ISCSI', connectionList='[StorageServerConnections:{id='b449f09b-0613-4af4-8f9d-80951f4741bd', connection='10.35.146.225', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='2c148526-496c-4a9c-86cd-a068aa0b2846', connection='10.35.146.193', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='9eecb527-422e-4360-af37-190800c6b502', connection='10.35.146.129', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='e1d0e6c0-5088-46ad-872e-acda578f7568', connection='10.35.146.161', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 1cd3cb8d
2016-05-22 18:35:02,432 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-28) [2d5fdbbe] FINISH, ConnectStorageServerVDSCommand, return: {b449f09b-0613-4af4-8f9d-80951f4741bd=0, 2c148526-496c-4a9c-86cd-a068aa0b2846=0, e1d0e6c0-5088-46ad-872e-acda578f7568=0, 9eecb527-422e-4360-af37-190800c6b502=0}, log id: 1cd3cb8d
2016-05-22 18:35:02,433 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-49) [2d5fdbbe] ConnectDomainToStorage. After Connect all hosts to pool. Time: Sun May 22 18:35:02 IDT 2016
2016-05-22 18:40:00,022 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-60) [] Autorecovering 1 storage domains
2016-05-22 18:40:00,022 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-60) [] Autorecovering storage domains id: e3cf4b5c-ba13-40aa-83e5-5496f36fb036 
2016-05-22 18:40:00,026 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-60) [39c2c7fa] Running command: ConnectDomainToStorageCommand internal: true. Entities affected :  ID: e3cf4b5c-ba13-40aa-83e5-5496f36fb036 Type: Storage
2016-05-22 18:40:00,026 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-60) [39c2c7fa] ConnectDomainToStorage. Before Connect all hosts to pool. Time: Sun May 22 18:40:00 IDT 2016
2016-05-22 18:40:00,056 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-23) [39c2c7fa] START, ConnectStorageServerVDSCommand(HostName = puma22.scl.lab.tlv.redhat.com, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='9a9847eb-8132-4215-8b93-90cb8bfb6440', storagePoolId='a7484455-20aa-4dde-87b3-f60956fc7e9d', storageType='ISCSI', connectionList='[StorageServerConnections:{id='b449f09b-0613-4af4-8f9d-80951f4741bd', connection='10.35.146.225', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='2c148526-496c-4a9c-86cd-a068aa0b2846', connection='10.35.146.193', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='9eecb527-422e-4360-af37-190800c6b502', connection='10.35.146.129', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='e1d0e6c0-5088-46ad-872e-acda578f7568', connection='10.35.146.161', iqn='iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 3c00a0d1
2016-05-22 18:40:03,197 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-23) [39c2c7fa] FINISH, ConnectStorageServerVDSCommand, return: {b449f09b-0613-4af4-8f9d-80951f4741bd=0, 2c148526-496c-4a9c-86cd-a068aa0b2846=0, e1d0e6c0-5088-46ad-872e-acda578f7568=0, 9eecb527-422e-4360-af37-190800c6b502=0}, log id: 3c00a0d1
2016-05-22 18:40:03,198 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-60) [39c2c7fa] ConnectDomainToStorage. After Connect all hosts to pool. Time: Sun May 22 18:40:03 IDT 2016
2016-05-22 18:41:28,090 WARN  [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-6) [] skipping VM '521e0a5d-737d-4382-ad47-e2d884b24786' from this monitoring cycle - the VM data has changed since fetching the data
2016-05-22 18:42:11,141 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler_Worker-23) [] VM '521e0a5d-737d-4382-ad47-e2d884b24786'(vm1) moved from 'Up' --> 'Paused'
2016-05-22 18:42:11,266 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-23) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm1 has been paused.
2016-05-22 18:42:11,394 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-23) [] Correlation ID: null, Call Stack: null, Custom Ev:
2016-05-22 18:45:00,023 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-40) [] Autorecovering 1 storage domains
2016-05-22 18:45:00,024 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-40) [] Autorecovering storage domains id: e3cf4b5c-ba13-40aa-83e5-5496f36fb036 
2016-05-22 18:45:00,027 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-40) [2966a79b] Running command: ConnectDomainToStorageCommand internal: true. Entities affected :  ID: e3cf4b5c-ba13-40aa-83e5-5496f36fb036 Type: Storage
2016-05-22 18:45:00,027 INFO  [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler_Worker-40) [2966a79b] ConnectDomainToStorage. Before Connect all hosts to pool. Time: Sun May 22 18:45:00 IDT 2016

Comment 33 Yaniv Kaul 2017-01-15 15:32:30 UTC
Eyal - with the workaround suggestion, what is the latest on this?

Comment 34 Yaniv Lavi 2017-02-07 15:22:05 UTC
What is the workaround?

Comment 35 Yaniv Kaul 2017-02-07 16:43:45 UTC
(In reply to Yaniv Dary from comment #34)
> What is the workaround?

Under the [irs] section, set something like:
volume_utilization_percent=25  
volume_utilization_chunk_mb=10240

This will start increasing the disk after 25% is filled and will increase it by 10G.

Comment 36 Allon Mureinik 2017-07-24 14:05:53 UTC
(In reply to Yaniv Kaul from comment #33)
> Eyal - with the workaround suggestion, what is the latest on this?

Returning the needinfo that was removed by mistake.

Comment 37 Allon Mureinik 2017-09-05 14:34:57 UTC
Pushing out. If we have actionable data, we can consider bringing this back in to 4.2.

Comment 38 Yaniv Kaul 2017-11-19 11:03:26 UTC
Closing. If QE gets to reproduce this, please feel free to re-open.

Comment 39 Raz Tamir 2018-07-10 08:07:25 UTC
Clearing needinfo


Note You need to log in before you can comment on or make changes to this bug.