Bug 1279052 - [ppc64le] consume fix for: "After writing to a secondary iscsi thin provision disk with data that has been resized, the qemu process dies or vm stops with not enough space"
Summary: [ppc64le] consume fix for: "After writing to a secondary iscsi thin provision...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.0
Hardware: ppc64le
OS: Linux
high
high
Target Milestone: ovirt-3.6.2
: 3.6.0
Assignee: Daniel Erez
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On: 1281520 1283987
Blocks: RHEV3.6PPC
TreeView+ depends on / blocked
 
Reported: 2015-11-07 13:47 UTC by Carlos Mestre González
Modified: 2016-03-10 11:58 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-10 10:34:12 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log (1.84 MB, text/plain)
2015-11-07 13:49 UTC, Carlos Mestre González
no flags Details
vdsm.log (3.09 MB, text/plain)
2015-11-07 13:50 UTC, Carlos Mestre González
no flags Details
message.log when vm paused (816.04 KB, text/plain)
2015-11-07 13:52 UTC, Carlos Mestre González
no flags Details
vdsm logs (1.37 MB, application/x-gzip)
2015-11-10 10:15 UTC, Raz Tamir
no flags Details
Logs new run (3.70 MB, application/x-gzip)
2015-11-11 14:11 UTC, Carlos Mestre González
no flags Details
logs run with new qemu-kvm-rhev packages (10.49 MB, application/x-gzip)
2015-11-20 15:46 UTC, Carlos Mestre González
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 50435 0 master MERGED spec: Require newer qemu version for el7 2015-12-20 11:00:47 UTC
oVirt gerrit 50766 0 ovirt-3.6 MERGED spec: Require newer qemu version for el7 2015-12-21 02:29:58 UTC

Description Carlos Mestre González 2015-11-07 13:47:40 UTC
Description of problem:
I've been testing live resize of disks with PPC and there's some "weird" issues going on for thin provisioned disks.

There are two scenarios after a live disk resize and trying to write the disk:


1) vm just stops (engine shows qemu process was lost) or

2) the VM is paused for not enough space (though we're writing less data that is suppose to) and any action to the disks just freezes (umount the partition, trying to access the disk table).

Version-Release number of selected component (if applicable):
rhevm-3.6.0.2-0.1.el6.noarch
libvirt-daemon-driver-nodedev-1.2.17-13.el7.ppc64le
libvirt-daemon-driver-storage-1.2.17-13.el7.ppc64le
libvirt-client-1.2.17-13.el7.ppc64le
libvirt-daemon-driver-nwfilter-1.2.17-13.el7.ppc64le
libvirt-daemon-driver-interface-1.2.17-13.el7.ppc64le
libvirt-daemon-driver-qemu-1.2.17-13.el7.ppc64le
libvirt-daemon-1.2.17-13.el7.ppc64le
libvirt-daemon-driver-secret-1.2.17-13.el7.ppc64le
libvirt-daemon-kvm-1.2.17-13.el7.ppc64le
libvirt-daemon-config-nwfilter-1.2.17-13.el7.ppc64le
libvirt-python-1.2.17-2.el7.ppc64le
libvirt-daemon-driver-network-1.2.17-13.el7.ppc64le
libvirt-lock-sanlock-1.2.17-13.el7.ppc64le
qemu-kvm-rhev-2.3.0-31.el7.ppc64le
qemu-kvm-tools-rhev-2.3.0-31.el7.ppc64le
qemu-kvm-common-rhev-2.3.0-31.el7.ppc64le
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
qemu-img-rhev-2.3.0-31.el7.ppc64le
vdsm-python-4.17.10.1-0.el7ev.noarch
vdsm-yajsonrpc-4.17.10.1-0.el7ev.noarch
vdsm-4.17.10.1-0.el7ev.noarch
vdsm-jsonrpc-4.17.10.1-0.el7ev.noarch
vdsm-cli-4.17.10.1-0.el7ev.noarch
vdsm-xmlrpc-4.17.10.1-0.el7ev.noarch
vdsm-infra-4.17.10.1-0.el7ev.noarch
vdsm-hook-ethtool-options-4.17.10.1-0.el7ev.noarch

How reproducible:
3/4

Steps to Reproduce:
1. Have a VM with rhel for ppc
2. Create a thin provisioned disks of 1 GB on ISCSI
3. Start the vm
4. After start increase the size of the disk by 1 Gb
5. after the operation is done, OS detects the increase in size, create a Linux partition of 2 Gb (I've used cfdisk), format it (mkfs.ext4) and mount it.
6. Start writing to the vm from the os disk for example (dd if=/dev/vdb of=/mnt/file bs=1M count=1600). This doesn't seem to happen using the oflag=direct parameter, but I've only tested it once with that parameter.

Actual results:
The two scenarios described on the Description

Expected results:
You can write ~2Gb data to the disk, there's enough space and the qemu process does not die.

Additional info:
lsblk on the vm and the vdsClient all show the correct size after the resize:
	apparentsize = '2147483648'
	truesize = '2147483648'

When the vm pauses and gets back up again the messages of the guest os show:
Nov  7 08:33:32 golden-env-vm-01 kernel: INFO: task umount:2724 blocked for more than 120 seconds.
Nov  7 08:33:32 golden-env-vm-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  7 08:33:32 golden-env-vm-01 kernel: umount          D 00003fff781f7c1c     0  2724   2611 0x00040082
Nov  7 08:33:32 golden-env-vm-01 kernel: Call Trace:
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4b960] [0000000000000062] 0x62 (unreliable)
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bb30] [c0000000000164f8] __switch_to+0x268/0x450
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bb90] [c0000000009425b4] __schedule+0x324/0xa40
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bc60] [c000000000945124] rwsem_down_write_failed+0x164/0x280
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bcf0] [c0000000009416fc] down_write+0x6c/0x74
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bd20] [c000000000300d48] deactivate_super+0x78/0xa0
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bd50] [c000000000333bb8] mntput_no_expire+0x158/0x1c0
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4bd90] [c0000000003360bc] SyS_umount+0x10c/0x530
Nov  7 08:33:32 golden-env-vm-01 kernel: [c000000008e4be30] [c00000000000a17c] system_call+0x38/0xb4

Comment 1 Carlos Mestre González 2015-11-07 13:49:04 UTC
Created attachment 1090956 [details]
engine.log

engine.log when the vm dies (vm name is vm01)

Comment 2 Carlos Mestre González 2015-11-07 13:50:53 UTC
Created attachment 1090957 [details]
vdsm.log

vdsm.log when the vm died. the time zone in the host is different so there's a 7 hour different from the engine.

Comment 3 Carlos Mestre González 2015-11-07 13:52:12 UTC
Created attachment 1090958 [details]
message.log when vm paused

Guest OS messages of the vm that paused because of not enough storage and when up again.

Comment 4 Allon Mureinik 2015-11-08 07:38:42 UTC
Daniel, as the current qa-contact, please take a look.
Thanks!

Comment 5 Yaniv Kaul 2015-11-08 07:56:08 UTC
If the qemu process dies, why is this an oVirt bug?
(The rest may be, but a guest exiting unexpectedly rarely is!)

Comment 6 Daniel Erez 2015-11-09 11:58:54 UTC
Hi Carlos,

Can you please attach the relevant qemu and libvirt logs?

@Kevin - anything else is required to determine the right component?

Thanks!
Daniel

Comment 7 Allon Mureinik 2015-11-09 15:30:24 UTC
(In reply to Carlos Mestre González from comment #0)

> Steps to Reproduce:
> 1. Have a VM with rhel for ppc
> 2. Create a thin provisioned disks of 1 GB on ISCSI
How is it connected to the VM?
(i.e., what interface does it use? SPAPR?)

Comment 8 Thomas Huth 2015-11-09 20:54:24 UTC
I think some interesting lines can be found in the OS messages from comment 3:

Nov  7 08:23:45 golden-env-vm-01 kernel: virtio_blk virtio0: new size: 4194304 512-byte logical blocks (2.14 GB/2.00 GiB)
Nov  7 08:23:45 golden-env-vm-01 kernel: vda: detected capacity change from 1073741824 to 2147483648
Nov  7 08:24:22 golden-env-vm-01 kernel: vda: vda1
Nov  7 08:24:23 golden-env-vm-01 kernel: vda: vda1
Nov  7 08:24:30 golden-env-vm-01 kernel: EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
Nov  7 08:24:52 golden-env-vm-01 kernel: virtio_blk virtio0: req.0:id 109 is not a head!
Nov  7 08:24:52 golden-env-vm-01 kernel: blk_update_request: I/O error, dev vda, sector 2506815
Nov  7 08:24:52 golden-env-vm-01 kernel: EXT4-fs warning (device vda1): ext4_end_bio:332: I/O error -5 writing to inode 12 (offset 1073741824 size 5308416 starting block 313479)
Nov  7 08:24:52 golden-env-vm-01 kernel: Buffer I/O error on device vda1, logical block 313344

Looks like the guest disk is attached with virtio-block and something got out of sync there ("id 109 is not a head!")

Comment 9 Thomas Huth 2015-11-09 21:02:04 UTC
Carlos, could you somehow get a backtrace or a core file of QEMU when it crashes?

Comment 10 Thomas Huth 2015-11-10 07:23:44 UTC
(In reply to Carlos Mestre González from comment #0)
>
> 2. Create a thin provisioned disks of 1 GB on ISCSI

That means you put the guest disk images on a filesystem that has been mounted via iSCSI, right? Or do you directly mount the iSCSI disk into the guest?

Also, can you collect the output of QEMU itself somehow, e.g. in /var/log/libvirt/qemu/guestname and see whether there is anything of interest in that text?

Comment 11 Allon Mureinik 2015-11-10 07:52:23 UTC
(In reply to Thomas Huth from comment #10)
> (In reply to Carlos Mestre González from comment #0)
> >
> > 2. Create a thin provisioned disks of 1 GB on ISCSI
> 
> That means you put the guest disk images on a filesystem that has been
> mounted via iSCSI, right? Or do you directly mount the iSCSI disk into the
> guest?
Neither.
oVirt orchestrates connecting to an iSCSI target(s) and creates a VG on top of the exposed luns. We then provision an LV from that VG, place (in this case), a QCow header on it, and connect it as a disk to the guest.

Comment 12 Raz Tamir 2015-11-10 10:15:03 UTC
Created attachment 1092148 [details]
vdsm logs

Hi guys,
This issue also happens in regular setup, not only in ppc

Comment 13 Carlos Mestre González 2015-11-10 16:40:44 UTC
(In reply to Allon Mureinik from comment #7)
> (In reply to Carlos Mestre González from comment #0)
> 
> > Steps to Reproduce:
> > 1. Have a VM with rhel for ppc
> > 2. Create a thin provisioned disks of 1 GB on ISCSI
> How is it connected to the VM?
> (i.e., what interface does it use? SPAPR?)

Allon, disk is created and then attached to the vm when it's down via REST API. Interface is VIRTIO.

Comment 14 Carlos Mestre González 2015-11-10 16:41:36 UTC
(In reply to Thomas Huth from comment #9)
> Carlos, could you somehow get a backtrace or a core file of QEMU when it
> crashes?

I'll reproduce and provide those.

Comment 15 Carlos Mestre González 2015-11-10 16:43:39 UTC
Removed the need infos, adding them again

Comment 16 Paolo Bonzini 2015-11-10 18:43:04 UTC
How was the disk resized?

It could be that you are resizing the qcow2 metadata, but the parent LV is not resized.  In this case, stopping the VM when qcow2 receives an ENOSPC is the intended behavior.  oVirt will catch it, enlarge the LV and restart the VM (it will also poll and try to enlarge in advance, but this may not succeed if you do many writes as in this case).

Of course, it is however not intended to cause something like "id 109 is not a head" or even less to crash QEMU completely.  Still, it would be nice if someone could read the oVirt logs and figure out what's happening, in order to have a reproducer running on bare RHEL.

Comment 17 Carlos Mestre González 2015-11-11 14:11:47 UTC
Created attachment 1092764 [details]
Logs new run

These are the logs of a new run (engine, vdsm, libvirtd, qemu, vm's messages after bringing it up again). All logs were trimmed to start at the beginning of the execution:

1) create test_vm from a template clone
2) add a 1 Gb Virtio/Thin Provisioning/ISCSI disk
3) start the vm
4) resize the disk via UI to 2GB
5) create a linux partition (cfdisk), format it (mkfs.ext4), mount it and dd to it:
dd if=/dev/vda of=/mnt/file bs=1M count=1448

at 15:51:21 (from engine) the vm shutdowns VM test_vm is down with error. Exit message: Lost connection with qemu process.

vdsm logs are -7 hours (starts at 08:40).

Later I'm adding the qemu core dump.

Comment 18 Carlos Mestre González 2015-11-11 14:22:13 UTC
(In reply to ratamir from comment #12)
> Created attachment 1092148 [details]
> vdsm logs
> 
> Hi guys,
> This issue also happens in regular setup, not only in ppc

Here what is happening x86_64 for 3.5 (rhevm-3.5.6.2-0.1.el6ev.noarch) the same steps leads to a "paused due to no Storage space error" (regression), but I'm not sure if it's relevant.

Comment 20 Thomas Huth 2015-11-11 14:47:10 UTC
QEMU log shows this error message:

 ERROR:qom/object.c:716:object_unref: assertion failed: (obj->ref > 0)

Thus the QEMU process exited because it hit an assert() statement - something called object_unref() with an object that was not referenced anymore.

Comment 21 Daniel Erez 2015-11-11 17:04:15 UTC
@Francesco - according to the logs [*], it seems that the VM stops with onIOError event. Perhaps the thread of DriveWatermarkMonitor (periodic.py) is stuck and therefore we don't call the extendDrivesIfNeeded? Should we add some logs in the flow to understand if there's an issue? Could it be related to bug 1261980 (issue with dd during lsm).

[*]

1. Vm stopped here because of ioerror:

libvirtEventLoop::INFO::2015-11-07 07:37:46,581::vm::3634::virt.vm::(onIOError) vmId=`bce3836e-02b6-421b-a16a-130b780e0e63`::abnormal vm stop device virtio-disk1 error enospc
libvirtEventLoop::INFO::2015-11-07 07:37:46,582::vm::4959::virt.vm::(_logGuestCpuStatus) vmId=`bce3836e-02b6-421b-a16a-130b780e0e63`::CPU stopped: onIOError


2. And we extend the disk:

libvirtEventLoop::INFO::2015-11-07 07:37:46,700::vm::1013::virt.vm::(extendDrivesIfNeeded) vmId=`bce3836e-02b6-421b-a16a-130b780e0e63`::Requesting extension for volume c7139853-ce78-4738-869c-c6916c8ea755 on domain c85b2179-3c00-4169-9679-b9e5ecde78d8 (apparent: 1073741824, capacity: 2147483648, allocated: 1110704128, physical: 1073741824)
libvirtEventLoop::DEBUG::2015-11-07 07:37:46,700::vm::1078::virt.vm::(__extendDriveVolume) vmId=`bce3836e-02b6-421b-a16a-130b780e0e63`::Requesting an extension for the volume: {'newSize': 2147483648, 'domainID': u'c85b2179-3c00-4169-9679-b9e5ecde78d8', 'name': u'vdb', 'volumeID': u'c7139853-ce78-4738-869c-c6916c8ea755', 'imageID': u'135e5f21-7abf-4045-8765-af1440a3fca0', 'internal': False, 'poolID': u'9788742b-2e4b-4b6e-a3ba-4f96900a3877'}

Comment 22 Thomas Huth 2015-11-11 19:43:32 UTC
FWIW, here's the backtrace of the QEMU core dump:

Core was generated by `/usr/libexec/qemu-kvm -name test_vm -S -machine pseries-rhel7.2.0,accel=kvm,usb'.
Program terminated with signal 6, Aborted.
#0  0x00003fff9417e578 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libdb-5.3.21-19.el7.ppc64le
(gdb) bt
#0  0x00003fff9417e578 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00003fff941806fc in __GI_abort () at abort.c:90
#2  0x00003fff94fa88c4 in g_assertion_message (domain=<optimized out>, file=0x5dfeaf20 "qom/object.c", 
    line=<optimized out>, func=<optimized out>, message=0x1001a832b50 "assertion failed: (obj->ref > 0)")
    at gtestutils.c:2292
#3  0x00003fff94fa89cc in g_assertion_message_expr (domain=0x0, file=0x5dfeaf20 "qom/object.c", 
    line=<optimized out>, func=0x5dfeae90 <__FUNCTION__.29928> "object_unref", expr=<optimized out>)
    at gtestutils.c:2307
#4  0x000000005de9ec44 in object_unref (obj=0x1001a8305d0) at qom/object.c:716
#5  0x000000005dc3064c in memory_region_unref (mr=<optimized out>) at /usr/src/debug/qemu-2.3.0/memory.c:1369
#6  0x000000005dc3315c in memory_region_del_subregion (mr=0x1001a8d4000, subregion=0x1001a8d4500)
    at /usr/src/debug/qemu-2.3.0/memory.c:1802
#7  0x000000005dc33288 in memory_region_finalize (obj=<optimized out>)
    at /usr/src/debug/qemu-2.3.0/memory.c:1327
#8  0x000000005de9ed3c in object_deinit (type=0x1001a990300, obj=<optimized out>) at qom/object.c:399
#9  object_finalize (data=0x1001a8d4000) at qom/object.c:413
#10 object_unref (obj=0x1001a8d4000) at qom/object.c:720
#11 0x000000005de9ecac in object_property_del_all (obj=0x1001a8305d0) at qom/object.c:367
#12 object_finalize (data=0x1001a8305d0) at qom/object.c:412
#13 object_unref (obj=0x1001a8305d0) at qom/object.c:720
#14 0x000000005dc3064c in memory_region_unref (mr=<optimized out>) at /usr/src/debug/qemu-2.3.0/memory.c:1369
#15 0x000000005dbd66a0 in address_space_unmap (as=<optimized out>, buffer=<optimized out>, 
    len=<optimized out>, is_write=<optimized out>, access_len=262144) at /usr/src/debug/qemu-2.3.0/exec.c:2652
#16 0x000000005dc6acf0 in virtqueue_fill (vq=0x1001ca00000, elem=0x1001e234010, len=<optimized out>, idx=0)
    at /usr/src/debug/qemu-2.3.0/hw/virtio/virtio.c:266
#17 0x000000005dc6afe4 in virtqueue_push (vq=0x1001ca00000, elem=<optimized out>, len=<optimized out>)
    at /usr/src/debug/qemu-2.3.0/hw/virtio/virtio.c:294
#18 0x000000005dc4408c in virtio_blk_complete_request (req=0x1001e234000, status=<optimized out>)
    at /usr/src/debug/qemu-2.3.0/hw/block/virtio-blk.c:58
#19 0x000000005dc44774 in virtio_blk_req_complete (status=0 '\000', req=0x1001e234000)
    at /usr/src/debug/qemu-2.3.0/hw/block/virtio-blk.c:64
#20 virtio_blk_rw_complete (opaque=<optimized out>, ret=0)
    at /usr/src/debug/qemu-2.3.0/hw/block/virtio-blk.c:119
#21 0x000000005dee12a4 in bdrv_co_em_bh (opaque=0x1001a9d90f0) at block.c:4981
#22 0x000000005ded6e64 in aio_bh_poll (ctx=0x1001aa70840) at async.c:85
#23 0x000000005deecf3c in aio_dispatch_clients (ctx=<optimized out>, client_mask=<optimized out>)
    at aio-posix.c:139
#24 0x000000005ded6b90 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, 
    user_data=<optimized out>) at async.c:219
#25 0x00003fff94f726d0 in g_main_dispatch (context=0x1001a930240) at gmain.c:3109
#26 g_main_context_dispatch (context=0x1001a930240) at gmain.c:3708
#27 0x000000005deeb02c in glib_pollfds_poll () at main-loop.c:209
#28 os_host_main_loop_wait (timeout=<optimized out>) at main-loop.c:254
#29 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:503
#30 0x000000005dbcc630 in main_loop () at vl.c:1818
#31 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4394

Comment 23 Carlos Mestre González 2015-11-12 16:34:31 UTC
Created a new BZ for qemu-kvm-rhev here:
https://bugzilla.redhat.com/show_bug.cgi?id=1281520 (Adding dependency)

This bz will be to any related problem with rhev(vdsm) on this.

Comment 24 Fam Zheng 2015-11-16 02:43:22 UTC
Could be related to bug 1277922.

Comment 25 Francesco Romani 2015-11-16 13:55:18 UTC
(In reply to Daniel Erez from comment #21)
> @Francesco - according to the logs [*], it seems that the VM stops with
> onIOError event. Perhaps the thread of DriveWatermarkMonitor (periodic.py)
> is stuck and therefore we don't call the extendDrivesIfNeeded? Should we add
> some logs in the flow to understand if there's an issue? Could it be related
> to bug 1261980 (issue with dd during lsm).

Could also be related to https://bugzilla.redhat.com/show_bug.cgi?id=1250839 . We had a bad bug which made things (much) worse in presence of unresponsive storage. But to trigger bz1250839, storage had to stop to be responsive in the first place, so I think we miss a common cause. Will review the VDSM logs to learn more.

Comment 26 Francesco Romani 2015-11-16 14:55:02 UTC
This log from VDSM logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=1279052#c17

periodic/19::INFO::2015-11-11 08:50:26,580::vm::1013::virt.vm::(extendDrivesIfNeeded) vmId=`86dad3ce-8678-403a-8e39-7c4146eb2362`::Requesting extension for volume c75254f1-3959-4947-94e3-bc194ecc0e6d on domain c85b2179-3c00-4169-9679-b9e5ecde78d8 (apparent: 1073741824, capacity: 2147483648, allocated: 588741632, physical: 1073741824)

shows that VDSM was monitoring the disks and requested extension preemptively (not after ENOSPC). I suggest to move the investigation in the lower levels of the stack

Comment 27 Daniel Erez 2015-11-18 14:27:57 UTC
(In reply to Francesco Romani from comment #26)
> This log from VDSM logs attached in
> https://bugzilla.redhat.com/show_bug.cgi?id=1279052#c17
> 
> periodic/19::INFO::2015-11-11
> 08:50:26,580::vm::1013::virt.vm::(extendDrivesIfNeeded)
> vmId=`86dad3ce-8678-403a-8e39-7c4146eb2362`::Requesting extension for volume
> c75254f1-3959-4947-94e3-bc194ecc0e6d on domain
> c85b2179-3c00-4169-9679-b9e5ecde78d8 (apparent: 1073741824, capacity:
> 2147483648, allocated: 588741632, physical: 1073741824)
> 
> shows that VDSM was monitoring the disks and requested extension
> preemptively (not after ENOSPC). I suggest to move the investigation in the
> lower levels of the stack

OK, Thanks Francesco!

@Allon - can we close this as a duplicate of bug 1281520 (as it requires lower level investigation).

Comment 28 Allon Mureinik 2015-11-18 14:40:57 UTC
(In reply to Daniel Erez from comment #27)
> (In reply to Francesco Romani from comment #26)
> > This log from VDSM logs attached in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1279052#c17
> > 
> > periodic/19::INFO::2015-11-11
> > 08:50:26,580::vm::1013::virt.vm::(extendDrivesIfNeeded)
> > vmId=`86dad3ce-8678-403a-8e39-7c4146eb2362`::Requesting extension for volume
> > c75254f1-3959-4947-94e3-bc194ecc0e6d on domain
> > c85b2179-3c00-4169-9679-b9e5ecde78d8 (apparent: 1073741824, capacity:
> > 2147483648, allocated: 588741632, physical: 1073741824)
> > 
> > shows that VDSM was monitoring the disks and requested extension
> > preemptively (not after ENOSPC). I suggest to move the investigation in the
> > lower levels of the stack
> 
> OK, Thanks Francesco!
> 
> @Allon - can we close this as a duplicate of bug 1281520 (as it requires
> lower level investigation).
We depend on it, let's wait to see what goes on there.

Comment 32 Carlos Mestre González 2015-11-20 15:46:09 UTC
Created attachment 1097253 [details]
logs run with new qemu-kvm-rhev packages

Hi,

I recreated the scenario with the new packages provided in bug 1281520 by the qemu-kvm-rhev team. The qemu doesn't die anymore, but the vm pauses because of an storage space error.

Here are the logs from the run, timestamps are a bit off between engine/vdsm, but the scenario is the same as before.

[ libvirtd => same, vdsm.log => 09:59:11...]
- 14:59:11 => VM virtual_disk_resize_iscsi disk_5063 disk was updated 

- 15:00:06 => [ start writing to the disk with the dd command as in the scenario ]

- 15:00:21 => VM virtual_disk_resize_iscsi has been paused due to no Storage space error.
- 15:00:35 => VM virtual_disk_resize_iscsi has recovered from paused back to up.

The vm as paused, though the dd command succeeds:

17:00:29,553 => 
1448+0 records in
17:00:26 1448+0 records out
17:00:26 1518338048 bytes (1.5 GB) copied, 20.0244 s, 75.8 MB/s

and as state before the vm is back up in a few seconds. 

As I see it there's no reason for the vm to stop during this scenario, so probably this is another issue with storage in this case? or maybe a different issue with qemu?

Comment 33 Fam Zheng 2015-11-23 05:50:34 UTC
I think the VM is paused because it was started with "werror=enospc" (the default), and the host storage ran out at some point while doing dd.

The documentation of the werror option from qemu-doc:

...
           werror=action,rerror=action
               Specify which action to take on write and read errors. Valid actions are: "ignore" (ignore the error and try to continue), "stop" (pause
               QEMU), "report" (report the error to the guest), "enospc" (pause QEMU only if the host disk is full; report the error to the guest
               otherwise).  The default setting is werror=enospc and rerror=report.

...

Comment 37 Carlos Mestre González 2015-12-03 15:05:53 UTC
THe issue is fixed with the qemu-kvm-rhev package provided, the vm is up and there's no space error. I only was able to test this on NFS since right now we cannot use our ISCSI devices. I'll mark this as VERIFIED when I would be able to test it in ISCSI too.

Comment 41 Allon Mureinik 2015-12-21 07:52:24 UTC
Backport is merged, moving to MODIFIED.

Comment 42 Carlos Mestre González 2016-01-07 10:56:35 UTC
Verified too on ISCSI domains with latest qemu-kvm-rhev package:

qemu-kvm-rhev-2.3.0-31.el7_2.5.ppc64le

Comment 43 Allon Mureinik 2016-03-10 10:34:12 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 44 Allon Mureinik 2016-03-10 10:36:24 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 45 Allon Mureinik 2016-03-10 10:41:59 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 46 Allon Mureinik 2016-03-10 11:58:48 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE


Note You need to log in before you can comment on or make changes to this bug.