1548448 – scsi-disk: crash when resuming from rerror (broken GlusterFS)

Bug 1548448 - scsi-disk: crash when resuming from rerror (broken GlusterFS)

Summary: scsi-disk: crash when resuming from rerror (broken GlusterFS)

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	Virtualization Maintenance
QA Contact:	zixchen
Docs Contact:
URL:
Whiteboard:
Depends On:	1333358
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-23 14:59 UTC by Milan Zamazal
Modified:	2021-03-03 09:46 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-15 07:35:26 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
libvirtd.log (400.38 KB, application/x-xz) 2018-02-23 14:59 UTC, Milan Zamazal	no flags	Details
View All

Description Milan Zamazal 2018-02-23 14:59:47 UTC

Created attachment 1399889 [details]
libvirtd.log

Description of problem:

When I run, from oVirt, a VM with a GlusterFS drive with error policy `stop' attached and then make the GlusterFS volume broken, the VM gets paused. When I attempt to resume it, while GlusterFS is still broken, QEMU crashes.

Version-Release number of selected component (if applicable):

qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-daemon-3.9.0-13.el7.x86_64
glusterfs-3.12.6-1.el7.x86_64
kernel-3.10.0-851.el7.x86_64

How reproducible:

100%

Steps to Reproduce:

- Run a VM with a GlusterFS drive attached. I do that from oVirt, i.e. using libvirt. My QEMU command line is

  /usr/libexec/qemu-kvm -name guest=centos,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-centos/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m size=262144k,slots=16,maxmem=1048576k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=256 -uuid 3f66e6c9-37bd-460e-96cd-bd5d03b257a6 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=7.5-6.el7,serial=19962E73-E43B-4235-9964-ED9BF0B85EDC,uuid=3f66e6c9-37bd-460e-96cd-bd5d03b257a6 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-centos/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2018-02-22T16:14:27,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=ua-2dca12a4-7fb1-44aa-a138-8569c2344aa4,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=ua-760990bc-a7da-4fc0-9fd2-43f4f70be978,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=ua-3f1a3c9e-d492-4796-ac9b-ee5137e88542,max_ports=16,bus=pci.0,addr=0x4 -drive if=none,id=drive-ide0-1-0,readonly=on,werror=report,rerror=report -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/mnt/centos-engine:_exports_data/38a10d98-d551-492f-a0eb-690885f771a7/images/fd571551-e69a-4519-b254-59e532f807b3/15abbcd0-091b-4da6-9c92-e84de3354c59,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=fd571551-e69a-4519-b254-59e532f807b3,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=ua-760990bc-a7da-4fc0-9fd2-43f4f70be978.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive file=gluster://centos-engine:24007/gv0/8210aae7-7619-4e0b-8488-79635c1c67df/images/7c4e9f02-6c45-44fe-b4f2-95dc713000a1/12bb0ba4-63cd-4881-bf2b-8c2d11baac6f,file.debug=4,format=raw,if=none,id=drive-scsi0-0-0-2,serial=7c4e9f02-6c45-44fe-b4f2-95dc713000a1,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=ua-760990bc-a7da-4fc0-9fd2-43f4f70be978.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi0-0-0-2,id=scsi0-0-0-2 -netdev tap,fd=32,id=hostnet0,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:01:04,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3f66e6c9-37bd-460e-96cd-bd5d03b257a6.ovirt-guest-agent.0,server,nowait -device virtserialport,bus=ua-3f1a3c9e-d492-4796-ac9b-ee5137e88542.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3f66e6c9-37bd-460e-96cd-bd5d03b257a6.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=ua-3f1a3c9e-d492-4796-ac9b-ee5137e88542.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-3f1a3c9e-d492-4796-ac9b-ee5137e88542.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=192.168.121.121,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,seamless-migration=on -device qxl-vga,id=ua-0426b335-30bf-45ad-b838-214bb299f99e,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=ua-a5d4eb8d-7e78-47ab-958d-0d2a9f18592e,bus=pci.0,addr=0x6 -object rng-random,id=objua-9e3926fd-b833-463c-8176-12e2bb2e8d79,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9e3926fd-b833-463c-8176-12e2bb2e8d79,id=ua-9e3926fd-b833-463c-8176-12e2bb2e8d79,bus=pci.0,addr=0x7 -msg timestamp=on

- Make the GlusterFS drive temporarily broken. I use a replicated GlusterFS volume replicated on three nodes and I make it partially broken by shutting down two of the three machines running the GlusterFS volume.

- Write something to the broken drive from inside the VM. I use the following command in the guest OS for that purpose:

  dd if=/dev/zero of=/dev/sdb count=100 bs=1k

- After about 40 seconds the command finishes and the VM gets paused.

- Run `virsh resume' on the VM. The QEMU process disappears.

Actual results:

QEMU process disappears after resume, apparently due to its crash.

Expected results:

The QEMU process survives and the VM gets paused again (as happens with NFS or iSCSI drives).

Additional info:

The following is reported in /var/log/libvirtd/qemu/centos.log when the GlusterFS drive is broken:

  [2018-02-23 14:30:54.591095] E [socket.c:2369:socket_connect_finish] 0-gv0-client-2: connection to 192.168.121.102:24007 failed (No route to host); disconnecting socket
  [2018-02-23 14:31:06.593500] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: server 192.168.121.101:49152 has not responded in the last 42 seconds, disconnecting.
  [2018-02-23 14:31:06.595071] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f721f59eedb] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f721f363e6e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f721f363f8e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f721f365710] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f721f366200] ))))) 0-gv0-client-1: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2018-02-23 14:30:23.825778 (xid=0x8e)
  [2018-02-23 14:31:06.595324] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f721f59eedb] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f721f363e6e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f721f363f8e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f721f365710] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f721f366200] ))))) 0-gv0-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2018-02-23 14:30:23.825792 (xid=0x8f)
  [2018-02-23 14:31:06.729423] E [MSGID: 108006] [afr-common.c:5006:__afr_handle_child_down_event] 0-gv0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.

And the following is reported there on the resume attempt:

  qemu-kvm: hw/scsi/scsi-disk.c:319: scsi_read_complete: Assertion `r->req.aiocb != ((void *)0)' failed.
2018-02-23 14:37:08.110+0000: shutting down, reason=crashed

libvirtd.log is attached.

I started the VM from oVirt, but before reproducing the bug I stopped oVirt Engine and Vdsm to prevent them from messing with the matters.

Comment 3 Longxiang Lyu 2018-02-26 06:29:04 UTC

QE wants to know:
1. is the system disk also locates on same gluster volume?
2. can you reproduce with gluster packages available in brewweb?
you are using glusterfs-3.12.6-1.el7.x86_64 while the latest available in brewweb is: glusterfs-3.12.2-4.el7.

Comment 4 Milan Zamazal 2018-02-26 14:45:32 UTC

(In reply to Longxiang Lyu from comment #3)
> QE wants to know:
> 1. is the system disk also locates on same gluster volume?

No, the system disk is located on NFS.

> 2. can you reproduce with gluster packages available in brewweb?
> you are using glusterfs-3.12.6-1.el7.x86_64 while the latest available in
> brewweb is: glusterfs-3.12.2-4.el7.

Yes, I can reproduce it also with glusterfs-3.12.2-4.el7 on the host running the VM.

Comment 6 Longxiang Lyu 2018-02-27 09:29:30 UTC

I see the gluster data disk image you used is: 
gluster://centos-engine:24007/gv0/8210aae7-7619-4e0b-8488-79635c1c67df/images/7c4e9f02-6c45-44fe-b4f2-95dc713000a1/12bb0ba4-63cd-4881-bf2b-8c2d11baac6f

the system disk image is:
/rhev/data-center/mnt/centos-engine:_exports_data/38a10d98-d551-492f-a0eb-690885f771a7/images/fd571551-e69a-4519-b254-59e532f807b3/15abbcd0-091b-4da6-9c92-e84de3354c59

The machine: centos-engine hosts both as gluster and nfs server, right?
If so, in step two of turning down two servers upon three, this server is one left still running, right?

Comment 7 Milan Zamazal 2018-02-27 09:36:49 UTC

(In reply to Longxiang Lyu from comment #6)

> The machine: centos-engine hosts both as gluster and nfs server, right?
> If so, in step two of turning down two servers upon three, this server is
> one left still running, right?

Yes, it's exactly as you describe.

Comment 8 Longxiang Lyu 2018-02-27 11:02:11 UTC

Still fail to reproduce this bug. Also have a question here:
the guest connects the glusterfs server with node: centos-engine, since this node is still alive during partial-broken of glusterfs service, why there are error messages about failure to connect to the turned-off servers?

Could you please have a try with server's self-heal function off?
1.gluster volume set <volname> data-self-heal off
2.gluster volume set <volname> metadata-self-heal off
3.gluster volume set <volname> entry-self-heal off

thanks.

Comment 9 Milan Zamazal 2018-02-27 20:41:54 UTC

(In reply to Longxiang Lyu from comment #8)
> Still fail to reproduce this bug. 

Hm, thinking what could be special about my setup, the only idea is that my host running the VM is a VM itself, so it's nested virtualization if it can have any influence (sorry, I currently don't have physical machines at hand to test it elsewhere).

What happens in your setup? Does the VM get paused after disconnection of the two servers? And when you attempt to resume the paused VM, does it get paused again (rather than crash as in my case)?

> Also have a question here:
> the guest connects the glusterfs server with node: centos-engine, since this
> node is still alive during partial-broken of glusterfs service, why there
> are error messages about failure to connect to the turned-off servers?

Do you mean the messages about "no route to host" and "server has not responded"? Since the servers were running at the time the VM was started and only later turned off, aren't some attempts to connect to them, resulting in failures when they are no longer available, expected?

> Could you please have a try with server's self-heal function off?
> 1.gluster volume set <volname> data-self-heal off
> 2.gluster volume set <volname> metadata-self-heal off
> 3.gluster volume set <volname> entry-self-heal off

Tried, the same result.

Comment 13 Longxiang Lyu 2018-03-01 08:01:02 UTC

could reproduce in qemu-kvm-rhev-2.9.0-16.el7_4.14.

Comment 14 Jeff Cody 2018-04-24 18:14:51 UTC

I've been able to reproduce this as well.  Working on a fix currently.

Comment 18 Ademar Reis 2020-02-05 22:47:08 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 23 Maxim Levitsky 2020-02-24 15:56:48 UTC

I investigated the issue, and it looks like the issue is present in upstream code as well but it doesn't trigger a crash.

It all stared with this commit
https://patchwork.kernel.org/patch/8757871/

I had read the whole discussion on this and it looks like (without me knowing the internals of the glusterfs) the gluster client library
throws away cached writes that haven't reached yet the storage when fsync fails. We consider these writes to be already committed thus,
once this happens, we can't allow any more writes to the volume as long as gluster isn't fixed.

It looks like it was fixed upstream, but we have an issue detecting if gluster have the option that
enables sane behavior supported (resync-failed-syncs-after-fsync).

Now what happens in this particular bug:

1. User stops the bricks.
2. User does some writes.
3. Write fails which triggers VM pause

4. VM pause triggers flush in the gluster driver

scsi_handle_rw_error->blk_error_action->qemu_system_vmstop_request(RUN_STATE_IO_ERROR);
vm_stop -> ... bdrv_flush_all -> .. qemu_gluster_co_flush_to_disk

5. Flush of course fails, potentially triggering the corruption, and thus code from the above commit
bails out and permanently disables the gluster block device (making it appear as if it had no medium)

6. Once user resumes the VM (user doesn't have to resume the bricks) the scsi driver restarts the pending requests,

Then we have this code:

static void scsi_read_data(SCSIRequest *req)
...
if (!blk_is_available(req->dev->conf.blk)) {
scsi_read_complete(r, -ENOMEDIUM);
return;
}
...

blk_is_available returns false for block devices with no medium inserted,
and then scsi_read_complete is called which fails on an assert.
This was fixed upstream with commit 1505421a48a75a36d2b421355d11d7c47d76c6d0
which replaced the scsi_read_complete with scsi_read_complete_noio fixing the crash.
This commit can be backported.

However this doesn't improve the situation much since now the device returns 'no medium present' errors all the time,
which make the werror=stop error policy make guest sleep again and again after each time to wake it up,
thus effectively not allowing the guest to wake up ever.
With some qmp involment it should be possible thought to unplug the problematic disk and continue the guest this way.

It is possible to make a partial workaround for this issue but I am not sure that it is worth it.
It is possible to detect that we are waiting on IO errors and fail that fsync without calling gluster library at all,
but I didn't find a clean way to do so and I am not sure that this is worth it.

For example the gluster code can check if the guest is suspended due to IO error and fail the fsync, but this
is not clean solution since IO error might be from other block device.
Other option is to make the blk_error_action inform all the block drivers instances (BlockDriverState's) attached to the BlockBackend that IO error happened.

Gluster instance can note this and then use this information to fail all the fsync calls till it is informed that guest was resumed.

However it is still possible to get the fsync in such way that the above won't catch it (or when error policy is not to stop on errors) and have the
block device be hosed due to the original problem in the gluster fsync implementation.

Best regards,
Maxim Levitsky

Comment 24 Maxim Levitsky 2020-03-30 09:48:31 UTC

For the reference, this is the upstream link to the of not beeing able
to know if gluster supports non broken fsync, which in turn forces us to use the woraround

https://github.com/gluster/glusterfs/issues/600

Comment 25 John Ferlan 2020-04-16 15:14:01 UTC

Similar to bug 1524254 at least w/r/t qemu_gluster_co_flush_to_disk calls and the inability to determine whether 'resync-failed-syncs-after-fsync' is supported. Adding this to the closed downstream tracker, lowering priority, and moving to the virt-maint backlog. Perhaps someday the issue can be revisited if/when a minimum version of gluster is required for qemu that is known to have necessary APIs.

Comment 29 RHEL Program Management 2021-02-15 07:35:26 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 30 zixchen 2021-02-22 05:26:48 UTC

As glusterfs is deprecated, QA agrees to close this bug.

Note You need to log in before you can comment on or make changes to this bug.