Bug 1269874

Summary: HMP/QMP blocked on failed NBD device (during migration and quit)
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Dr. David Alan Gilbert <dgilbert>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
qemu-kvm sub component: NBD QA Contact: leidwang <leidwang>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: medium CC: chayang, coli, eblake, jinzhao, juzhang, knoel, pezhang, qzhang, rbalakri, stefanha, virt-maint, xfu
Version: ---Keywords: Triaged
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-15 07:37:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dr. David Alan Gilbert 2015-10-08 11:57:06 UTC
Description of problem:
It's possible to hang the HMP in various cases if there's a non-responding block device; in this case an NBD server that's fallen off the net.


At a high level we have:
  a) Start a drive_mirror to a remote NBD server
  b) Fail the network connection to the NBD server
  c) Cancel the block-job (apparently works)
  d) Start a migration
  e) Hangs at the end of migration

or
  d2) Try and quit (i.e. q) - hangs

the problem is that the TCP connection to the NBD device blocks and things hang in bdrv_flush_all - although I'm not sure why that causes an HMP hang for the migrate

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.3.0-29.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. You need two hosts;  source & destination - you should have two network connections to them, so that you can take one down and still work with it.  Lets assume that the 'source' can talk to 'destination' using the name 'destinationname' and on the destionation that's ether device em2
2. You need a VM image of your favourite OS
3. On both hosts run:
   /usr/libexec/qemu-kvm -nographic -machine pc,accel=kvm -m 2048 -drive file=/home/localvms/local-f20b.qcow2,if=none,id=foo -device virtio-blk-pci,drive=foo,id=food -S
4. On the destination run:
  nbd_server_start :8889
  nbd_server_add -w foo
5. On the source run:
  drive_add 0 id=remote,file=nbd:destinationname:8889:exportname=foo,if=none
  drive_mirror foo remote

6. On the destination take down the interface that has destinationname
  ifdown em2

 You now see on the source that info block-jobs shows the job not advancing
7. On the source
 migrate_set_speed 100G
 migrate -d "exec:cat > /dev/null"

Actual results:
after a second or two the HMP blocks

Expected results:
we should never block the HMP

Additional info:
Cases where we could hit this include:
   a) Just a dead NBD server (probably also NFS etc)
   b) A migrate to a host that dies during the migrate, so you want to cancel that and then migrate somewhere else.

Comment 2 Dr. David Alan Gilbert 2015-10-09 17:10:58 UTC
the path that causes the HMP lock is that in migration_completion we do a lock_iothread, 
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE)
do the device state save
unlock the iothread.

The vm_stop_force_state does a bdrv_flush_all(), which I think is where it hangs;
one possibility is for us to do an earlier flush (and I think there's a patch to do that for performance reasons from Intel somewhere); that would at least cause the migration thread to block before it locks the iothread unless you got really unlucky and the destination failed after that point.

However, even then we need a way of forcibly stopping the nbd client to be able to free the migration thread.

Comment 3 Stefan Hajnoczi 2015-10-14 14:50:14 UTC
This is a design limitation in QEMU today.  There are synchronous "wait for all I/Os" points in the code.

We need to move to a model with timeouts so these operations can fail.  That may mean that migration fails and requires the user to issue some kind of "force remove" command to take down the broken NBD device.

None of this exists yet and I've seen other BZs related to the same issue.  I suggest we keep this around and keep moving this BZ to the next release until we have time to tackle this issue.

Comment 4 Dr. David Alan Gilbert 2015-10-14 17:10:59 UTC
(In reply to Stefan Hajnoczi from comment #3)
> This is a design limitation in QEMU today.  There are synchronous "wait for
> all I/Os" points in the code.
> 
> We need to move to a model with timeouts so these operations can fail.  That
> may mean that migration fails and requires the user to issue some kind of
> "force remove" command to take down the broken NBD device.

Thinking about the 'force remove' was what initially got me thinking about this, and would solve the worst migration case, i.e. a migration to a machine that dies and then you try a new migration, because libvirt could do that 'force remove' when it kills the block job doing the disk copy as part of the failed migration.

Dave

> None of this exists yet and I've seen other BZs related to the same issue. 
> I suggest we keep this around and keep moving this BZ to the next release
> until we have time to tackle this issue.

Comment 5 Ademar Reis 2015-12-28 14:46:27 UTC
I'm assuming this affects QMP as well, otherwise we would close this BZ as WONTFIX, because HMP is not supported.

I'm adding QMP to the summary.

Comment 6 Ademar Reis 2015-12-28 14:47:27 UTC
See also: Bug 1285453

Comment 8 juzhang 2017-06-08 11:59:03 UTC
Hi Qunfang,

Free to update the QE contact.

Comment 9 Markus Armbruster 2018-07-23 11:50:07 UTC
Stefan, has the situation improved upstream since you posted comment#3?

Comment 10 Stefan Hajnoczi 2018-07-24 12:44:21 UTC
Vladimir Sementsov-Ogievskiy is working on NBD client reconnect for QEMU 3.1:
https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg02362.html

Perhaps Vladimir's patches will solve this BZ (i.e. migration would fail because the NBD drive is disconnected).

Deferred to the next RHEL release.

Comment 11 Markus Armbruster 2018-11-26 15:20:45 UTC
Looks like Vladimir's work missed upstream QEMU 3.1.  Stefan, any news on its status?

Comment 12 Ademar Reis 2018-12-10 21:12:57 UTC
I believe Eric is tracking this work in the context of incremental backup / NBD. Eric: please review and let us know what you think of it.

Comment 13 Eric Blake 2018-12-10 21:20:24 UTC
(In reply to Ademar Reis from comment #12)
> I believe Eric is tracking this work in the context of incremental backup /
> NBD. Eric: please review and let us know what you think of it.

Vladimir's reconnect work would at least let you get NBD up and running again if the network connection reappears, but doesn't address the root cause of a hang waiting for the reconnection in the first place. His patches did not make it in 3.1, but should make it into 4.0 if they pass review (it's actually been a few months since he wrote them, so they may require a respin to even properly apply at this point in time).

Comment 14 Eric Blake 2019-06-26 19:12:42 UTC
(In reply to Eric Blake from comment #13)
> (In reply to Ademar Reis from comment #12)
> > I believe Eric is tracking this work in the context of incremental backup /
> > NBD. Eric: please review and let us know what you think of it.
> 
> Vladimir's reconnect work would at least let you get NBD up and running
> again if the network connection reappears, but doesn't address the root
> cause of a hang waiting for the reconnection in the first place. His patches
> did not make it in 3.1, but should make it into 4.0 if they pass review
> (it's actually been a few months since he wrote them, so they may require a
> respin to even properly apply at this point in time).

NBD reconnect missed upstream 4.0; the latest revision (v7) still has enough review work still required that may also be at risk of missing 4.1:

https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg03888.html

Comment 15 Ademar Reis 2020-02-05 22:42:27 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 19 RHEL Program Management 2020-12-15 07:37:35 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 20 leidwang@redhat.com 2021-01-06 01:56:24 UTC
Tested this BZ with RHEL 8.4, not hit this issue.

Since this bug has been fixed, set status to CURRENTRELEASE.Thanks!