Bug 1945532

Summary:

VM migration halts occasionally

Product:

Container Native Virtualization (CNV)

Reporter:

Jaspreet Kaur <jkaur>

Component:

Virtualization

Assignee:

David Vossel <dvossel>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Israel Pinto <ipinto>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.5.0

CC:

ailan, aliang, alitke, aprajapa, berrange, chayang, cnv-qe-bugs, coli, dgilbert, dvossel, fdeutsch, guchen, ipinto, jdenemar, jsuchane, kbidarka, kseremet, mprivozn, mtessun, rdey, rjones, roxenham, sgott, timao, vromanso, xiaohli, yanghliu, zixchen

Target Milestone:

---

Keywords:

TestOnly

Target Release:

2.6.8

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-09-20 12:46:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1965099

Bug Blocks:

Attachments:

Description	Flags
source virt-launcher with libvirt debug	none

Description Jaspreet Kaur 2021-04-01 08:31:27 UTC

Description of problem: After a MC is implemented or an upgrade is performed some VM do not start migration although the VM migration signal is received. A sample VM is "azksc01" at the must-gather logs. 

The MCP gets degraded. Stopping and starting the VM resumes the MC implementation.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Upgrade or create an Machine-config 
2. vm do not start migration.
3. Mcp gets degraded

Actual results: Vm migration doesnt start.


Expected results: Vm migrtion do not start.


Additional info:

Comment 14 Fabian Deutsch 2021-04-13 12:51:57 UTC

There is an idea about what might be causing this bug, we are currently investigating it further.

Comment 16 Vladik Romanovsky 2021-04-13 13:19:45 UTC

Created attachment 1771624 [details]
source virt-launcher with libvirt debug

Comment 18 Vladik Romanovsky 2021-04-13 13:43:23 UTC

Created attachment 1771625 [details]
source virt-launcher with libvirt errors

Comment 19 Michal Privoznik 2021-04-13 13:48:28 UTC

To enable libvirt debug logs please refer to: https://libvirt.org/kbase/debuglogs.html
Ideally, we would have logs from both the source and the destination. Also, what are the versions of libvirt and qemu please?

Comment 20 Jiri Denemark 2021-04-13 14:06:02 UTC

From the incredibly illegible logs that sadly only contain errors and warnings
from libvirt it seems qemuMigrationAnyCompleted reports a broken connection to
the destination host just a few seconds after the migration was started (I
might be wrong here as checking the timestamps in the json logs causes my head
to start burning) and then the migration gets stuck in a job somewhere
forever. It's really hard to say what's happening there without debug logs
from libvirt (see comment #19). And please, attach plain debug logs from
libvirt without all the json mess around.

Comment 21 David Vossel 2021-04-13 14:13:04 UTC

(In reply to Jiri Denemark from comment #20)
> From the incredibly illegible logs that sadly only contain errors and
> warnings
> from libvirt it seems qemuMigrationAnyCompleted reports a broken connection
> to
> the destination host just a few seconds after the migration was started (I
> might be wrong here as checking the timestamps in the json logs causes my
> head
> to start burning) and then the migration gets stuck in a job somewhere
> forever. It's really hard to say what's happening there without debug logs
> from libvirt (see comment #19). And please, attach plain debug logs from
> libvirt without all the json mess around.


I just want to make sure expectations are set here on the engineering side. Retrieving libvirt debug logs is non-trivial for cnv 2.5. We can do it, but it's going to require the customer to reproduce this issue again (which is random) and it will require changes to their environment that impacts every VM on the system in order to catch this.

Furthermore, the result won't be "plain" libvirt debug logs as you all are typically expecting. When we get debug logs, they will be smashed together with the virt-launcher logs in json format. Since the logs are in json format, you should be able to filter by subcomponent: "libvirt" to isolate the libvirt parts from everything else.

I'll communicate how to retrieve the debug logs on the customer case. I just want to be clear in saying we don't currently have this information, and it will take some time to get it.

Comment 22 David Vossel 2021-04-13 14:38:18 UTC

@mprivozn  I've communicated how to retrieve the libvirt debug logs on the customer case now.

In the mean time, what else can we do to make progress here?

For example, is it possible for CNV to work around this issue somehow by hard resetting libvirt when the issue is encountered? Basically, I'm looking for anything we can do to address this that will allow us to restore the ability to migrate while not requiring us to destroy the qemu process and launch a new pod.

Comment 24 Jiri Denemark 2021-04-13 15:52:03 UTC

(In reply to David Vossel from comment #22)
> @mprivozn  I've communicated how to retrieve the libvirt debug
> logs on the customer case now.
> 
> In the mean time, what else can we do to make progress here?

Well, sharing the exact libvirt release would be a good start. And knowing
they way migration was initiated (i.e., the API call and its arguments) would
not hurt either. Perhaps it's already somewhere in the logs, but explicitly
stating it in a comment would make it easily visible.

> For example, is it possible for CNV to work around this issue somehow by
> hard resetting libvirt when the issue is encountered? Basically, I'm looking
> for anything we can do to address this that will allow us to restore the
> ability to migrate while not requiring us to destroy the qemu process and
> launch a new pod.

Well, it depends on where the problem is. So a general workaround "restart
libvirtd whenever libvirt reports timeout acquiring a job" is a very bad idea.
It can help if libvirtd itself has a bug and it forgets to release a job or is
stuck somewhere else, e.g., communicating with remote daemon during migration.
So in this specific case, restarting libvirtd might help, but it's not
guaranteed and there's no way CNV could automatically decide whether
restarting the daemon is a good idea. It should only be done manually and
after some investigation.

Comment 25 Michal Privoznik 2021-04-13 15:54:37 UTC

(In reply to David Vossel from comment #21)
> (In reply to Jiri Denemark from comment #20)
> > From the incredibly illegible logs that sadly only contain errors and
> > warnings
> > from libvirt it seems qemuMigrationAnyCompleted reports a broken connection
> > to
> > the destination host just a few seconds after the migration was started (I
> > might be wrong here as checking the timestamps in the json logs causes my
> > head
> > to start burning) and then the migration gets stuck in a job somewhere
> > forever. It's really hard to say what's happening there without debug logs
> > from libvirt (see comment #19). And please, attach plain debug logs from
> > libvirt without all the json mess around.
> 
> 
> I just want to make sure expectations are set here on the engineering side.

David, would it help if libvirt printed debug logs onto stderr or syslog or journald? These are the outputs that libvirt supports:

# The format for an output can be:
#    level:stderr
#      output goes to stderr
#    level:syslog:name
#      use syslog for the output and use the given name as the ident
#    level:file:file_path
#      output to a file, with the given filepath
#    level:journald
#      output to journald logging system


(In reply to David Vossel from comment #22)
> @mprivozn  I've communicated how to retrieve the libvirt debug
> logs on the customer case now.
> 

Awesome thanks.

> In the mean time, what else can we do to make progress here?
> 
> For example, is it possible for CNV to work around this issue somehow by
> hard resetting libvirt when the issue is encountered? Basically, I'm looking
> for anything we can do to address this that will allow us to restore the
> ability to migrate while not requiring us to destroy the qemu process and
> launch a new pod.

So my take is that this is not easy to reproduce, correct?

Comment 26 Jiri Denemark 2021-04-13 15:57:32 UTC

(In reply to David Vossel from comment #21)
> Furthermore, the result won't be "plain" libvirt debug logs as you all are
> typically expecting. When we get debug logs, they will be smashed together
> with the virt-launcher logs in json format. Since the logs are in json
> format, you should be able to filter by subcomponent: "libvirt" to isolate
> the libvirt parts from everything else.

Oh, that's very sad. The problem is not about logs containing messages from
other components. The biggest problem is the json format which makes the logs
unreadable. Mainly timestamps at random position at the end of line and a lot
of distracting stuff other than the message itself. Could you at least provide
a tool to restore the original logs from json format?

Comment 37 Jiri Denemark 2021-04-19 18:27:32 UTC

Thanks a lot David for preprocessing the logs.

Logs from the source "host" (virt-launcher-asvc135-2zms2_running):

Peer-to-peer migration initiated by

    2021-04-15T19:01:05.547000Z [thead:16482] [pos:remoteDispatchDomainMigratePerform3ParamsHelper:8680] server=0x562ba52c1160
        client=0x562ba52ba790 msg=0x562ba52ec1b0 rerr=0x7f8eed6228e0 args=0x7f8edc01a9e0 ret=0x7f8edc01aa40
    2021-04-15T19:01:05.548000Z [thead:16482] [pos:virDomainMigratePerform3Params:4958] dom=0x7f8edc005780, (VM:
        name=s-backend_asvc135, uuid=fac8031f-f539-50f4-b822-4b930d18e5a8), dconnuri=qemu+tcp://[::1]:22222/system,
        params=0x7f8edc014300, nparams=3, cookiein=(nil), cookieinlen=0, cookieout=0x7f8eed6227e8, cookieoutlen=0x7f8eed6227dc,
        flags=0x8b

starts by creating a connection to the destination host from the source
libvirtd

    2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetSocketNew:288] RPC_SOCKET_NEW: sock=0x7f8edc01b340 fd=19 errfd=-1 pid=0
        localAddr=::1;36482, remoteAddr=::1;22222
    2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetClientNew:323] RPC_CLIENT_NEW: client=0x7f8edc015120 sock=0x7f8edc01b340
    2021-04-15T19:01:05.596000Z [thead:16482] [pos:qemuMigrationSrcPerformPeer2Peer3:4138] driver=0x7f8ea0008e00,
        sconn=0x7f8ed80019e0, dconn=0x7f8edc006820, dconnuri=qemu+tcp://[::1]:22222/system

Storage migration is set up and blockmirror job is started:

    2021-04-15T19:01:06.891000Z [thread:16482] [pos:qemuMonitorSend:996] {"execute":"blockdev-mirror","arguments":{"job-id":
        "drive-ua-cloudinitdisk","device":"libvirt-1-format","target":"migration-vdb-format","speed":9223372036853727232,
        "sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-7995"}

A moment later, storage migration is ready and memory migration can begin

    2021-04-15T19:01:06.965000Z [thead:16482] [pos:qemuMigrationSrcNBDStorageCopyReady:546] All disk mirrors are ready
    2021-04-15T19:01:06.979000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg=
        {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-8000"}
    2021-04-15T19:01:07.326000Z [thead:16478] [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds": 1618513267,
        "microseconds": 325426}, "event": "MIGRATION", "data": {"status": "active"}}]

From this point the client is monitoring migration progress by calling
virDomainGetJobInfo (virDomainGetJobStats is a newer better variant able to
provide a lot more data and it should be used instead of virDomainGetJobInfo
unless compatibility with very old libvirt is required), which results in
"query-migrate" QMP commands to be issued.

Several keepalive messages and even timeouts may be seen in the logs as well,
but they are expected. Keepalive timeout is triggered every 5 seconds on an
idle connection and there's nothing to worry about as long as each keepalive
request is answered with a keepalive response in a timely manner. And all
answers are sent in time in our case (we can only see keepalive timeouts after
5s of inactivity and not more, which would mean responses did not come in
time, i.e., with countToDeath=5 idle=5).

So noting interesting, until the connection to the destination libvirtd
suddenly breaks and reports I/O error event:

    2021-04-15T19:01:18.168000Z [pos:virNetClientIncomingEvent:1947] client=0x7f8edc015120 wantclose=0
    2021-04-15T19:01:18.168000Z [pos:virNetSocketReadWire:1832] End of file while reading data: Input/output error
    2021-04-15T19:01:18.168000Z [pos:virNetClientMarkClose:726] client=0x7f8edc015120, reason=0
    2021-04-15T19:01:18.168000Z [pos:virNetClientCloseLocked:746] client=0x7f8edc015120, sock=0x7f8edc01b340, reason=0
    2021-04-15T19:01:18.169000Z [pos:qemuMigrationAnyCompleted:1628] operation failed: Lost connection to destination host

Since there's no way migration could succeed without the connection, we just
abort it and cancel the block job:

    2021-04-15T19:01:18.170000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
        msg={"execute":"migrate_cancel","id":"libvirt-8034"}
    2021-04-15T19:01:18.587000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
        msg={"execute":"job-dismiss","arguments":{"id":"drive-ua-cloudinitdisk"},"id":"libvirt-8037"}
    2021-04-15T19:01:18.593000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
        msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb-format"},"id":"libvirt-8038"}
    2021-04-15T19:01:18.595000Z [thead:16478] [pos:qemuMonitorJSONIOProcessLine:220] Line [{"return": {}, "id": "libvirt-8038"}]
    2021-04-15T19:01:18.596000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
        msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb-storage"},"id":"libvirt-8039"}

However, this last "blockdev-del" never finishes and we never get any reply
from QEMU (no line with mon=0x7f8ea4002e30 from now on). Thus we're stuck in a
migration job and trying to run other APIs on the same domain always times out
and reports

    Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params)



Let's look at the logs from the destination "host"
(virt-launcher-asvc135-fjtqf_completed):

    2021-04-15T19:01:05.671000Z [thread:55] [pos:virDomainMigratePrepare3Params:4861] dconn=0x7f0d30000fd0,
        params=0x7f0d3c003c60, nparams=3, cookiein=0x7f0d3c0035d0, cookieinlen=1235, cookieout=0x7f0d4c4b0808,
        cookieoutlen=0x7f0d4c4b07fc, uri_out=0x7f0d3c005b20, flags=0x8
    2021-04-15T19:01:06.741000Z [thread:55] [pos:qemuMonitorNBDServerStart:3905] server={tcp host=:: port=49153} tls_alias=<null>
    2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMigrationDstRun:1954] Setting up incoming migration with URI tcp:[::]:49152
    2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f0d3c022d60
        msg={"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:49152"},"id":"libvirt-368"}
    2021-04-15T19:01:06.746000Z [thread:55] [pos:virThreadJobClear:119] Thread 55 (virNetServerHandleJob) finished job
        remoteDispatchDomainMigratePrepare3Params with ret=0

At this point QEMU is running on the destination host and it is listening for
incoming migration from the source. Once QEMU on the source host is asked to
start migrating, we can see it on the destination too:

    2021-04-15T19:01:06.974000Z [thread:53] [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds": 1618513266,
        "microseconds": 974594}, "event": "MIGRATION", "data": {"status": "active"}}]

There's nothing interesting going on until we get an I/O error on the
connection from the source libvirtd which initiated the migration:

    2021-04-15T19:01:40.148000Z [thread:53] virNetSocketReadWire:1832 End of file while reading data: Input/output error
    2021-04-15T19:01:40.148000Z [thread:53] virConnectClose:1195 conn=0x7f0d30000fd0

And libvirt kills the domain on the destination as there's no way migration
could ever succeed when we lost the controlling connection:

    2021-04-15T19:01:40.148000Z [thread:53] qemuProcessAutoDestroy:7607 vm=s-backend_asvc135, conn=0x7f0d30000fd0


In other words, everything looks mostly normal until the connection breaks,
but that's an external event and not initiated by libvirt on any side of
migration. The only interesting part is hanging "blockdev-del" command, but
it's not a reason of the failure. It's actually just a result. That command is
supposed to close connection to NBD server, but that connection is likely
affected by the malfunction which caused the I/O error seen by libvirt.

To me it looks the migration proxy used by CNV to tunnel connections between
containers is not working as expected as just before the source libvirtd gets
I/O error on the connection to the destination host, the following messages
are reported by migration-proxy:

    {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
        "231 bytes read from oubound connection",
        "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.167992Z"}
    {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
        "377047 bytes written oubound connection",
        "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168241Z"}
    {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
        "1152 bytes read from oubound connection",
        "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168269Z"}
    {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
        "9312 bytes written oubound connection",
        "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168504Z"}
    {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
        "1821025366 bytes written oubound connection",
        "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168629Z"}
    {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
        "0 bytes read from oubound connection",
        "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168664Z"}
    {"code":"virErrCode(1)","component":"virt-launcher","level":"error","msg":
        "error encountered copying data to outbound proxy connection /var/run/kubevirt/migrationproxy/f56887f1-0022-4776-83e0-3bb9496b81a7-49152-source.sock",
        "pos":"migration-proxy.go:411",
        "reason":"write unix @-\u003e/proc/2567715/root/var/run/kubevirt/migrationproxy/f56887f1-0022-4776-83e0-3bb9496b81a7-49152-source.sock: write: broken pipe",
        "timestamp":"2021-04-15T19:01:18.168718Z"}

Comment 38 David Vossel 2021-04-19 20:30:08 UTC

(In reply to Jiri Denemark from comment #37)
> Thanks a lot David for preprocessing the logs.
> 
> Logs from the source "host" (virt-launcher-asvc135-2zms2_running):
> 
> Peer-to-peer migration initiated by
> 
>     2021-04-15T19:01:05.547000Z [thead:16482]
> [pos:remoteDispatchDomainMigratePerform3ParamsHelper:8680]
> server=0x562ba52c1160
>         client=0x562ba52ba790 msg=0x562ba52ec1b0 rerr=0x7f8eed6228e0
> args=0x7f8edc01a9e0 ret=0x7f8edc01aa40
>     2021-04-15T19:01:05.548000Z [thead:16482]
> [pos:virDomainMigratePerform3Params:4958] dom=0x7f8edc005780, (VM:
>         name=s-backend_asvc135, uuid=fac8031f-f539-50f4-b822-4b930d18e5a8),
> dconnuri=qemu+tcp://[::1]:22222/system,
>         params=0x7f8edc014300, nparams=3, cookiein=(nil), cookieinlen=0,
> cookieout=0x7f8eed6227e8, cookieoutlen=0x7f8eed6227dc,
>         flags=0x8b
> 
> starts by creating a connection to the destination host from the source
> libvirtd
> 
>     2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetSocketNew:288]
> RPC_SOCKET_NEW: sock=0x7f8edc01b340 fd=19 errfd=-1 pid=0
>         localAddr=::1;36482, remoteAddr=::1;22222
>     2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetClientNew:323]
> RPC_CLIENT_NEW: client=0x7f8edc015120 sock=0x7f8edc01b340
>     2021-04-15T19:01:05.596000Z [thead:16482]
> [pos:qemuMigrationSrcPerformPeer2Peer3:4138] driver=0x7f8ea0008e00,
>         sconn=0x7f8ed80019e0, dconn=0x7f8edc006820,
> dconnuri=qemu+tcp://[::1]:22222/system
> 
> Storage migration is set up and blockmirror job is started:
> 
>     2021-04-15T19:01:06.891000Z [thread:16482] [pos:qemuMonitorSend:996]
> {"execute":"blockdev-mirror","arguments":{"job-id":
>        
> "drive-ua-cloudinitdisk","device":"libvirt-1-format","target":"migration-vdb-
> format","speed":9223372036853727232,
>        
> "sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-7995"}
> 
> A moment later, storage migration is ready and memory migration can begin
> 
>     2021-04-15T19:01:06.965000Z [thead:16482]
> [pos:qemuMigrationSrcNBDStorageCopyReady:546] All disk mirrors are ready
>     2021-04-15T19:01:06.979000Z [thead:16482] [pos:qemuMonitorSend:996]
> QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg=
>        
> {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,
> "uri":"fd:migrate"},"id":"libvirt-8000"}
>     2021-04-15T19:01:07.326000Z [thead:16478]
> [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds":
> 1618513267,
>         "microseconds": 325426}, "event": "MIGRATION", "data": {"status":
> "active"}}]
> 
> From this point the client is monitoring migration progress by calling
> virDomainGetJobInfo (virDomainGetJobStats is a newer better variant able to
> provide a lot more data and it should be used instead of virDomainGetJobInfo
> unless compatibility with very old libvirt is required), which results in
> "query-migrate" QMP commands to be issued.
> 
> Several keepalive messages and even timeouts may be seen in the logs as well,
> but they are expected. Keepalive timeout is triggered every 5 seconds on an
> idle connection and there's nothing to worry about as long as each keepalive
> request is answered with a keepalive response in a timely manner. And all
> answers are sent in time in our case (we can only see keepalive timeouts
> after
> 5s of inactivity and not more, which would mean responses did not come in
> time, i.e., with countToDeath=5 idle=5).
> 
> So noting interesting, until the connection to the destination libvirtd
> suddenly breaks and reports I/O error event:
> 
>     2021-04-15T19:01:18.168000Z [pos:virNetClientIncomingEvent:1947]
> client=0x7f8edc015120 wantclose=0
>     2021-04-15T19:01:18.168000Z [pos:virNetSocketReadWire:1832] End of file
> while reading data: Input/output error
>     2021-04-15T19:01:18.168000Z [pos:virNetClientMarkClose:726]
> client=0x7f8edc015120, reason=0
>     2021-04-15T19:01:18.168000Z [pos:virNetClientCloseLocked:746]
> client=0x7f8edc015120, sock=0x7f8edc01b340, reason=0
>     2021-04-15T19:01:18.169000Z [pos:qemuMigrationAnyCompleted:1628]
> operation failed: Lost connection to destination host
> 
> Since there's no way migration could succeed without the connection, we just
> abort it and cancel the block job:
> 
>     2021-04-15T19:01:18.170000Z [thead:16482] [pos:qemuMonitorSend:996]
> QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
>         msg={"execute":"migrate_cancel","id":"libvirt-8034"}
>     2021-04-15T19:01:18.587000Z [thead:16482] [pos:qemuMonitorSend:996]
> QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
>        
> msg={"execute":"job-dismiss","arguments":{"id":"drive-ua-cloudinitdisk"},
> "id":"libvirt-8037"}
>     2021-04-15T19:01:18.593000Z [thead:16482] [pos:qemuMonitorSend:996]
> QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
>        
> msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb-
> format"},"id":"libvirt-8038"}
>     2021-04-15T19:01:18.595000Z [thead:16478]
> [pos:qemuMonitorJSONIOProcessLine:220] Line [{"return": {}, "id":
> "libvirt-8038"}]
>     2021-04-15T19:01:18.596000Z [thead:16482] [pos:qemuMonitorSend:996]
> QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30
>        
> msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb-
> storage"},"id":"libvirt-8039"}
> 
> However, this last "blockdev-del" never finishes and we never get any reply
> from QEMU (no line with mon=0x7f8ea4002e30 from now on). Thus we're stuck in
> a
> migration job and trying to run other APIs on the same domain always times
> out
> and reports
> 
>     Timed out during operation: cannot acquire state change lock (held by
> monitor=remoteDispatchDomainMigratePerform3Params)

yes. This is specifically what needs to be solved. This libvirt lock is blocking CNV from re-trying the migration.

Who do we need to get involved to address this?

> 
> 
> 
> Let's look at the logs from the destination "host"
> (virt-launcher-asvc135-fjtqf_completed):
> 
>     2021-04-15T19:01:05.671000Z [thread:55]
> [pos:virDomainMigratePrepare3Params:4861] dconn=0x7f0d30000fd0,
>         params=0x7f0d3c003c60, nparams=3, cookiein=0x7f0d3c0035d0,
> cookieinlen=1235, cookieout=0x7f0d4c4b0808,
>         cookieoutlen=0x7f0d4c4b07fc, uri_out=0x7f0d3c005b20, flags=0x8
>     2021-04-15T19:01:06.741000Z [thread:55]
> [pos:qemuMonitorNBDServerStart:3905] server={tcp host=:: port=49153}
> tls_alias=<null>
>     2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMigrationDstRun:1954]
> Setting up incoming migration with URI tcp:[::]:49152
>     2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMonitorSend:996]
> QEMU_MONITOR_SEND_MSG: mon=0x7f0d3c022d60
>        
> msg={"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:49152"},"id":
> "libvirt-368"}
>     2021-04-15T19:01:06.746000Z [thread:55] [pos:virThreadJobClear:119]
> Thread 55 (virNetServerHandleJob) finished job
>         remoteDispatchDomainMigratePrepare3Params with ret=0
> 
> At this point QEMU is running on the destination host and it is listening for
> incoming migration from the source. Once QEMU on the source host is asked to
> start migrating, we can see it on the destination too:
> 
>     2021-04-15T19:01:06.974000Z [thread:53]
> [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds":
> 1618513266,
>         "microseconds": 974594}, "event": "MIGRATION", "data": {"status":
> "active"}}]
> 
> There's nothing interesting going on until we get an I/O error on the
> connection from the source libvirtd which initiated the migration:
> 
>     2021-04-15T19:01:40.148000Z [thread:53] virNetSocketReadWire:1832 End of
> file while reading data: Input/output error
>     2021-04-15T19:01:40.148000Z [thread:53] virConnectClose:1195
> conn=0x7f0d30000fd0
> 
> And libvirt kills the domain on the destination as there's no way migration
> could ever succeed when we lost the controlling connection:
> 
>     2021-04-15T19:01:40.148000Z [thread:53] qemuProcessAutoDestroy:7607
> vm=s-backend_asvc135, conn=0x7f0d30000fd0
> 
> 
> In other words, everything looks mostly normal until the connection breaks,
> but that's an external event and not initiated by libvirt on any side of
> migration. The only interesting part is hanging "blockdev-del" command, but
> it's not a reason of the failure. It's actually just a result. That command
> is
> supposed to close connection to NBD server, but that connection is likely
> affected by the malfunction which caused the I/O error seen by libvirt.
> 
> To me it looks the migration proxy used by CNV to tunnel connections between
> containers is not working as expected as just before the source libvirtd gets
> I/O error on the connection to the destination host, the following messages
> are reported by migration-proxy:

to clarify, the migration failure isn't the focus here. We don't actually care why the migration failed. CNV needs to be able to detect the failure occurred and retry it. The Libvirt error "cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params)" is what is blocking that from occurring. 

> 
>     {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
>         "231 bytes read from oubound connection",
>        
> "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.167992Z"}
>     {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
>         "377047 bytes written oubound connection",
>        
> "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168241Z"}
>     {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
>         "1152 bytes read from oubound connection",
>        
> "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168269Z"}
>     {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
>         "9312 bytes written oubound connection",
>        
> "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168504Z"}
>     {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
>         "1821025366 bytes written oubound connection",
>        
> "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168629Z"}
>     {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg":
>         "0 bytes read from oubound connection",
>        
> "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168664Z"}
>    
> {"code":"virErrCode(1)","component":"virt-launcher","level":"error","msg":
>         "error encountered copying data to outbound proxy connection
> /var/run/kubevirt/migrationproxy/f56887f1-0022-4776-83e0-3bb9496b81a7-49152-
> source.sock",
>         "pos":"migration-proxy.go:411",
>         "reason":"write unix
> @-\u003e/proc/2567715/root/var/run/kubevirt/migrationproxy/f56887f1-0022-
> 4776-83e0-3bb9496b81a7-49152-source.sock: write: broken pipe",
>         "timestamp":"2021-04-15T19:01:18.168718Z"}

Comment 39 Jiri Denemark 2021-04-20 08:40:36 UTC

I see, it's question for QEMU folks then as we can't really do much (we can
only kill QEMU) in case we're blocked on a QMP command. But my guess is the
command is blocked because it was to finish sending data to the NBD server
running on the destination and the connection is not working as expected so
solving the original issue might actually be the only way to unblock the
process. But as I said, QEMU folks should be shed more light on this.

Comment 40 Fabian Deutsch 2021-04-20 08:51:30 UTC

@dgilbert can you help here?

Comment 41 Fabian Deutsch 2021-04-20 08:53:47 UTC

> > To me it looks the migration proxy used by CNV to tunnel connections between
> > containers is not working as expected as just before the source libvirtd gets
> > I/O error on the connection to the destination host, the following messages
> > are reported by migration-proxy:
> 
> to clarify, the migration failure isn't the focus here.

@dvossel I'd say this is still interesting, not? Do we have an idea why it failed?

Comment 42 Jiri Denemark 2021-04-20 09:05:02 UTC

BTW, Peter Krempa found a "yank" QMP command by accident when he was updating
test data for our QMP schema tests and it seems this command is what could
help unblock QEMU. But it's not going to help us in this case because it is
supported since QEMU 6.0, which is not available even in RHEL-AV 8.4.0.

Comment 43 Dr. David Alan Gilbert 2021-04-20 10:35:50 UTC

(In reply to Jiri Denemark from comment #42)
> BTW, Peter Krempa found a "yank" QMP command by accident when he was updating
> test data for our QMP schema tests and it seems this command is what could
> help unblock QEMU. But it's not going to help us in this case because it is
> supported since QEMU 6.0, which is not available even in RHEL-AV 8.4.0.

Yes 'yank' was added as an explicit way to get out of this very problem - anything hanging on a network connection.
(and I can see there's code in block/nbd.c to wire to it).

If we understand why/how the network connection is failing you might be able to solve the problem
without 'yank'; if you can force the TCP connections to fail, so that the nbd gets an IO error
it might fix it - if it's just seeing dropped packets though that may take a long time (TCP timeout, 15 mins??)
If the networking is under your control then killing/firewalling the route so it gives a 'no route to host'
or similar might kill it.

Comment 46 David Vossel 2021-04-20 15:10:50 UTC

> @dvossel I'd say this is still interesting, not? Do we have an idea why it failed?

The reason the migration fails is due to connectivity issues. The source of the connectivity issues are currently unknown. The CNV logs don't provide enough fidelity to provide a clear picture here. This will be improved in future releases. 

The most immediate goal here is for our components need to be resilient to failure, regardless of where the failure occurs. We know this environment can result in successful migrations, and we need to ensure that our components are eventually consistent.

Not being able to retry or even detect that a migration has failed is the core issue here.

Comment 51 Fabian Deutsch 2021-04-28 11:04:30 UTC

We are not able to reproduce this issue.
@dvossel is preparing some patches in order to improve the debugging situation.
We are aiming to get these improvements into a zstream to get them to the customer. Once this is done, we get some more details.
However, these changes will not give us insights into _why_ there are connection issues.

Comment 67 Jiri Denemark 2021-05-03 13:04:13 UTC

Failed migration can be detected by checking the value returned by
virDomainMigrate* APIs. I don't think changing virDomainGetJob{Stats,Info} to
report the failure earlier as soon as we know about it, but before the
migration API finishes is worth the effort. Specifically you wouldn't be able
to retry the migration until the original API finishes (and reports the
error). And in case migration hangs because QEMU does not respond to a monitor
command (which is what happened here), restarting libvirtd may not help at
either.

In the first place we should make sure we don't get into such situation by
fixing the root cause itself and forcefully aborting the migration block jobs
after a failed migration.

Additionally we could also add a new flag for virDomainAbortJob to let upper
layers forcefully abort migration in case they decide it is stuck. BTW, you
can use virDomainGetControlInfo to check whether libvirt is currently waiting
for a reply from QEMU and for how long and implement some kind of policy that
would guess whether the last QMP command got stuck.

Comment 78 David Vossel 2021-05-24 20:50:57 UTC

After analyzing logs with the enhanced migration logging, I did indeed find an anomaly that may explain why the migration failures are occurring to begin with.

When a migration is first being processed, there's a hand off that occurs between virt-controller to virt-handler on the target and source nodes. That hand off is coordinated through the vmi.Status.MigrationState field. While investigating logs with the enhanced debugging enabled, I found that our migration connections are being setup, then torn down very quickly, then re-established. Those connections are coordinated by this vmi.Status.MigrationState field that is managed during the hand off. Looking at the timestamps, I can verify that the setup, tear down, and re-setup of the migration proxy connection occurs while the migration is in-flight between the two pods. It's likely that this break in the migration proxy connection is why the migration fails and why the libvirt/qemu enters into the deadlocked state.

Here's what I think is happening. Under high load scenarios it's possible for controller reconcile logic and the controller's informer cache to fall behind. This forms a gap in our virt-controller logic that opens the possibility of a double hand off of a migration to virt-handler.

This double handoff re-initializes and clobbers the vmi.Status.MigrationState field, which in turn causes virt-handler to reset the migration proxy connections on the source node. All of this quickly get's corrected, but at that point the damage is already done because the migration connections are severed while the migration is in-flight.

--------------------

With that said, I want to setting expectations accurately for the PRs I'm introducing. We have evidence that the double virt-handler hand off occurs in cases where the frozen libvirt/qemu is encountered. Since we've been unsuccessful in reproducing this issue, it's difficult to say with certainty that preventing the double hand off will serve as a mitigation to avoid the frozen libvirt/qemu state. It's also difficult to say whether there are other paths that potentially lead to a frozen libvirt/qemu.

I think it's worthwhile for us to backport the double hand off fix, but just to be clear I can't guarantee this resolves the issue encountered here.

Backport PRs
CNV 2.5 - Kubevirt 0.34 - https://github.com/kubevirt/kubevirt/pull/5706
CNV 2.6 - KubeVirt 0.36 - https://github.com/kubevirt/kubevirt/pull/5705
CNV 4.8 - KubeVirt 0.41 - https://github.com/kubevirt/kubevirt/pull/5704

Comment 81 David Vossel 2021-05-26 20:22:39 UTC

I've transitioned this BZ back to NEW. We'll track the potential mitigation fix in a separate bz here, [1]


To be clear, the changes I'm introducing in the new bz [1] do not fix the root cause of the issue here. The root issue is that the libvirt/qemu stack can enter into a locked state due to a connection failure. At best the new changes within KubeVirt might lessen the chances that the issue occurs. Even then, this is only a theory though. Without an accurate reproducer to validate against, we can't say anything with certainty. 



1. https://bugzilla.redhat.com/show_bug.cgi?id=1965099

Comment 83 Fabian Deutsch 2021-05-31 11:40:52 UTC

Our current understanding of how the bug can be reproduced:

Environment with
- Nodes have no memory pressure
- virt-controller is running on worker nodes (should be the default)
- at least 2 workers
- boot disk can be ocs

Test steps
1. Create enough VMs to make all worker nodes ~100% cpu utilization (i.e. run stressng inside)
2. trigger a live migration of one or more VMs until bug is hit

Comment 87 Rhys Oxenham 2021-06-09 11:41:36 UTC

I don't think I'm required to provide any info here at this time.

Comment 94 sgott 2021-09-20 12:46:18 UTC

We're closing this BZ because this does not appear to be being experienced anymore. We believe other migration related fixes might have already solved this.

Please don't hesitate to re-open this BZ if there's any concern at all.

Comment 95 Red Hat Bugzilla 2023-09-18 00:25:36 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days