Bug 1945532
Summary: | VM migration halts occasionally | ||||||
---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Jaspreet Kaur <jkaur> | ||||
Component: | Virtualization | Assignee: | David Vossel <dvossel> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Israel Pinto <ipinto> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.5.0 | CC: | ailan, aliang, alitke, aprajapa, berrange, chayang, cnv-qe-bugs, coli, dgilbert, dvossel, fdeutsch, guchen, ipinto, jdenemar, jsuchane, kbidarka, kseremet, mprivozn, mtessun, rdey, rjones, roxenham, sgott, timao, vromanso, xiaohli, yanghliu, zixchen | ||||
Target Milestone: | --- | Keywords: | TestOnly | ||||
Target Release: | 2.6.8 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-09-20 12:46:18 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1965099 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Jaspreet Kaur
2021-04-01 08:31:27 UTC
There is an idea about what might be causing this bug, we are currently investigating it further. Created attachment 1771624 [details]
source virt-launcher with libvirt debug
Created attachment 1771625 [details]
source virt-launcher with libvirt errors
To enable libvirt debug logs please refer to: https://libvirt.org/kbase/debuglogs.html Ideally, we would have logs from both the source and the destination. Also, what are the versions of libvirt and qemu please? From the incredibly illegible logs that sadly only contain errors and warnings from libvirt it seems qemuMigrationAnyCompleted reports a broken connection to the destination host just a few seconds after the migration was started (I might be wrong here as checking the timestamps in the json logs causes my head to start burning) and then the migration gets stuck in a job somewhere forever. It's really hard to say what's happening there without debug logs from libvirt (see comment #19). And please, attach plain debug logs from libvirt without all the json mess around. (In reply to Jiri Denemark from comment #20) > From the incredibly illegible logs that sadly only contain errors and > warnings > from libvirt it seems qemuMigrationAnyCompleted reports a broken connection > to > the destination host just a few seconds after the migration was started (I > might be wrong here as checking the timestamps in the json logs causes my > head > to start burning) and then the migration gets stuck in a job somewhere > forever. It's really hard to say what's happening there without debug logs > from libvirt (see comment #19). And please, attach plain debug logs from > libvirt without all the json mess around. I just want to make sure expectations are set here on the engineering side. Retrieving libvirt debug logs is non-trivial for cnv 2.5. We can do it, but it's going to require the customer to reproduce this issue again (which is random) and it will require changes to their environment that impacts every VM on the system in order to catch this. Furthermore, the result won't be "plain" libvirt debug logs as you all are typically expecting. When we get debug logs, they will be smashed together with the virt-launcher logs in json format. Since the logs are in json format, you should be able to filter by subcomponent: "libvirt" to isolate the libvirt parts from everything else. I'll communicate how to retrieve the debug logs on the customer case. I just want to be clear in saying we don't currently have this information, and it will take some time to get it. @mprivozn I've communicated how to retrieve the libvirt debug logs on the customer case now. In the mean time, what else can we do to make progress here? For example, is it possible for CNV to work around this issue somehow by hard resetting libvirt when the issue is encountered? Basically, I'm looking for anything we can do to address this that will allow us to restore the ability to migrate while not requiring us to destroy the qemu process and launch a new pod. (In reply to David Vossel from comment #22) > @mprivozn I've communicated how to retrieve the libvirt debug > logs on the customer case now. > > In the mean time, what else can we do to make progress here? Well, sharing the exact libvirt release would be a good start. And knowing they way migration was initiated (i.e., the API call and its arguments) would not hurt either. Perhaps it's already somewhere in the logs, but explicitly stating it in a comment would make it easily visible. > For example, is it possible for CNV to work around this issue somehow by > hard resetting libvirt when the issue is encountered? Basically, I'm looking > for anything we can do to address this that will allow us to restore the > ability to migrate while not requiring us to destroy the qemu process and > launch a new pod. Well, it depends on where the problem is. So a general workaround "restart libvirtd whenever libvirt reports timeout acquiring a job" is a very bad idea. It can help if libvirtd itself has a bug and it forgets to release a job or is stuck somewhere else, e.g., communicating with remote daemon during migration. So in this specific case, restarting libvirtd might help, but it's not guaranteed and there's no way CNV could automatically decide whether restarting the daemon is a good idea. It should only be done manually and after some investigation. (In reply to David Vossel from comment #21) > (In reply to Jiri Denemark from comment #20) > > From the incredibly illegible logs that sadly only contain errors and > > warnings > > from libvirt it seems qemuMigrationAnyCompleted reports a broken connection > > to > > the destination host just a few seconds after the migration was started (I > > might be wrong here as checking the timestamps in the json logs causes my > > head > > to start burning) and then the migration gets stuck in a job somewhere > > forever. It's really hard to say what's happening there without debug logs > > from libvirt (see comment #19). And please, attach plain debug logs from > > libvirt without all the json mess around. > > > I just want to make sure expectations are set here on the engineering side. David, would it help if libvirt printed debug logs onto stderr or syslog or journald? These are the outputs that libvirt supports: # The format for an output can be: # level:stderr # output goes to stderr # level:syslog:name # use syslog for the output and use the given name as the ident # level:file:file_path # output to a file, with the given filepath # level:journald # output to journald logging system (In reply to David Vossel from comment #22) > @mprivozn I've communicated how to retrieve the libvirt debug > logs on the customer case now. > Awesome thanks. > In the mean time, what else can we do to make progress here? > > For example, is it possible for CNV to work around this issue somehow by > hard resetting libvirt when the issue is encountered? Basically, I'm looking > for anything we can do to address this that will allow us to restore the > ability to migrate while not requiring us to destroy the qemu process and > launch a new pod. So my take is that this is not easy to reproduce, correct? (In reply to David Vossel from comment #21) > Furthermore, the result won't be "plain" libvirt debug logs as you all are > typically expecting. When we get debug logs, they will be smashed together > with the virt-launcher logs in json format. Since the logs are in json > format, you should be able to filter by subcomponent: "libvirt" to isolate > the libvirt parts from everything else. Oh, that's very sad. The problem is not about logs containing messages from other components. The biggest problem is the json format which makes the logs unreadable. Mainly timestamps at random position at the end of line and a lot of distracting stuff other than the message itself. Could you at least provide a tool to restore the original logs from json format? Thanks a lot David for preprocessing the logs. Logs from the source "host" (virt-launcher-asvc135-2zms2_running): Peer-to-peer migration initiated by 2021-04-15T19:01:05.547000Z [thead:16482] [pos:remoteDispatchDomainMigratePerform3ParamsHelper:8680] server=0x562ba52c1160 client=0x562ba52ba790 msg=0x562ba52ec1b0 rerr=0x7f8eed6228e0 args=0x7f8edc01a9e0 ret=0x7f8edc01aa40 2021-04-15T19:01:05.548000Z [thead:16482] [pos:virDomainMigratePerform3Params:4958] dom=0x7f8edc005780, (VM: name=s-backend_asvc135, uuid=fac8031f-f539-50f4-b822-4b930d18e5a8), dconnuri=qemu+tcp://[::1]:22222/system, params=0x7f8edc014300, nparams=3, cookiein=(nil), cookieinlen=0, cookieout=0x7f8eed6227e8, cookieoutlen=0x7f8eed6227dc, flags=0x8b starts by creating a connection to the destination host from the source libvirtd 2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetSocketNew:288] RPC_SOCKET_NEW: sock=0x7f8edc01b340 fd=19 errfd=-1 pid=0 localAddr=::1;36482, remoteAddr=::1;22222 2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetClientNew:323] RPC_CLIENT_NEW: client=0x7f8edc015120 sock=0x7f8edc01b340 2021-04-15T19:01:05.596000Z [thead:16482] [pos:qemuMigrationSrcPerformPeer2Peer3:4138] driver=0x7f8ea0008e00, sconn=0x7f8ed80019e0, dconn=0x7f8edc006820, dconnuri=qemu+tcp://[::1]:22222/system Storage migration is set up and blockmirror job is started: 2021-04-15T19:01:06.891000Z [thread:16482] [pos:qemuMonitorSend:996] {"execute":"blockdev-mirror","arguments":{"job-id": "drive-ua-cloudinitdisk","device":"libvirt-1-format","target":"migration-vdb-format","speed":9223372036853727232, "sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-7995"} A moment later, storage migration is ready and memory migration can begin 2021-04-15T19:01:06.965000Z [thead:16482] [pos:qemuMigrationSrcNBDStorageCopyReady:546] All disk mirrors are ready 2021-04-15T19:01:06.979000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg= {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-8000"} 2021-04-15T19:01:07.326000Z [thead:16478] [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds": 1618513267, "microseconds": 325426}, "event": "MIGRATION", "data": {"status": "active"}}] From this point the client is monitoring migration progress by calling virDomainGetJobInfo (virDomainGetJobStats is a newer better variant able to provide a lot more data and it should be used instead of virDomainGetJobInfo unless compatibility with very old libvirt is required), which results in "query-migrate" QMP commands to be issued. Several keepalive messages and even timeouts may be seen in the logs as well, but they are expected. Keepalive timeout is triggered every 5 seconds on an idle connection and there's nothing to worry about as long as each keepalive request is answered with a keepalive response in a timely manner. And all answers are sent in time in our case (we can only see keepalive timeouts after 5s of inactivity and not more, which would mean responses did not come in time, i.e., with countToDeath=5 idle=5). So noting interesting, until the connection to the destination libvirtd suddenly breaks and reports I/O error event: 2021-04-15T19:01:18.168000Z [pos:virNetClientIncomingEvent:1947] client=0x7f8edc015120 wantclose=0 2021-04-15T19:01:18.168000Z [pos:virNetSocketReadWire:1832] End of file while reading data: Input/output error 2021-04-15T19:01:18.168000Z [pos:virNetClientMarkClose:726] client=0x7f8edc015120, reason=0 2021-04-15T19:01:18.168000Z [pos:virNetClientCloseLocked:746] client=0x7f8edc015120, sock=0x7f8edc01b340, reason=0 2021-04-15T19:01:18.169000Z [pos:qemuMigrationAnyCompleted:1628] operation failed: Lost connection to destination host Since there's no way migration could succeed without the connection, we just abort it and cancel the block job: 2021-04-15T19:01:18.170000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg={"execute":"migrate_cancel","id":"libvirt-8034"} 2021-04-15T19:01:18.587000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg={"execute":"job-dismiss","arguments":{"id":"drive-ua-cloudinitdisk"},"id":"libvirt-8037"} 2021-04-15T19:01:18.593000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb-format"},"id":"libvirt-8038"} 2021-04-15T19:01:18.595000Z [thead:16478] [pos:qemuMonitorJSONIOProcessLine:220] Line [{"return": {}, "id": "libvirt-8038"}] 2021-04-15T19:01:18.596000Z [thead:16482] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb-storage"},"id":"libvirt-8039"} However, this last "blockdev-del" never finishes and we never get any reply from QEMU (no line with mon=0x7f8ea4002e30 from now on). Thus we're stuck in a migration job and trying to run other APIs on the same domain always times out and reports Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) Let's look at the logs from the destination "host" (virt-launcher-asvc135-fjtqf_completed): 2021-04-15T19:01:05.671000Z [thread:55] [pos:virDomainMigratePrepare3Params:4861] dconn=0x7f0d30000fd0, params=0x7f0d3c003c60, nparams=3, cookiein=0x7f0d3c0035d0, cookieinlen=1235, cookieout=0x7f0d4c4b0808, cookieoutlen=0x7f0d4c4b07fc, uri_out=0x7f0d3c005b20, flags=0x8 2021-04-15T19:01:06.741000Z [thread:55] [pos:qemuMonitorNBDServerStart:3905] server={tcp host=:: port=49153} tls_alias=<null> 2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMigrationDstRun:1954] Setting up incoming migration with URI tcp:[::]:49152 2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMonitorSend:996] QEMU_MONITOR_SEND_MSG: mon=0x7f0d3c022d60 msg={"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:49152"},"id":"libvirt-368"} 2021-04-15T19:01:06.746000Z [thread:55] [pos:virThreadJobClear:119] Thread 55 (virNetServerHandleJob) finished job remoteDispatchDomainMigratePrepare3Params with ret=0 At this point QEMU is running on the destination host and it is listening for incoming migration from the source. Once QEMU on the source host is asked to start migrating, we can see it on the destination too: 2021-04-15T19:01:06.974000Z [thread:53] [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds": 1618513266, "microseconds": 974594}, "event": "MIGRATION", "data": {"status": "active"}}] There's nothing interesting going on until we get an I/O error on the connection from the source libvirtd which initiated the migration: 2021-04-15T19:01:40.148000Z [thread:53] virNetSocketReadWire:1832 End of file while reading data: Input/output error 2021-04-15T19:01:40.148000Z [thread:53] virConnectClose:1195 conn=0x7f0d30000fd0 And libvirt kills the domain on the destination as there's no way migration could ever succeed when we lost the controlling connection: 2021-04-15T19:01:40.148000Z [thread:53] qemuProcessAutoDestroy:7607 vm=s-backend_asvc135, conn=0x7f0d30000fd0 In other words, everything looks mostly normal until the connection breaks, but that's an external event and not initiated by libvirt on any side of migration. The only interesting part is hanging "blockdev-del" command, but it's not a reason of the failure. It's actually just a result. That command is supposed to close connection to NBD server, but that connection is likely affected by the malfunction which caused the I/O error seen by libvirt. To me it looks the migration proxy used by CNV to tunnel connections between containers is not working as expected as just before the source libvirtd gets I/O error on the connection to the destination host, the following messages are reported by migration-proxy: {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": "231 bytes read from oubound connection", "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.167992Z"} {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": "377047 bytes written oubound connection", "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168241Z"} {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": "1152 bytes read from oubound connection", "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168269Z"} {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": "9312 bytes written oubound connection", "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168504Z"} {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": "1821025366 bytes written oubound connection", "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168629Z"} {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": "0 bytes read from oubound connection", "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168664Z"} {"code":"virErrCode(1)","component":"virt-launcher","level":"error","msg": "error encountered copying data to outbound proxy connection /var/run/kubevirt/migrationproxy/f56887f1-0022-4776-83e0-3bb9496b81a7-49152-source.sock", "pos":"migration-proxy.go:411", "reason":"write unix @-\u003e/proc/2567715/root/var/run/kubevirt/migrationproxy/f56887f1-0022-4776-83e0-3bb9496b81a7-49152-source.sock: write: broken pipe", "timestamp":"2021-04-15T19:01:18.168718Z"} (In reply to Jiri Denemark from comment #37) > Thanks a lot David for preprocessing the logs. > > Logs from the source "host" (virt-launcher-asvc135-2zms2_running): > > Peer-to-peer migration initiated by > > 2021-04-15T19:01:05.547000Z [thead:16482] > [pos:remoteDispatchDomainMigratePerform3ParamsHelper:8680] > server=0x562ba52c1160 > client=0x562ba52ba790 msg=0x562ba52ec1b0 rerr=0x7f8eed6228e0 > args=0x7f8edc01a9e0 ret=0x7f8edc01aa40 > 2021-04-15T19:01:05.548000Z [thead:16482] > [pos:virDomainMigratePerform3Params:4958] dom=0x7f8edc005780, (VM: > name=s-backend_asvc135, uuid=fac8031f-f539-50f4-b822-4b930d18e5a8), > dconnuri=qemu+tcp://[::1]:22222/system, > params=0x7f8edc014300, nparams=3, cookiein=(nil), cookieinlen=0, > cookieout=0x7f8eed6227e8, cookieoutlen=0x7f8eed6227dc, > flags=0x8b > > starts by creating a connection to the destination host from the source > libvirtd > > 2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetSocketNew:288] > RPC_SOCKET_NEW: sock=0x7f8edc01b340 fd=19 errfd=-1 pid=0 > localAddr=::1;36482, remoteAddr=::1;22222 > 2021-04-15T19:01:05.562000Z [thead:16482] [pos:virNetClientNew:323] > RPC_CLIENT_NEW: client=0x7f8edc015120 sock=0x7f8edc01b340 > 2021-04-15T19:01:05.596000Z [thead:16482] > [pos:qemuMigrationSrcPerformPeer2Peer3:4138] driver=0x7f8ea0008e00, > sconn=0x7f8ed80019e0, dconn=0x7f8edc006820, > dconnuri=qemu+tcp://[::1]:22222/system > > Storage migration is set up and blockmirror job is started: > > 2021-04-15T19:01:06.891000Z [thread:16482] [pos:qemuMonitorSend:996] > {"execute":"blockdev-mirror","arguments":{"job-id": > > "drive-ua-cloudinitdisk","device":"libvirt-1-format","target":"migration-vdb- > format","speed":9223372036853727232, > > "sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-7995"} > > A moment later, storage migration is ready and memory migration can begin > > 2021-04-15T19:01:06.965000Z [thead:16482] > [pos:qemuMigrationSrcNBDStorageCopyReady:546] All disk mirrors are ready > 2021-04-15T19:01:06.979000Z [thead:16482] [pos:qemuMonitorSend:996] > QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 msg= > > {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false, > "uri":"fd:migrate"},"id":"libvirt-8000"} > 2021-04-15T19:01:07.326000Z [thead:16478] > [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds": > 1618513267, > "microseconds": 325426}, "event": "MIGRATION", "data": {"status": > "active"}}] > > From this point the client is monitoring migration progress by calling > virDomainGetJobInfo (virDomainGetJobStats is a newer better variant able to > provide a lot more data and it should be used instead of virDomainGetJobInfo > unless compatibility with very old libvirt is required), which results in > "query-migrate" QMP commands to be issued. > > Several keepalive messages and even timeouts may be seen in the logs as well, > but they are expected. Keepalive timeout is triggered every 5 seconds on an > idle connection and there's nothing to worry about as long as each keepalive > request is answered with a keepalive response in a timely manner. And all > answers are sent in time in our case (we can only see keepalive timeouts > after > 5s of inactivity and not more, which would mean responses did not come in > time, i.e., with countToDeath=5 idle=5). > > So noting interesting, until the connection to the destination libvirtd > suddenly breaks and reports I/O error event: > > 2021-04-15T19:01:18.168000Z [pos:virNetClientIncomingEvent:1947] > client=0x7f8edc015120 wantclose=0 > 2021-04-15T19:01:18.168000Z [pos:virNetSocketReadWire:1832] End of file > while reading data: Input/output error > 2021-04-15T19:01:18.168000Z [pos:virNetClientMarkClose:726] > client=0x7f8edc015120, reason=0 > 2021-04-15T19:01:18.168000Z [pos:virNetClientCloseLocked:746] > client=0x7f8edc015120, sock=0x7f8edc01b340, reason=0 > 2021-04-15T19:01:18.169000Z [pos:qemuMigrationAnyCompleted:1628] > operation failed: Lost connection to destination host > > Since there's no way migration could succeed without the connection, we just > abort it and cancel the block job: > > 2021-04-15T19:01:18.170000Z [thead:16482] [pos:qemuMonitorSend:996] > QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 > msg={"execute":"migrate_cancel","id":"libvirt-8034"} > 2021-04-15T19:01:18.587000Z [thead:16482] [pos:qemuMonitorSend:996] > QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 > > msg={"execute":"job-dismiss","arguments":{"id":"drive-ua-cloudinitdisk"}, > "id":"libvirt-8037"} > 2021-04-15T19:01:18.593000Z [thead:16482] [pos:qemuMonitorSend:996] > QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 > > msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb- > format"},"id":"libvirt-8038"} > 2021-04-15T19:01:18.595000Z [thead:16478] > [pos:qemuMonitorJSONIOProcessLine:220] Line [{"return": {}, "id": > "libvirt-8038"}] > 2021-04-15T19:01:18.596000Z [thead:16482] [pos:qemuMonitorSend:996] > QEMU_MONITOR_SEND_MSG: mon=0x7f8ea4002e30 > > msg={"execute":"blockdev-del","arguments":{"node-name":"migration-vdb- > storage"},"id":"libvirt-8039"} > > However, this last "blockdev-del" never finishes and we never get any reply > from QEMU (no line with mon=0x7f8ea4002e30 from now on). Thus we're stuck in > a > migration job and trying to run other APIs on the same domain always times > out > and reports > > Timed out during operation: cannot acquire state change lock (held by > monitor=remoteDispatchDomainMigratePerform3Params) yes. This is specifically what needs to be solved. This libvirt lock is blocking CNV from re-trying the migration. Who do we need to get involved to address this? > > > > Let's look at the logs from the destination "host" > (virt-launcher-asvc135-fjtqf_completed): > > 2021-04-15T19:01:05.671000Z [thread:55] > [pos:virDomainMigratePrepare3Params:4861] dconn=0x7f0d30000fd0, > params=0x7f0d3c003c60, nparams=3, cookiein=0x7f0d3c0035d0, > cookieinlen=1235, cookieout=0x7f0d4c4b0808, > cookieoutlen=0x7f0d4c4b07fc, uri_out=0x7f0d3c005b20, flags=0x8 > 2021-04-15T19:01:06.741000Z [thread:55] > [pos:qemuMonitorNBDServerStart:3905] server={tcp host=:: port=49153} > tls_alias=<null> > 2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMigrationDstRun:1954] > Setting up incoming migration with URI tcp:[::]:49152 > 2021-04-15T19:01:06.743000Z [thread:55] [pos:qemuMonitorSend:996] > QEMU_MONITOR_SEND_MSG: mon=0x7f0d3c022d60 > > msg={"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:49152"},"id": > "libvirt-368"} > 2021-04-15T19:01:06.746000Z [thread:55] [pos:virThreadJobClear:119] > Thread 55 (virNetServerHandleJob) finished job > remoteDispatchDomainMigratePrepare3Params with ret=0 > > At this point QEMU is running on the destination host and it is listening for > incoming migration from the source. Once QEMU on the source host is asked to > start migrating, we can see it on the destination too: > > 2021-04-15T19:01:06.974000Z [thread:53] > [pos:qemuMonitorJSONIOProcessLine:220] Line [{"timestamp": {"seconds": > 1618513266, > "microseconds": 974594}, "event": "MIGRATION", "data": {"status": > "active"}}] > > There's nothing interesting going on until we get an I/O error on the > connection from the source libvirtd which initiated the migration: > > 2021-04-15T19:01:40.148000Z [thread:53] virNetSocketReadWire:1832 End of > file while reading data: Input/output error > 2021-04-15T19:01:40.148000Z [thread:53] virConnectClose:1195 > conn=0x7f0d30000fd0 > > And libvirt kills the domain on the destination as there's no way migration > could ever succeed when we lost the controlling connection: > > 2021-04-15T19:01:40.148000Z [thread:53] qemuProcessAutoDestroy:7607 > vm=s-backend_asvc135, conn=0x7f0d30000fd0 > > > In other words, everything looks mostly normal until the connection breaks, > but that's an external event and not initiated by libvirt on any side of > migration. The only interesting part is hanging "blockdev-del" command, but > it's not a reason of the failure. It's actually just a result. That command > is > supposed to close connection to NBD server, but that connection is likely > affected by the malfunction which caused the I/O error seen by libvirt. > > To me it looks the migration proxy used by CNV to tunnel connections between > containers is not working as expected as just before the source libvirtd gets > I/O error on the connection to the destination host, the following messages > are reported by migration-proxy: to clarify, the migration failure isn't the focus here. We don't actually care why the migration failed. CNV needs to be able to detect the failure occurred and retry it. The Libvirt error "cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params)" is what is blocking that from occurring. > > {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": > "231 bytes read from oubound connection", > > "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.167992Z"} > {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": > "377047 bytes written oubound connection", > > "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168241Z"} > {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": > "1152 bytes read from oubound connection", > > "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168269Z"} > {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": > "9312 bytes written oubound connection", > > "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168504Z"} > {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": > "1821025366 bytes written oubound connection", > > "pos":"migration-proxy.go:404","timestamp":"2021-04-15T19:01:18.168629Z"} > {"code":"virErrCode(1)","component":"virt-launcher","level":"info","msg": > "0 bytes read from oubound connection", > > "pos":"migration-proxy.go:397","timestamp":"2021-04-15T19:01:18.168664Z"} > > {"code":"virErrCode(1)","component":"virt-launcher","level":"error","msg": > "error encountered copying data to outbound proxy connection > /var/run/kubevirt/migrationproxy/f56887f1-0022-4776-83e0-3bb9496b81a7-49152- > source.sock", > "pos":"migration-proxy.go:411", > "reason":"write unix > @-\u003e/proc/2567715/root/var/run/kubevirt/migrationproxy/f56887f1-0022- > 4776-83e0-3bb9496b81a7-49152-source.sock: write: broken pipe", > "timestamp":"2021-04-15T19:01:18.168718Z"} I see, it's question for QEMU folks then as we can't really do much (we can only kill QEMU) in case we're blocked on a QMP command. But my guess is the command is blocked because it was to finish sending data to the NBD server running on the destination and the connection is not working as expected so solving the original issue might actually be the only way to unblock the process. But as I said, QEMU folks should be shed more light on this. @dgilbert can you help here? > > To me it looks the migration proxy used by CNV to tunnel connections between
> > containers is not working as expected as just before the source libvirtd gets
> > I/O error on the connection to the destination host, the following messages
> > are reported by migration-proxy:
>
> to clarify, the migration failure isn't the focus here.
@dvossel I'd say this is still interesting, not? Do we have an idea why it failed?
BTW, Peter Krempa found a "yank" QMP command by accident when he was updating test data for our QMP schema tests and it seems this command is what could help unblock QEMU. But it's not going to help us in this case because it is supported since QEMU 6.0, which is not available even in RHEL-AV 8.4.0. (In reply to Jiri Denemark from comment #42) > BTW, Peter Krempa found a "yank" QMP command by accident when he was updating > test data for our QMP schema tests and it seems this command is what could > help unblock QEMU. But it's not going to help us in this case because it is > supported since QEMU 6.0, which is not available even in RHEL-AV 8.4.0. Yes 'yank' was added as an explicit way to get out of this very problem - anything hanging on a network connection. (and I can see there's code in block/nbd.c to wire to it). If we understand why/how the network connection is failing you might be able to solve the problem without 'yank'; if you can force the TCP connections to fail, so that the nbd gets an IO error it might fix it - if it's just seeing dropped packets though that may take a long time (TCP timeout, 15 mins??) If the networking is under your control then killing/firewalling the route so it gives a 'no route to host' or similar might kill it. > @dvossel I'd say this is still interesting, not? Do we have an idea why it failed?
The reason the migration fails is due to connectivity issues. The source of the connectivity issues are currently unknown. The CNV logs don't provide enough fidelity to provide a clear picture here. This will be improved in future releases.
The most immediate goal here is for our components need to be resilient to failure, regardless of where the failure occurs. We know this environment can result in successful migrations, and we need to ensure that our components are eventually consistent.
Not being able to retry or even detect that a migration has failed is the core issue here.
We are not able to reproduce this issue. @dvossel is preparing some patches in order to improve the debugging situation. We are aiming to get these improvements into a zstream to get them to the customer. Once this is done, we get some more details. However, these changes will not give us insights into _why_ there are connection issues. Failed migration can be detected by checking the value returned by virDomainMigrate* APIs. I don't think changing virDomainGetJob{Stats,Info} to report the failure earlier as soon as we know about it, but before the migration API finishes is worth the effort. Specifically you wouldn't be able to retry the migration until the original API finishes (and reports the error). And in case migration hangs because QEMU does not respond to a monitor command (which is what happened here), restarting libvirtd may not help at either. In the first place we should make sure we don't get into such situation by fixing the root cause itself and forcefully aborting the migration block jobs after a failed migration. Additionally we could also add a new flag for virDomainAbortJob to let upper layers forcefully abort migration in case they decide it is stuck. BTW, you can use virDomainGetControlInfo to check whether libvirt is currently waiting for a reply from QEMU and for how long and implement some kind of policy that would guess whether the last QMP command got stuck. After analyzing logs with the enhanced migration logging, I did indeed find an anomaly that may explain why the migration failures are occurring to begin with. When a migration is first being processed, there's a hand off that occurs between virt-controller to virt-handler on the target and source nodes. That hand off is coordinated through the vmi.Status.MigrationState field. While investigating logs with the enhanced debugging enabled, I found that our migration connections are being setup, then torn down very quickly, then re-established. Those connections are coordinated by this vmi.Status.MigrationState field that is managed during the hand off. Looking at the timestamps, I can verify that the setup, tear down, and re-setup of the migration proxy connection occurs while the migration is in-flight between the two pods. It's likely that this break in the migration proxy connection is why the migration fails and why the libvirt/qemu enters into the deadlocked state. Here's what I think is happening. Under high load scenarios it's possible for controller reconcile logic and the controller's informer cache to fall behind. This forms a gap in our virt-controller logic that opens the possibility of a double hand off of a migration to virt-handler. This double handoff re-initializes and clobbers the vmi.Status.MigrationState field, which in turn causes virt-handler to reset the migration proxy connections on the source node. All of this quickly get's corrected, but at that point the damage is already done because the migration connections are severed while the migration is in-flight. -------------------- With that said, I want to setting expectations accurately for the PRs I'm introducing. We have evidence that the double virt-handler hand off occurs in cases where the frozen libvirt/qemu is encountered. Since we've been unsuccessful in reproducing this issue, it's difficult to say with certainty that preventing the double hand off will serve as a mitigation to avoid the frozen libvirt/qemu state. It's also difficult to say whether there are other paths that potentially lead to a frozen libvirt/qemu. I think it's worthwhile for us to backport the double hand off fix, but just to be clear I can't guarantee this resolves the issue encountered here. Backport PRs CNV 2.5 - Kubevirt 0.34 - https://github.com/kubevirt/kubevirt/pull/5706 CNV 2.6 - KubeVirt 0.36 - https://github.com/kubevirt/kubevirt/pull/5705 CNV 4.8 - KubeVirt 0.41 - https://github.com/kubevirt/kubevirt/pull/5704 I've transitioned this BZ back to NEW. We'll track the potential mitigation fix in a separate bz here, [1] To be clear, the changes I'm introducing in the new bz [1] do not fix the root cause of the issue here. The root issue is that the libvirt/qemu stack can enter into a locked state due to a connection failure. At best the new changes within KubeVirt might lessen the chances that the issue occurs. Even then, this is only a theory though. Without an accurate reproducer to validate against, we can't say anything with certainty. 1. https://bugzilla.redhat.com/show_bug.cgi?id=1965099 Our current understanding of how the bug can be reproduced: Environment with - Nodes have no memory pressure - virt-controller is running on worker nodes (should be the default) - at least 2 workers - boot disk can be ocs Test steps 1. Create enough VMs to make all worker nodes ~100% cpu utilization (i.e. run stressng inside) 2. trigger a live migration of one or more VMs until bug is hit I don't think I'm required to provide any info here at this time. We're closing this BZ because this does not appear to be being experienced anymore. We believe other migration related fixes might have already solved this. Please don't hesitate to re-open this BZ if there's any concern at all. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |