Description of problem: [SR-IOV] vdsm-client doesn't reports netIfaces after migration more than 5 minutes Our SR-IOV migration over bond in guest test reproduce a bug in vdsm-client. After the migration, vdsm-client doesn't reports VM's netIfaces for a very long time and our test fail to get Vm's interfaces after the migration is done. This is new issue triggered by new vdsm vdsm-4.40.20-1.el8ev.x86_64 Before the migration everything is fine: vdsm-client VM getStats vmID=b7b6d2d4-d689-4d5c-abe0-cf1d1620ecbd |grep -i netifaces -A 30 "netIfaces": [ "hw": "00:xx:xx:xx:xx:xx", "inet": [ "IPv4 address" ], "inet6": [ "IPv6 address", "linklocal" ], "name": "bond1" }, After migration, there are no netIfaces reported at all for a very long time. more than 5 minutes. vdsm-client VM getStats vmID=b7b6d2d4-d689-4d5c-abe0-cf1d1620ecbd |grep -i netifaces -A 30 Return nothing for a long time Version-Release number of selected component (if applicable): vdsm-4.40.20-1.el8ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run SR-IOV migration test 2. Check netIfaces using vdsm-client before migration 3. Check netIfaces using vdsm-client after migration Actual results: 2. netIfaces reported fine 3. netIfaces dosn't reported at all after migration. Only after many many minutes, more than 5. Expected results: netIfaces must be reported as expcted after VM migration. Additional info: This is blocking SR-IOV migration test on all HW. Looks like regression on new vdsm-client.
Looks very much similar to something we had in the past with VM start and that stats were not reported for 5 minutes. BZ 1680398 Looks like this is another scenario of this bug after migration, netIfaces in stats not reported more than 5 minutes, our test limit timeout is 300 seconds.
(In reply to Michael Burman from comment #3) > Looks very much similar to something we had in the past with VM start and > that stats were not reported for 5 minutes. BZ 1680398 Yes, indeed looks similar - could be that the frequency of sampling the guest-agent at the destination is every 5 min. In that case, we can maybe increase the frequency as we do when booting the VM. > Looks like this is another scenario of this bug after migration, netIfaces > in stats not reported more than 5 minutes, our test limit timeout is 300 > seconds. I guess 300 sec since starting the migration, right? If so, can we change the countdown to start when the migration ends?
(In reply to Arik from comment #4) > > (In reply to Michael Burman from comment #3) > > Looks very much similar to something we had in the past with VM start and > > that stats were not reported for 5 minutes. BZ 1680398 > > Yes, indeed looks similar - could be that the frequency of sampling the > guest-agent at the destination is every 5 min. > In that case, we can maybe increase the frequency as we do when booting the > VM. > > > Looks like this is another scenario of this bug after migration, netIfaces > > in stats not reported more than 5 minutes, our test limit timeout is 300 > > seconds. > > I guess 300 sec since starting the migration, right? > If so, can we change the countdown to start when the migration ends? Hi Arik. 300 sec after migration ends. Not on start.
It looks like that it takes 5 minutes before we even query the agent for capabilities. But I can't tell the reason why, the logs are missing debug info. Please enable DEBUG level for 'vds' handler.
(In reply to Michael Burman from comment #5) > 300 sec after migration ends. Not on start. But on the destination I see: The migration started at 15:05:44 The migration finished at 15:05:58 Data was received from the guest-agent at 15:10:46 The next call to 'getStats' was at 15:11:47 (previous call at 15:10:27) So had the test queried the stats between 15:10:46 to 10.10.58, it should have got the guest agent data. Can the test query the data also when the timeout occurs?
(In reply to Arik from comment #7) > (In reply to Michael Burman from comment #5) > > 300 sec after migration ends. Not on start. > > But on the destination I see: > The migration started at 15:05:44 > The migration finished at 15:05:58 > Data was received from the guest-agent at 15:10:46 > The next call to 'getStats' was at 15:11:47 (previous call at 15:10:27) > > So had the test queried the stats between 15:10:46 to 10.10.58, it should > have got the guest agent data. > Can the test query the data also when the timeout occurs? Why we need to query the date also after 5 minutes? 5 minutes it's just a threshold, in my opinion to big. I want to get the IP right away after migration is done from vdsm-client, as it worked always. I'm not going to change our code, this is wrong. I proved that the vdsm-client doesn't show netIfaces for long time, the data is available but not reported in vdsm-client. This should be fixed. I will provide env for investigation.
This is not a recent regression, this is a behavior change when moving to qemu-guest-agent in RHEL 8. RHEL 7 VMs shouldn't exhivbit this behavior.
decreasing severity accordingly. Again, for tests you can modify he interval until this is fixed
can you please take a look at REST API information? Does it exhibit the same behavior? vdsm-client is only an intenral tool
I would also say this is not a new issue. Either way, this has been improved a little recently. Please retest when there is a new vdsm build.
Can you pls upload the relevant patches that should fix this. I really don't understand it. We see it for the first time on vdsm-4.40.20-1.el8ev.x86_64, not before on migration sceanrio. We saw it many time on VM boot, but not on migration. I know you touched this area of code lately, so not sure how this is not a new issue.
(In reply to Michal Skrivanek from comment #12) > can you please take a look at REST API information? Does it exhibit the same > behavior? vdsm-client is only an intenral tool REST has the same issue, no IPv4 address after migration in REST as well for more than 5 minutes. eth0 and bond1 are also missing during this time. before migration: <ips> <ip> <address>x.x.x.x</address> <version>v4</version> </ip> after migration No IP for more than 5 minutes. After this time, all info is available in REST and in vdsm-client. On the guest during the whole time, the VM has IP and all devices are exist. So this is an issue with pulling and querying this info from the guest-agent, like was on VM start. BTW, I know that vdsm-client is an internal tool, but all QE using the vdsm-client tool in their tests, and not from today.
It looks like Michael is right in a way. This used to sort of work because of a bug in vdsm. When we fixed the bug with https://gerrit.ovirt.org/#/c/109496/ we "broke" the behavior after migration. But as I said in comment 13, there was another change improving the behavior: https://gerrit.ovirt.org/#/c/108935/ So please re-check after next vdsm build.
Shouldn't it also be part of the "state" that is passed from the source to the destination? That way we can avoid a gap in which the data is not reported
(In reply to Arik from comment #17) > Shouldn't it also be part of the "state" that is passed from the source to > the destination? > That way we can avoid a gap in which the data is not reported As discussed offline, it would be better to let Engine recognize that the current guest agent stats are not valid and the last available stats should be reused, until proper stats are received. Tomáš, would any changes on the Vdsm side be required for that or is it just needed to make a corresponding change in Engine?
(In reply to Milan Zamazal from comment #18) > (In reply to Arik from comment #17) > > Shouldn't it also be part of the "state" that is passed from the source to > > the destination? > > That way we can avoid a gap in which the data is not reported > > As discussed offline, it would be better to let Engine recognize that the > current guest agent stats are not valid and the last available stats should > be reused, until proper stats are received. Tomáš, would any changes on the > Vdsm side be required for that or is it just needed to make a corresponding > change in Engine? No I don't think it would require any changes in vdsm. For example comparing the content of os-info in guest stats should serve the purpose well.
Filed bz 1853897 about preventing the guest agent information from being cleared during live migration (comments 17-19). This issue should be resolved with latest build.
(In reply to Arik from comment #20) > Filed bz 1853897 about preventing the guest agent information from being > cleared during live migration (comments 17-19). > > This issue should be resolved with latest build. Thanks Arik. Indeed the issue seems to be resolved with latest build rhvm-4.4.1.7-0.3.el8ev.noarch and vdsm-4.40.22-1.el8ev Issue didn't reproduced any more. Can we please attach the relevant patch/patches that might resolved this issue to this bug, for future clear vision of this bug. Tomas mentioned a patch that might fixed it in comment 16 and i would like to have it attached to the bug as a link. Thanks,
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.