Bug 1851726 - [SR-IOV] vdsm-client doesn't reports netIfaces after migration more than 5 minutes
Summary: [SR-IOV] vdsm-client doesn't reports netIfaces after migration more than 5 mi...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.19
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.1
: ---
Assignee: Tomáš Golembiovský
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-28 12:11 UTC by Michael Burman
Modified: 2020-07-08 08:26 UTC (History)
6 users (show)

Fixed In Version: rhv-4.4.1-10
Clone Of:
Environment:
Last Closed: 2020-07-08 08:26:25 UTC
oVirt Team: Virt
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 108935 0 None MERGED qga: move on boot checks into separate function 2020-12-06 10:25:09 UTC

Description Michael Burman 2020-06-28 12:11:32 UTC
Description of problem:
[SR-IOV] vdsm-client doesn't reports netIfaces after migration more than 5 minutes

Our SR-IOV migration over bond in guest test reproduce a bug in vdsm-client. After the migration, vdsm-client doesn't reports VM's netIfaces for a very long time and our test fail to get Vm's interfaces after the migration is done. This is new issue triggered by new vdsm vdsm-4.40.20-1.el8ev.x86_64

Before the migration everything is fine:
vdsm-client VM getStats vmID=b7b6d2d4-d689-4d5c-abe0-cf1d1620ecbd |grep -i netifaces -A 30
        "netIfaces": [
      
                "hw": "00:xx:xx:xx:xx:xx",
                "inet": [
                    "IPv4 address"
                ],
                "inet6": [
                    "IPv6 address",
                    "linklocal"
                ],
                "name": "bond1"
            },
            
After migration, there are no netIfaces reported at all for a very long time. more than 5 minutes.

vdsm-client VM getStats vmID=b7b6d2d4-d689-4d5c-abe0-cf1d1620ecbd |grep -i netifaces -A 30

Return nothing for a long time

Version-Release number of selected component (if applicable):
vdsm-4.40.20-1.el8ev.x86_64

How reproducible:
100% 

Steps to Reproduce:
1. Run SR-IOV migration test
2. Check netIfaces using vdsm-client before migration
3. Check netIfaces using vdsm-client after migration

Actual results:
2. netIfaces reported fine
3. netIfaces dosn't reported at all after migration. Only after many many minutes, more than 5.

Expected results:
netIfaces must be reported as expcted after VM migration. 

Additional info:
This is blocking SR-IOV migration test on all HW. Looks like regression on new vdsm-client.

Comment 3 Michael Burman 2020-06-28 15:17:55 UTC
Looks very much similar to something we had in the past with VM start and that stats were not reported for 5 minutes. BZ 1680398
Looks like this is another scenario of this bug after migration, netIfaces in stats not reported more than 5 minutes, our test limit timeout is 300 seconds.

Comment 4 Arik 2020-06-28 17:20:07 UTC

(In reply to Michael Burman from comment #3)
> Looks very much similar to something we had in the past with VM start and
> that stats were not reported for 5 minutes. BZ 1680398

Yes, indeed looks similar - could be that the frequency of sampling the guest-agent at the destination is every 5 min.
In that case, we can maybe increase the frequency as we do when booting the VM.

> Looks like this is another scenario of this bug after migration, netIfaces
> in stats not reported more than 5 minutes, our test limit timeout is 300
> seconds.

I guess 300 sec since starting the migration, right?
If so, can we change the countdown to start when the migration ends?

Comment 5 Michael Burman 2020-06-29 05:47:19 UTC
(In reply to Arik from comment #4)
> 
> (In reply to Michael Burman from comment #3)
> > Looks very much similar to something we had in the past with VM start and
> > that stats were not reported for 5 minutes. BZ 1680398
> 
> Yes, indeed looks similar - could be that the frequency of sampling the
> guest-agent at the destination is every 5 min.
> In that case, we can maybe increase the frequency as we do when booting the
> VM.
> 
> > Looks like this is another scenario of this bug after migration, netIfaces
> > in stats not reported more than 5 minutes, our test limit timeout is 300
> > seconds.
> 
> I guess 300 sec since starting the migration, right?
> If so, can we change the countdown to start when the migration ends?

Hi Arik.
300 sec after migration ends. Not on start.

Comment 6 Tomáš Golembiovský 2020-06-29 08:46:07 UTC
It looks like that it takes 5 minutes before we even query the agent for capabilities. But I can't tell the reason why, the logs are missing debug info. Please enable DEBUG level for 'vds' handler.

Comment 7 Arik 2020-06-29 08:59:08 UTC
(In reply to Michael Burman from comment #5)
> 300 sec after migration ends. Not on start.

But on the destination I see:
The migration started at 15:05:44
The migration finished at 15:05:58
Data was received from the guest-agent at 15:10:46
The next call to 'getStats' was at 15:11:47 (previous call at 15:10:27)

So had the test queried the stats between 15:10:46 to 10.10.58, it should have got the guest agent data.
Can the test query the data also when the timeout occurs?

Comment 8 Michael Burman 2020-06-29 09:05:47 UTC
(In reply to Arik from comment #7)
> (In reply to Michael Burman from comment #5)
> > 300 sec after migration ends. Not on start.
> 
> But on the destination I see:
> The migration started at 15:05:44
> The migration finished at 15:05:58
> Data was received from the guest-agent at 15:10:46
> The next call to 'getStats' was at 15:11:47 (previous call at 15:10:27)
> 
> So had the test queried the stats between 15:10:46 to 10.10.58, it should
> have got the guest agent data.
> Can the test query the data also when the timeout occurs?

Why we need to query the date also after 5 minutes? 5 minutes it's just a threshold, in my opinion to big. I want to get the IP right away after migration is done from vdsm-client, as it worked always.
I'm not going to change our code, this is wrong. I proved that the vdsm-client doesn't show netIfaces for long time, the data is available but not reported in vdsm-client. 
This should be fixed.
I will provide env for investigation.

Comment 9 Michal Skrivanek 2020-06-29 09:47:22 UTC
This is not a recent regression, this is a behavior change when moving to qemu-guest-agent in RHEL 8. RHEL 7 VMs shouldn't exhivbit this behavior.

Comment 10 Michal Skrivanek 2020-06-29 10:00:50 UTC
decreasing severity accordingly. Again, for tests you can modify he interval until this is fixed

Comment 12 Michal Skrivanek 2020-06-29 11:34:16 UTC
can you please take a look at REST API information? Does it exhibit the same behavior? vdsm-client is only an intenral tool

Comment 13 Tomáš Golembiovský 2020-06-29 11:50:53 UTC
I would also say this is not a new issue. Either way, this has been improved a little recently. Please retest when there is a new vdsm build.

Comment 14 Michael Burman 2020-06-29 11:56:56 UTC
Can you pls upload the relevant patches that should fix this. I really don't understand it. We see it for the first time on vdsm-4.40.20-1.el8ev.x86_64, not before on migration sceanrio. We saw it many time on VM boot, but not on migration. I know you touched this area of code lately, so not sure how this is not a new issue.

Comment 15 Michael Burman 2020-06-29 12:38:35 UTC
(In reply to Michal Skrivanek from comment #12)
> can you please take a look at REST API information? Does it exhibit the same
> behavior? vdsm-client is only an intenral tool

REST has the same issue, no IPv4 address after migration in REST as well for more than 5 minutes. eth0 and bond1 are also missing during this time.

before migration:
 <ips>
                    <ip>
                        <address>x.x.x.x</address>
                        <version>v4</version>
                    </ip>

after migration
No IP for more than 5 minutes. 

After this time, all info is available in REST and in vdsm-client. 
On the guest during the whole time, the VM has IP and all devices are exist.
So this is an issue with pulling and querying this info from the guest-agent, like was on VM start.

BTW, I know that vdsm-client is an internal tool, but all QE using the vdsm-client tool in their tests, and not from today.

Comment 16 Tomáš Golembiovský 2020-06-29 12:47:22 UTC
It looks like Michael is right in a way. This used to sort of work because of a bug in vdsm. When we fixed the bug with https://gerrit.ovirt.org/#/c/109496/ we "broke" the behavior after migration. But as I said in comment 13, there was another change improving the behavior: https://gerrit.ovirt.org/#/c/108935/

So please re-check after next vdsm build.

Comment 17 Arik 2020-06-29 13:32:54 UTC
Shouldn't it also be part of the "state" that is passed from the source to the destination?
That way we can avoid a gap in which the data is not reported

Comment 18 Milan Zamazal 2020-06-30 19:49:12 UTC
(In reply to Arik from comment #17)
> Shouldn't it also be part of the "state" that is passed from the source to
> the destination?
> That way we can avoid a gap in which the data is not reported

As discussed offline, it would be better to let Engine recognize that the current guest agent stats are not valid and the last available stats should be reused, until proper stats are received. Tomáš, would any changes on the Vdsm side be required for that or is it just needed to make a corresponding change in Engine?

Comment 19 Tomáš Golembiovský 2020-07-03 07:54:47 UTC
(In reply to Milan Zamazal from comment #18)
> (In reply to Arik from comment #17)
> > Shouldn't it also be part of the "state" that is passed from the source to
> > the destination?
> > That way we can avoid a gap in which the data is not reported
> 
> As discussed offline, it would be better to let Engine recognize that the
> current guest agent stats are not valid and the last available stats should
> be reused, until proper stats are received. Tomáš, would any changes on the
> Vdsm side be required for that or is it just needed to make a corresponding
> change in Engine?

No I don't think it would require any changes in vdsm. For example comparing the content of os-info in guest stats should serve the purpose well.

Comment 20 Arik 2020-07-05 10:33:26 UTC
Filed bz 1853897 about preventing the guest agent information from being cleared during live migration (comments 17-19).

This issue should be resolved with latest build.

Comment 21 Michael Burman 2020-07-06 05:46:07 UTC
(In reply to Arik from comment #20)
> Filed bz 1853897 about preventing the guest agent information from being
> cleared during live migration (comments 17-19).
> 
> This issue should be resolved with latest build.

Thanks Arik. Indeed the issue seems to be resolved with latest build rhvm-4.4.1.7-0.3.el8ev.noarch and vdsm-4.40.22-1.el8ev
Issue didn't reproduced any more. 
Can we please attach the relevant patch/patches that might resolved this issue to this bug, for future clear vision of this bug. Tomas mentioned a patch that might fixed it in comment 16 and i would like to have it attached to the bug as a link. 
Thanks,

Comment 22 Sandro Bonazzola 2020-07-08 08:26:25 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.