Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600248

Summary: [RFE] Host should check if local MTU is bigger than switch's
Product: Red Hat Enterprise Virtualization Manager Reporter: Siddhant Rao <sirao>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED DEFERRED QA Contact: Meni Yakove <myakove>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.4CC: dholler, germano, lsurette, mburman, michal.skrivanek, mkalinin, mperina, mzamazal, rbarry, srevivo, ycui
Target Milestone: ---Keywords: FutureFeature, Reopened
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-07 13:49:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Siddhant Rao 2018-07-11 20:27:19 UTC
Description of problem:
Before Migration, Source Host should ping the Destination Host using the MTU value which is set for the migration Network. If the ping hangs, the migration should fail.

Version-Release number of selected component (if applicable):


How reproducible:
I am unable to as i do not have a switch setup. The customer has however reproduced this.

Steps to Reproduce:
The customer used the below steps,

1. Setup a switch which is configured to use Jumbo frames but does not successfully do it.
2. Ping the Destination Host using the MTU value corresponding to the Jumbo Frames

Actual results:
The Ping hangs, thereby the migration is stuck at 0%.

Expected results:
The Ping should timeout if it get's hung and the migration should fail


Additional info:

When a migration is initiated, the Source Host should ping the Destination Host using the MTU value which is set for the migration Network. If the ping hangs, the migration should fail.

I Believe we do ping but that seems to be a normal ping2 (please correct me if am wrong here). I see the below message on the Source Host when migration is initiated,

~~~~~~

2018-06-27 18:00:04,537+0530 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573)

~~~~~~

I Had a case where the Migration Network was set to use 9000 MTU, The switch did support it, However we soon found out that it was failing to forward packets at that MTU value. The Migration was stuck at 0%

When we tried to ping the destination Host using that MTU value set for the migration network using the below command, It did not return anything, simply got hung.

# ping -M do -s 9000 <IP of Destination Host>

I believe normal pings were working.

This proved that the switch was not forwarding frames at that value, even though  it should have.

Once we reduced the MTU on the switch back to 1500 (Default) the migration worked as expected.

My ask here is that why do we not ping using the MTU value set for the network before the migration is started?.
Or is the Host.ping2 using the MTU value set for that network?.

If the ping does Hang like this, then it should be timed out and logged and the migration should fail rather than migration getting stuck at 0%

For example,

If the MTU for the migration is set at 9000 then before the migration is initiated from the source host, we should ping the destination Host using the MTU set for that network, probably something like,

# ping -M do -s <MTU For the Network> <IP of Destination Host>

Based on the above, if the ping times out then the migration should fail rather than getting hung at 0%

Let me know if something is missing here or if something is needed.

Regards,
Siddhant Rao

Comment 3 Dan Kenigsberg 2018-07-13 23:03:54 UTC
I don't think the problem is limited to migration. If the host as MTU that is bigger than that of the switch, we'd see problems in any other role of network.

It makes sense to alert users if they set an MTU that is bigger than that of the switch. This can be done with a ping as suggested here, or by trusting the MTU value reported by the switch over LLDP.

Comment 6 Dan Kenigsberg 2018-12-14 19:36:12 UTC
https://gerrit.ovirt.org/#/q/I0dc1f80d4e4f704d6be6af2cbaeaf1fce6d24343 already did that.

*** This bug has been marked as a duplicate of bug 1515877 ***

Comment 7 Dan Kenigsberg 2018-12-14 19:37:09 UTC
wrong tab!

Comment 11 Michal Skrivanek 2020-03-18 15:43:34 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 12 Michal Skrivanek 2020-03-18 15:46:48 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 13 Dominik Holler 2020-03-24 14:14:19 UTC
Is the user notified in a reasonable way in a short amount of time after starting the migration, which will timeout?

Comment 14 Ryan Barry 2020-03-24 15:19:43 UTC
Well, we try to scale migrations with postcopy in order to get a sliding window. If it never makes it at all, I'm not sure if we report anything at all. Milan?

Comment 15 Milan Zamazal 2020-03-24 16:38:33 UTC
A qualified answer could be provided from libvirt/QEMU devs, but a little experiment doesn't hurt. I had started a migration and then blocked input packets on the destination. The migration has failed after about 2 minutes with "unable to connect to server at '...:49152': Connection timed out" or after 30 seconds with "Lost connection to destination host" in vdsm.log, depending on whether I blocked the connection before starting the migration or during migration in progress. So I'd say the answer to Dominik's question above is "yes", although Engine just reports migration failure, without further explanation.

Comment 17 Michal Skrivanek 2020-04-01 14:41:13 UTC
Does anyone want to formally test it? Otherwise I'd close CURRENTRELEASE...

Comment 18 Michal Skrivanek 2020-06-23 12:34:11 UTC
This request is not currently committed to 4.4.z, moving it to 4.5

Comment 19 Dominik Holler 2020-07-07 13:49:56 UTC
Please re-open, if this is still required.