Bug 1564590

Summary: Cannot add host to clean install of Ovirt (when ovirtmgmt interface has MTU of 9000)
Product: [oVirt] ovirt-engine Reporter: james.mclaren.open
Component: BLL.NetworkAssignee: Martin Perina <mperina>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Meni Yakove <myakove>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.2.2.3CC: bugs, james.mclaren.open, mburman, mperina, myakove
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-06 11:58:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs etc none

Description james.mclaren.open 2018-04-06 17:23:53 UTC
Created attachment 1418254 [details]
Logs etc

Description of problem:

(1) Clean install of Centos on host hardware.
(2) yum -y update
(3) yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release42.rpm
(4) mkdir /root/.ssh
(5) vi /root/.ssh/authorized_keys # Insert PK key from Ovirt Engine GUI > Hosts > New
(6) Automated installation on host commences ...
(7) After long time ~30 mins ... Web interface for host install shows Status as "Host Unresponsive" or "Activating"

Version-Release number of selected component (if applicable):
See attached logs

How reproducible:
100%


Steps to Reproduce:
See 1-7 above

Actual results:
Web interface host install shows status as "Host Unresponsive" or "Activating"

Expected results:
Host install succeeds


Additional info:

Comment 1 Martin Perina 2018-04-09 08:59:04 UTC
Could you please engine logs?

Comment 2 james.mclaren.open 2018-04-09 19:37:56 UTC
Hello Martin,

The engine.log file is in the original attachment: https://bugzilla.redhat.com/attachment.cgi?id=1418254

Comment 3 Yaniv Kaul 2018-04-10 10:42:34 UTC
Are you sure ovirt-host1.localdomain is really DNS resolvable by the Engine?

Comment 4 Martin Perina 2018-04-10 11:17:19 UTC
(In reply to james.mclaren.open from comment #2)
> Hello Martin,
> 
> The engine.log file is in the original attachment:
> https://bugzilla.redhat.com/attachment.cgi?id=1418254

Logs are not complete, I can see 1st installation failure on 2018-04-05 20:22:49 from which we don't have logs and which was strangely interrupted by "No route to host" exception.
Then I can see the 2nd attempt at 2018-04-05 20:37:45, which failed quite soon again due to "No route to host" exception.
The 3rd one started at 2018-04-05 20:58:15,574+01, which failed due to SSH timeout.
The 4th one started at 2018-04-06 11:26:22,926+01 which again failed due to SSH timeout error.
And there are bunch of others ...
So what's you current status? Is it possible to remove host from engine, install OS from scratch to it, try to add it to engine and in case of failure attach complete SOS report from both engine and host?

Comment 5 Martin Perina 2018-04-16 06:53:59 UTC
Were you able to resolve the issue with adding a completely clean host? If not could you please provide logs requested in Comment 4?

Comment 6 james.mclaren.open 2018-04-16 19:26:11 UTC
After hours of testing I have found what triggers the problem. If the NIC eventually used for the ovirtmanagement network (p2p1 in this case) has a MTU set high (e.g. 9000) the host installation fails. If the MTU is left at the default value (1500) the host installation succeeds.

If you install with a 1500 MTU and later try to increase the MTU to 9000 for the ovirtmanagement network in the Engine web GUI it fails again.

So the workaround is to leave the NIC that will become the ovirtmanagement network with a default MTU i.e. 1500.

Comment 7 Yaniv Kaul 2018-04-17 06:49:40 UTC
(In reply to james.mclaren.open from comment #6)
> After hours of testing I have found what triggers the problem. If the NIC
> eventually used for the ovirtmanagement network (p2p1 in this case) has a
> MTU set high (e.g. 9000) the host installation fails. If the MTU is left at
> the default value (1500) the host installation succeeds.
> 
> If you install with a 1500 MTU and later try to increase the MTU to 9000 for
> the ovirtmanagement network in the Engine web GUI it fails again.
> 
> So the workaround is to leave the NIC that will become the ovirtmanagement
> network with a default MTU i.e. 1500.

I must admit that I've seen that happening as well. Let me to try to reproduce (in OST).

Comment 8 Yaniv Kaul 2018-04-22 10:49:34 UTC
(In reply to Yaniv Kaul from comment #7)
> (In reply to james.mclaren.open from comment #6)
> > After hours of testing I have found what triggers the problem. If the NIC
> > eventually used for the ovirtmanagement network (p2p1 in this case) has a
> > MTU set high (e.g. 9000) the host installation fails. If the MTU is left at
> > the default value (1500) the host installation succeeds.
> > 
> > If you install with a 1500 MTU and later try to increase the MTU to 9000 for
> > the ovirtmanagement network in the Engine web GUI it fails again.
> > 
> > So the workaround is to leave the NIC that will become the ovirtmanagement
> > network with a default MTU i.e. 1500.
> 
> I must admit that I've seen that happening as well. Let me to try to
> reproduce (in OST).

Works fine for me (in OST - tested setting the ovirtmgmt to 9000 and the relevant interfaces to 9000 and it installed well on me, keeping the value).

Comment 9 Martin Perina 2018-04-27 09:50:49 UTC
James, is it still reproducible in your environment? If so could you please provide completelogs from engine and installed host (using sos logcollector tool) so we can investigate that?

Comment 10 james.mclaren.open 2018-04-29 08:39:40 UTC
The installation is now in use (with MTU 1500 on mngtmt NIC, 9000 on other 3 NICs) so I cant generate new logs at the moment.

I'm not sure why the logs are regarded as being incomplete. The engine.log covers about 10 attempts to get the host installed. The last attempt is reflected in the engine.log from approx 2018-04-06 17:29:58,352+01 onwards which corresponds with the failed installation log left on the host: ovirt-host-deploy-ansible-20180406174155-ovirt-host1.localdomain-4315862e.log ?

It was completely reproduceable over ~20+ install attempts with minor networking tweaks to try about work around the issue.

Comment 11 Martin Perina 2018-05-30 08:48:12 UTC
Meni, could you please try to reproduce this issue?

Comment 12 Martin Perina 2018-07-31 05:23:27 UTC
(In reply to Martin Perina from comment #11)
> Meni, could you please try to reproduce this issue?

Meni, any progress with reproducing this issue?

Comment 13 Martin Perina 2018-08-06 11:58:28 UTC
Closing as insufficient data, feel free to reopen and provide requested information

Comment 14 Red Hat Bugzilla 2023-09-14 04:26:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days