Bug 1404120 - Add host to engine failed via static bond(mode 4) configured by cockpit
Summary: Add host to engine failed via static bond(mode 4) configured by cockpit
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.1.1
: ---
Assignee: Edward Haas
QA Contact: dguo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-13 06:34 UTC by dguo
Modified: 2017-02-28 01:50 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-02-08 15:37:05 UTC
oVirt Team: Network
Embargoed:
ylavi: ovirt-4.1+


Attachments (Terms of Use)
ifcfg files/engine log/host var logs/deploy log (799.49 KB, application/x-gzip)
2016-12-13 06:34 UTC, dguo
no flags Details
ifcfg files,vdsm,supervdsm log before and after add host to engine (50.24 KB, application/x-bzip)
2016-12-13 11:38 UTC, Yihui Zhao
no flags Details
new issue attachment (380.79 KB, application/x-gzip)
2016-12-22 03:16 UTC, dguo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1403717 0 medium CLOSED The NetworkManager status is still active(running) after HostedEngine deployed successfully. 2021-02-22 00:41:40 UTC

Internal Links: 1403717

Description dguo 2016-12-13 06:34:28 UTC
Created attachment 1231071 [details]
ifcfg files/engine log/host var logs/deploy log

Description of problem:
Add rhvh to engine over static bond(mode 4) failed, and the bond were configured by cockpit.

Version-Release number of selected component (if applicable):
redhat-virtualization-host-4.0-20161206.0.x86_64
imgbased-0.8.11-0.1.el7ev.noarch
vdsm-4.18.18-1.el7ev.x86_64
rhvm Version: 4.0.6.2-0.1.el7ev

How reproducible:
100%

Steps to Reproduce:
1. Install RHVH
2. Reboot RHVH and login cockpit, enter Networking page in cockpit
3. Check the network info, there is IP address on em1(10.73.131.65/23)
4. Setup bond0 with static ip(10.73.128.238/24) over two NICs(p7p1 and p7p2) with mode 4(802.3ad)
5. ifdown the em1 to prevent the multiple default route causing the bond0 is not pingable
6. Add host to engine via bond0 ip(10.73.128.238)

Actual results:
1. After step#6, wait for a long time, failed to add host to engine. There is error on rhvm "Host dell-per730-35 installation failed. SSH session timeout host 'root.128.238'"
2. After step#6, check the network info on the host, there is no ovirtmgmt on the host, bond0 still has ip(10.73.128.238), and ip address(10.73.131.65) on em1 is recovered.

Expected results:
1. After step#6, Add rhvh to engine successfully

Additional info:
1. With the same test steps, there is no such issue on the dhcp bond(mode 4)

Comment 1 Fabian Deutsch 2016-12-13 09:35:20 UTC
Was switched configured properly?
Could you try to reproduce with a different network on em1 or a disabled em1?

Comment 2 Edward Haas 2016-12-13 10:33:23 UTC
If it passed the deployment step and a setup network has been sent form Engine to VDSM, please add the list of ifcfg files before and after + vdsm/supervdsm logs.

Comment 3 Yihui Zhao 2016-12-13 11:38:56 UTC
Created attachment 1231168 [details]
ifcfg files,vdsm,supervdsm log before and after add host to engine

Comment 4 Yihui Zhao 2016-12-13 11:41:32 UTC
(In reply to Edward Haas from comment #2)
> If it passed the deployment step and a setup network has been sent form
> Engine to VDSM, please add the list of ifcfg files before and after +
> vdsm/supervdsm logs.

Hi,Edward
   the ifcfg files,vdsm ,supervdsm logs before and after add host to engine on attachment.

logs:

https://bugzilla.redhat.com/attachment.cgi?id=1231168

Thanks,
Yihui

Comment 5 Edward Haas 2016-12-13 14:00:44 UTC
It did not passed the deployment step, no setup network ever arrived to VDSM.

Comment 6 dguo 2016-12-14 03:17:51 UTC
(In reply to Fabian Deutsch from comment #1)
> Was switched configured properly?
> Could you try to reproduce with a different network on em1 or a disabled em1?

a. For switch, it was configured properly.

b. Maybe em1 confused in my description, em1 is nothing related to the bond0 which only be configured over p7p1 and p7p2. 

em1 was just used to enter the cockpit, and since em1 and bond0 both have the public ip address, in avoiding bond0 be disturbed by em1, we just disabled em1.

Comment 7 Edward Haas 2016-12-19 08:16:19 UTC
It looks like the connectivity has been lost during the ovirt-host-deploy run.
We had something similar when NetworkManager was disabled, loosing the bond and vlans. (it wiped out these logical interfaces)
This should have been fixes in RHEL 7.3.1.

Could you please try to check if this is indeed NM related or not?
You can try to do this:
Before step #6 (adding the host to Engine), confirm connectivity through bond0 and stop NM service, checking connectivity again.
If the connectivity fails, try to check why. (Does the bond0 device exists? Does it have an address? Is it up?)

Comment 8 Edward Haas 2016-12-19 08:31:03 UTC
Didi just mentioned that NM has not been successfully stopped. It came up immediately after.

Try to stop NM before the host deploy and check it was indeed stopped (and not started again)
Cockpit or any other app that uses NM should not be up when disabling NM.

Comment 9 dguo 2016-12-22 03:15:45 UTC
(In reply to Edward Haas from comment #7)
> It looks like the connectivity has been lost during the ovirt-host-deploy
> run.
> We had something similar when NetworkManager was disabled, loosing the bond
> and vlans. (it wiped out these logical interfaces)
> This should have been fixes in RHEL 7.3.1.
> 
> Could you please try to check if this is indeed NM related or not?
> You can try to do this:
> Before step #6 (adding the host to Engine), confirm connectivity through
> bond0 and stop NM service, checking connectivity again.

After the test, the bond0 can be pingable both before and after the NM stopped.

Here NM were did stopped, and after this operation, add host to engine failed, finding the bond0 is not pingable, which state is DOWN.


> If the connectivity fails, try to check why. (Does the bond0 device exists?
> Does it have an address? Is it up?)

Comment 10 dguo 2016-12-22 03:16:33 UTC
Created attachment 1234598 [details]
new issue attachment

Comment 11 Edward Haas 2017-01-18 09:08:20 UTC
From the supervdsm logs, all the steps seem to pass successfully up until the connectivity check.
After the bridge networking has been set on the host (bridge over bond0 with two slaves) no engine pings have been detected for 120sec and a rollback has been issued.

The only thing that I can think of is that when bond0 has been reconfigured by VDSM to mode 4, something changed that could not restore the connection.
Perhaps the bond protocol has failed to re-negotiate with the switch or the mac address of the bond has been swapped (between the slaves).

Please try to catch this 120sec after the setupNetwork has been issued, and check if anything changed, like the mac address.

I would also suggest testing this on latest 4.1, so Engine connectivity issues have been handled there, perhaps it will help.

Comment 12 dguo 2017-01-18 10:03:49 UTC
(In reply to Edward Haas from comment #11)
> From the supervdsm logs, all the steps seem to pass successfully up until
> the connectivity check.
> After the bridge networking has been set on the host (bridge over bond0 with
> two slaves) no engine pings have been detected for 120sec and a rollback has
> been issued.
> 
> The only thing that I can think of is that when bond0 has been reconfigured
> by VDSM to mode 4, something changed that could not restore the connection.
> Perhaps the bond protocol has failed to re-negotiate with the switch or the
> mac address of the bond has been swapped (between the slaves).
> 
> Please try to catch this 120sec after the setupNetwork has been issued, and
> check if anything changed, like the mac address.
> 
> I would also suggest testing this on latest 4.1, so Engine connectivity
> issues have been handled there, perhaps it will help.

Okay, since the test env is not ready currently, will try later and give feedback

Comment 13 Dan Kenigsberg 2017-02-08 15:37:05 UTC
Please reopen if this reproduces, and provide the information requested in comment 11


Note You need to log in before you can comment on or make changes to this bug.