Created attachment 1231071 [details] ifcfg files/engine log/host var logs/deploy log Description of problem: Add rhvh to engine over static bond(mode 4) failed, and the bond were configured by cockpit. Version-Release number of selected component (if applicable): redhat-virtualization-host-4.0-20161206.0.x86_64 imgbased-0.8.11-0.1.el7ev.noarch vdsm-4.18.18-1.el7ev.x86_64 rhvm Version: 4.0.6.2-0.1.el7ev How reproducible: 100% Steps to Reproduce: 1. Install RHVH 2. Reboot RHVH and login cockpit, enter Networking page in cockpit 3. Check the network info, there is IP address on em1(10.73.131.65/23) 4. Setup bond0 with static ip(10.73.128.238/24) over two NICs(p7p1 and p7p2) with mode 4(802.3ad) 5. ifdown the em1 to prevent the multiple default route causing the bond0 is not pingable 6. Add host to engine via bond0 ip(10.73.128.238) Actual results: 1. After step#6, wait for a long time, failed to add host to engine. There is error on rhvm "Host dell-per730-35 installation failed. SSH session timeout host 'root.128.238'" 2. After step#6, check the network info on the host, there is no ovirtmgmt on the host, bond0 still has ip(10.73.128.238), and ip address(10.73.131.65) on em1 is recovered. Expected results: 1. After step#6, Add rhvh to engine successfully Additional info: 1. With the same test steps, there is no such issue on the dhcp bond(mode 4)
Was switched configured properly? Could you try to reproduce with a different network on em1 or a disabled em1?
If it passed the deployment step and a setup network has been sent form Engine to VDSM, please add the list of ifcfg files before and after + vdsm/supervdsm logs.
Created attachment 1231168 [details] ifcfg files,vdsm,supervdsm log before and after add host to engine
(In reply to Edward Haas from comment #2) > If it passed the deployment step and a setup network has been sent form > Engine to VDSM, please add the list of ifcfg files before and after + > vdsm/supervdsm logs. Hi,Edward the ifcfg files,vdsm ,supervdsm logs before and after add host to engine on attachment. logs: https://bugzilla.redhat.com/attachment.cgi?id=1231168 Thanks, Yihui
It did not passed the deployment step, no setup network ever arrived to VDSM.
(In reply to Fabian Deutsch from comment #1) > Was switched configured properly? > Could you try to reproduce with a different network on em1 or a disabled em1? a. For switch, it was configured properly. b. Maybe em1 confused in my description, em1 is nothing related to the bond0 which only be configured over p7p1 and p7p2. em1 was just used to enter the cockpit, and since em1 and bond0 both have the public ip address, in avoiding bond0 be disturbed by em1, we just disabled em1.
It looks like the connectivity has been lost during the ovirt-host-deploy run. We had something similar when NetworkManager was disabled, loosing the bond and vlans. (it wiped out these logical interfaces) This should have been fixes in RHEL 7.3.1. Could you please try to check if this is indeed NM related or not? You can try to do this: Before step #6 (adding the host to Engine), confirm connectivity through bond0 and stop NM service, checking connectivity again. If the connectivity fails, try to check why. (Does the bond0 device exists? Does it have an address? Is it up?)
Didi just mentioned that NM has not been successfully stopped. It came up immediately after. Try to stop NM before the host deploy and check it was indeed stopped (and not started again) Cockpit or any other app that uses NM should not be up when disabling NM.
(In reply to Edward Haas from comment #7) > It looks like the connectivity has been lost during the ovirt-host-deploy > run. > We had something similar when NetworkManager was disabled, loosing the bond > and vlans. (it wiped out these logical interfaces) > This should have been fixes in RHEL 7.3.1. > > Could you please try to check if this is indeed NM related or not? > You can try to do this: > Before step #6 (adding the host to Engine), confirm connectivity through > bond0 and stop NM service, checking connectivity again. After the test, the bond0 can be pingable both before and after the NM stopped. Here NM were did stopped, and after this operation, add host to engine failed, finding the bond0 is not pingable, which state is DOWN. > If the connectivity fails, try to check why. (Does the bond0 device exists? > Does it have an address? Is it up?)
Created attachment 1234598 [details] new issue attachment
From the supervdsm logs, all the steps seem to pass successfully up until the connectivity check. After the bridge networking has been set on the host (bridge over bond0 with two slaves) no engine pings have been detected for 120sec and a rollback has been issued. The only thing that I can think of is that when bond0 has been reconfigured by VDSM to mode 4, something changed that could not restore the connection. Perhaps the bond protocol has failed to re-negotiate with the switch or the mac address of the bond has been swapped (between the slaves). Please try to catch this 120sec after the setupNetwork has been issued, and check if anything changed, like the mac address. I would also suggest testing this on latest 4.1, so Engine connectivity issues have been handled there, perhaps it will help.
(In reply to Edward Haas from comment #11) > From the supervdsm logs, all the steps seem to pass successfully up until > the connectivity check. > After the bridge networking has been set on the host (bridge over bond0 with > two slaves) no engine pings have been detected for 120sec and a rollback has > been issued. > > The only thing that I can think of is that when bond0 has been reconfigured > by VDSM to mode 4, something changed that could not restore the connection. > Perhaps the bond protocol has failed to re-negotiate with the switch or the > mac address of the bond has been swapped (between the slaves). > > Please try to catch this 120sec after the setupNetwork has been issued, and > check if anything changed, like the mac address. > > I would also suggest testing this on latest 4.1, so Engine connectivity > issues have been handled there, perhaps it will help. Okay, since the test env is not ready currently, will try later and give feedback
Please reopen if this reproduces, and provide the information requested in comment 11