Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1707086

Summary: The management bridge "doesn't work" after a reboot of the host
Product: [oVirt] ovirt-ansible-collection Reporter: elv1313
Component: hosted-engine-setupAssignee: Asaf Rachmani <arachman>
Status: CLOSED INSUFFICIENT_DATA QA Contact: meital avital <mavital>
Severity: unspecified Docs Contact: Tahlia Richardson <trichard>
Priority: unspecified    
Version: unspecifiedCC: bugs, dholler, stirabos
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 08:51:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
agent.log
none
broker.log none

Description elv1313 2019-05-06 19:09:32 UTC
Description of problem:

New, temporarily single host, install wont reboot clean. Some network settings added by the "ovirt-hosted-engine-setup" (like ip forwarding) are not persistent.

How reproducible:


Steps to Reproduce:
1. Install CentOs
2. Run the ovirt setup
3. Reboot

Actual results:

VSDMD goes up, but the ha agent can't start. Doing ping on the default gateway doesn't work. There are errors in the logs:

RequestError: Failed to start monitor ping, options {'addr': 'g.a.t.e'}: [Errno 2] No such file or directory

The reason for this is that there's a new network "bridge" and it doesn't forward the traffic properly and messes with the Internet access of the host.

There's a second variant of this problem where the IP gets hardcoded in the network setting. This is problematic when first installing in a DHCP environment. Getting static IP sometime requires to ask other people who can do it to do it. It can take them a while to grant them. This adds a "window of opportunity" for catastrophic failures when doing the initial installation. Not hardcoding the IP in /etc/sysconfig/network-scripts/ is probably a good idea.

Expected results:

Being able to reboot a single host "datacenter" with a fresh install of oVirt without playing with /sys/ networking settings and the `ip` command for 10 minutes.

Additional info:

There's many open bug, including https://bugzilla.redhat.com/show_bug.cgi?id=1639997 about variants of this problem. None are really helpful. Note that this can be mitigated with enough time, but it may give sour taste when it is you first oVirt install. The old 3.x versions didn't have this issue. I report his mainly so you can work on making oVirt "just work out of the box".

Comment 1 Simone Tiraboschi 2019-05-08 10:02:54 UTC
(In reply to elv1313 from comment #0)
> VSDMD goes up, but the ha agent can't start. Doing ping on the default
> gateway doesn't work. There are errors in the logs:
> 
> RequestError: Failed to start monitor ping, options {'addr': 'g.a.t.e'}:
> [Errno 2] No such file or directory

This is absolutely not related to the network and manually re-configuring it under vdsm can only make it worst.
 
I know that it can sound a bit ambiguous but "RequestError: Failed to start monitor ping, options {'addr': 'g.a.t.e'}: [Errno 2] No such file or directory"
simply means that ovirt-ha-agent failed to ask ovirt-ha-broker to send a ping and , in that case, "[Errno 2] No such file or directory" means that the unix domain socket of the broker is closed and so the broker is down.
99% this is a storage issue, not a network one.

Can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ?

This is probably just a duplicate of:
https://bugzilla.redhat.com/1609029

elv1313, how did you reboot your hosts?
If you didn't explicitly stopped the SPM before powering down/rebooting, when you power on the host again, it will have to wait for about 5 minutes to be sure that the lock on the shared storage is really expired (even if you keep the host down for a few hours) and in that time frame ovirt-ha-broker is not able to acquire another lock and so it will fail and systemd will restart it.
If ovirt-ha-agent will try to send a ping while ovirt-ha-broker is down, it will fail with "[Errno 2] No such file or directory" as you reported.
After 5 minutes everything should converge by itself without the need of any manual recovery procedure.

Normally in order to cleanly power-off a managed host, the user is required to set it into maintenance mode from the engine and this will properly shutdown sanlock before disconnecting the shared storage.

On the last host of an hosted-engine environment this is a bit more complex because the host is still connected to the shared storage to run the engine VM and then, once you shutdown the engine VM, you have no engine anymore to properly trigger the maintenance mode on the host and so on and you have to manually do it to avoid the issue you just reported.

In order to avoid that, we developed an ansible role to properly shutdown the whole oVirt environment without side effects:
https://github.com/oVirt/ovirt-ansible-shutdown-env

Comment 2 elv1313 2019-05-08 18:14:31 UTC
> elv1313, how did you reboot your hosts?

I pulled the power cord on purpose. I should have mentioned that, sorry. Given one of the VM powers the SIP server for the phone and security camera, time-to-recovery is important. The original iteration of the cluster has 5 Dell poweredge (it will grow), enough UPS for an hour and 2 Internet providers. A script capable of "better shutdown" when the UPS is nearly empty is possible, but not currently implemented. Backup generators are not planned for this project.

I think powerloss scenario and powerloss recovery time are important features.

> This is absolutely not related to the network and manually re-configuring it under vdsm can only make it worst.

I have doubt about this statement. It was in fact no possible to ping the gateway because if bridge did not forward the traffic and the network script had an hardcoded IP with a now expired lease. It could not fix itself. In fact there's already a closed (wont fix) bug report somewhere that even `ovirt-hosted-engine-cleanup` will not delete the birdge and will leave the host with a broken Internet access. I don't mind if you leave "not deleting the bridge on cleanup" as "wont fix", however having a broken bridge after hard rebooting a clean install is much more severe. Maybe this will work better In RHEL/CentOS 8 now that NetworkManager handles more case.

> it will have to wait for about 5 minutes to be sure that the lock on the shared storage is really expired

It did not come back online after 30 minutes.

> Can you please attach /var/log/ovirt-hosted-engine-ha/broker.log?

Sorry, I am afraid it already rotated away. The problem is solved. This bug report is about steps I consider should have been handled automatically. This is mainly to make "first time" users experience better. I think the steps to reproduce the problem mentioned abve should reproduce the problem.

Comment 3 Simone Tiraboschi 2019-05-09 08:09:35 UTC
(In reply to elv1313 from comment #2)
> I pulled the power cord on purpose. 

> I think powerloss scenario and powerloss recovery time are important
> features.

Yes, I agree with you on this.
Currently, after a sudden power loss of the whole data-center you can/should expect something less than 10 minutes to have back a running engine and then the engine, after a 5 minutes grace period, will start HA enabled VMs for you.

But this in the optimistic case.
Then, according to what was going on storage side at the time where you pulled off the power cable you have a lot of possible issues that requires a manual intervention:
- hosts that don't boot requiring a manual FS repair with fsck
- VMs that don't boot requiring a manual FS repair with fsck
- corruption on DB side requiring a manual restore from a backup
and so on...

> A script capable of "better shutdown" when the UPS is nearly empty is possible, but not currently implemented.

I'd strongly suggest to focus on this if you want to be on the safe side.
You can start from/consume: https://github.com/oVirt/ovirt-ansible-shutdown-env

It will cleanly shutdown all the VMs and all the hosts.
It's designed to work also with hosted-engine and eventually also in a gluster hyperconverged environment. 


> > This is absolutely not related to the network and manually re-configuring it under vdsm can only make it worst.
> 
> I have doubt about this statement. It was in fact no possible to ping the
> gateway because if bridge did not forward the traffic and the network script
> had an hardcoded IP with a now expired lease.

Sorry but this is not clear to me:
during hosted-engine deployment you choose a netwrok interface to create the management bridge on.
If that interface was configured with DHCP (and having a reservation here is strongly recommended if not mandatory), the management bridge will be configured with DHCP using the same MAC address.
If that interface was configured with a static IP (IPv4 or IPv6), the same static IP will be copied to the management bridge.
If the default gateway was configured on that interface it will be configured on the management bridge.
Static routed are not supported. 

Can you please attach deployment logs and vdsm.log and supervdsm.log for the deployment time to dig about that?

> It could not fix itself. In
> fact there's already a closed (wont fix) bug report somewhere that even
> `ovirt-hosted-engine-cleanup` will not delete the birdge and will leave the
> host with a broken Internet access. I don't mind if you leave "not deleting
> the bridge on cleanup" as "wont fix", however having a broken bridge after
> hard rebooting a clean install is much more severe. Maybe this will work
> better In RHEL/CentOS 8 now that NetworkManager handles more case.

oVirt 4.3 is not going to support RHEL/CentOS 8 hosts (VMs are fine), they are going to be supported in 4.4.
But yes, the management bridge should work after a reboot and I want to understand why it didn't for you.

Comment 4 elv1313 2019-05-13 19:25:14 UTC
> But yes, the management bridge should work after a reboot and I want to understand why it didn't for you.

Given the oVirt cluster in question is now in production, I cannot easily try to replicate and the logs are expired. I will schedule some lab time and try to reproduce on another install. Given it requires only one host and I got the answer file for ansible, it should not take that much time. I am not sure when this will get done, I will report back.

Comment 5 Simone Tiraboschi 2019-05-14 13:34:07 UTC
(In reply to elv1313 from comment #4)
> > But yes, the management bridge should work after a reboot and I want to understand why it didn't for you.
> 
> Given the oVirt cluster in question is now in production, I cannot easily
> try to replicate and the logs are expired. I will schedule some lab time and
> try to reproduce on another install. Given it requires only one host and I
> got the answer file for ansible, it should not take that much time. I am not
> sure when this will get done, I will report back.

Maybe this https://bugzilla.redhat.com/1588052 ?

if your management bridge is configured to get an address via DHCP but the DHCP server doesn't provide a valid address within something like 20 minutes, VDSM will try to rollback to the previous working configuration.
If you just created the management bridge and never edited it, the bridge will be destroyed and the host will fail starting the engine VM.
Is this your case?

Comment 6 elv1313 2019-05-24 21:10:07 UTC
Created attachment 1573049 [details]
agent.log

Comment 7 elv1313 2019-05-24 21:10:19 UTC
Created attachment 1573050 [details]
broker.log

Comment 8 elv1313 2019-05-24 21:11:45 UTC
> Maybe this https://bugzilla.redhat.com/1588052 ?

Possible. But not so sure. I just rebooted the only host of the cluster that worked and I got the problem again. The DHCP IP and Gateway are unchanged and working.

And for the log, yes, mounting the nfs share works locally

Comment 9 elv1313 2019-05-24 21:26:39 UTC
Note that restarting the broker, then the agent, then vdsmd, then the agent again fixed the problem

Comment 10 Simone Tiraboschi 2019-05-27 09:08:35 UTC
Can you please attach also vdsm.log for the relevant time frame>

Comment 11 Simone Tiraboschi 2019-11-06 08:34:12 UTC
Dominik,
this looks exactly as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1588052 when the external DHCP server is not properly working at host boot time and, after 20 minutes, vdsm tries to rollback network configuration to the latest know working status which can imply the destruction of the DHCP configured bridge.

Dominik, do we have any plan to fix/change this?

Comment 12 Dominik Holler 2019-11-06 14:40:51 UTC
(In reply to Simone Tiraboschi from comment #11)
> Dominik,
> this looks exactly as a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1588052 when the external DHCP
> server is not properly working at host boot time and, after 20 minutes, vdsm
> tries to rollback network configuration to the latest know working status
> which can imply the destruction of the DHCP configured bridge.
> 

Might be, but I do not understand how the restart is related, do you?

> Dominik, do we have any plan to fix/change this?

No, even the implementation will change in 4.4, the behavior is not planned to be changed.

Comment 13 elv1313 2019-11-25 07:58:29 UTC
Just to mention, the customer gave up on oVirt and went to VMWare. He could not maintain his cluster alive due to the various papercuts, most of which are reported and known. He got fed up paying the consultant bills to bring it back online every month. Variants of this issue caused problems many time, the last time when the network was modified to add redundant switches. The bug that really ended the project was the gluster collapsing after unknown "manual interventions" (I don't know the details). I think the post mortem is that oVirt is still not mature enough for people with this level of IT knowledge compared to "the competition". I personally used it on my personal cluster (mostly Jenkins Slave VMs and Docker hosts) until recently when I moved on to a contract where owning and operating a rack makes zero sense. I never had serious issues I could not solve, but my use cases were limited. Still, I am happy with the progress you guys made from the 3.x days until today. It is much better than it used to be. Unfortunately, as shown by this project, it apparently goes through Q/A with very specific scenarios and they don't reflect the "real world" of medium size installations on less than perfect networks. I think fixing those papercuts is really necessary to compete in the "too small for OpenStack" market.

Comment 15 Dominik Holler 2020-09-15 13:49:34 UTC
Can you confirm that this bug is fixed in oVirt 4.4 ?

Comment 16 elv1313 2020-09-21 03:45:40 UTC
> Can you confirm that this bug is fixed in oVirt 4.4 ?

I am sorry, as mentioned above, the customer I did the project for abandoned oVirt. This being said, this bug was really, really a problem that contributed to the failure of the project. Since a minor changes to the network setting cause total and unsalvageable collapse of a cluster, then you should really keep this open bug open until it is proven to be fixed. The reproducer is simple enough.

Comment 17 Asaf Rachmani 2021-10-18 08:51:20 UTC
Based on comment 15, this bug is most likely fixed in 4.4.
In case this issue pops up in 4.4, feel free to reopen it.