Bug 1903152

Summary: Baremetal: Rebooting any host that uses OVN Kubernetes leaves it unable to access the network
Product: OpenShift Container Platform Reporter: Antoni Segura Puimedon <asegurap>
Component: Machine Config OperatorAssignee: Antoni Segura Puimedon <asegurap>
Status: CLOSED DUPLICATE QA Contact: Victor Voronkov <vvoronko>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-01 14:57:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Antoni Segura Puimedon 2020-12-01 13:42:26 UTC
Description of problem:
4.7 makes use of OverlayFS to make sure that any changes that happen at runtime, stay only at runtime. In order to do that:
* It mounts OverlayFS to a new directory /etc/NetworkManager/system-connections-merged and tells NetworkManager to use it as its source of system connection configuration.

    lowerdir=/etc/NetworkManager/system-connections,upperdir=/run/nm-system-connections,workdir=/run/nm-system-connections-work

This happens before NetworkManager runs, as NetworkManager needs to be started pointing to /etc/NetworkManager/system-connections-merged. So just after systemd finishes setting up the temporary directories, it gets set up.

Another part of the networking setup done by ovs-configuration.service is in charge of setting up NetworkManager and open vSwitch for OVN Kubernetes. The way it does the set up consists on checking which NetworkManager connection is the one used the default gateway and morphing it into a bridged connection.


How reproducible: 100%


Steps to Reproduce:
1. Deploy OCP 4.7 with OVN Kubernetes
2. oc debug node/mynodename
3. chroot /chroot
4. systemctl reboot

Actual results:
The node boots up and is unable to set up its networking, so it appears as NotReady in `oc get nodes`. It also can't be accessed via `oc debug`.

Expected results:
After a short time, mynodename shows up as Ready in `oc get nodes` and can be accessed doing oc debug node/mynodename

Additional info:

The reason for this is that the NetworkManager configuration that ovs-configuration.service ends up being ephemeral due to OverlayFS, whereas the ovsdb configuration that comes from the same service is not. The inconsistency makes it impossible to boot.

Workarounds:

While the bug is being worked on, one can do the following to be able to reboot the nodes:

1. oc debug into each node after they appear as ready
2. copy the contents of /etc/NetworkManager/system-connections-merged into /etc/NetworkManager/system-connections

Comment 1 Antoni Segura Puimedon 2020-12-01 14:57:11 UTC

*** This bug has been marked as a duplicate of bug 1898036 ***