Bug 1902584

Summary: RHCOS fails to activate static VLAN IP when first booting from disk during installation
Product: OpenShift Container Platform Reporter: Ondrej Faměra <ofamera>
Component: RHCOSAssignee: Dusty Mabe <dustymabe>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Nguyen <mnguyen>
Severity: low Docs Contact:
Priority: medium    
Version: 4.5CC: bbreard, imcleod, jligon, miabbott, nstielau, pchavan
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Failure to properly tear down network interfaces in the initrd before switching to the real root Consequence: Static IP assignment to a VLAN interface may not be successfully activated in the real root. Fix: Change how network interfaces are torn down in the initrd Result: Static IP assignments to VLAN interfaces are successfully activated in the real root.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-10 20:15:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ondrej Faměra 2020-11-30 06:30:58 UTC
### Description of problem:
After RHCOS image is written to disk and system reboots first time into RHCOS the static IP on VLAN interface is never activated and RHCOS cannot continue installation. After manual reboot the RHCOS manages to bring up static VLAN IP as expected.

### Version-Release number of selected component (if applicable):
4.5.6 on both baremetal and VM
(4.6.1 seems to be NOT affected based on testing on VM, but we were not able to test this on baremetal)

### How reproducible:
100%

### Steps to Reproduce (VM):
1. Create VM with at least 2 network cards (ens2, ens7)
2. Make the network cable for first physical interface (ens2) to be disconnected (in libvirt add `<link state="down" />` to interface definition of ens2)
3. Boot VM into RHCOS 4.5.6 installer and provide it with following network configuration parameter 'rd.neednet=1 ip=10.0.0.2::10.0.0.1:255.255.255.0:testhost:ens7.2012:none nameserver=8.8.8.8 vlan=ens7.2012:ens7' in addition to location of image of RHCOS 4.5.6 and ignition file.
4. Wait for RHCOS installer to write RHCOS image and downloaded ignition file to disk.
5. Observe that VM reboots automatically and starts booting from disk

### Actual results:
VM reaches the login prompt and following information can be seen in regard to network configuration
~~~
...
ens2: 
ens7: fe80::...
ens7.2012: fe80::...
...
~~~
Installer doesn't progress as there is no IPv4 address to continue with. Static IP for VLAN 2012 on interface ens7 is never activated.
At this point manually rebooting the system results in Expected behaviour, but it is a manual step that should not be needed here.

### Expected results:
VM reaches the login prompt and following information can be seen in regard to network configuration
~~~
...
ens2: 
ens7: fe80::...
ens7.2012: 10.0.0.2 fe80::...
...
~~~
VM has started successfully static IP configuration on VLAN 2012 and installation progress automatically as expected.

### Additional info:
This was initially observed on baremetal installation, but later we have managed to reproduce this reliably on VM once we knew that "disconnected network card" was the culprit. As system moved toward production the testing on baremetal is not possible any more but as this can be reproduced on demand on VMs we can further test any solutions on VMs to confirm if the issue was resolved.
Testing with 4.6.1 RHCOS using same command line parameters (step 3.) as for 4.5.6 doesn't lead to problems and static IP on VLAN is activated on first attempt.

Modifying the ignition file to allow the console login we were further able to confirm that in case of 4.5.6 if we log in and issue `nmcli c up ens7.2012` the network is brought up and installation continues. When using multiple network interfaces we can also see from terminal that NetworkManager is trying to 'activate' all connections repeatedly except of the ens7.2012. After manual reboot we can again confirm from console that NetworkManager in that case tries to bring up the 'ens7.2012' and that results in successful start of static IP on VLAN.

If additional data are needed please let me know. Thank you.

Comment 3 Micah Abbott 2020-12-01 21:34:35 UTC
Targeting for 4.7 with medium priority; if a fix is needed in 4.6, we will need to clone the BZ accordingly.

Comment 4 Dusty Mabe 2020-12-01 23:32:14 UTC
I believe this is a duplicate of BZ#1860060 (fixed in 4.6). See comment https://bugzilla.redhat.com/show_bug.cgi?id=1860060#c3 for details.

Comment 5 Ondrej Faměra 2020-12-02 01:47:32 UTC
Hi Dusty,

Checking on BZ#1860060 it really feels like that addresses it - as mentioned in description the 4.6 works fine, the 4.5 is affected.
In short:

1. Is there consideration to bring the fix from 4.6 into 4.5? (yes/no)
  (if yes, we will wait with support on updates here)
2. If not, then can we treat this as documentation BUG to improve docs mentioning this behaviour as "known limitation that was improved in 4.6 release" ideally mentioning the workaround (additional manual reboot needed for RHCOS to pick up the address)?

Thank you.

Comment 6 Micah Abbott 2020-12-02 16:00:58 UTC
(In reply to Ondrej Faměra from comment #5)
> Hi Dusty,
> 
> Checking on BZ#1860060 it really feels like that addresses it - as mentioned
> in description the 4.6 works fine, the 4.5 is affected.
> In short:
> 
> 1. Is there consideration to bring the fix from 4.6 into 4.5? (yes/no)
>   (if yes, we will wait with support on updates here)

There were significant changes to how networking was handled in the initrd as part of 4.6, so I don't believe a simple backport is possible.  Additionally, changing how the initrd operates in 4.5 would require rebuilding all of the RHCOS boot media as part of a 4.5.z release, which is something we avoid unless absolutely necessary (i.e. to mitigate a CVE).

Therefore, we are not considering backporting this fix into 4.5 without additional justification.

> 2. If not, then can we treat this as documentation BUG to improve docs
> mentioning this behaviour as "known limitation that was improved in 4.6
> release" ideally mentioning the workaround (additional manual reboot needed
> for RHCOS to pick up the address)?

We can pursue updating the docs for this issue; if you could identify where in the docs we could make an update, that would be useful.

> 
> Thank you.

Comment 7 Dusty Mabe 2020-12-04 22:44:14 UTC
This bug needs more information. It is not scheduled to be worked on in the current sprint.

Comment 8 Ondrej Faměra 2020-12-08 13:02:36 UTC
Hi Dusty,

Thank you for answer and sorry for delay.

I think that adding 'Note' at the end of 'Configure advanced networking' section (https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-static-network_installing-bare-metal) would make most sense for me.

Text could be something like:

~~~
Note: When using some of the advanced networking options, such as `vlan=`, you may encounter issue where on first RHCOS boot the statically configured address is not present/activated properly. In such case you can try manually rebooting the machine (use ctrl+alt+delete or sending reset signal to machine depending on your environment). In RHCOS 4.6 the network code was significantly overhauled so these kind of issues should be resolved there.
~~~

(above is just suggestion, feel free to edit)

Thank you

Comment 9 Micah Abbott 2020-12-10 20:15:44 UTC
@Ondrej, thank you for the suggestion.

I've made a PR to the docs for OCP 4.5 to suggest the workaround - https://github.com/openshift/openshift-docs/pull/28036

I'm going to close this as CURRENTRELEASE based on comment #5