2155991 – VLAN device doesn't activate after reload

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2155991 - VLAN device doesn't activate after reload

Summary: VLAN device doesn't activate after reload

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Beniamino Galvani
QA Contact:	Filip Pokryvka
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2159758 (view as bug list)
Depends On:
Blocks:	2159758
TreeView+	depends on / blocked

Reported:	2022-12-23 09:57 UTC by Andrea Panattoni
Modified:	2023-11-07 10:12 UTC (History)
CC List:	11 users (show)
Fixed In Version:	NetworkManager-1.43.4-1.el9
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2159758 (view as bug list)
Environment:
Last Closed:	2023-11-07 08:37:57 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport-worker-0 (17.03 MB, application/x-xz) 2022-12-23 09:57 UTC, Andrea Panattoni	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	NMT-155	None	None	None	2023-01-24 08:25:01 UTC
Red Hat Issue Tracker	OCPBUGS-3612	None	None	None	2022-12-23 09:57:56 UTC
Red Hat Issue Tracker	RHELPLAN-143129	None	None	None	2022-12-23 10:02:29 UTC
Red Hat Product Errata	RHBA-2023:6585	None	None	None	2023-11-07 08:38:29 UTC
freedesktop.org Gitlab	NetworkManager NetworkManager-ci merge_requests 1390/	None	None	None	2023-04-18 14:39:59 UTC
freedesktop.org Gitlab	NetworkManager NetworkManager merge_requests 1522	None	merged	manager: relax check when creating virtual devices	2023-03-16 11:06:43 UTC

Description Andrea Panattoni 2022-12-23 09:57:56 UTC

Created attachment 1934243 [details]
sosreport-worker-0

Description of problem:

The device bond0.409 (VLAN over bonded interfaces) doesn't activate after issuing `nmcli connection reload`.

The problem can be inspected in the attached sosreport and journalctl:

~~~~~~~~~~~~~~
Dec 19 22:36:12 worker-0 configure-ovs.sh[6116]: + local 'connected_state=10 (unmanaged)'
Dec 19 22:36:12 worker-0 configure-ovs.sh[6116]: + [[ 10 (unmanaged) =~ disconnected ]]
Dec 19 22:36:12 worker-0 configure-ovs.sh[6116]: + echo 'Waiting for interface bond0.409 to activate...'
Dec 19 22:36:12 worker-0 configure-ovs.sh[6116]: Waiting for interface bond0.409 to activate...
Dec 19 22:36:12 worker-0 configure-ovs.sh[6116]: + timeout 60 bash -c 'while ! nmcli -g DEVICE,STATE c | grep "bond0.409:activated"; do sleep 5; done'
...
Dec 19 22:37:12 worker-0 configure-ovs.sh[6116]: + echo 'Warning: bond0.409 did not activate'
~~~~~~~~~~~~~~


Version-Release number of selected component (if applicable):

8.6

How reproducible:

The bug is likely to be reproduced if NM LOGLEVEL=info, not often with LOGLEVEL=debug, never reproduced with LOGLEVEL=trace.

Steps to Reproduce:
1. 
2.
3.

Actual results:

bond0.409 connection should activate as other connections (bond0) do.

Expected results:

bond0.409 connection does not activate

Additional info:

- The error appeared in OCP4.12rc5
- It works correctly in OCP4.10, which is based on RHEL 8.4
- The error occurs during the boot, in configure-ovs.sh [1]


[1] https://github.com/openshift/machine-config-operator/blob/release-4.12/templates/common/_base/files/configure-ovs-network.yaml#L366

Comment 1 Scott Dodson 2023-01-09 17:36:46 UTC

If this is root cause for the referenced OCPBUGS-3612[1], this represents a regression for all OCP clusters upgrading from 4.10 to 4.11 or 4.12. Since 4.11 has already shipped and 4.12 is imminently shipping we need this to be looked into urgently.

1 - https://issues.redhat.com/browse/OCPBUGS-3612

Comment 2 Beniamino Galvani 2023-01-24 08:22:31 UTC

Hi Andrea,

I'm working on a patch to fix this bug. Unfortunately, I can't reproduce the problem locally; would you test (or ask the customer to test) a scratch build once there is a proposed fix?

Comment 3 Andrea Panattoni 2023-01-24 09:03:32 UTC

Hi @bgalvani , thanks for the update.

@manrodri Do you think it's possible to test a scratch build (supposing an RPM) on the distributed CI you used for https://issues.redhat.com/browse/OCPBUGS-3612?

We should test the ovs-configure.sh script without the `touch /run/configure-ovs-boot-done` workaround.

Comment 4 Manuel Rodriguez 2023-01-24 14:23:43 UTC

Hi @apanatto,

In our Distributed CI we can install clusters from several releases (nightly builds, EC, RC) of OCP, but if I'm understanding correctly we'll need an OCP cluster running 4.12 without the work-around, then we'll upgrade an RPM package and I guess, reboot the nodes to test? Please let me know if that's the case, I can prepare a cluster.

Thanks,

Comment 5 Andrea Panattoni 2023-01-24 15:40:56 UTC

It would be better to install the RPM package before the OpenShift installation, but I suppose it's not that easy as it's all automated.

If you install the package later in the process, we need to be sure a simple reboot does not solve the problem. So the steps are slightly different:

1. Setup OCP cluster 4.12
2. Apply the MachineConfig that makes configure-ovs.sh fail
3. Reboot the node and check if it still fails
4. Install the RPM provided by Beniamino
5. Reboot the node
6. Check if it comes up without errors

@manrodri do you think it is feasible?

@bgalvani any feedback on the above process?

Comment 6 Beniamino Galvani 2023-01-24 16:06:23 UTC

The test procedure above looks ok.

Comment 7 Manuel Rodriguez 2023-01-24 16:08:20 UTC

@apanatto thanks for the details, that looks good to me, I've never installed an RPM in an OCP node, but I'm up for testing, so please let me know when you have an RPM available and I'll run the procedure.

Comment 13 Andrea Panattoni 2023-02-28 13:08:17 UTC

@manrodri did you have the chance to do more tests on this?
Although the nature of this problem is not so deterministic, can we say it improves the overall stability of the startup process?

Comment 17 sfaye 2023-03-01 08:21:28 UTC

*** Bug 2159758 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2023-11-07 08:37:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:6585

Note You need to log in before you can comment on or make changes to this bug.