Bug 2095264 - ovs-configuration.service fails with Error: Failed to modify connection 'ovs-if-br-ex': failed to update connection: error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection'
Summary: ovs-configuration.service fails with Error: Failed to modify connection 'ovs-...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Jaime Caamaño Ruiz
QA Contact: Ross Brattain
URL:
Whiteboard:
: 2028003 2095415 (view as bug list)
Depends On:
Blocks: 2098097
TreeView+ depends on / blocked
 
Reported: 2022-06-09 11:59 UTC by Marius Cornea
Modified: 2022-07-25 12:32 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2098097 (view as bug list)
Environment:
Last Closed: 2022-06-23 08:15:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journal.log (2.06 MB, text/plain)
2022-06-09 11:59 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3188 0 None Merged [release-4.9] Bug 2098099: configure-ovs: clone connection to avoid selinux problems 2022-06-22 19:18:04 UTC

Description Marius Cornea 2022-06-09 11:59:37 UTC
Created attachment 1888319 [details]
journal.log

Description of problem:

ovs-configuration.service on a single node openshift with static single stack IPv6 addressing fails with the error below:

Jun 09 11:40:38 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: Error: Failed to modify connection 'ovs-if-br-ex': failed to update connection: error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.JSHHN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied


Version-Release number of selected component (if applicable):
4.9.38

How reproducible:
100%

Steps to Reproduce:
1. Deploy single node openshift cluster with static ipv6 addressing via ZTP procedure

2.oc get co network
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network             False       True          True       12m     The network is starting up

3. SSH to the node and check ovs-configuration.service status

Actual results:

[core@sno ~]$ systemctl status ovs-configuration.service
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2022-06-09 11:40:46 UTC; 14min ago
 Main PID: 2147 (code=exited, status=1/FAILURE)
      CPU: 668ms

Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: + ip -6 route show
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: ::1 dev lo proto kernel metric 256 pref medium
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: 2620:52:0:198::/64 dev ens2f2 proto kernel metric 102 pref medium
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: fe80::/64 dev ens2f3 proto kernel metric 101 pref medium
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: fe80::/64 dev ens2f2 proto kernel metric 102 pref medium
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: default via 2620:52:0:198::fe dev ens2f2 proto static metric 102 pref medium
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com configure-ovs.sh[2147]: + exit 1
Jun 09 11:40:46 sno.kni-qe-1.lab.eng.rdu2.redhat.com systemd[1]: ovs-configuration.service: Consumed 668ms CPU time


Expected results:
No failures

Additional info:
Attaching system journal.

Comment 1 Marius Cornea 2022-06-09 12:04:44 UTC
It appears SELinux is preventing the operation:

grep denied /var/log/audit/audit.log 
type=AVC msg=audit(1654774836.017:51): avc:  denied  { unlink } for  pid=2023 comm="NetworkManager" name="ovs-if-br-ex.nmconnection" dev="overlay" ino=57610 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:NetworkManager_etc_t:s0 tclass=file permissive=0


Note that after switch SELinux to permissive ovs-configuration.service started without issues.

Comment 2 Jaime Caamaño Ruiz 2022-06-14 12:56:35 UTC
So it looks like the overlay /etc/NetworkManager/systemConnectionsMerged is not registered with selinux as a valid location for NM to manage its connection profiles and if we manually copy a profile there and restorecon, which we do when trying to copy a static ip configuration, then it will have as scontext NetworkManager_t instead of the expected NetworkManager_var_run_t or NetworkManager_etc_rw_t.

Recently we introduced a change with https://github.com/openshift/machine-config-operator/pull/3160 that attempt to configure something after this copy through nmcli and it fails.

Trying to work around it with https://github.com/openshift/machine-config-operator/pull/3188 via using `nmcli clone` instead of a manual copy.

Could you guys give it a shot?

Comment 3 Jaime Caamaño Ruiz 2022-06-14 13:16:48 UTC
*** Bug 2095415 has been marked as a duplicate of this bug. ***

Comment 5 Ross Brattain 2022-06-14 22:29:08 UTC

https://github.com/openshift/machine-config-operator/pull/3188 Seems to work with static IP.

-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  292 Jun 14 14:40 /etc/NetworkManager/system-connections/br-ex.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  292 Jun 14 14:40 /etc/NetworkManager/systemConnectionsMerged/br-ex.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_var_run_t:s0 527 Jun 14 21:33 /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_var_run_t:s0 324 Jun 14 21:33 /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  186 Jun 14 14:40 /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  187 Jun 14 14:40 /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  527 Jun 14 21:33 /etc/NetworkManager/system-connections/ovs-if-br-ex.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  324 Jun 14 21:33 /etc/NetworkManager/system-connections/ovs-if-phys0.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  186 Jun 14 14:40 /etc/NetworkManager/system-connections/ovs-port-br-ex.nmconnection
-rw-------. 1 root root system_u:object_r:NetworkManager_etc_rw_t:s0  187 Jun 14 14:40 /etc/NetworkManager/system-connections/ovs-port-phys0.nmconnection

303-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + break
304-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + '[' 0 -eq 0 ']'
305-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + echo 'Brought up connection ovs-if-br-ex successfully'
306-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: Brought up connection ovs-if-br-ex successfully
307:Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes
308-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + '[' -f /etc/ovnk/extra_bridge ']'
309-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + persist_nm_conn_files
310-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + update_nm_conn_files br-ex phys0
311-Jun 14 14:40:56 compute-1 configure-ovs.sh[1336]: + bridge_name=br-ex
--
599-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + local active_state=activated
600-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + '[' activated '!=' activated ']'
601-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + echo 'Connection ovs-if-br-ex already activated'
602-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: Connection ovs-if-br-ex already activated
603:Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes
604-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + '[' -f /etc/ovnk/extra_bridge ']'
605-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + persist_nm_conn_files
606-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + update_nm_conn_files br-ex phys0
607-Jun 14 21:33:45 compute-1 configure-ovs.sh[1563]: + bridge_name=br-ex

Comment 6 Jaime Caamaño Ruiz 2022-06-15 10:22:36 UTC
@rbrattai could share the ovs-configuration logs to cross-check? Thank you.

Comment 8 Jaime Caamaño Ruiz 2022-06-16 17:51:26 UTC
We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

    reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
    example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0.
    example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version.

What is the impact? Is it serious enough to warrant removing update recommendations?

    reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
    example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
    example: Up to 90 seconds of API downtime. Check with curl ....
    example: etcd loses quorum and you have to restore from backup. Check with ssh ....

How involved is remediation?

    reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
    example: Issue resolves itself after five minutes.
    example: Admin can run a single: oc ....
    example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

Is this a regression?

    reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
    example: No, it has always been like this we just never noticed.
    example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

Comment 9 Jaime Caamaño Ruiz 2022-06-16 18:27:48 UTC
Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

4.9.z to 4.9.38: all ovn-kubernetes clusters
4.9.z to 4.9.39: all ovn-kubernetes with static IP configurations

What is the impact? Is it serious enough to warrant removing update recommendations?

In 4.9.38, ovn-kubernetes with static IP configurations will fail deploying and all other ovn-kubernetes clusters networking will break after first reboot.
In 4.9.39, ovn-kubernetes with static IP configurations will fail deploying.

How involved is remediation?

For deployments with no static IP configuration, access the nodes through the provisioning network or a console and:
- run `systemctl start ovs-configuration`
- run `nmcli -g name c show -active | egrep "(ovs-if-|-slave-ovs-clone)" | xargs -I % nmcli c mod % connect.autoconnect yes`
- reboot

For deployments with static IP configuration, access the nodes through the provisioning network or a console and:
- Set selinux to permissive mode (as documented in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/using_selinux/changing-selinux-states-and-modes_using-selinux)
- run `systemctl start ovs-configuration`
- run `nmcli -g name c show -active | egrep "(ovs-if-|-slave-ovs-clone)" | xargs -I % nmcli c mod % connect.autoconnect yes`
- reboot
- Set selinux to enforcing mode (as documented in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/using_selinux/changing-selinux-states-and-modes_using-selinux)

These steps may have to be performed again if the network configuration in the node resets to its original state, although this would be unexpected.

Is this a regression?

Yes.

Comment 11 Tim Rozet 2022-06-17 15:14:08 UTC
This is a problem that only occurs in 4.9 and there is no fix needed for 4.11 or 4.10. QE please verify.

Comment 13 W. Trevor King 2022-06-22 23:16:56 UTC
Update graph-data response is being discussed in bug 2098099 (the 4.9.z-targeted bug in this series), e.g. here [1].  We only need the UpgradeBlocker metadata on a single bug in the series, so dropping it from this bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2098099#c6

Comment 14 Michael Filanov 2022-07-25 12:32:35 UTC
*** Bug 2028003 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.