Bug 2095415
Summary: | OCP on Z 4.9.38 and 4.9.39 builds hang on network operator during zVM environment install | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | krmoser | ||||||||||
Component: | Machine Config Operator | Assignee: | Jaime Caamaño Ruiz <jcaamano> | ||||||||||
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Ross Brattain <rbrattai> | ||||||||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||||
Severity: | high | ||||||||||||
Priority: | unspecified | CC: | chanphil, christian.lapolt, danili, dmistry, fleber, Holger.Wolf, jcaamano, mkrejci, psundara, skumari | ||||||||||
Version: | 4.9 | ||||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | 4.9.z | ||||||||||||
Hardware: | s390x | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2022-06-14 13:16:48 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
krmoser
2022-06-09 17:11:12 UTC
Created attachment 1888403 [details]
partial must-gather for OCP 4.9.38 zVM install issue
Partial "oc adm must-gather" for OCP 4.9.38 zVM install issue.
Thank you.
ovnkube-node pods fail with : I0609 16:55:39.356410 37236 gateway_localnet.go:173] Node local addresses initialized to: map[10.129.0.2:{10.129.0.0 fffffe00} 10.20.116.12:{10.20.116.0 ffffff00} 127.0.0.1:{127.0.0.0 ff000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::8808:c4ff:fed3:412:{fe80:: ffffffffffffffff0000000000000000} fe80::943d:51ff:fe27:b2fc:{fe80:: ffffffffffffffff0000000000000000}] I0609 16:55:39.356500 37236 helper_linux.go:73] Found default gateway interface enc2e0 10.20.116.1 F0609 16:55:39.356532 37236 ovnkube.go:130] could not find IP addresses: failed to lookup link br-ex: Link not found Kyle, Which was the last 4.9 build which worked for you ? Prashanth Prashanth, 1. The last OCP 4.9 on Z build that installs properly in a zVM environment is the predecessor to this build, 4.9.37. 2. The OCP on Z 4.9.37 build was released on June 3, 2022. Thank you, Kyle Thanks Kyle. The difference between 4.9.37 and 4.9.38 seems to be https://github.com/openshift/machine-config-operator/pull/3160 in the machine-config-operator and looks related to what we are seeing. @jcaamano - could this issue we are hitting be related to the above change? Yes, it could be. It has an issue with a tentative fix here: https://github.com/openshift/machine-config-operator/pull/3183 Assigning this bug to Networking team. Hello Jaime, Can you please set blocker flag to '+ 'or '-' based on your assessment? Can't be completely sure if it is the same thing without a node journal. @krmoser.com would you be able to provide one? Prashanth, Please let us know where you would like the node journal collected from and the commands to do so. Thank you, Kyle Hi Kyle, A "journalctl" on the master nodes should help. Thanks Prashanth Created attachment 1889463 [details]
master-0 journalctl logs
master-0 journalctl logs
Created attachment 1889464 [details]
master-1 journalctl logs
master-1 journalctl logs
Created attachment 1889465 [details]
master-2 journalctl logs
master-2 journalctl logs
Thanks @krmoser.com Looks like something different: Jun 13 15:41:58 master-0.pok-96.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1390]: Brought up connection ovs-if-br-ex successfully Jun 13 15:41:58 master-0.pok-96.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1390]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com NetworkManager[1370]: ((libnm-core/nm-connection.c:186)): assertion '<dropped>' failed Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com NetworkManager[1370]: <warn> [1655134916.9590] keyfile: commit: failure to write 13d22672-3c1f-4735-8db5-60cd18e60b8d ((null)) to "/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection": error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.1E9SN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com NetworkManager[1370]: <info> [1655134916.9590] audit: op="connection-update" uuid="13d22672-3c1f-4735-8db5-60cd18e60b8d" name="ovs-if-br-ex" args="connection.autoconnect,connection.timestamp" pid=1785 uid=0 result="fail" reason="failed to update connection: error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.1E9SN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied" Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1404]: Error: Failed to modify connection 'ovs-if-br-ex': failed to update connection: error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.1E9SN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied Could you please provide ownership and permissions of /etc/NetworkManager/systemConnectionsMerged and /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection as well as /var/log/audit/audit.log on any of the nodes? It looks like there is no write permissions on /etc/NetworkManager/systemConnectionsMerged but there should be as that dir is created via /etc/tmpfiles.d/nm.conf containing: d /etc/NetworkManager/systemConnectionsMerged 0755 root root - - Jaime, Thanks for the assistance. Here's the requested information. There is no /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection file. 1. /etc/NetworkManager/systemConnectionsMerged : ====================================================================================================================================== [root@master-2 ~]# ls -al /etc/NetworkManager/systemConnectionsMerged total 4 drwxr-xr-x. 1 root root 140 Jun 13 15:41 . drwxr-xr-x. 8 root root 165 Jun 13 15:41 .. -rw-------. 1 root root 406 Jun 13 15:40 default_connection.nmconnection [root@master-2 ~]# 2. /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection : ====================================================================================================================================== [root@master-2 ~]# ls -al /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection ls: cannot access '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': No such file or directory [root@master-2 ~]# 3. /var/log/audit/audit.log : ====================================================================================================================================== [root@master-2 ~]# ls -al /var/log/audit/audit.log -rw-------. 1 root root 92813 Jun 13 20:15 /var/log/audit/audit.log [root@master-2 ~]# Thank you, Kyle So it looks like the overlay /etc/NetworkManager/systemConnectionsMerged is not registered with selinux as a valid location for NM to manage its connection profiles and if we manually copy a profile there and restorecon, which we do when trying to copy a static ip configuration, then it will have as scontext NetworkManager_t instead of the expected NetworkManager_var_run_t or NetworkManager_etc_rw_t. Recently we introduced a change with https://github.com/openshift/machine-config-operator/pull/3160 that attempt to configure something after this copy through nmcli and it fails. Trying to work around it with https://github.com/openshift/machine-config-operator/pull/3188 via using `nmcli clone` instead of a manual copy. Could you guys give it a shot? Marking as dup of 2095264. *** This bug has been marked as a duplicate of bug 2095264 *** Folks, Please let us know when there is an OCP 4.9.38 on Z successor build available to test with the proposed fix. Thank you, Kyle Folks, It appears that the same issue exists for this week's OCP 4.9 post-GA publicly available build: 4.9.39. Thank you, Kyle Folks, The OCP on Z Solution Test team has successfully tested the following OCP 4.9 on Z builds for both connected and disconnected installs: 1. OCP 4.9.40 2. OCP 4.9.41 Thank you, Kyle |