Description of problem: 1. The OCP 4.9.38 on Z build, using the RHCOS 4.9 GA version, fails to install for zVM environments, with the install not progressing past the network cluster operator. 2. The OCP 4.9.38 on Z build, using the same RHCOS 4.9 GA version, installs successfully for multiple KVM environments. Here are the OCP 4.9.38 OCP CLI "oc get clusterversion", "oc get nodes", and "oc get co" commands' output for when attempting an install in a zVM environment: [root@ospamgr1 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 51m Unable to apply 4.9.38: an unknown error has occurred: MultipleErrors [root@ospamgr1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-25.ocptest.pok.stglabs.ibm.com NotReady master 50m v1.22.8+f34b40c master-1.pok-25.ocptest.pok.stglabs.ibm.com NotReady master 50m v1.22.8+f34b40c master-2.pok-25.ocptest.pok.stglabs.ibm.com NotReady master 50m v1.22.8+f34b40c [root@ospamgr1 ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager 4.9.38 True False False 50m cloud-credential True False False 50m cluster-autoscaler config-operator console csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network False True True 50m The network is starting up node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage [root@ospamgr1 ~]# Version-Release number of selected component (if applicable): 1. OCP 4.9.38 at https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.9.38 2. RHCOS 4.9.0 at https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.9/4.9.0 How reproducible: 1. Consistently reproducible in a OCP on Z zVM environment. 2. Consistently NOT reproducible in a OCP on Z KVM environment. Steps to Reproduce: 1. Attempt to install the OCP 4.9.38 on Z build in a zVM environment. Actual results: The installation does not progress past the installation of the OCP 4.9.38 on Z network operator. Expected results: The installation should successfully complete. Additional info: Here is the console output of an attempted "oc adm must-gather": [root@ospamgr1 ~]# oc adm must-gather [must-gather ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather) [must-gather ] OUT [must-gather ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 7fd28f1d-3ee8-4815-b2a6-ddf935e06199 ClusterVersion: Installing "4.9.38" for 34 minutes: Unable to apply 4.9.38: an unknown error has occurred: MultipleErrors ClusterOperators: clusteroperator/authentication is not available (<missing>) because <missing> clusteroperator/baremetal is not available (<missing>) because <missing> clusteroperator/cluster-autoscaler is not available (<missing>) because <missing> clusteroperator/config-operator is not available (<missing>) because <missing> clusteroperator/console is not available (<missing>) because <missing> clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing> clusteroperator/dns is not available (<missing>) because <missing> clusteroperator/etcd is not available (<missing>) because <missing> clusteroperator/image-registry is not available (<missing>) because <missing> clusteroperator/ingress is not available (<missing>) because <missing> clusteroperator/insights is not available (<missing>) because <missing> clusteroperator/kube-apiserver is not available (<missing>) because <missing> clusteroperator/kube-controller-manager is not available (<missing>) because <missing> clusteroperator/kube-scheduler is not available (<missing>) because <missing> clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing> clusteroperator/machine-api is not available (<missing>) because <missing> clusteroperator/machine-approver is not available (<missing>) because <missing> clusteroperator/machine-config is not available (<missing>) because <missing> clusteroperator/marketplace is not available (<missing>) because <missing> clusteroperator/monitoring is not available (<missing>) because <missing> clusteroperator/network is not available (The network is starting up) because DaemonSet "openshift-ovn-kubernetes/ovn-ipsec" rollout is not making progress - last change 2022-06-09T16:16:46Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-9sn8d is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-dhx2p is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-wzc67 is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-06-09T16:16:46Z clusteroperator/node-tuning is not available (<missing>) because <missing> clusteroperator/openshift-apiserver is not available (<missing>) because <missing> clusteroperator/openshift-controller-manager is not available (<missing>) because <missing> clusteroperator/openshift-samples is not available (<missing>) because <missing> clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing> clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing> clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing> clusteroperator/service-ca is not available (<missing>) because <missing> clusteroperator/storage is not available (<missing>) because <missing> [must-gather ] OUT namespace/openshift-must-gather-6qp27 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-wtxkg created [must-gather ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created [must-gather-l5jvv] OUT gather did not start: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-wtxkg deleted [must-gather ] OUT namespace/openshift-must-gather-6qp27 deleted Error running must-gather collection: gather did not start for pod must-gather-l5jvv: timed out waiting for the condition Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information. Gathering data for ns/openshift-cloud-controller-manager-operator... Gathering data for ns/openshift-cloud-controller-manager... Gathering data for ns/openshift-cloud-credential-operator... Gathering data for ns/openshift-machine-api... Gathering data for ns/openshift-config... Gathering data for ns/openshift-config-managed... Gathering data for ns/openshift-etcd-operator... Gathering data for ns/openshift-etcd... Gathering data for ns/openshift-kube-apiserver-operator... Gathering data for ns/openshift-kube-apiserver... Gathering data for ns/openshift-kube-controller-manager... Gathering data for ns/openshift-kube-controller-manager-operator... Gathering data for ns/openshift-kube-scheduler-operator... Gathering data for ns/openshift-kube-scheduler... Gathering data for ns/openshift-kube-storage-version-migrator-operator... Gathering data for ns/openshift-cluster-machine-approver... Gathering data for ns/openshift-machine-config-operator... Gathering data for ns/openshift-multus... Gathering data for ns/openshift-ovn-kubernetes... Gathering data for ns/openshift-host-network... Gathering data for ns/openshift-network-diagnostics... Gathering data for ns/openshift-network-operator... Gathering data for ns/openshift-cluster-samples-operator... Wrote inspect data to must-gather.local.685000384472291690/inspect.local.2710222056658389937. error running backup collection: errors ocurred while gathering data: [skipping gathering securitycontextconstraints.security.openshift.io due to error: the server doesn't have a resource type "securitycontextconstraints", skipping gathering podnetworkconnectivitychecks.controlplane.operator.openshift.io due to error: the server doesn't have a resource type "podnetworkconnectivitychecks", skipping gathering apirequestcounts.apiserver.openshift.io due to error: the server doesn't have a resource type "apirequestcounts", skipping gathering namespaces/openshift-kube-storage-version-migrator due to error: namespaces "openshift-kube-storage-version-migrator" not found, skipping gathering controllerconfigs.machineconfiguration.openshift.io due to error: the server doesn't have a resource type "controllerconfigs", skipping gathering namespaces/openshift-multus due to error: one or more errors ocurred while gathering pod-specific data for namespace: openshift-multus [one or more errors ocurred while gathering container data for pod network-metrics-daemon-gpp62: [previous terminated container "network-metrics-daemon" in pod "network-metrics-daemon-gpp62" not found, container "network-metrics-daemon" in pod "network-metrics-daemon-gpp62" is waiting to start: ContainerCreating, container "kube-rbac-proxy" in pod "network-metrics-daemon-gpp62" is waiting to start: ContainerCreating, previous terminated container "kube-rbac-proxy" in pod "network-metrics-daemon-gpp62" not found], one or more errors ocurred while gathering container data for pod network-metrics-daemon-lnptv: [container "network-metrics-daemon" in pod "network-metrics-daemon-lnptv" is waiting to start: ContainerCreating, previous terminated container "network-metrics-daemon" in pod "network-metrics-daemon-lnptv" not found, previous terminated container "kube-rbac-proxy" in pod "network-metrics-daemon-lnptv" not found, container "kube-rbac-proxy" in pod "network-metrics-daemon-lnptv" is waiting to start: ContainerCreating], one or more errors ocurred while gathering container data for pod network-metrics-daemon-sh2c5: [previous terminated container "network-metrics-daemon" in pod "network-metrics-daemon-sh2c5" not found, container "network-metrics-daemon" in pod "network-metrics-daemon-sh2c5" is waiting to start: ContainerCreating, previous terminated container "kube-rbac-proxy" in pod "network-metrics-daemon-sh2c5" not found, container "kube-rbac-proxy" in pod "network-metrics-daemon-sh2c5" is waiting to start: ContainerCreating]], skipping gathering namespaces/openshift-ovn-kubernetes due to error: one or more errors ocurred while gathering pod-specific data for namespace: openshift-ovn-kubernetes [one or more errors ocurred while gathering container data for pod ovn-ipsec-8bkvt: [container "ovn-ipsec" in pod "ovn-ipsec-8bkvt" is waiting to start: PodInitializing, previous terminated container "ovn-ipsec" in pod "ovn-ipsec-8bkvt" not found], one or more errors ocurred while gathering container data for pod ovn-ipsec-dwv2r: [previous terminated container "ovn-ipsec" in pod "ovn-ipsec-dwv2r" not found, container "ovn-ipsec" in pod "ovn-ipsec-dwv2r" is waiting to start: PodInitializing], one or more errors ocurred while gathering container data for pod ovn-ipsec-lwrz2: [container "ovn-ipsec" in pod "ovn-ipsec-lwrz2" is waiting to start: PodInitializing, previous terminated container "ovn-ipsec" in pod "ovn-ipsec-lwrz2" not found]], skipping gathering namespaces/openshift-network-diagnostics due to error: one or more errors ocurred while gathering pod-specific data for namespace: openshift-network-diagnostics [one or more errors ocurred while gathering container data for pod network-check-target-9hmqn: [container "network-check-target-container" in pod "network-check-target-9hmqn" is waiting to start: ContainerCreating, previous terminated container "network-check-target-container" in pod "network-check-target-9hmqn" not found], one or more errors ocurred while gathering container data for pod network-check-target-dsmpc: [container "network-check-target-container" in pod "network-check-target-dsmpc" is waiting to start: ContainerCreating, previous terminated container "network-check-target-container" in pod "network-check-target-dsmpc" not found], one or more errors ocurred while gathering container data for pod network-check-target-mcsl4: [previous terminated container "network-check-target-container" in pod "network-check-target-mcsl4" not found, container "network-check-target-container" in pod "network-check-target-mcsl4" is waiting to start: ContainerCreating]], skipping gathering configs.samples.operator.openshift.io/cluster due to error: configs.samples.operator.openshift.io "cluster" not found, skipping gathering templates.template.openshift.io due to error: the server doesn't have a resource type "templates", skipping gathering imagestreams.image.openshift.io due to error: the server doesn't have a resource type "imagestreams"] Reprinting Cluster State: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 7fd28f1d-3ee8-4815-b2a6-ddf935e06199 ClusterVersion: Installing "4.9.38" for 45 minutes: Working towards 4.9.38: 592 of 738 done (80% complete) ClusterOperators: clusteroperator/authentication is not available (<missing>) because <missing> clusteroperator/baremetal is not available (<missing>) because <missing> clusteroperator/cluster-autoscaler is not available (<missing>) because <missing> clusteroperator/config-operator is not available (<missing>) because <missing> clusteroperator/console is not available (<missing>) because <missing> clusteroperator/csi-snapshot-controller is not available (<missing>) because <missing> clusteroperator/dns is not available (<missing>) because <missing> clusteroperator/etcd is not available (<missing>) because <missing> clusteroperator/image-registry is not available (<missing>) because <missing> clusteroperator/ingress is not available (<missing>) because <missing> clusteroperator/insights is not available (<missing>) because <missing> clusteroperator/kube-apiserver is not available (<missing>) because <missing> clusteroperator/kube-controller-manager is not available (<missing>) because <missing> clusteroperator/kube-scheduler is not available (<missing>) because <missing> clusteroperator/kube-storage-version-migrator is not available (<missing>) because <missing> clusteroperator/machine-api is not available (<missing>) because <missing> clusteroperator/machine-approver is not available (<missing>) because <missing> clusteroperator/machine-config is not available (<missing>) because <missing> clusteroperator/marketplace is not available (<missing>) because <missing> clusteroperator/monitoring is not available (<missing>) because <missing> clusteroperator/network is not available (The network is starting up) because DaemonSet "openshift-ovn-kubernetes/ovn-ipsec" rollout is not making progress - last change 2022-06-09T16:16:46Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-dhx2p is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-wzc67 is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-9sn8d is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-06-09T16:16:46Z clusteroperator/node-tuning is not available (<missing>) because <missing> clusteroperator/openshift-apiserver is not available (<missing>) because <missing> clusteroperator/openshift-controller-manager is not available (<missing>) because <missing> clusteroperator/openshift-samples is not available (<missing>) because <missing> clusteroperator/operator-lifecycle-manager is not available (<missing>) because <missing> clusteroperator/operator-lifecycle-manager-catalog is not available (<missing>) because <missing> clusteroperator/operator-lifecycle-manager-packageserver is not available (<missing>) because <missing> clusteroperator/service-ca is not available (<missing>) because <missing> clusteroperator/storage is not available (<missing>) because <missing> error: gather did not start for pod must-gather-l5jvv: timed out waiting for the condition Thank you.
Created attachment 1888403 [details] partial must-gather for OCP 4.9.38 zVM install issue Partial "oc adm must-gather" for OCP 4.9.38 zVM install issue. Thank you.
ovnkube-node pods fail with : I0609 16:55:39.356410 37236 gateway_localnet.go:173] Node local addresses initialized to: map[10.129.0.2:{10.129.0.0 fffffe00} 10.20.116.12:{10.20.116.0 ffffff00} 127.0.0.1:{127.0.0.0 ff000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::8808:c4ff:fed3:412:{fe80:: ffffffffffffffff0000000000000000} fe80::943d:51ff:fe27:b2fc:{fe80:: ffffffffffffffff0000000000000000}] I0609 16:55:39.356500 37236 helper_linux.go:73] Found default gateway interface enc2e0 10.20.116.1 F0609 16:55:39.356532 37236 ovnkube.go:130] could not find IP addresses: failed to lookup link br-ex: Link not found Kyle, Which was the last 4.9 build which worked for you ? Prashanth
Prashanth, 1. The last OCP 4.9 on Z build that installs properly in a zVM environment is the predecessor to this build, 4.9.37. 2. The OCP on Z 4.9.37 build was released on June 3, 2022. Thank you, Kyle
Thanks Kyle. The difference between 4.9.37 and 4.9.38 seems to be https://github.com/openshift/machine-config-operator/pull/3160 in the machine-config-operator and looks related to what we are seeing. @jcaamano - could this issue we are hitting be related to the above change?
Yes, it could be. It has an issue with a tentative fix here: https://github.com/openshift/machine-config-operator/pull/3183
Assigning this bug to Networking team. Hello Jaime, Can you please set blocker flag to '+ 'or '-' based on your assessment?
Can't be completely sure if it is the same thing without a node journal. @krmoser.com would you be able to provide one?
Setting target release to 4.9.z and blocker flag to + based on input from comment#5, comment#7
Prashanth, Please let us know where you would like the node journal collected from and the commands to do so. Thank you, Kyle
Hi Kyle, A "journalctl" on the master nodes should help. Thanks Prashanth
Created attachment 1889463 [details] master-0 journalctl logs master-0 journalctl logs
Created attachment 1889464 [details] master-1 journalctl logs master-1 journalctl logs
Created attachment 1889465 [details] master-2 journalctl logs master-2 journalctl logs
Thanks @krmoser.com Looks like something different: Jun 13 15:41:58 master-0.pok-96.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1390]: Brought up connection ovs-if-br-ex successfully Jun 13 15:41:58 master-0.pok-96.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1390]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com NetworkManager[1370]: ((libnm-core/nm-connection.c:186)): assertion '<dropped>' failed Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com NetworkManager[1370]: <warn> [1655134916.9590] keyfile: commit: failure to write 13d22672-3c1f-4735-8db5-60cd18e60b8d ((null)) to "/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection": error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.1E9SN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com NetworkManager[1370]: <info> [1655134916.9590] audit: op="connection-update" uuid="13d22672-3c1f-4735-8db5-60cd18e60b8d" name="ovs-if-br-ex" args="connection.autoconnect,connection.timestamp" pid=1785 uid=0 result="fail" reason="failed to update connection: error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.1E9SN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied" Jun 13 15:41:56 master-1.pok-96.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1404]: Error: Failed to modify connection 'ovs-if-br-ex': failed to update connection: error writing to file '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': failed rename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection.1E9SN1 to /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection: Permission denied Could you please provide ownership and permissions of /etc/NetworkManager/systemConnectionsMerged and /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection as well as /var/log/audit/audit.log on any of the nodes? It looks like there is no write permissions on /etc/NetworkManager/systemConnectionsMerged but there should be as that dir is created via /etc/tmpfiles.d/nm.conf containing: d /etc/NetworkManager/systemConnectionsMerged 0755 root root - -
Jaime, Thanks for the assistance. Here's the requested information. There is no /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection file. 1. /etc/NetworkManager/systemConnectionsMerged : ====================================================================================================================================== [root@master-2 ~]# ls -al /etc/NetworkManager/systemConnectionsMerged total 4 drwxr-xr-x. 1 root root 140 Jun 13 15:41 . drwxr-xr-x. 8 root root 165 Jun 13 15:41 .. -rw-------. 1 root root 406 Jun 13 15:40 default_connection.nmconnection [root@master-2 ~]# 2. /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection : ====================================================================================================================================== [root@master-2 ~]# ls -al /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection ls: cannot access '/etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection': No such file or directory [root@master-2 ~]# 3. /var/log/audit/audit.log : ====================================================================================================================================== [root@master-2 ~]# ls -al /var/log/audit/audit.log -rw-------. 1 root root 92813 Jun 13 20:15 /var/log/audit/audit.log [root@master-2 ~]# Thank you, Kyle
So it looks like the overlay /etc/NetworkManager/systemConnectionsMerged is not registered with selinux as a valid location for NM to manage its connection profiles and if we manually copy a profile there and restorecon, which we do when trying to copy a static ip configuration, then it will have as scontext NetworkManager_t instead of the expected NetworkManager_var_run_t or NetworkManager_etc_rw_t. Recently we introduced a change with https://github.com/openshift/machine-config-operator/pull/3160 that attempt to configure something after this copy through nmcli and it fails. Trying to work around it with https://github.com/openshift/machine-config-operator/pull/3188 via using `nmcli clone` instead of a manual copy. Could you guys give it a shot? Marking as dup of 2095264. *** This bug has been marked as a duplicate of bug 2095264 ***
Folks, Please let us know when there is an OCP 4.9.38 on Z successor build available to test with the proposed fix. Thank you, Kyle
Folks, It appears that the same issue exists for this week's OCP 4.9 post-GA publicly available build: 4.9.39. Thank you, Kyle
Folks, The OCP on Z Solution Test team has successfully tested the following OCP 4.9 on Z builds for both connected and disconnected installs: 1. OCP 4.9.40 2. OCP 4.9.41 Thank you, Kyle