Note: 1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product. 2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product. Description of problem: After deploying OCP on ppc64le environment, when I run: [root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get MachineConfigPool worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-627087f67ace055b92ca4c26488fdeb7 False True False 3 2 2 0 18h I expect to see READYMACHINECOUNT be 3 but it is 2. Then I stopped kubelet.service on one of the nodes (worker). The count became: [root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get MachineConfigPool worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-627087f67ace055b92ca4c26488fdeb7 False True False 3 1 2 0 18h READYMACHINECOUNT became 1. When I started the service back, the COUNT got back to 2. Version-Release number of MCO (Machine Config Operator) (if applicable): 4.11 Platform (AWS, VSphere, Metal, etc.): IBM Power Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Yes as I am looking at MachineConfigPool. How reproducible: Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. As described above. 2. 3. Actual results: Expected results: Additional info: 1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish. 2. If a must-gather is unavailable, please provide the output of: $ oc get co machine-config -o yaml $ oc get mcp (and oc describe mcp/${degraded_pool} if pools are degraded) $ oc get mc $ oc get pod -n openshift-machine-config-operator $ oc get node -o wide 3. If a node is not accessible via API, please provide console/journal/kubelet logs of the problematic node 4. Are there RHEL nodes on the cluster? If yes, please upload the whole Ansible logs or Jenkins job
[root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-06-20T18:26:39Z" generation: 1 name: machine-config ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: f588cffc-ca69-471a-b3e3-ae10d9389af9 resourceVersion: "195134" uid: 192f78c1-88a5-431a-9fb8-37588daaece0 spec: {} status: conditions: - lastTransitionTime: "2022-06-20T18:33:05Z" message: Cluster version is 4.11.0-0.nightly-ppc64le-2022-06-16-003709 status: "False" type: Progressing - lastTransitionTime: "2022-06-20T18:50:36Z" message: One or more machine config pools are updating, please see `oc get mcp` for further details reason: PoolUpdating status: "False" type: Upgradeable - lastTransitionTime: "2022-06-21T03:36:12Z" status: "False" type: Degraded - lastTransitionTime: "2022-06-21T03:36:16Z" message: Cluster has deployed [{operator 4.11.0-0.nightly-ppc64le-2022-06-16-003709}] status: "True" type: Available extension: master: all 3 nodes are at latest configuration rendered-master-c87dd4684c54c74ee43bf6cb07a4ff85 worker: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-627087f67ace055b92ca4c26488fdeb7 relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: "" resource: machineconfigpools - group: machineconfiguration.openshift.io name: "" resource: controllerconfigs - group: machineconfiguration.openshift.io name: "" resource: kubeletconfigs - group: machineconfiguration.openshift.io name: "" resource: containerruntimeconfigs - group: machineconfiguration.openshift.io name: "" resource: machineconfigs - group: "" name: "" resource: nodes - group: "" name: openshift-kni-infra resource: namespaces - group: "" name: openshift-openstack-infra resource: namespaces - group: "" name: openshift-ovirt-infra resource: namespaces - group: "" name: openshift-vsphere-infra resource: namespaces - group: "" name: openshift-nutanix-infra resource: namespaces versions: - name: operator version: 4.11.0-0.nightly-ppc64le-2022-06-16-003709 [root@rdr-sri-7cfc-tok04-bastion-0 ~]# Nothing is degraded. [root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c87dd4684c54c74ee43bf6cb07a4ff85 True False False 3 3 3 0 19h worker rendered-worker-627087f67ace055b92ca4c26488fdeb7 False True False 3 2 2 0 19h [r [root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 00-worker 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 01-master-container-runtime 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 01-master-kubelet 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 01-worker-container-runtime 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 01-worker-kubelet 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 99-master-chrony-configuration 2.2.0 19h 99-master-generated-registries 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 99-master-ssh 3.2.0 19h 99-worker-chrony-configuration 2.2.0 19h 99-worker-generated-registries 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h 99-worker-ssh 3.2.0 19h rendered-master-c87dd4684c54c74ee43bf6cb07a4ff85 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h rendered-worker-627087f67ace055b92ca4c26488fdeb7 0bba861a74209e78177612789983183cdc77f1a5 3.2.0 19h [root@rdr-sri-7cfc-tok04-bastion-0 ~]# [root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc get pod -n openshift-machine-config-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-6c66f4d5f5-ncm5f 2/2 Running 1 (19h ago) 19h 10.129.0.11 tok04-master-1.rdr-sri-7cfc.ibm.com <none> <none> machine-config-daemon-52nvt 2/2 Running 0 19h 192.168.36.167 tok04-worker-0.rdr-sri-7cfc.ibm.com <none> <none> machine-config-daemon-l2f9r 1/2 CrashLoopBackOff 230 (2m47s ago) 19h 192.168.36.100 tok04-worker-2.rdr-sri-7cfc.ibm.com <none> <none> machine-config-daemon-l66fl 2/2 Running 0 19h 192.168.36.10 tok04-master-2.rdr-sri-7cfc.ibm.com <none> <none> machine-config-daemon-p4dct 2/2 Running 0 19h 192.168.36.52 tok04-master-1.rdr-sri-7cfc.ibm.com <none> <none> machine-config-daemon-rhtxr 2/2 Running 0 19h 192.168.36.203 tok04-worker-1.rdr-sri-7cfc.ibm.com <none> <none> machine-config-daemon-x699v 2/2 Running 0 19h 192.168.36.78 tok04-master-0.rdr-sri-7cfc.ibm.com <none> <none> machine-config-operator-8475fdd878-prkwj 1/1 Running 1 (19h ago) 19h 10.128.0.11 tok04-master-0.rdr-sri-7cfc.ibm.com <none> <none> machine-config-server-d5wpl 1/1 Running 0 19h 192.168.36.78 tok04-master-0.rdr-sri-7cfc.ibm.com <none> <none> machine-config-server-jvp6d 1/1 Running 0 19h 192.168.36.52 tok04-master-1.rdr-sri-7cfc.ibm.com <none> <none> machine-config-server-mqzqq 1/1 Running 0 19h 192.168.36.10 tok04-master-2.rdr-sri-7cfc.ibm.com <none> <none> [root@rdr-sri-7cfc-tok04-bastion-0 ~]# Logs from the failing pod: [root@rdr-sri-7cfc-tok04-bastion-0 ~]# oc logs machine-config-daemon-l2f9r -n openshift-machine-config-operator Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy I0621 14:21:32.442851 1404020 start.go:112] Version: v4.11.0-202206142048.p0.g0bba861.assembly.stream-dirty (0bba861a74209e78177612789983183cdc77f1a5) I0621 14:21:32.444651 1404020 start.go:125] Calling chroot("/rootfs") I0621 14:21:32.445279 1404020 update.go:1962] Running: systemctl start rpm-ostreed I0621 14:21:37.496328 1404020 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0621 14:21:37.542192 1404020 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0621 14:21:37.587780 1404020 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 (411.86.202206141021-0) I0621 14:21:37.598870 1404020 start.go:101] Copied self to /run/bin/machine-config-daemon on host I0621 14:21:37.600049 1404020 start.go:189] overriding kubernetes api to https://api-int.rdr-sri-7cfc.ibm.com:6443 I0621 14:21:37.600638 1404020 metrics.go:106] Registering Prometheus metrics I0621 14:21:37.600731 1404020 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0621 14:21:37.605236 1404020 daemon.go:971] Detected a new login session: New session 1 of user core. I0621 14:21:37.605248 1404020 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30] goroutine 66 [running]: github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc0002ac400?) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30 github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1() /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc [root@rdr-sri-7cfc-tok04-bastion-0 ~]#
So there's a few things to unpack here, let's go at this one at a time: 1. platform-none is specifically for vsphere/baremetal non-platformed deployment (and corresponding templates) and I think this is core MCO, so let's move that sub component. I'm not sure what differences exist for Power in terms of the MCO 2. In your initial assessment it sounds like one of your three worker nodes is not ready due to whatever reason, and then you ssh'ed? into a node. Which node did you ssh into? Was it tok04-worker-2.rdr-sri-7cfc.ibm.com ? If so, given that the ready machinecount dropped when you stopped kubelet, I think you ssh'ed into one of the good nodes. 3. for the failing pod, what is happening is that the MCO is trying to apply a ssh annotation due to the fact that the node has been accessed, but the MCD doesn't have an initialized nodewriter, so it panic'ed upon doing so. This is the first time I've ever seen this happen. Is this node the one you ssh'ed to earlier to stop kubelet? If so, this should be a different issue than the one you initially reported, and only triggered because you ssh'ed into the node. We still don't know what the original issue is based on existing information 4. If the above assessment is correct, there's 2 issues with this cluster: the panic due to ssh annotation failing to be written and the original node notready. I would probably need a full must-gather from this cluster as well as other context (what is this cluster used for, what customizations, what special things exist for ppc, etc.) Marking not a blocker since this should be very rare and generally not a scenario you would get into. I tested with a regular 4.11 nightly on GCP and ssh seems fine (both the regular annotation write and a forced restart into a boot annotation write as above)
After deploying OCP 4.11.0-0.nightly-ppc64le-2022-06-21-133338 on ppc64le environment issue seen with mcp. [root@rdr-shw4-1385-bastion-0 ~]# oc version Client Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338 Kustomize Version: v4.5.4 Server Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338 Kubernetes Version: v1.24.0+284d62a [root@rdr-shw4-1385-bastion-0 ~]# oc get pods -A | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE openshift-kube-apiserver installer-8-master-1 0/1 Error 0 18h openshift-machine-config-operator machine-config-daemon-bttgz 1/2 CrashLoopBackOff 215 (109s ago) 18h openshift-machine-config-operator machine-config-daemon-j6sbw 1/2 CrashLoopBackOff 215 (3m31s ago) 18h [root@rdr-shw4-1385-bastion-0 ~]# oc describe pod machine-config-daemon-bttgz -n openshift-machine-config-operator Name: machine-config-daemon-bttgz Namespace: openshift-machine-config-operator Priority: 2000001000 Priority Class Name: system-node-critical Node: worker-0/9.114.99.175 Start Time: Thu, 23 Jun 2022 07:02:03 -0400 Labels: controller-revision-hash=864967879d k8s-app=machine-config-daemon pod-template-generation=1 Annotations: openshift.io/scc: privileged Status: Running IP: 9.114.99.175 IPs: IP: 9.114.99.175 Controlled By: DaemonSet/machine-config-daemon Containers: machine-config-daemon: Container ID: cri-o://3e33796ae57621dc1c501bc1f73fd7f28e8c9b7263dcc7c226f787df80c1d244 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27 Port: <none> Host Port: <none> Command: /usr/bin/machine-config-daemon Args: start State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: I0624 05:18:36.510514 1321011 start.go:112] Version: v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty (37b741601f9b7ff9b2e1870102cc2970b24e1835) I0624 05:18:36.513320 1321011 start.go:125] Calling chroot("/rootfs") I0624 05:18:36.513388 1321011 update.go:1962] Running: systemctl start rpm-ostreed I0624 05:18:42.277171 1321011 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0624 05:18:42.323875 1321011 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0624 05:18:42.370453 1321011 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 (411.86.202206141021-0) I0624 05:18:42.384526 1321011 start.go:101] Copied self to /run/bin/machine-config-daemon on host I0624 05:18:42.386124 1321011 start.go:189] overriding kubernetes api to https://api-int.rdr-shw4-1385.redhat.com:6443 I0624 05:18:42.386924 1321011 metrics.go:106] Registering Prometheus metrics I0624 05:18:42.387025 1321011 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0624 05:18:42.392218 1321011 daemon.go:971] Detected a new login session: New session 1 of user core. I0624 05:18:42.392234 1321011 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30] goroutine 20 [running]: github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc000250400?) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30 github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1() /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc Exit Code: 2 Started: Fri, 24 Jun 2022 01:18:36 -0400 Finished: Fri, 24 Jun 2022 01:18:42 -0400 Ready: False Restart Count: 215 Requests: cpu: 20m memory: 50Mi Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /rootfs from rootfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dxvbr (ro) oauth-proxy: Container ID: cri-o://9e75d0f269b86c2e985a7260f3c670d0f7bf9ac4b9d7b6be46cd1399c3647458 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879 Port: 9001/TCP Host Port: 9001/TCP Args: --https-address=:9001 --provider=openshift --openshift-service-account=machine-config-daemon --upstream=http://127.0.0.1:8797 --tls-cert=/etc/tls/private/tls.crt --tls-key=/etc/tls/private/tls.key --cookie-secret-file=/etc/tls/cookie-secret/cookie-secret --openshift-sar={"resource": "namespaces", "verb": "get"} --openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}} State: Running Started: Thu, 23 Jun 2022 07:03:23 -0400 Ready: True Restart Count: 0 Requests: cpu: 20m memory: 50Mi Environment: <none> Mounts: /etc/tls/cookie-secret from cookie-secret (rw) /etc/tls/private from proxy-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dxvbr (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: rootfs: Type: HostPath (bare host directory volume) Path: / HostPathType: proxy-tls: Type: Secret (a volume populated by a Secret) SecretName: proxy-tls Optional: false cookie-secret: Type: Secret (a volume populated by a Secret) SecretName: cookie-secret Optional: false kube-api-access-dxvbr: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 14m (x214 over 18h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27" already present on machine Warning BackOff 4m28s (x4968 over 18h) kubelet Back-off restarting failed container [root@rdr-shw4-1385-bastion-0 ~]# oc logs machine-config-daemon-bttgz -n openshift-machine-config-operator Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy I0624 05:34:15.467779 1339582 start.go:112] Version: v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty (37b741601f9b7ff9b2e1870102cc2970b24e1835) I0624 05:34:15.470217 1339582 start.go:125] Calling chroot("/rootfs") I0624 05:34:15.470311 1339582 update.go:1962] Running: systemctl start rpm-ostreed I0624 05:34:20.899845 1339582 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0624 05:34:20.956738 1339582 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0624 05:34:21.011356 1339582 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 (411.86.202206141021-0) I0624 05:34:21.025216 1339582 start.go:101] Copied self to /run/bin/machine-config-daemon on host I0624 05:34:21.027056 1339582 start.go:189] overriding kubernetes api to https://api-int.rdr-shw4-1385.redhat.com:6443 I0624 05:34:21.027965 1339582 metrics.go:106] Registering Prometheus metrics I0624 05:34:21.028061 1339582 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0624 05:34:21.033329 1339582 daemon.go:971] Detected a new login session: New session 1 of user core. I0624 05:34:21.033345 1339582 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30] goroutine 37 [running]: github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc000740400?) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30 github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1() /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc
(In reply to shweta from comment #3) > After deploying OCP 4.11.0-0.nightly-ppc64le-2022-06-21-133338 on ppc64le > environment issue seen with mcp. > > > [root@rdr-shw4-1385-bastion-0 ~]# oc version > Client Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338 > Kustomize Version: v4.5.4 > Server Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338 > Kubernetes Version: v1.24.0+284d62a > > [root@rdr-shw4-1385-bastion-0 ~]# oc get pods -A | grep -v Running | grep -v > Completed > NAMESPACE NAME > READY STATUS RESTARTS AGE > openshift-kube-apiserver installer-8-master-1 > 0/1 Error 0 18h > openshift-machine-config-operator > machine-config-daemon-bttgz 1/2 > CrashLoopBackOff 215 (109s ago) 18h > openshift-machine-config-operator > machine-config-daemon-j6sbw 1/2 > CrashLoopBackOff 215 (3m31s ago) 18h > > > [root@rdr-shw4-1385-bastion-0 ~]# oc describe pod > machine-config-daemon-bttgz -n openshift-machine-config-operator > Name: machine-config-daemon-bttgz > Namespace: openshift-machine-config-operator > Priority: 2000001000 > Priority Class Name: system-node-critical > Node: worker-0/9.114.99.175 > Start Time: Thu, 23 Jun 2022 07:02:03 -0400 > Labels: controller-revision-hash=864967879d > k8s-app=machine-config-daemon > pod-template-generation=1 > Annotations: openshift.io/scc: privileged > Status: Running > IP: 9.114.99.175 > IPs: > IP: 9.114.99.175 > Controlled By: DaemonSet/machine-config-daemon > Containers: > machine-config-daemon: > Container ID: > cri-o://3e33796ae57621dc1c501bc1f73fd7f28e8c9b7263dcc7c226f787df80c1d244 > Image: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27 > Image ID: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27 > Port: <none> > Host Port: <none> > Command: > /usr/bin/machine-config-daemon > Args: > start > State: Waiting > Reason: CrashLoopBackOff > Last State: Terminated > Reason: Error > Message: I0624 05:18:36.510514 1321011 start.go:112] Version: > v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty > (37b741601f9b7ff9b2e1870102cc2970b24e1835) > I0624 05:18:36.513320 1321011 start.go:125] Calling chroot("/rootfs") > I0624 05:18:36.513388 1321011 update.go:1962] Running: systemctl start > rpm-ostreed > I0624 05:18:42.277171 1321011 rpm-ostree.go:324] Running captured: > rpm-ostree status --json > I0624 05:18:42.323875 1321011 rpm-ostree.go:324] Running captured: > rpm-ostree status --json > I0624 05:18:42.370453 1321011 daemon.go:236] Booted osImageURL: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 > (411.86.202206141021-0) > I0624 05:18:42.384526 1321011 start.go:101] Copied self to > /run/bin/machine-config-daemon on host > I0624 05:18:42.386124 1321011 start.go:189] overriding kubernetes api to > https://api-int.rdr-shw4-1385.redhat.com:6443 > I0624 05:18:42.386924 1321011 metrics.go:106] Registering Prometheus metrics > I0624 05:18:42.387025 1321011 metrics.go:111] Starting metrics listener on > 127.0.0.1:8797 > I0624 05:18:42.392218 1321011 daemon.go:971] Detected a new login session: > New session 1 of user core. > I0624 05:18:42.392234 1321011 daemon.go:972] Login access is discouraged! > Applying annotation: machineconfiguration.openshift.io/ssh > panic: runtime error: invalid memory address or nil pointer dereference > [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30] > > goroutine 20 [running]: > github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon). > applySSHAccessedAnnotation(0xc000250400?) > > /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go: > 986 +0x30 > github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon). > runLoginMonitor.func1() > > /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go: > 973 +0x18c > created by > github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon). > runLoginMonitor > > /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go: > 954 +0x1fc > > Exit Code: 2 > Started: Fri, 24 Jun 2022 01:18:36 -0400 > Finished: Fri, 24 Jun 2022 01:18:42 -0400 > Ready: False > Restart Count: 215 > Requests: > cpu: 20m > memory: 50Mi > Environment: > NODE_NAME: (v1:spec.nodeName) > Mounts: > /rootfs from rootfs (rw) > /var/run/secrets/kubernetes.io/serviceaccount from > kube-api-access-dxvbr (ro) > oauth-proxy: > Container ID: > cri-o://9e75d0f269b86c2e985a7260f3c670d0f7bf9ac4b9d7b6be46cd1399c3647458 > Image: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879 > Image ID: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 78f76006f0a7ed69f40c52398ea7f2a5a86544c7fcefd4c86f8d77aa6182f879 > Port: 9001/TCP > Host Port: 9001/TCP > Args: > --https-address=:9001 > --provider=openshift > --openshift-service-account=machine-config-daemon > --upstream=http://127.0.0.1:8797 > --tls-cert=/etc/tls/private/tls.crt > --tls-key=/etc/tls/private/tls.key > --cookie-secret-file=/etc/tls/cookie-secret/cookie-secret > --openshift-sar={"resource": "namespaces", "verb": "get"} > --openshift-delegate-urls={"/": {"resource": "namespaces", "verb": > "get"}} > State: Running > Started: Thu, 23 Jun 2022 07:03:23 -0400 > Ready: True > Restart Count: 0 > Requests: > cpu: 20m > memory: 50Mi > Environment: <none> > Mounts: > /etc/tls/cookie-secret from cookie-secret (rw) > /etc/tls/private from proxy-tls (rw) > /var/run/secrets/kubernetes.io/serviceaccount from > kube-api-access-dxvbr (ro) > Conditions: > Type Status > Initialized True > Ready False > ContainersReady False > PodScheduled True > Volumes: > rootfs: > Type: HostPath (bare host directory volume) > Path: / > HostPathType: > proxy-tls: > Type: Secret (a volume populated by a Secret) > SecretName: proxy-tls > Optional: false > cookie-secret: > Type: Secret (a volume populated by a Secret) > SecretName: cookie-secret > Optional: false > kube-api-access-dxvbr: > Type: Projected (a volume that contains injected data > from multiple sources) > TokenExpirationSeconds: 3607 > ConfigMapName: kube-root-ca.crt > ConfigMapOptional: <nil> > DownwardAPI: true > ConfigMapName: openshift-service-ca.crt > ConfigMapOptional: <nil> > QoS Class: Burstable > Node-Selectors: kubernetes.io/os=linux > Tolerations: op=Exists > Events: > Type Reason Age From Message > ---- ------ ---- ---- ------- > Normal Pulled 14m (x214 over 18h) kubelet Container image > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > eeb6f3cef8897c23632e1070264910d1b7371b0c2e5ad91483ff07b13d40ac27" already > present on machine > Warning BackOff 4m28s (x4968 over 18h) kubelet Back-off restarting > failed container > > > [root@rdr-shw4-1385-bastion-0 ~]# oc logs machine-config-daemon-bttgz -n > openshift-machine-config-operator > Defaulted container "machine-config-daemon" out of: machine-config-daemon, > oauth-proxy > I0624 05:34:15.467779 1339582 start.go:112] Version: > v4.11.0-202206211257.p0.g37b7416.assembly.stream-dirty > (37b741601f9b7ff9b2e1870102cc2970b24e1835) > I0624 05:34:15.470217 1339582 start.go:125] Calling chroot("/rootfs") > I0624 05:34:15.470311 1339582 update.go:1962] Running: systemctl start > rpm-ostreed > I0624 05:34:20.899845 1339582 rpm-ostree.go:324] Running captured: > rpm-ostree status --json > I0624 05:34:20.956738 1339582 rpm-ostree.go:324] Running captured: > rpm-ostree status --json > I0624 05:34:21.011356 1339582 daemon.go:236] Booted osImageURL: > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > c42269031936a32eb36410d501b1e7e87b6ff72047f1209bc0b9544c46cfd841 > (411.86.202206141021-0) > I0624 05:34:21.025216 1339582 start.go:101] Copied self to > /run/bin/machine-config-daemon on host > I0624 05:34:21.027056 1339582 start.go:189] overriding kubernetes api to > https://api-int.rdr-shw4-1385.redhat.com:6443 > I0624 05:34:21.027965 1339582 metrics.go:106] Registering Prometheus metrics > I0624 05:34:21.028061 1339582 metrics.go:111] Starting metrics listener on > 127.0.0.1:8797 > I0624 05:34:21.033329 1339582 daemon.go:971] Detected a new login session: > New session 1 of user core. > I0624 05:34:21.033345 1339582 daemon.go:972] Login access is discouraged! > Applying annotation: machineconfiguration.openshift.io/ssh > panic: runtime error: invalid memory address or nil pointer dereference > [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cc30] > > goroutine 37 [running]: > github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon). > applySSHAccessedAnnotation(0xc000740400?) > > /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go: > 986 +0x30 > github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon). > runLoginMonitor.func1() > > /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go: > 973 +0x18c > created by > github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon). > runLoginMonitor > > /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go: > 954 +0x1fc Must Gather log link: https://drive.google.com/file/d/1UXhTEufxwVATmNJo9MxrHkx7rbaHtzTE/view?usp=sharing
(In reply to shweta from comment #3) > After deploying OCP 4.11.0-0.nightly-ppc64le-2022-06-21-133338 on ppc64le > environment issue seen with mcp. > > > [root@rdr-shw4-1385-bastion-0 ~]# oc version > Client Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338 > Kustomize Version: v4.5.4 > Server Version: 4.11.0-0.nightly-ppc64le-2022-06-21-133338 > Kubernetes Version: v1.24.0+284d62a Must Gather log link: https://drive.google.com/file/d/1UXhTEufxwVATmNJo9MxrHkx7rbaHtzTE/view?usp=sharing
@sbiragda Thanks for providing the logs. This problem is blocking us to deploy ODF on OCP.
Issue with MCP is seen on 4.11.0-fc.3 build as well. [root@rdr-shw4-e184-bastion-0 ~]# oc version Client Version: 4.11.0-fc.3 Kustomize Version: v4.5.4 Server Version: 4.11.0-fc.3 Kubernetes Version: v1.24.0+284d62a [root@rdr-shw4-e184-bastion-0 ~]# oc get pods -A | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE openshift-machine-config-operator machine-config-daemon-4ztc5 1/2 Error 57 (5m20s ago) 4h32m openshift-machine-config-operator machine-config-daemon-9fnh5 1/2 CrashLoopBackOff 57 (21s ago) 4h32m [root@rdr-shw4-e184-bastion-0 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-0f88e229ebe0dcd9249a3f56216ef41d True False False 3 3 3 0 4h46m worker rendered-worker-58ade7fb1aa863df1fe5ee1680df0ef0 False True False 2 0 0 0 4h46m
>> I0624 05:34:21.033329 1339582 daemon.go:971] Detected a new login session: New session 1 of user core. >> I0624 05:34:21.033345 1339582 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh I don't understand why there is a need to ssh into machine while cluster is being created. Jerry has added some initial thought in comment#2 . MCO team doesn't have access to ppc64le hardware to create cluster and reproduce this issue. I will let multiarch team assess this problem first to understand what is the root cause.
Like Sinny said, generally, you shouldn't be SSH'ing into nodes in OCP clusters. Where is the SSH coming from (if there is an SSH at all) and could it be not performed?
The SSH mentioned in this test is only to demonstrate it is accessible. After OCP deployment we do not SSH into the node until it is complete. And after the deployment is over, we find the machine config not reporting the right number under the READY state column. This is causing problems to deploy ODF (the prevalidation step of automated ODF deployment looks for the right number under the READY column, otherwise it quits.
I just deployed a cluster. With no SSH into any nodes, saw this : [root@nx124-411-4f09-sao01-bastion-0 ~]# oc get MachineConfigPool worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-4a60f72cc2342ef860c5c7bdb0d07de1 False True False 3 2 2 0 27m READYMACHINECOUNT is 2 instead of 3. [root@nx124-411-4f09-sao01-bastion-0 ~]# oc get pod -n openshift-machine-config-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-76975949b-vq6tl 2/2 Running 1 (23m ago) 27m 10.128.0.8 sao01-master-0.nx124-411-4f09.ibm.com <none> <none> machine-config-daemon-2vz9d 2/2 Running 0 11m 192.168.0.56 sao01-worker-1.nx124-411-4f09.ibm.com <none> <none> machine-config-daemon-4c6l6 2/2 Running 0 11m 192.168.0.224 sao01-worker-0.nx124-411-4f09.ibm.com <none> <none> machine-config-daemon-6xq4m 2/2 Running 0 28m 192.168.0.31 sao01-master-0.nx124-411-4f09.ibm.com <none> <none> machine-config-daemon-7rts8 1/2 CrashLoopBackOff 7 (2m54s ago) 14m 192.168.0.159 sao01-worker-2.nx124-411-4f09.ibm.com <none> <none> machine-config-daemon-cggl2 2/2 Running 0 28m 192.168.0.64 sao01-master-1.nx124-411-4f09.ibm.com <none> <none> machine-config-daemon-wm8jt 2/2 Running 0 28m 192.168.0.226 sao01-master-2.nx124-411-4f09.ibm.com <none> <none> machine-config-operator-647c8d6c55-hbdvq 1/1 Running 1 (22m ago) 32m 10.130.0.9 sao01-master-1.nx124-411-4f09.ibm.com <none> <none> machine-config-server-8s8gv 1/1 Running 0 27m 192.168.0.64 sao01-master-1.nx124-411-4f09.ibm.com <none> <none> machine-config-server-p2wn6 1/1 Running 0 27m 192.168.0.226 sao01-master-2.nx124-411-4f09.ibm.com <none> <none> machine-config-server-zrlfk 1/1 Running 0 27m 192.168.0.31 sao01-master-0.nx124-411-4f09.ibm.com <none> <none> [root@nx124-411-4f09-sao01-bastion-0 ~]# [root@nx124-411-4f09-sao01-bastion-0 ~]# oc logs machine-config-daemon-7rts8 -n openshift-machine-config-operator Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy I0627 16:10:25.488671 18800 start.go:112] Version: v4.11.0-202206230535.p0.ge595ae9.assembly.stream-dirty (e595ae9c763e14213450d77f557bc01643db9fe3) I0627 16:10:25.490672 18800 start.go:125] Calling chroot("/rootfs") I0627 16:10:25.490768 18800 update.go:1962] Running: systemctl start rpm-ostreed I0627 16:10:28.638650 18800 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0627 16:10:28.685424 18800 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0627 16:10:28.735944 18800 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:adb4ae15df5e062e1e7ff35324b98ddf3be2372ab88511464900887ef8f74343 (411.86.202206231243-0) I0627 16:10:28.746099 18800 start.go:101] Copied self to /run/bin/machine-config-daemon on host I0627 16:10:28.747021 18800 start.go:189] overriding kubernetes api to https://api-int.nx124-411-4f09.ibm.com:6443 I0627 16:10:28.747642 18800 metrics.go:106] Registering Prometheus metrics I0627 16:10:28.747754 18800 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0627 16:10:28.752686 18800 daemon.go:971] Detected a new login session: New session 1 of user core. I0627 16:10:28.752699 18800 daemon.go:972] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x125cd60] goroutine 84 [running]: github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).applySSHAccessedAnnotation(0xc00048a400?) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:986 +0x30 github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor.func1() /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:973 +0x18c created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).runLoginMonitor /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:954 +0x1fc [root@nx124-411-4f09-sao01-bastion-0 ~]#
[root@nx124-411-4f09-sao01-bastion-0 ~]# oc get nodes NAME STATUS ROLES AGE VERSION sao01-master-0.nx124-411-4f09.ibm.com Ready master 31m v1.24.0+284d62a sao01-master-1.nx124-411-4f09.ibm.com Ready master 31m v1.24.0+284d62a sao01-master-2.nx124-411-4f09.ibm.com Ready master 31m v1.24.0+284d62a sao01-worker-0.nx124-411-4f09.ibm.com Ready worker 13m v1.24.0+284d62a sao01-worker-1.nx124-411-4f09.ibm.com Ready worker 12m v1.24.0+284d62a sao01-worker-2.nx124-411-4f09.ibm.com Ready worker 15m v1.24.0+284d62a [root@nx124-411-4f09-sao01-bastion-0 ~]#
Sridhar, Does the automation for deployment ssh into the nodes by any chance ? Prashanth
We use ocp4-upi-powervs to deploy the infrastructure and OCP. I am asking the owner of ocp4-upi-powervs to see if there is any SSH operation in the nodes, I doubt it but I will update this bug with an answer. But you can see not all mc pods are in crashloopbackoff state. Only one of them crashed.
(In reply to Sridhar Venkat (IBM) from comment #6) > @sbiragda Thanks for providing the logs. This problem is blocking > us to deploy ODF on OCP. Blocking bug for Power QE team too. Hitting this issue with all recent builds of 4.11 including fc.3. This is blocking the completion of a few regression tests for 4.11.
> But you can see not all mc pods are in crashloopbackoff state. Only one of them crashed. That is the the weirdest part for me. If this was a purely infra/MCD issue, all 3 MCD should be crashing with the same error, as they are doing the exact same thing. This bug so far points to only a single node having some sort of difference. The check that the MCD applies is from the output of: journalctl -b -f -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66 if you can run that command on all 3 nodes and check for diffs, that would also help us debug the issue. Again, I cannot reproduce this on a normal cluster. I have only seen this error before when running the MCO test OUTSIDE of a cluster environment, since there is no node. For the MCD to have gotten to the point of that error, the node should have initialized by that point.
The ocp4-upi-powervs call ocp4-playbooks to perform OCP installation, during OCP installation, ocp4-playbook will try to ssh to all cluster nodes to do some configuration. If playbook could not ssh to cluster nodes, the playbook will timeout on ssh and exit installation with error. If the playbook completed without error, that means it is not playbook's problem, it is OCP's problem.
Thanks ChongShi for your comment. The deployment is completed successfully without any problems. And as you can see above, I could manually SSH into one of the nodes as well successfully.
I reproduced the problem and the pod is crashing on worker-0. I tried to run journalctl -b -f -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c6 on all three nodes, it hangs for ever. No output on any of the worker nodes even after 5 minutes.
BTW, on the same environment, deployed another cluster and I am not able to reproduce the problem. On this healthy node, tried the journalctl command: [core@sao01-worker-0 ~]$ journalctl -b -f -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66 New session 1 of user core. and then it hangs forever. In this environment, at least it is printing the 'New session 1 of user core'. In the unhealthy cluster above, this line is not present (the command does not print anything and hangs).
Sorry I put an additional -f in there, which means it is following (it's not hanging, it's listening for ssh's). The actual command should be: journalctl -b -o cat -u systemd-logind.service MESSAGE_ID=8d45620c1a4348dbb17410da57c60c6 In the case of it hanging, it just means there were no additional ssh's. The "New session 1 of user core." indicated that for that specific node an ssh did happen likely prior to you running the command > ocp4-playbook will try to ssh to all cluster nodes to do some configuration May I know what this configuration is? And after the nodes have applied the configuration, does the nodes get rebooted, etc.? Generally in OCP 4 we point people not to ssh into nodes. I checked our code and we introduced a racy condition in 4.11 for this code, so I will fix that ASAP.
@jerzhang I am not sure if the nodes are restarted during the deployment. @chonzhan can answer this question. Let me know when the race condition is fixed, and we will attempt another deployment.
The ocp4-playbook will ssh to cluster node to perform mtu setting due to ibmveth network adapter's bug, no reboot.
Two subsequent deploys are good so far since early today morning. FYI.
Thanks for the responses. > Two subsequent deploys are good so far since early today morning. FYI. This is a race condition, hopefully it won't happen too often. As a workaround when it does happen, you can try removing the ssh journal entry > Let me know when the race condition is fixed, and we will attempt another deployment. Currently blocked on CI issues due to branching. We will fix this in 4.12 and backport to 4.11 since code freeze for 4.11.0 is in effect
Thanks to @jerzhang We will wait. We saw this problem last night.
@jerzhang Is this problem fixed? I do not see new builds in https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp-dev-preview/latest-4.11/. It has install images as of June 23rd. We saw the problem today morning.
(In reply to Sridhar Venkat (IBM) from comment #28) > @jerzhang Is this problem fixed? I do not see new builds in > https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp-dev- > preview/latest-4.11/. It has install images as of June 23rd. We saw the > problem today morning. Just fYI: @jerzhang @svenkat OCP QE on Power team was able to reproduce the issue with 4.11 rc.0 build dated June 30-> https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp/4.11.0-rc.0/ Waiting for the next rc build, for re-testing and verifying the fix.
This has not been fixed for 4.11. The fix version is in the BZ field above, it is now fixed in the 4.12 branches. I will begin the backport to 4.11 process
@jerzhang do you have an outlook for completing the backport and is there a different BZ (or PR) where I can see when it completes? Do you know if it is targeted for OCP 4.11 GA or future OCP 4.11 z stream post GA?
the latest feature candidate has the fix: https://ppc64le.ocp.releases.ci.openshift.org/releasestream/4-stable-ppc64le/release/4.11.0-rc.2
@lmcfadde backport BZ https://bugzilla.redhat.com/show_bug.cgi?id=2103749
I see only 4.11.0-rc.1 under https://mirror.openshift.com/pub/openshift-v4/ppc64le/clients/ocp/. Will rc.2 be available here?
Until it is available in the mirror, you can test with the image i pointed to above: quay.io/openshift-release-dev/ocp-release:4.11.0-rc.2-ppc64le
Issue with MCP is not seen on 4.11.0-rc.2 build [root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc version Client Version: 4.11.0-rc.2 Kustomize Version: v4.5.4 Server Version: 4.11.0-rc.2 Kubernetes Version: v1.24.0+9546431 [root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-44d47954f24666349149b34be4fdcbf9 True False False 3 3 3 0 55m worker rendered-worker-495bb191d4d14a21b89733276315bf3b True False False 2 2 2 0 55m [root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc get pod -n openshift-machine-config-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-dbcc464ff-n7fjv 2/2 Running 0 68m 10.128.0.10 tor01-master-1.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-daemon-57wbb 2/2 Running 0 51m 193.168.200.80 tor01-worker-0.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-daemon-bm22f 2/2 Running 0 69m 193.168.200.51 tor01-master-2.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-daemon-fwt7b 2/2 Running 0 64m 193.168.200.221 tor01-master-0.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-daemon-hhx2b 2/2 Running 0 50m 193.168.200.230 tor01-worker-1.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-daemon-v678k 2/2 Running 0 69m 193.168.200.3 tor01-master-1.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-operator-86c57dd94c-cmnwb 1/1 Running 0 73m 10.129.0.2 tor01-master-2.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-server-jmm6d 1/1 Running 0 64m 193.168.200.221 tor01-master-0.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-server-lxkps 1/1 Running 0 68m 193.168.200.3 tor01-master-1.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> machine-config-server-xhqdd 1/1 Running 0 68m 193.168.200.51 tor01-master-2.rdr-ci-ocp-upi-a7bf.redhat.com <none> <none> [root@rdr-ci-ocp-upi-a7bf-tor01-bastion-0 ~]# oc get pods -A | grep -v "Running\|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE
We are not seeing this problem anymore for the past four days.
Issue is NOT reproduced on OCP 4.12 cluster on IBM Power platform. All nodes, COs, pods are in good state. Build used: 4.12.0-0.nightly-ppc64le-2022-07-07-144302 # oc version Client Version: 4.12.0-0.nightly-ppc64le-2022-07-07-144302 Kustomize Version: v4.5.4 Server Version: 4.12.0-0.nightly-ppc64le-2022-07-07-144302 Kubernetes Version: v1.24.0+bd7662a # oc get node NAME STATUS ROLES AGE VERSION master-0 Ready master 46m v1.24.0+2dd8bb1 master-1 Ready master 46m v1.24.0+2dd8bb1 master-2 Ready master 46m v1.24.0+2dd8bb1 worker-0 Ready worker 32m v1.24.0+2dd8bb1 worker-1 Ready worker 32m v1.24.0+2dd8bb1 [root@rdr-shw-412-ba82-bastion-0 ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 23m cloud-controller-manager 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 46m cloud-credential 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 48m cluster-autoscaler 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m config-operator 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 45m console 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 28m csi-snapshot-controller 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 44m dns 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m etcd 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 41m image-registry 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 20m ingress 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 31m insights 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 38m kube-apiserver 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 39m kube-controller-manager 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 42m kube-scheduler 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 41m kube-storage-version-migrator 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 44m machine-api 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 44m machine-approver 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m machine-config 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 42m marketplace 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m monitoring 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 28m network 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 45m node-tuning 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m openshift-apiserver 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 36m openshift-controller-manager 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m openshift-samples 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 36m operator-lifecycle-manager 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 44m operator-lifecycle-manager-catalog 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 43m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 36m service-ca 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 44m storage 4.12.0-0.nightly-ppc64le-2022-07-07-144302 True False False 44m # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-3b0ed0ab0a7b81be3e884db08c4062c5 True False False 3 3 3 0 48m worker rendered-worker-69e5eb440d1ea81cf0fb56566091c996 True False False 2 2 2 0 48m # oc get pods -A | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE
move this BZ to VERIFIED
Was on PTO. I see that the request has been answered and the BZ is verified. Let me know if anything else is needed from me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399