Description of problem: After installing a cluster on GEN 10 plus hardware, the customer(Nokia-NOM) finds that the baremetal operator is sporadically stuck in 'progressing' state. From our analysis, the installation is actually complete. The nodes and rest of the cluster operators are Available and not Degraded. - The cluster-baremetal-opeator(top-level) doesn't have any errors as such except for some warnings.(worth looking at as they might have caused the status to NOT be updated) [1] - The metal3-baremetal-operator(2nd-level) pod also doesn't have any errors if I grep out info logs. - I checked other resources like machines, BMH, and provisioning CR that are managed by the bare metal operator and do not see any problem there. [2] - It is possible that the cluster-baremetal-opeator has synced the resources but failed to report it back to the cluster-version-operator. - Logs from the cluster version operator indicate that the manifest resource sync is complete. I believe it is unlikely that an issue is here within CVO. [3] Due to nothing in particular standing out from the logs, asked them to roll out a new bare-metal operator pod. # oc delete pod cluster-baremetal-operator-xxxxxx -n openshift-machine-api This got the stale condition to be pushed out and the baremetal operator was in progressing=false state now. ---- Logs: [1] ``` 2022-05-17T06:35:23.879134688Z E0517 06:35:23.879105 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.Proxy: the server could not find the requested resource (get proxies.config.openshift.io) 2022-05-17T06:35:23.890165361Z E0517 06:35:23.890113 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:137: Failed to watch *v1alpha1.Provisioning: the server could not find the requested resource (get provisionings.metal3.io) 2022-05-17T06:39:41.825418177Z W0517 06:39:41.825142 1 reflector.go:436] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:137: watch of *v1alpha1.Provisioning ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2022-05-17T06:39:41.825418177Z W0517 06:39:41.825171 1 reflector.go:436] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: watch of *v1.DaemonSet ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2022-05-17T06:39:41.825418177Z W0517 06:39:41.825175 1 reflector.go:436] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2 ``` [2] ``` [simore@supportshell-1 03221561]$ omg get machine NAME PHASE TYPE REGION ZONE AGE sl732-fw8cs-master-0 Running 6h17m sl732-fw8cs-master-1 Running 6h17m sl732-fw8cs-master-2 Running 6h17m sl732-fw8cs-worker-0-ctsgj Running 5h38m sl732-fw8cs-worker-0-j5shn Running 5h38m ... [simore@supportshell-1 03221561]$ omg get bmh NAME AGE sl732compute1 6h17m sl732compute2 6h17m sl732control1 6h17m sl732control2 6h17m sl732control3 6h17m ... [simore@supportshell-1 03221561]$ omg get provisioning -oyaml apiVersion: metal3.io/v1alpha1 kind: Provisioning metadata: creationTimestamp: '2022-05-17T05:53:51Z' finalizers: - provisioning.metal3.io generation: 2 name: provisioning-configuration resourceVersion: '16277' uid: 7efe44a3-1e28-468e-afa9-9bbef0fbdf02 spec: provisioningNetwork: Disabled provisioningOSDownloadURL: http://10.92.117.51:8080/rhcos-openstack.x86_64.qcow2.gz?sha256=e0a1d8a99c5869150a56b8de475ea7952ca2fa3aacad7ca48533d1176df503ab status: generations: - group: apps hash: '' lastGeneration: 1 name: metal3 namespace: openshift-machine-api resource: deployments - group: apps hash: '' lastGeneration: 1 name: metal3-image-cache namespace: openshift-machine-api resource: daemonsets observedGeneration: 2 readyReplicas: 0 ``` [3] ``` 2022-05-17T12:11:03.852498096Z I0517 12:11:03.852421 1 cvo.go:483] Finished syncing cluster version "openshift-cluster-version/version" (136.141µs) 2022-05-17T12:11:05.346829912Z I0517 12:11:05.346694 1 sync_worker.go:283] Wait finished 2022-05-17T12:11:05.346829912Z I0517 12:11:05.346721 1 sync_worker.go:337] Previous sync status: &cvo.SyncWorkerStatus{Generation:1, Step:"", Failure:error(nil), Done:681, Total:681, Completed:87, Reconcil ing:true, Initial:false, VersionHash:"PPyjn2NyQf4=", LastProgress:time.Time{wall:0xc09901eb9736078c, ext:17469120453094, loc:(*time.Location)(0x2ad3e80)}, Actual:v1.Release{Version:"4.8.29-assembly.art3875", Ima ge:"quay.io/openshift-release-dev/ocp-release@sha256:5719c6cdcac7f8517d00d11fe07a571b357d88a9c0462565cc61adcf50cd2a66", URL:"", Channels:[]string(nil)}, Verified:false} ``` Version-Release number of selected component (if applicable): - This is the hotfix version of OCP being installed 4.8.29-assembly.art3875 How reproducible: - The issue is not consistently reproduced on each new installation. - Customer mentions it has reoccurred a few times. Steps to Reproduce: 1. Install RHOCP on GEN 10 plus hardware. 2. Hope the issue occurs Actual results: - The baremetal operator is sporadically stuck in progressing state after a fresh install. Expected results: - After the install the operator is supposed to NOT be in a progressing state. Additional info: - Customer is unsure whether to trust the status if all the resources are in fact synced from the operator conditions. - Customer is requesting an RCA as to why this happens. Support understands due to the inconsistent nature of the reproducer, it is difficult to find the root cause of this issue. It might also depend on various underlying factors like network gitch, etc. We need engineering assistance in understanding if there is a way to increase log level of this operator in a way that could point further at the root cause. Additionally, if such an issue is seen again, what info/logs should be captured that could help in finding the root cause.
Hi, From the messages we see on the logs and the CBO logs mentioning "client connection lost" we have a few assumptions. 2022-05-17T06:42:11.570481957Z 2022-05-17 06:42:11.569 1 ERROR ironic.drivers.modules.inspector ironic.common.exception.ImageDownloadFailed: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2540bd6208>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) 2022-05-17T06:42:11.570481957Z 2022-05-17 06:42:11.569 1 ERROR ironic.drivers.modules.inspector 2022-05-17T06:42:13.479335725Z 2022-05-17 06:42:13.478 1 ERROR ironic.conductor.task_manager [req-40b05d56-7302-4e0e-b066-3ebce9bc1ede ironic-user - - - -] Node d736a30a-56e2-4142-9a59-b7b7066eb833 moved to provision state "inspect failed" from state "inspecting"; target provision state is "manageable": ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2540bca048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) Ironic was unable to download the image and this can be due to a proxy miss configuration or the image cache is down. Maybe there is also some network glitch happening ? Are they using multipath ? (Just wondering since they are using the HotFix 4.8.29-assembly.art3875 - contains a kernel fix if I recall)
It looks like the CBO was successful in deploying all of its resources. However, the code that checks for the metal3 Deployment to be complete simply loops over the conditions in the Deployment and returns the first one that is true: https://github.com/openshift/cluster-baremetal-operator/blob/1146017627352b72e0f2399f9f7e3423f0327eb2/provisioning/baremetal_pod.go#L916-L920 For reasons that are completely unfathomable to me, when a Deployment is finished and no longer progressing, it sets the condition "Progressing: True": https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment So the result here will depend on the order of conditions, and indeed we see that all of the Deployments have the "Progressing" condition before the "Available" condition: status: availableReplicas: 1 conditions: - lastTransitionTime: "2022-05-17T06:35:32Z" lastUpdateTime: "2022-05-17T06:41:26Z" message: ReplicaSet "metal3-85d9dc5d56" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2022-05-17T06:44:00Z" lastUpdateTime: "2022-05-17T06:44:00Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 1 readyReplicas: 1 replicas: 1 updatedReplicas: 1 thus CBO decides that the Deployment is still progressing, and continues to report that it too is still progressing. It's not clear why the conditions are showing up in a different order to usual, but it is clear that the fault lies with the CBO logic. Nevertheless, this does not represent a real problem with the cluster.
as we discussed, since the problem is not reproduced on our setups, verifying this one as OtherQA after reviewing code changes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days