Bug 2088428

Summary: clusteroperator/baremetal stays in progressing: Applying metal3 resources state on a fresh install
Product: OpenShift Container Platform Reporter: Siddhant More <simore>
Component: Bare Metal Hardware ProvisioningAssignee: sdasu
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: augol, cgaynor, ealcaniz, eglottma, imelofer, janders, kurathod, lshilin, omichael, openshift-bugs-escalate, pibanezr, pmannidi, rbartal, sdasu, tsedovic, vkochuku, wking, zbitter
Version: 4.8Keywords: FastFix, OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:12:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2091738    

Description Siddhant More 2022-05-19 12:45:51 UTC
Description of problem:

After installing a cluster on GEN 10 plus hardware, the customer(Nokia-NOM) finds that the baremetal operator is sporadically stuck in 'progressing' state. 

From our analysis, the installation is actually complete. The nodes and rest of the cluster operators are Available and not Degraded. 

- The cluster-baremetal-opeator(top-level) doesn't have any errors as such except for some warnings.(worth looking at as they might have caused the status to NOT be updated) [1]
- The metal3-baremetal-operator(2nd-level) pod also doesn't have any errors if I grep out info logs.  
- I checked other resources like machines, BMH, and provisioning CR that are managed by the bare metal operator and do not see any problem there. [2]
- It is possible that the cluster-baremetal-opeator has synced the resources but failed to report it back to the cluster-version-operator. 
- Logs from the cluster version operator indicate that the manifest resource sync is complete. I believe it is unlikely that an issue is here within CVO. [3]

Due to nothing in particular standing out from the logs, asked them to roll out a new bare-metal operator pod. 
  
  # oc delete pod cluster-baremetal-operator-xxxxxx -n openshift-machine-api
  
This got the stale condition to be pushed out and the baremetal operator was in progressing=false state now. 




----
Logs:

[1]
```
2022-05-17T06:35:23.879134688Z E0517 06:35:23.879105       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.Proxy: the server could not find the requested resource (get proxies.config.openshift.io)
2022-05-17T06:35:23.890165361Z E0517 06:35:23.890113       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:137: Failed to watch *v1alpha1.Provisioning: the server could not find the requested resource (get provisionings.metal3.io)
2022-05-17T06:39:41.825418177Z W0517 06:39:41.825142       1 reflector.go:436] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:137: watch of *v1alpha1.Provisioning ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2022-05-17T06:39:41.825418177Z W0517 06:39:41.825171       1 reflector.go:436] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: watch of *v1.DaemonSet ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2022-05-17T06:39:41.825418177Z W0517 06:39:41.825175       1 reflector.go:436] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2
```

[2] 
```
[simore@supportshell-1 03221561]$ omg get machine 
NAME                        PHASE    TYPE  REGION  ZONE  AGE
sl732-fw8cs-master-0        Running                      6h17m
sl732-fw8cs-master-1        Running                      6h17m
sl732-fw8cs-master-2        Running                      6h17m
sl732-fw8cs-worker-0-ctsgj  Running                      5h38m
sl732-fw8cs-worker-0-j5shn  Running                      5h38m

...
[simore@supportshell-1 03221561]$ omg get bmh
NAME           AGE
sl732compute1  6h17m
sl732compute2  6h17m
sl732control1  6h17m
sl732control2  6h17m
sl732control3  6h17m

...
[simore@supportshell-1 03221561]$ omg get provisioning -oyaml 
apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  creationTimestamp: '2022-05-17T05:53:51Z'
  finalizers:
  - provisioning.metal3.io
  generation: 2
  name: provisioning-configuration
  resourceVersion: '16277'
  uid: 7efe44a3-1e28-468e-afa9-9bbef0fbdf02
spec:
  provisioningNetwork: Disabled
  provisioningOSDownloadURL: http://10.92.117.51:8080/rhcos-openstack.x86_64.qcow2.gz?sha256=e0a1d8a99c5869150a56b8de475ea7952ca2fa3aacad7ca48533d1176df503ab
status:
  generations:
  - group: apps
    hash: ''
    lastGeneration: 1
    name: metal3
    namespace: openshift-machine-api
    resource: deployments
  - group: apps
    hash: ''
    lastGeneration: 1
    name: metal3-image-cache
    namespace: openshift-machine-api
    resource: daemonsets
  observedGeneration: 2
  readyReplicas: 0
```

[3]
```
2022-05-17T12:11:03.852498096Z I0517 12:11:03.852421       1 cvo.go:483] Finished syncing cluster version "openshift-cluster-version/version" (136.141┬Ás)
2022-05-17T12:11:05.346829912Z I0517 12:11:05.346694       1 sync_worker.go:283] Wait finished
2022-05-17T12:11:05.346829912Z I0517 12:11:05.346721       1 sync_worker.go:337] Previous sync status: &cvo.SyncWorkerStatus{Generation:1, Step:"", Failure:error(nil), Done:681, Total:681, Completed:87, Reconcil
ing:true, Initial:false, VersionHash:"PPyjn2NyQf4=", LastProgress:time.Time{wall:0xc09901eb9736078c, ext:17469120453094, loc:(*time.Location)(0x2ad3e80)}, Actual:v1.Release{Version:"4.8.29-assembly.art3875", Ima
ge:"quay.io/openshift-release-dev/ocp-release@sha256:5719c6cdcac7f8517d00d11fe07a571b357d88a9c0462565cc61adcf50cd2a66", URL:"", Channels:[]string(nil)}, Verified:false}
```


Version-Release number of selected component (if applicable):
- This is the hotfix version of OCP being installed 4.8.29-assembly.art3875


How reproducible:
- The issue is not consistently reproduced on each new installation. 
- Customer mentions it has reoccurred a few times.


Steps to Reproduce:
1. Install RHOCP on GEN 10 plus hardware.
2. Hope the issue occurs 


Actual results:
- The baremetal operator is sporadically stuck in progressing state after a fresh install. 


Expected results:
- After the install the operator is supposed to NOT be in a progressing state. 


Additional info:
- Customer is unsure whether to trust the status if all the resources are in fact synced from the operator conditions. 
- Customer is requesting an RCA as to why this happens. 

Support understands due to the inconsistent nature of the reproducer, it is difficult to find the root cause of this issue. It might also depend on various underlying factors like network gitch, etc. We need engineering assistance in understanding if there is a way to increase log level of this operator in a way that could point further at the root cause. Additionally, if such an issue is seen again, what info/logs should be captured that could help in finding the root cause.

Comment 4 Iury Gregory Melo Ferreira 2022-05-25 18:09:17 UTC
Hi,

From the messages we see on the logs and the CBO logs mentioning "client connection lost"  we have a few assumptions.


2022-05-17T06:42:11.570481957Z 2022-05-17 06:42:11.569 1 ERROR ironic.drivers.modules.inspector ironic.common.exception.ImageDownloadFailed: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2540bd6208>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2022-05-17T06:42:11.570481957Z 2022-05-17 06:42:11.569 1 ERROR ironic.drivers.modules.inspector 
2022-05-17T06:42:13.479335725Z 2022-05-17 06:42:13.478 1 ERROR ironic.conductor.task_manager [req-40b05d56-7302-4e0e-b066-3ebce9bc1ede ironic-user - - - -] Node d736a30a-56e2-4142-9a59-b7b7066eb833 moved to provision state "inspect failed" from state "inspecting"; target provision state is "manageable": ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2540bca048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))


Ironic was unable to download the image and this can be due to a proxy miss configuration or the image cache is down.
Maybe there is also some network glitch happening ?

Are they using multipath ? (Just wondering since they are using the HotFix  4.8.29-assembly.art3875 - contains a kernel fix if I recall)

Comment 5 Zane Bitter 2022-05-26 03:25:33 UTC
It looks like the CBO was successful in deploying all of its resources. However, the code that checks for the metal3 Deployment to be complete simply loops over the conditions in the Deployment and returns the first one that is true:

https://github.com/openshift/cluster-baremetal-operator/blob/1146017627352b72e0f2399f9f7e3423f0327eb2/provisioning/baremetal_pod.go#L916-L920

For reasons that are completely unfathomable to me, when a Deployment is finished and no longer progressing, it sets the condition "Progressing: True":

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

So the result here will depend on the order of conditions, and indeed we see that all of the Deployments have the "Progressing" condition before the "Available" condition:

  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2022-05-17T06:35:32Z"
      lastUpdateTime: "2022-05-17T06:41:26Z"
      message: ReplicaSet "metal3-85d9dc5d56" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    - lastTransitionTime: "2022-05-17T06:44:00Z"
      lastUpdateTime: "2022-05-17T06:44:00Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    observedGeneration: 1
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1

thus CBO decides that the Deployment is still progressing, and continues to report that it too is still progressing. It's not clear why the conditions are showing up in a different order to usual, but it is clear that the fault lies with the CBO logic. Nevertheless, this does not represent a real problem with the cluster.

Comment 22 Lubov 2022-06-01 13:36:01 UTC
as we discussed, since the problem is not reproduced on our setups, verifying this one as OtherQA after reviewing code changes.

Comment 24 errata-xmlrpc 2022-08-10 11:12:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 25 Red Hat Bugzilla 2023-09-18 04:37:25 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days