Description of the problem: Deploying the assisted service operator bundle on a 4.11 hub cluster triggered the two already provisioned worker nodes going through deprovisioning and provisioning processes. The two workers eventually stuck in a provisioning state. This was due to BMAC setting the customDeploy method to "start_assisted_install" on the two BMHs (when the converged flow is enabled). oc get bmh -n openshift-machine-api cnfdt08-worker-0 -ojsonpath={'.spec.customDeploy}' | jq { "method": "start_assisted_install" } oc get bmh -n openshift-machine-api cnfdt08-worker-1 -ojsonpath={'.spec.customDeploy}' | jq { "method": "start_assisted_install" } oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api cnfdt08-master-0 externally provisioned cnfdt08-2qtb2-master-0 true 15h openshift-machine-api cnfdt08-master-1 externally provisioned cnfdt08-2qtb2-master-1 true 15h openshift-machine-api cnfdt08-master-2 externally provisioned cnfdt08-2qtb2-master-2 true 15h openshift-machine-api cnfdt08-worker-0 provisioning cnfdt08-2qtb2-worker-0-z8cdr true 15h openshift-machine-api cnfdt08-worker-1 provisioning cnfdt08-2qtb2-worker-0-tc6rc true 15h Release version: - Latest upstream assisted-service-operator (with converged flow enabled) - OCP 4.11 on hub (4.11.0-0.nightly-2022-06-06-201913) Steps to reproduce: 1. Deploy a 4.11 hub cluster with 3 masters and 2 workers 2. Deploy the assisted service operator bundle with converged flow enabled Actual results: The hub cluster is in a bad state and many pods (including the assisted-service pods) can't find an available host. Expected results: The hub cluster is in a healthy state that allows deploying a spoke cluster Additional info:
I'm not sure if the start_assisted_install customDeploy is the reason that triggered the de-provisioning or perhaps it's something else we do (perhaps we updated the preprovisioning image with an new ISO URL. BMAC (the BMH controller in assisted-service) is reconciling BMH CRs that belong to an InfraEnv. Can you attach: 1. The full workers BMH yaml? 2. All the preprovisioningimages you see in the cluster? 3. The assisted-service and metal3 logs
I have attached the full workers BMH yaml, preprovisioningimages and BMO logs. The assisted-service pods were running on the worker node and the logs were not available. oc get PreprovisioningImage -A NAMESPACE NAME READY REASON openshift-machine-api cnfdt08-worker-0 True ImageSuccess openshift-machine-api cnfdt08-worker-1 True ImageSuccess oc logs -n assisted-installer assisted-service-684bd8cd48-brpbb -c assisted-service Error from server: Get "https://10.46.55.22:10250/containerLogs/assisted-installer/assisted-service-684bd8cd48-brpbb/assisted-service": dial tcp 10.46.55.22:10250: connect: no route to host
I re-tested with the latest upstream operator and this issue has been resolved.
Verified per previous entry (Also QE does not see this in CI with recent builds)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6370