Created attachment 1717570 [details] openshift_install.log Version: $ ./openshift-baremetal-install version ./openshift-baremetal-install 4.6.0-0.nightly-2020-09-28-212756 built from commit 94c4a3afe8492ddf69026b0297fc6b341b575243 release image registry.svc.ci.openshift.org/ocp/release@sha256:3e44bc1f1f031e649f92d89da96d44cc512c5d492a7c0c5fa40b35bef196ae3e Platform: baremetal IPI What happened? In install-config.yaml rootDeviceHints for workers set to non-existing disk (e.g. deviceName: /dev/sdc). Deployment expected to fail with relevant problem reporting 1. Deployment didn't stop when the problem is detected. BMH and machine are in state provisioned. The problem is reported in metal3-ironic-condusctor log: ERROR ironic.drivers.modules.agent [req-ade76e71-cad7-4726-a24b-676f70a032de ironic-user - - - -] node 441d1756-b123-4394-8149-f45c3383990e command status errored: {'type': 'DeviceNotFound', 'code': 404, 'message': 'Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'}"} 2. Deployment fails after about 30 minutes after the error above with report not related to the real problem (see openshift_install.log): failed to initialize the cluster: Some cluster operators are still updating: authentication, console, ingress, kube-storage-version-migrator, monitoring What did you expect to happen? Deployment expected to fail shortly after the problem with the disk is detected with the message like provided for masters Error: could not inspect: could not inspect node, node is currently 'inspect failed', last error was 'ironic-inspector inspection failed: No disks satisfied root device hints' How to reproduce it (as minimally and precisely as possible)? 1. Set in install-config.yaml rootDeviceHints for workers to not existing disk 2. Deploy the cluster As I mentioned BMH and machines are in state provided (BMH expected ti be reported in failed inspection) [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-92c8g-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/efae7100-2315-4610-88e8-e7763bd174d9 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-92c8g-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/2d9a977f-b447-4b43-ab25-ff9a785c0091 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-92c8g-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/8cbf3590-980b-4660-9222-7e70fa78f006 true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-92c8g-worker-0-4dk7s redfish://192.168.123.1:8000/redfish/v1/Systems/bd43e21b-fe10-4b1a-9742-5647fa39244d unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-92c8g-worker-0-qvsww redfish://192.168.123.1:8000/redfish/v1/Systems/773c0a24-9367-4e13-9372-fe814dc668c1 unknown [kni@provisionhost-0-0 ~]$ oc get machine -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ocp-edge-cluster-0-92c8g-master-0 Running 82m openshift-machine-api ocp-edge-cluster-0-92c8g-master-1 Running 82m openshift-machine-api ocp-edge-cluster-0-92c8g-master-2 Running 82m openshift-machine-api ocp-edge-cluster-0-92c8g-worker-0-4dk7s Provisioned 47m openshift-machine-api ocp-edge-cluster-0-92c8g-worker-0-qvsww Provisioned 47m [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 71m v1.19.0+e465e66 master-0-1 Ready master 71m v1.19.0+e465e66 master-0-2 Ready master 71m v1.19.0+e465e66
must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1883564_must-gather.tar.gz
*** This bug has been marked as a duplicate of bug 1816904 ***
If the misconfigured values are only on the workers, then the control plane hosts will provision properly, the cluster will try to come up, and the workers should fail to provision preventing the cluster from forming. The installer does not pay attention to worker provisioning status (see #1826904), so the reason for the failure is going to be something like what was seen here. If the workers are showing up as provisioned, that may still be an error. Is it possible that those hosts had old images on their existing disks, so they booted RHCOS and joined the cluster again even though provisioning should have failed?
To further add to Doug's comment, I don't think this is a duplicate of 1826904. You are seeing expected behavior with the error, so I don't think its a bug. Going back to 1826904, the openshift installer doesn't really care about workers being provisioned on time; they are OK with workers being added to the cluster even after the installer exits successfully. So I see why this was marked as a duplicate.
(In reply to Doug Hellmann from comment #4) > If the workers are showing up as provisioned, that may still be an error. Is > it possible that those hosts had old images on their existing disks, so they > booted RHCOS and joined the cluster again even though provisioning should > have failed? Since we run tests on virtual emulator of BM, VMs are re-created every time before deployment, so there is no chance for old RHCOS I still believe, it is bug