Bug 1883564 - Deployment on workers with rootDeviceHints for not existing disk fails with not relevant problem report after 30 minutes after real problem detected
Summary: Deployment on workers with rootDeviceHints for not existing disk fails with n...
Keywords:
Status: CLOSED DUPLICATE of bug 1816904
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Beth White
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-29 15:11 UTC by Lubov
Modified: 2020-10-06 06:46 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-29 16:20:45 UTC
Target Upstream Version:
Embargoed:
yboaron: needinfo-


Attachments (Terms of Use)
openshift_install.log (139.56 KB, text/plain)
2020-09-29 15:11 UTC, Lubov
no flags Details

Description Lubov 2020-09-29 15:11:09 UTC
Created attachment 1717570 [details]
openshift_install.log

Version:

$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.6.0-0.nightly-2020-09-28-212756
built from commit 94c4a3afe8492ddf69026b0297fc6b341b575243
release image registry.svc.ci.openshift.org/ocp/release@sha256:3e44bc1f1f031e649f92d89da96d44cc512c5d492a7c0c5fa40b35bef196ae3e

Platform:

baremetal IPI


What happened?
In install-config.yaml rootDeviceHints for workers set to non-existing disk (e.g. deviceName: /dev/sdc). Deployment expected to fail with relevant problem reporting
1. Deployment didn't stop when the problem is detected. BMH and machine are in state provisioned. The problem is reported in metal3-ironic-condusctor log:  
ERROR ironic.drivers.modules.agent [req-ade76e71-cad7-4726-a24b-676f70a032de ironic-user - - - -] node 441d1756-b123-4394-8149-f45c3383990e command status errored: {'type': 'DeviceNotFound', 'code': 404, 'message': 'Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'}"}

2. Deployment fails after about 30 minutes after the error above with report not related to the real problem (see openshift_install.log):
failed to initialize the cluster: Some cluster operators are still updating: authentication, console, ingress, kube-storage-version-migrator, monitoring


What did you expect to happen?
Deployment expected to fail shortly after the problem with the disk is detected with the message like provided for masters
Error: could not inspect: could not inspect node, node is currently 'inspect failed', last error was 'ironic-inspector inspection failed: No disks satisfied root device hints'

How to reproduce it (as minimally and precisely as possible)?
1. Set in install-config.yaml rootDeviceHints for workers to not existing disk
2. Deploy the cluster

As I mentioned BMH and machines are in state provided (BMH expected ti be reported in failed inspection)

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-92c8g-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/efae7100-2315-4610-88e8-e7763bd174d9                      true
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-92c8g-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/2d9a977f-b447-4b43-ab25-ff9a785c0091                      true
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-92c8g-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/8cbf3590-980b-4660-9222-7e70fa78f006                      true
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-92c8g-worker-0-4dk7s   redfish://192.168.123.1:8000/redfish/v1/Systems/bd43e21b-fe10-4b1a-9742-5647fa39244d   unknown            true
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-92c8g-worker-0-qvsww   redfish://192.168.123.1:8000/redfish/v1/Systems/773c0a24-9367-4e13-9372-fe814dc668c1   unknown

[kni@provisionhost-0-0 ~]$ oc get machine -A
NAMESPACE               NAME                                      PHASE         TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge-cluster-0-92c8g-master-0         Running                              82m
openshift-machine-api   ocp-edge-cluster-0-92c8g-master-1         Running                              82m
openshift-machine-api   ocp-edge-cluster-0-92c8g-master-2         Running                              82m
openshift-machine-api   ocp-edge-cluster-0-92c8g-worker-0-4dk7s   Provisioned                          47m
openshift-machine-api   ocp-edge-cluster-0-92c8g-worker-0-qvsww   Provisioned                          47m

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS   ROLES    AGE   VERSION
master-0-0   Ready    master   71m   v1.19.0+e465e66
master-0-1   Ready    master   71m   v1.19.0+e465e66
master-0-2   Ready    master   71m   v1.19.0+e465e66

Comment 2 Yossi Boaron 2020-09-29 16:20:45 UTC

*** This bug has been marked as a duplicate of bug 1816904 ***

Comment 4 Doug Hellmann 2020-10-01 17:38:30 UTC
If the misconfigured values are only on the workers, then the control plane hosts will provision properly, the cluster will try to come up, and the workers should fail to provision preventing the cluster from forming. The installer does not pay attention to worker provisioning status (see #1826904), so the reason for the failure is going to be something like what was seen here.

If the workers are showing up as provisioned, that may still be an error. Is it possible that those hosts had old images on their existing disks, so they booted RHCOS and joined the cluster again even though provisioning should have failed?

Comment 5 Kiran Thyagaraja 2020-10-02 01:56:32 UTC
To further add to Doug's comment, I don't think this is a duplicate of 1826904. You are seeing expected behavior with the error, so I don't think its a bug. Going back to 1826904, the openshift installer doesn't really care about workers being provisioned on time; they are OK with workers being added to the cluster even after the installer exits successfully. So I see why this was marked as a duplicate.

Comment 7 Lubov 2020-10-06 06:46:08 UTC
(In reply to Doug Hellmann from comment #4)
> If the workers are showing up as provisioned, that may still be an error. Is
> it possible that those hosts had old images on their existing disks, so they
> booted RHCOS and joined the cluster again even though provisioning should
> have failed?
Since we run tests on virtual emulator of BM, VMs are re-created every time before deployment, so there is no chance for old RHCOS
I still believe, it is bug


Note You need to log in before you can comment on or make changes to this bug.