Bug 1972374 - Adopt failure can trigger deprovisioning
Summary: Adopt failure can trigger deprovisioning
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.9.0
Assignee: Zane Bitter
QA Contact: Ori Michaeli
Padraig O'Grady
URL:
Whiteboard:
Depends On:
Blocks: 1972426
TreeView+ depends on / blocked
 
Reported: 2021-06-15 18:23 UTC by Steven Hardy
Modified: 2021-10-18 17:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: An error (other than a registration error) in the provisioned state would cause the Host to be deprovisioned. Consequence: After a restart of the metal3 pod (such as during an upgrade), if the image provisioned to a BareMetalHost was no longer accessible, the Host would enter the deprovisioning state, and be stuck there until the image became available (at which time it would be deprovisioned). Fix: An error in the provisioned state is now reported without triggering deprovisioning. Result: If the image become unavailable, the error will be reported but deprovisioning will not be initiated.
Clone Of:
: 1972426 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:34:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 157 0 None closed Bug 1972374: Don't deprovision provisioned host due to error 2021-06-16 11:40:09 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:34:45 UTC

Description Steven Hardy 2021-06-15 18:23:22 UTC
Description of problem:

https://bugzilla.redhat.com/show_bug.cgi?id=1971602 caused us to look at some failed-upgrade cases which resulted in a broken image cache - this results in adopt failure of the BMH resource in Ironic.

This should be a transient error (fixing the image cache should restore the BMH resources to fully managed status without any interruption to existing provisioned hosts)

However instead we see the BMH resources move to deprovisioning, which is not expected.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Disable the image cache, e.g by ssh to each master then move the cached image/checksum files under to some temporary different filename
2. Restart the metal3 pod
3. Wait some time, and observe the worker BMH resources move to deprovisioning

Actual results:

$ oc get bmh
NAME              STATE                    CONSUMER                      ONLINE   ERROR
ostest-master-0   externally provisioned   ostest-g6bq8-master-0         true     provisioned registration error
ostest-master-1   externally provisioned   ostest-g6bq8-master-1         true     provisioned registration error
ostest-master-2   externally provisioned   ostest-g6bq8-master-2         true     provisioned registration error
ostest-worker-0   deprovisioning           ostest-g6bq8-worker-0-xh2wn   true     provisioned registration error
ostest-worker-1   deprovisioning           ostest-g6bq8-worker-0-c9mtq   true     provisioned registration error

Expected results:

The BMH resources should not change state from provisioned, only the error should be reflected.

Additional info:

In CI we encountered this scenario due to a master host running out of disk space, the reproduce steps above were an emulation of such a scenario that made the images referenced by the BMH temporarily unavailable via the second-level cache

Comment 2 Zane Bitter 2021-06-15 18:36:31 UTC
Adoption is triggered by losing the ironic DB - i.e. by a rescheduling of the metal3 Pod (common during an upgrade). So a combination of that and the image becoming unavailable would trigger the bug. When it is triggered, unless the failure is very brief there is a decent chance that all worker nodes will be deprovisioned.

Comment 4 Steven Hardy 2021-06-16 10:48:06 UTC
Some more detailed reproducer notes to assist with QE verification when the fix lands:

First get the url for the image/checksum from one of the worker BMH resources, e.g:

oc get bmh ostest-worker-0 -n openshift-machine-api -o json | jq .spec.image
{
  "checksum": "http://[fd00:1101::3]:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum",
  "url": "http://[fd00:1101::3]:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2"
}

fd00:1101::3 is the API VIP IP - we need to temporarily disable the image cache on the master host that currently refers to, e.g:

$ ssh core@fd00:1101::3
$ sudo mv /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2 /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.bak
$ sudo mv /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum.bak

Note - this only disables the cache on one master, if anything happens to the cluster to cause the API VIP to fail-over to a different master, the process will need to be repeated (or, move the files on all 3 masters)

We can now confirm the image-cache is broken, the checksum is no longer found:

$ curl -I http://[fd00:1101::3]:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum
HTTP/1.1 404 Not Found
Date: Wed, 16 Jun 2021 10:41:11 GMT
Server: Apache
Content-Type: text/html; charset=iso-8859-1


Now get the metal3 pod name (*not* the image-cache pods) and delete it

$ oc get pods -n openshift-machine-api
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-7968d794cb-jfbjd   2/2     Running   0          25h
cluster-baremetal-operator-8674588c96-st98w    2/2     Running   0          25h
machine-api-controllers-f77f88478-dhchs        7/7     Running   0          24h
machine-api-operator-6cc478d96f-8tdj9          2/2     Running   1          25h
metal3-58f4bf4d48-q7kfp                        10/10   Running   0          24h
metal3-image-cache-8xqmm                       1/1     Running   0          24h
metal3-image-cache-959rt                       1/1     Running   0          24h
metal3-image-cache-gl4ht                       1/1     Running   0          24h

$ oc delete pod metal3-58f4bf4d48-q7kfp

After some time, a new metal3 pod will be started, causing existing BMH resources to be adopted.

Using e.g oc get bmh we can now observe that the worker hosts remain in provisioned state, but that the adopt error is reflected via the ERROR field - the BMH resource STATE field should not change after this process (e.g to deprovisioning as in the initial report)

Comment 7 Ori Michaeli 2021-06-17 12:41:59 UTC
Verified with 4.9.0-0.nightly-2021-06-17-021644:

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATE                    CONSUMER                                  ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   externally provisioned   ocp-edge-cluster-0-tgbth-master-0         true     provisioned registration error
openshift-machine-api   openshift-master-0-1   externally provisioned   ocp-edge-cluster-0-tgbth-master-1         true     provisioned registration error
openshift-machine-api   openshift-master-0-2   externally provisioned   ocp-edge-cluster-0-tgbth-master-2         true     provisioned registration error
openshift-machine-api   openshift-worker-0-0   provisioned              ocp-edge-cluster-0-tgbth-worker-0-knh5g   true     provisioned registration error
openshift-machine-api   openshift-worker-0-1   provisioned              ocp-edge-cluster-0-tgbth-worker-0-pvpcn   true     provisioned registration error

Comment 10 errata-xmlrpc 2021-10-18 17:34:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.