Bug 1972374
| Summary: | Adopt failure can trigger deprovisioning | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Hardy <shardy> | |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Zane Bitter <zbitter> | |
| Bare Metal Hardware Provisioning sub component: | baremetal-operator | QA Contact: | Ori Michaeli <omichael> | |
| Status: | CLOSED ERRATA | Docs Contact: | Padraig O'Grady <pogrady> | |
| Severity: | urgent | |||
| Priority: | urgent | CC: | pogrady, rbartal | |
| Version: | 4.8 | Keywords: | Triaged, UpcomingSprint | |
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: An error (other than a registration error) in the provisioned state would cause the Host to be deprovisioned.
Consequence: After a restart of the metal3 pod (such as during an upgrade), if the image provisioned to a BareMetalHost was no longer accessible, the Host would enter the deprovisioning state, and be stuck there until the image became available (at which time it would be deprovisioned).
Fix: An error in the provisioned state is now reported without triggering deprovisioning.
Result: If the image become unavailable, the error will be reported but deprovisioning will not be initiated.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1972426 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:34:18 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1972426 | |||
|
Description
Steven Hardy
2021-06-15 18:23:22 UTC
https://github.com/metal3-io/baremetal-operator/issues/915 https://github.com/metal3-io/baremetal-operator/pull/916 Adoption is triggered by losing the ironic DB - i.e. by a rescheduling of the metal3 Pod (common during an upgrade). So a combination of that and the image becoming unavailable would trigger the bug. When it is triggered, unless the failure is very brief there is a decent chance that all worker nodes will be deprovisioned. Some more detailed reproducer notes to assist with QE verification when the fix lands:
First get the url for the image/checksum from one of the worker BMH resources, e.g:
oc get bmh ostest-worker-0 -n openshift-machine-api -o json | jq .spec.image
{
"checksum": "http://[fd00:1101::3]:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum",
"url": "http://[fd00:1101::3]:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2"
}
fd00:1101::3 is the API VIP IP - we need to temporarily disable the image cache on the master host that currently refers to, e.g:
$ ssh core@fd00:1101::3
$ sudo mv /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2 /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.bak
$ sudo mv /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum /var/lib/metal3/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum.bak
Note - this only disables the cache on one master, if anything happens to the cluster to cause the API VIP to fail-over to a different master, the process will need to be repeated (or, move the files on all 3 masters)
We can now confirm the image-cache is broken, the checksum is no longer found:
$ curl -I http://[fd00:1101::3]:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/cached-rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.md5sum
HTTP/1.1 404 Not Found
Date: Wed, 16 Jun 2021 10:41:11 GMT
Server: Apache
Content-Type: text/html; charset=iso-8859-1
Now get the metal3 pod name (*not* the image-cache pods) and delete it
$ oc get pods -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-7968d794cb-jfbjd 2/2 Running 0 25h
cluster-baremetal-operator-8674588c96-st98w 2/2 Running 0 25h
machine-api-controllers-f77f88478-dhchs 7/7 Running 0 24h
machine-api-operator-6cc478d96f-8tdj9 2/2 Running 1 25h
metal3-58f4bf4d48-q7kfp 10/10 Running 0 24h
metal3-image-cache-8xqmm 1/1 Running 0 24h
metal3-image-cache-959rt 1/1 Running 0 24h
metal3-image-cache-gl4ht 1/1 Running 0 24h
$ oc delete pod metal3-58f4bf4d48-q7kfp
After some time, a new metal3 pod will be started, causing existing BMH resources to be adopted.
Using e.g oc get bmh we can now observe that the worker hosts remain in provisioned state, but that the adopt error is reflected via the ERROR field - the BMH resource STATE field should not change after this process (e.g to deprovisioning as in the initial report)
Verified with 4.9.0-0.nightly-2021-06-17-021644: [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api openshift-master-0-0 externally provisioned ocp-edge-cluster-0-tgbth-master-0 true provisioned registration error openshift-machine-api openshift-master-0-1 externally provisioned ocp-edge-cluster-0-tgbth-master-1 true provisioned registration error openshift-machine-api openshift-master-0-2 externally provisioned ocp-edge-cluster-0-tgbth-master-2 true provisioned registration error openshift-machine-api openshift-worker-0-0 provisioned ocp-edge-cluster-0-tgbth-worker-0-knh5g true provisioned registration error openshift-machine-api openshift-worker-0-1 provisioned ocp-edge-cluster-0-tgbth-worker-0-pvpcn true provisioned registration error Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |