Description of problem: 4.8 has image caching mechanism and 4.7 has not. Thus, these 2 versions are using different image paths but eventually they are using the same image. In upgrade process, ironic rhcos downloader does not know image exist and that's why, in any cases, it re-downloads same image. This breaks symlinks and also delay the completion of upgrade process. Version-Release number of selected component (if applicable): 4.8+ How reproducible: Running e2e-metal-ipi-upgrade from 4.7 to 4.8 and checking metal3-machine-os-downloader container logs. Actual results: Symlinks of images are broken. Expected results: Symlinks are correctly aligned. Additional info:
It is definitely the case that the symlinks are broken, and remain broken (deprovisioning of the hosts caused by bug 1972374 does not actually succeed because we still can't adopt even after the all the pods are up again): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1404984966030299136/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-machine-api_metal3-5cd564855-c7tfz_metal3-httpd.log (Note that this is the log for httpd in the metal3 pod - we don't actually use the second-level cache's http unless the provisioning network is disabled at install time, although we do still run the downloader. Here is a PR to fix that: https://github.com/openshift/installer/pull/5008.) And it definitely the case that we are downloading the image again every time the downloader is run, and storing it forever (the tmpdir gets copied into the directory we are serving the image from, instead of being renamed to *be* that directory). This increases disk usage by almost 3.5GB every time it happens; the node that's actually in use has two of these pods so every time we bounce all the pods we throw away another ~7GB of space. What is not clear to me is why the symlinks are broken. It looks to me like they are set up correctly at the end of each downloader run, overwriting the previous link which should also have been correct: metal3 pod: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1404984966030299136/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-machine-api_metal3-5cd564855-c7tfz_metal3-machine-os-downloader.log metal3-image-cache pod on the same node: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1404984966030299136/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-machine-api_metal3-image-cache-wbfnq_metal3-machine-os-downloader.log So either we are *still* hitting the increased disk space limit (though there's nothing in the logs to suggest this?), in which case the patch will fix it, or there's something else happening that I don't yet understand, in which case I can't say if the patch will fix it.
Oh, I think I worked it out: Initially (prior to upgrade) we have the file we want: /shared/html/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2 Then we run a downloader with the new image. It downloads again to a tmpdir, then moves the tmpdir instead of renaming it so we end up with the following files that are leaked and never seen again: /shared/html/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/tmp.Gmh4P9y3FH/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2 /shared/html/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/tmp.Gmh4P9y3FH/cached-rhcos-47.83.202103251640-0-openstack.x86_64.qcow2 Finally, in the /shared/html/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/ directory we stomp on the good image we have by replacing it with a symlink to the 'cached-' image that we thought we'd just copied in, which is in fact elsewhere. I think this could potentially be considered a blocker for 4.8, because even with bug 1972374 fixed it will leave BareMetalHosts in an error state where we can't do power management. The current proposed patch is sufficient to prevent the issue occurring, but *not* to resolve it once it has occurred. So if this is not a blocker for 4.8 we will need an additional fix.
Note that the leaking of the TMPDIR is handled via https://github.com/openshift/ironic-rhcos-downloader/pull/45 - but we still need to deal with the fact that the filename changed between 4.7 and 4.8
(In reply to Steven Hardy from comment #3) > Note that the leaking of the TMPDIR is handled via > https://github.com/openshift/ironic-rhcos-downloader/pull/45 - but we still > need to deal with the fact that the filename changed between 4.7 and 4.8 Correction, that only fixes an exit-path leak, Zane identified another problem ref https://github.com/openshift/ironic-rhcos-downloader/pull/48#pullrequestreview-685605683
(In reply to Zane Bitter from comment #2) > The current proposed patch is sufficient to > prevent the issue occurring, but *not* to resolve it once it has occurred. > So if this is not a blocker for 4.8 we will need an additional fix. For the sake of posterity, noting that the PR that merged will both prevent the issue from occurring and fix it on a system where it has already occurred. Nevertheless, we're still going to backport urgently to 4.8 (bug 1973018).
This verification failed while Upgrading from 4.7.12 -> 4.8.0-rc.0 -> 4.9.0-0.nightly-2021-06-24-073147: 4.7 -> 4.8: upgrade passed and bmh show error and bmh are deprovisioning as expected. 4.8 -> 4.9: upgrade is stuck on 75% with multiple errors and both worker nodes show as not ready. [kni@provisionhost-0-0 ~]$ oc get co ^[[ANAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-06-24-073147 False False True 41h OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ocp-edge-cluster-0.qe.lab.redhat.com/healthz": dial tcp 192.168.123.10:443: connect: connection refused ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 2 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). baremetal 4.9.0-0.nightly-2021-06-24-073147 True False False 47h cloud-credential 4.9.0-0.nightly-2021-06-24-073147 True False False 2d cluster-autoscaler 4.9.0-0.nightly-2021-06-24-073147 True False False 47h config-operator 4.9.0-0.nightly-2021-06-24-073147 True False False 47h console 4.9.0-0.nightly-2021-06-24-073147 False False False 41h RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com/health): Get "https://console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com/health": dial tcp 192.168.123.10:443: connect: connection refused csi-snapshot-controller 4.9.0-0.nightly-2021-06-24-073147 True False False 47h dns 4.8.0-rc.0 True True False 44h DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 5." etcd 4.9.0-0.nightly-2021-06-24-073147 True False False 47h image-registry 4.9.0-0.nightly-2021-06-24-073147 False True True 41h Available: The deployment does not have available replicas NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created ingress 4.9.0-0.nightly-2021-06-24-073147 False True True 41h The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) insights 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-apiserver 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-controller-manager 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-scheduler 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-storage-version-migrator 4.9.0-0.nightly-2021-06-24-073147 True False False 41h machine-api 4.9.0-0.nightly-2021-06-24-073147 True False False 47h machine-approver 4.9.0-0.nightly-2021-06-24-073147 True False False 47h machine-config 4.8.0-rc.0 False False True 41h Cluster not available for 4.8.0-rc.0 marketplace 4.9.0-0.nightly-2021-06-24-073147 True False False 47h monitoring 4.9.0-0.nightly-2021-06-24-073147 False True True 41h Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.8.0-rc.0 True True True 47h DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-06-28T22:08:29Z DaemonSet "openshift-multus/multus-additional-cni-plugins" rollout is not making progress - last change 2021-06-28T22:08:30Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-06-28T22:08:30Z node-tuning 4.9.0-0.nightly-2021-06-24-073147 True False False 45h openshift-apiserver 4.9.0-0.nightly-2021-06-24-073147 True False False 41h openshift-controller-manager 4.9.0-0.nightly-2021-06-24-073147 True False False 24h openshift-samples 4.9.0-0.nightly-2021-06-24-073147 True False False 41h operator-lifecycle-manager 4.9.0-0.nightly-2021-06-24-073147 True False False 47h operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-06-24-073147 True False False 47h operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-06-24-073147 True False False 44h service-ca 4.9.0-0.nightly-2021-06-24-073147 True False False 47h storage 4.9.0-0.nightly-2021-06-24-073147 True False False 47h [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 47h v1.21.0-rc.0+120883f master-0-1 Ready master 47h v1.21.0-rc.0+120883f master-0-2 Ready master 47h v1.21.0-rc.0+120883f worker-0-0 NotReady worker 46h v1.21.0-rc.0+120883f worker-0-1 NotReady worker 46h v1.21.0-rc.0+120883f I opened bug 1977884 as I won't be able to verify it until fixed.
Ori - is it possible to close this now as VERIFIED?
Verified with upgrade from 4.7.24 -> 4.8.6 -> 4.9.0-fc.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759