Bug 1835112
Summary: | s390x/ppc64le: Failed to upgrade Cluster from 4.2.29 to 4.3.18: unable to sync: open /opt/openshift/operator/ocp-s390x: no such file or directory | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | jschinta |
Component: | Samples | Assignee: | Gabe Montero <gmontero> |
Status: | CLOSED ERRATA | QA Contact: | Barry Donahue <bdonahue> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.3.z | CC: | alklein, aos-bugs, bdonahue, cbaus, cfillekes, christian.lapolt, dgilmore, dslavens, gmontero, jokerman, jpoulin, nbziouec, wking, ygaponen |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.5.0 | ||
Hardware: | s390x | ||
OS: | Other | ||
Whiteboard: | multi-arch | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: earlier versions (4.2.x) of samples operator on s390x/ppc64le did not bootstrap as removed, given samples content has not been made available on those architectures yet
Consequence: an upgrade to alter versions would go degraded as the later versions assumed samples operator was already marked removed
Fix: newer versions of samples operator will now force if needed samples to removed during upgrade on s390x/ppc64le
Result: upgrade of samples operator on s390x/ppc64le will succeed
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-13 17:38:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1765215, 1835995 |
Description
jschinta
2020-05-13 06:42:57 UTC
Yep there is an upgrade specific (vs. initial install) error path with s390/ppc64le that I now see based on the data provided with the bug, which stems from bootstrapping as removed for those platforms, but having the payload imagestreams like tests and must-gather coming into the samples operator. Now, a couple of notes: 1) a process reminder, per the OCP process, I would need to fix this in 4.5 first, then 4.4.z, then 4.3.z; so it will take a bit 2) Jeremy Poulin and Renin Jose are in the process of trying to validate the s390 image in early 4.5 payloads to see if we can include some samples finally ... not sure yet if/when that would move back to 4.4 and 4.3, but if it did, it would obviate the need for the fix. I've cc:ed Jeremy on this bug and will send a needinfo to him for comment. But in the interim, I'll start on 1) We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le. Is there also an impact from 4.2 -> 4.2, 4.3 -> 4.4, etc.? What is the impact? Is it serious enough to warrant blocking edges? example: Samples sticks on arch-specific bug, CVO sticks on samples, update hangs. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Clearing the attempted update resolves the issue. There is no other remediation procedure. (In reply to W. Trevor King from comment #2) > We're asking the following questions to evaluate whether or not this bug > warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The > ultimate goal is to avoid delivering an update which introduces new risk or > reduces cluster functionality in any way. Sample answers are provided to > give more context and the UpgradeBlocker flag has been added to this bug. It > will be removed if the assessment indicates that this should not block > upgrade edges. > > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? > example: 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le. I believe this is 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le. > Is there also an impact from 4.2 -> 4.2, 4.3 -> 4.4, etc.? > What is the impact? Is it serious enough to warrant blocking edges? > example: Samples sticks on arch-specific bug, CVO sticks on samples, > update hangs. I would expect these results on 4.2 -> 4.2, 4.3 -> 4.4. We are currently working with the multiarch team to get samples vetted on s390 and perhaps ppc64le for 4.5. But most likely that is several weeks out for ppc64le and maybe a few days to a week for s390x. So I would not expect this for 4.4 -> 4.5. Then, a discussion on backporting content to 4.4 or 4.3 could occur, though it is not a given that would be agreed upon. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? > example: Clearing the attempted update resolves the issue. There is no > other remediation procedure. running `oc delete configs.samples cluster` should reset the samples operator; when it comes back up, it will treat things like an initial install and should bootstrap as removed, without misguided attempts to read non-existent content. Also, on my needinfo to Jeremy (though if anybody on cc: knows please feel free to chime in) - is there 4.3 -> 4.4 tests upgrade tests on s390x coming anytime soon? Lastly, to the originator and QA contact, can either of you reproduce the upgrade issue, and then run `oc delete configs.samples cluster` and observe the result. Ultimately, an `oc get clusteroperator openshift-samples -o yaml` should confirm the reset worked and samples is available==true progressing==false degraded==false version set, like you would get on an initial install. This would only affect s390. 4.2 was not released on ppc64. Given a) the 4.5 payloads are just starting out, and b) we are ultimately bring in s390x samples so the upgrade 4.5 won't hit the error path I'm moving this to verified to accelerate backport to 4.3.z, where the upgrade hiccup while removed/no s390x samples was observiced (In reply to Gabe Montero from comment #5) > Lastly, to the originator and QA contact, can either of you reproduce the > upgrade issue, and > then run `oc delete configs.samples cluster` and observe the result. > > Ultimately, an `oc get clusteroperator openshift-samples -o yaml` should > confirm the reset worked and samples is available==true progressing==false > degraded==false version set, > like you would get on an initial install. Unfortunately i needed the Cluster and had to reinstall it with 4.3.18. Since i don't have enough Machines for a second Cluster, i can't reproduce the issue. I'm holding on the doc update for now. If we provide s390 samples in 4.5, this bug/change will be rendered inert. Per last go around, manually verifying this since s390 4.5 is not available for an upgrade test, and there is no viable x86 approximation. At least this time, we were able to vet these changes with the multi arch team on a manually built 4.3 payload, using a 4.2 to 4.3 upgrade. *** Bug 1766287 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 The errata is only for: Red Hat OpenShift Container Platform 4.5 for RHEL 8 x86_64 Red Hat OpenShift Container Platform 4.5 for RHEL 7 x86_64 but the bug is against OCP 4.2, 4.3 on Z. The catalog sources appear to install some correct images in 4.5 on Z, and some of them appear to be s390x images, but they don't seem to start the expected workloads; they error out in strange ways; see https://bugzilla.redhat.com/show_bug.cgi?id=1766364 Also, the fact that ERRATA https://access.redhat.com/errata/RHBA-2020:2409 only applies to 4.5 on x86 -- is that only because 4.5 has not GA'd yet on Z, i.e. should we be testing the upgrade paths from 4.4 to 4.5 in looking for this fix, or should that appear in a nightly, such as 4.5.0-0.nightly-s390x-2020-07-03-213659 -- and, do we close the bug if the samples appear even if none of the samples we've tried on Z seem to work, or do we open a separate bug for each sample that does not work? directing this as a needinfo to dgilmore because it's a policies & procedures question as to what we should be doing with this class of bug in general. There seem to be a lot of them. removing needinfo Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |