Bug 2068601
| Summary: | Potential etcd inconsistent revision and data occurs | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | |
| Component: | Etcd | Assignee: | Dean West <dwest> | |
| Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.9 | CC: | aalsaadi, agogala, Alexandros.Phinikarides, alray, asalvati, astafeye, bernd.malmqvist, brian.otte, ccornejo, david.karlsen, dmunneor, dpathak, dwest, echen, Holger.Wolf, igreen, iheim, jdee, jiewu, jkho, jshivers, kahara, lmohanty, mifiedle, mkarnik, moddi, mrobson, musman, niklas.friberg, nsu, oarribas, palonsor, palshure, pawankum, pmuller, qguo, rdiazgav, rgertzbe, rh-container, rvicente, sbelmasg, sburke, seunlee, s.heijmans, shzhou, skrenger, sreber, travi, vnema, wking, wlewis, yuokada | |
| Target Milestone: | --- | Keywords: | Regression, UpgradeBlocker | |
| Target Release: | 4.11.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | UpdateRecommendationsBlocked | |||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2069825 2071114 (view as bug list) | Environment: | ||
| Last Closed: | 2022-08-10 11:02:19 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2069825 | |||
|
Description
David Eads
2022-03-25 18:57:27 UTC
> DO NOT downgrade from 3.5 to 3.4. There appears to be a problem with 3.5 to 3.4 back to 3.5 I don't think you even need to involve the 3.4 -> 3.5 leg, because 3.5 -> 3.4 will fail on the backwards-incompatible disk-schema change [1]: {"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790... which is why we set up pre-minor-bump etcd snapshots for 4.8 -> 4.9, starting in 4.8.12 [2]. And 4.8.14 is the oldest 4.8.z with currently recommended updates to 4.9 [3]. But that's all quibbling with "why?". I'm +100 on "don't try to roll back to etcd 3.4 / OCP 4.8" as "what?". [1]: https://github.com/openshift/release/pull/22287#issue-1008767920 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1999777#c19 [3]: https://github.com/openshift/cincinnati-graph-data/blob/d8c05513732b51a2a49735497609bcd0c945c1a2/build-suggestions/4.9.yaml#L5 This issue [1] results in etcd data inconsistency which makes the cluster unusable and the etcd data recovery very difficult. We are going ahead with blocking update edges to inhibit additional update from etcd 3.4 to etcd 3.5 (i.e. all update edges from OCP 4.8 to 4.9) while we investigate this issue. We estimate the chance of hitting this issue to be zero on 4.8 and earlier, and small on 4.9 and later. [1] https://github.com/etcd-io/etcd/issues/13766 Upstream announcement: https://etcd.io/docs/v3.5/op-guide/data_corruption/ Public errata URI is https://access.redhat.com/errata/RHBA-2022:1086 (associated with the related bug 2069085) [1] removed 4.8 -> 4.9 update recommendations while we work through this, so adding UpgradeBlocker and, per [2], UpdateRecommendationsBlocked. [1]: https://github.com/openshift/cincinnati-graph-data/pull/1663 [2]: https://github.com/openshift/enhancements/tree/master/enhancements/update/update-blocker-lifecycle Scott dropped #774 off this series back at 13:51 UTC^. I've cloned off bug 2071114 to track that alerting proposal. General bug-process reminder: this is the 4.11.0 bug. It's useful for generic bug-series discussion like comment 0's impact statement or comment 11's mention of the cloned alerting series. But unless you are waiting for 4.11 to GA, you are probably going to be more interested in bug 2069825 (ON_QA for 4.10.z) or bug 2069830 (ON_QA for 4.9.z). Adding yourself to the CC list on either or both of those bugs will get you notifications of progress in each z stream, and you'll get another notification when the associated public errata goes out for the patch release with the fix. Hello Dean, Hope you are doing well! My cu is failing to up-scale the pods and is quite impatient to wait for the fix to be released. Is there any way or workaround we can suggest him. Also, he is enforcing to upgrade his cluster, which I think will not be fair due to data inconsistency issue with the etcd 3.5. Could you please share your thoughts on this. There ask is - 1. Are we seeing these etcd error messages because of the bug in version 3.5 that you describe? >> I think it is beczuse of this bug on etcd 3.5.0 2. Does that mean we can never set the resource limit for one of the namespace to be 50% of the total amount of cluster resource? >> Help me in answering this? 3. Is updating to the latest version of the OCP; help in resolving the issue? SFDC Case Reference- #03264188 https://gss--c.visualforce.com/apex/Case_View?id=5006R00001mgPEQ&sfdc.override=1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |