Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2068601

Summary: Potential etcd inconsistent revision and data occurs
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: EtcdAssignee: Dean West <dwest>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.9CC: aalsaadi, agogala, Alexandros.Phinikarides, alray, asalvati, astafeye, bernd.malmqvist, brian.otte, ccornejo, david.karlsen, dmunneor, dpathak, dwest, echen, Holger.Wolf, igreen, iheim, jdee, jiewu, jkho, jshivers, kahara, lmohanty, mifiedle, mkarnik, moddi, mrobson, musman, niklas.friberg, nsu, oarribas, palonsor, palshure, pawankum, pmuller, qguo, rdiazgav, rgertzbe, rh-container, rvicente, sbelmasg, sburke, seunlee, s.heijmans, shzhou, skrenger, sreber, travi, vnema, wking, wlewis, yuokada
Target Milestone: ---Keywords: Regression, UpgradeBlocker
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: UpdateRecommendationsBlocked
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2069825 2071114 (view as bug list) Environment:
Last Closed: 2022-08-10 11:02:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2069825    

Description David Eads 2022-03-25 18:57:27 UTC
Data corruption of etcd under load combined with uncontrolled process death.
Etcd 3.5 has a data corruption problem: https://github.com/etcd-io/etcd/issues/13766.  It is triggered by moderator-to-high etcd load and uncontrolled etcd process kills.  This could be a kill -9, an OOM, power loss or something similar.

Once the etcd data is corrupted, the openshift configuration will appear to be running three healthy members, but will actually be split-brained, with inconsistent API results.

Once etcd data is corrupted by this bug, then the only way to recover is to restore from backup or choose one of the corrupted members as the new leader and take manual steps to restore etcd from it.

DO NOT downgrade from 3.5 to 3.4.  There appears to be a problem with 3.5 to 3.4 back to 3.5: https://github.com/etcd-io/etcd/issues/13514.  As I understand it, there’s a new 3.5 field called term that is left dirty by 3.4 and leads to data consistency problems.



Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
* Customers running etcd 3.5, which is 4.9 and 4.10 under moderate-to-high master load.
* We would need to block 4.8 to 4.9.  Blocking 4.9 to 4.10 to avoid the load surge on rolling upgrade of masters may also be advisable.
 
What is the impact? Is it serious enough to warrant blocking edges?
* Etcd goes split-brained, resulting in inconsistent API requests.  On our self-hosted platform, this results in operators that cannot function reliably and inconsistent customer workload behavior as pods appear and disappear randomly and leader election doesn’t function.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* We can either restore from backup or choose one of the split brains as the winner, kill the rest, and restore from the winner.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* Yes, etcd 3.4 does not suffer from this bug.  The reproducer only works on 3.5.0, 3.5.1, and 3.5.2.

Comment 1 W. Trevor King 2022-03-25 20:39:04 UTC
> DO NOT downgrade from 3.5 to 3.4.  There appears to be a problem with 3.5 to 3.4 back to 3.5

I don't think you even need to involve the 3.4 -> 3.5 leg, because 3.5 -> 3.4 will fail on the backwards-incompatible disk-schema change [1]:

  {"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790...

which is why we set up pre-minor-bump etcd snapshots for 4.8 -> 4.9, starting in 4.8.12 [2].  And 4.8.14 is the oldest 4.8.z with currently recommended updates to 4.9 [3].

But that's all quibbling with "why?".  I'm +100 on "don't try to roll back to etcd 3.4 / OCP 4.8" as "what?".

[1]: https://github.com/openshift/release/pull/22287#issue-1008767920
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1999777#c19
[3]: https://github.com/openshift/cincinnati-graph-data/blob/d8c05513732b51a2a49735497609bcd0c945c1a2/build-suggestions/4.9.yaml#L5

Comment 2 Lalatendu Mohanty 2022-03-25 20:57:00 UTC
This issue [1] results in etcd data inconsistency which makes the cluster unusable and the etcd data recovery very difficult. We are going ahead with blocking update edges to inhibit additional update from etcd 3.4 to etcd 3.5 (i.e. all update edges from OCP 4.8 to 4.9)  while we investigate this issue. We estimate the chance of hitting this issue to be zero on 4.8 and earlier, and small on 4.9 and later.

[1] https://github.com/etcd-io/etcd/issues/13766

Comment 3 W. Trevor King 2022-03-29 17:14:42 UTC
Upstream announcement: https://etcd.io/docs/v3.5/op-guide/data_corruption/

Comment 5 W. Trevor King 2022-03-29 17:45:53 UTC
Public errata URI is https://access.redhat.com/errata/RHBA-2022:1086 (associated with the related bug 2069085)

Comment 6 W. Trevor King 2022-03-29 18:20:10 UTC
[1] removed 4.8 -> 4.9 update recommendations while we work through this, so adding UpgradeBlocker and, per [2], UpdateRecommendationsBlocked.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/1663
[2]: https://github.com/openshift/enhancements/tree/master/enhancements/update/update-blocker-lifecycle

Comment 11 W. Trevor King 2022-04-01 20:57:02 UTC
Scott dropped #774 off this series back at 13:51 UTC^.  I've cloned off bug 2071114 to track that alerting proposal.

Comment 13 W. Trevor King 2022-04-05 03:57:37 UTC
General bug-process reminder: this is the 4.11.0 bug.  It's useful for generic bug-series discussion like comment 0's impact statement or comment 11's mention of the cloned alerting series.  But unless you are waiting for 4.11 to GA, you are probably going to be more interested in bug 2069825 (ON_QA for 4.10.z) or bug 2069830 (ON_QA for 4.9.z).  Adding yourself to the CC list on either or both of those bugs will get you notifications of progress in each z stream, and you'll get another notification when the associated public errata goes out for the patch release with the fix.

Comment 28 Vinya Nema 2022-07-27 04:56:55 UTC
Hello Dean,

Hope you are doing well!

My cu is failing to up-scale the pods and is quite impatient to wait for the fix to be released. 
Is there any way or workaround we can suggest him.
Also, he is enforcing to upgrade his cluster, which I think will not be fair due to data inconsistency issue with the etcd 3.5.

Could you please share your thoughts on this.

There ask is - 

1. Are we seeing these etcd error messages because of the bug in version 3.5 that you describe? >> I think it is beczuse of this bug on etcd 3.5.0
2. Does that mean we can never set the resource limit for one of the namespace to be 50% of the total amount of cluster resource? >> Help me in answering this?
3. Is updating to the latest version of the OCP; help in resolving the issue?


SFDC Case Reference- #03264188
https://gss--c.visualforce.com/apex/Case_View?id=5006R00001mgPEQ&sfdc.override=1

Comment 30 errata-xmlrpc 2022-08-10 11:02:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 31 Red Hat Bugzilla 2023-09-18 04:34:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days