Bug 2068601 - Potential etcd inconsistent revision and data occurs
Summary: Potential etcd inconsistent revision and data occurs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.11.0
Assignee: Dean West
QA Contact: ge liu
URL:
Whiteboard: UpdateRecommendationsBlocked
Depends On:
Blocks: 2069825
TreeView+ depends on / blocked
 
Reported: 2022-03-25 18:57 UTC by David Eads
Modified: 2023-09-18 04:34 UTC (History)
52 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2069825 2071114 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:02:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 770 0 None Merged Bug 2068601: turn on initial corruption check 2022-03-31 06:51:37 UTC
Red Hat Knowledge Base (Solution) 6844331 0 None None None 2022-03-28 19:37:45 UTC
Red Hat Knowledge Base (Solution) 6849521 0 None None None 2022-03-28 19:37:45 UTC
Red Hat Product Errata RHBA-2022:1086 0 None None None 2022-03-29 17:45:53 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:02:47 UTC

Description David Eads 2022-03-25 18:57:27 UTC
Data corruption of etcd under load combined with uncontrolled process death.
Etcd 3.5 has a data corruption problem: https://github.com/etcd-io/etcd/issues/13766.  It is triggered by moderator-to-high etcd load and uncontrolled etcd process kills.  This could be a kill -9, an OOM, power loss or something similar.

Once the etcd data is corrupted, the openshift configuration will appear to be running three healthy members, but will actually be split-brained, with inconsistent API results.

Once etcd data is corrupted by this bug, then the only way to recover is to restore from backup or choose one of the corrupted members as the new leader and take manual steps to restore etcd from it.

DO NOT downgrade from 3.5 to 3.4.  There appears to be a problem with 3.5 to 3.4 back to 3.5: https://github.com/etcd-io/etcd/issues/13514.  As I understand it, there’s a new 3.5 field called term that is left dirty by 3.4 and leads to data consistency problems.



Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
* Customers running etcd 3.5, which is 4.9 and 4.10 under moderate-to-high master load.
* We would need to block 4.8 to 4.9.  Blocking 4.9 to 4.10 to avoid the load surge on rolling upgrade of masters may also be advisable.
 
What is the impact? Is it serious enough to warrant blocking edges?
* Etcd goes split-brained, resulting in inconsistent API requests.  On our self-hosted platform, this results in operators that cannot function reliably and inconsistent customer workload behavior as pods appear and disappear randomly and leader election doesn’t function.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* We can either restore from backup or choose one of the split brains as the winner, kill the rest, and restore from the winner.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* Yes, etcd 3.4 does not suffer from this bug.  The reproducer only works on 3.5.0, 3.5.1, and 3.5.2.

Comment 1 W. Trevor King 2022-03-25 20:39:04 UTC
> DO NOT downgrade from 3.5 to 3.4.  There appears to be a problem with 3.5 to 3.4 back to 3.5

I don't think you even need to involve the 3.4 -> 3.5 leg, because 3.5 -> 3.4 will fail on the backwards-incompatible disk-schema change [1]:

  {"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790...

which is why we set up pre-minor-bump etcd snapshots for 4.8 -> 4.9, starting in 4.8.12 [2].  And 4.8.14 is the oldest 4.8.z with currently recommended updates to 4.9 [3].

But that's all quibbling with "why?".  I'm +100 on "don't try to roll back to etcd 3.4 / OCP 4.8" as "what?".

[1]: https://github.com/openshift/release/pull/22287#issue-1008767920
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1999777#c19
[3]: https://github.com/openshift/cincinnati-graph-data/blob/d8c05513732b51a2a49735497609bcd0c945c1a2/build-suggestions/4.9.yaml#L5

Comment 2 Lalatendu Mohanty 2022-03-25 20:57:00 UTC
This issue [1] results in etcd data inconsistency which makes the cluster unusable and the etcd data recovery very difficult. We are going ahead with blocking update edges to inhibit additional update from etcd 3.4 to etcd 3.5 (i.e. all update edges from OCP 4.8 to 4.9)  while we investigate this issue. We estimate the chance of hitting this issue to be zero on 4.8 and earlier, and small on 4.9 and later.

[1] https://github.com/etcd-io/etcd/issues/13766

Comment 3 W. Trevor King 2022-03-29 17:14:42 UTC
Upstream announcement: https://etcd.io/docs/v3.5/op-guide/data_corruption/

Comment 5 W. Trevor King 2022-03-29 17:45:53 UTC
Public errata URI is https://access.redhat.com/errata/RHBA-2022:1086 (associated with the related bug 2069085)

Comment 6 W. Trevor King 2022-03-29 18:20:10 UTC
[1] removed 4.8 -> 4.9 update recommendations while we work through this, so adding UpgradeBlocker and, per [2], UpdateRecommendationsBlocked.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/1663
[2]: https://github.com/openshift/enhancements/tree/master/enhancements/update/update-blocker-lifecycle

Comment 11 W. Trevor King 2022-04-01 20:57:02 UTC
Scott dropped #774 off this series back at 13:51 UTC^.  I've cloned off bug 2071114 to track that alerting proposal.

Comment 13 W. Trevor King 2022-04-05 03:57:37 UTC
General bug-process reminder: this is the 4.11.0 bug.  It's useful for generic bug-series discussion like comment 0's impact statement or comment 11's mention of the cloned alerting series.  But unless you are waiting for 4.11 to GA, you are probably going to be more interested in bug 2069825 (ON_QA for 4.10.z) or bug 2069830 (ON_QA for 4.9.z).  Adding yourself to the CC list on either or both of those bugs will get you notifications of progress in each z stream, and you'll get another notification when the associated public errata goes out for the patch release with the fix.

Comment 28 Vinya Nema 2022-07-27 04:56:55 UTC
Hello Dean,

Hope you are doing well!

My cu is failing to up-scale the pods and is quite impatient to wait for the fix to be released. 
Is there any way or workaround we can suggest him.
Also, he is enforcing to upgrade his cluster, which I think will not be fair due to data inconsistency issue with the etcd 3.5.

Could you please share your thoughts on this.

There ask is - 

1. Are we seeing these etcd error messages because of the bug in version 3.5 that you describe? >> I think it is beczuse of this bug on etcd 3.5.0
2. Does that mean we can never set the resource limit for one of the namespace to be 50% of the total amount of cluster resource? >> Help me in answering this?
3. Is updating to the latest version of the OCP; help in resolving the issue?


SFDC Case Reference- #03264188
https://gss--c.visualforce.com/apex/Case_View?id=5006R00001mgPEQ&sfdc.override=1

Comment 30 errata-xmlrpc 2022-08-10 11:02:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 31 Red Hat Bugzilla 2023-09-18 04:34:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.