Bug 2024005 - KCM pods are crashlooping [NEEDINFO]
Summary: KCM pods are crashlooping
Keywords:
Status: CLOSED DUPLICATE of bug 2077497
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Dean West
QA Contact: ge liu
URL:
Whiteboard: LifecycleStale
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-17 01:05 UTC by Hongkai Liu
Modified: 2022-04-27 10:12 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-27 10:12:13 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)

Comment 1 Michal Fojtik 2021-11-17 01:35:15 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 Petr Muller 2021-11-17 01:49:55 UTC
1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
yes

2. Give the output of "oc get clusteroperators -o yaml".
attached

3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
no

4. List the top 5 relevant errors from the logs of the operators and operands in (3).
KCM log:

I1116 17:11:04.559106       1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1116 17:11:04.559246       1 controllermanager.go:323] leaderelection lost

5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
KCM 

6. Explain why (5) is likely the right order and list the information used for that assessment.
KCM is crashlooping and we are not sufficient OCP troubleshooters to judge why

7. Explain why Engineering is necessary to make progress.
This problem brought down 2 OpenShift CI clusters and severely impacts CI

Comment 6 Sam Batschelet 2021-11-18 17:07:53 UTC
During a deeper review, I notice etcd container had exited non 0 unexpectedly. Kubelet reported.

```
Nov 16 15:30:25.817157 build0-gstfj-m-0.c.openshift-ci-build-farm.internal hyperkube[1887]: I1116 15:30:25.817117    1887 generic.go:296] "Generic (PLEG): container finished" podID=88ce32c5-2121-49ea-addd-18b8d1a4e5ee containerID="29f662eeb26c01194b942b9cb2af9095bbd571cb4ce28c0f23fefec04312860c" exitCode=2
Nov 16 15:30:25.817255 build0-gstfj-m-0.c.openshift-ci-build-farm.internal hyperkube[1887]: I1116 15:30:25.817208    1887 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-etcd/etcd-build0-gstfj-m-0.c.openshift-ci-build-farm.internal" event=&{ID:88ce32c5-2121-49ea-addd-18b8d1a4e5ee Type:ContainerDied Data:29f662eeb26c01194b942b9cb2af9095bbd571cb4ce28c0f23fefec04312860c}

```

We were lucky enough to have the previous container log which showed etcd panic.

```
{"level":"panic","ts":"2021-11-16T15:30:24.774Z","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue

goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc013779bc0, 0xc06da42a80, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0000aefa0, 0x122a229, 0x23, 0xc06da42a80, 0x1, 0x1)
        /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.uber.org/zap.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0000aefa0, 0xc00047aa40, 0xc02176a000, 0x2f9e, 0x32aa, 0xc02051a000, 0x2f9e, 0x32aa, 0x12, 0x7fffffffffffffff, ...)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00047aa00, 0x0)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00047aa00)
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
        /remote-source/cachito-gomod-with-deps/app/server/mvcc/watchable_store.go:95 +0x3a7

```

etcd team is exploring the situation further to understand the full extent of the problem and if the result was indeed corruption of etcd state. This bug has also been reported upstream by bytedance[1]. We will continue to work with bytedance team and other maintainers to ensure resolution. A more complete RCA report is pending investigation.

[1] https://github.com/etcd-io/etcd/pull/13426

Comment 7 ffernand 2021-11-22 15:20:42 UTC
Based on comment 6, adjusting component to etcd. Please reassign to networking team if that is not an etcd issue.

Comment 8 Michal Fojtik 2021-11-22 15:39:27 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 10 Sam Batschelet 2021-11-23 21:37:13 UTC
After further review and conversations with other upstream etcd teams it appears that the member who panics will think that its log is valid (passes bolt integrity checks) and also the leader will assume the peer is consistent thus not replacing the log with a snapshot. But at the same time kube-controller-manager was not able to update leases and it appeared that defragmentation was failing on the affected member as well which may point to an inability to the wal and or db.

Based on the observations of other teams that the affected state was isolated to the peer that paniced I choose to replace the member thus forcing resync of the raft log on scale-up via snapshot. This corrected the situation and the cluster made progress. From speaking with other teams I believe that this bug is a low probability but critical. As such etcd team will continue to work with upstream teams on mitigations and will perform a detailed post mortem to help support understand and resolve the issue in other clusters.

While the PR upstream[1] is open it really only focuses on mitigations. The actual root cause for the failure is unknown. But once we can harden the signal on observability of the condition we can attempt to aggregate CI tests and possibly telemetry to better understand the probability and thus a priority.

Today we have a potential workaround but given the severity, I will leave the bug at high while the triage is ongoing.

[1] https://github.com/etcd-io/etcd/pull/13426

Comment 12 W. Trevor King 2022-01-13 06:06:51 UTC
[1] was closed in favor of [2], which was backported to 3.5 in [3], and is expected to go out in 3.5.2 [4].

[1]: https://github.com/etcd-io/etcd/pull/13426
[2]: https://github.com/etcd-io/etcd/pull/13505
[3]: https://github.com/etcd-io/etcd/pull/13541
[4]: https://github.com/etcd-io/etcd/issues/13518

Comment 13 W. Trevor King 2022-01-13 18:05:47 UTC
Also some CI-search/Sippy stuff.  Specific error does show up in CI JUnit, where it's rare:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=panic:+failed+to+unmarshal+mvccpb.KeyValue' | grep 'failures match' | sort
  periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 88 runs, 57% failed, 2% of failures match = 1% impact

And in jobs like [1], the failing test-case is:

  : [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]	2s
  fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:52]: Unexpected error:
    <*errors.errorString | 0xc000a36d60>: {
        s: "Observed 1.0050724637681159 leader changes (expected 0) in 1h32m28s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics",
    }
    Observed 1.0050724637681159 leader changes (expected 0) in 1h32m28s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics
  occurred

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1481416866843856896

Comment 14 W. Trevor King 2022-02-03 04:52:17 UTC
Expanding on comment 6, a compact diagnostic, if your Kubernetes API is still up and able to return relatively current results, is:

  $ oc -n openshift-etcd get -o yaml pods | grep -1 mvccpb.KeyValue

If the output includes "failed to unmarshal mvccpb.KeyValue", it's probably this issue.  If it doesn't, you're cluster might have something else going on.  Or maybe it's this, and the results the Kube-API server sent back to you were stale.  So positive matches are fairly conclusive.  Negatives are less conclusive.

Comment 15 Michal Fojtik 2022-03-09 17:13:51 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 16 Thomas Jungblut 2022-04-27 10:12:13 UTC
upstream fix should be coming through the rebase in 2077497

*** This bug has been marked as a duplicate of bug 2077497 ***


Note You need to log in before you can comment on or make changes to this bug.