Bug 1890724

Summary:	openshift-apiserver and kube-apiserver flapping degraded on bare metal compact cluster during QE automation
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.6	CC:	aos-bugs, dmace, lxia, mfojtik, sbatsche, sttts
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleStale
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1891740 (view as bug list)		Environment:
Last Closed:	2021-01-23 12:04:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1891013

Description Mike Fiedler 2020-10-22 19:30:26 UTC

Description of problem:

I do have must-gather - will add in private comment- but not much else.   During an automated QE regression on 4.6.1 on a bare metal compact cluster, the apiservers started flapping degraded. Other operators went degraded as well but it looked like their issue was not being able to contact the API.  I did not save the oc describe for the apiservers, but hopefully the must-gather has it.


Version-Release number of selected component (if applicable): 4.6.1


How reproducible: unknown - i'll try building a new cluster


Additional info:

Will add must-gather in private comment.

Comment 6 Stefan Schimanski 2020-10-23 09:35:20 UTC

About must-gather from comment 2:

Next to lots of noise in the OperatorStatusChanged condition message (see attached PR to improve that), I only see one time where cluster-kube-apiserver-operator is degraded, namely at 14:40:28:

Status for clusteroperator/kube-apiserver changed: Degraded changed from False to True ("NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver\" is not ready: unknown reason\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver\" is terminated: Error: iled to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:05.337823 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:05.825955 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:09.406065 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:09.763972 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:16.443803 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:17.417228 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 &lt;nil&gt; 0 &lt;nil&gt;}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: Error: context deadline exceeded\nStaticPodsDegraded: I1022 06:40:19.703813 1 main.go:198] Termination finished with exit code 1\nStaticPodsDegraded: I1022 06:40:19.703851 1 main.go:151] Deleting termination lock file \"/var/log/kube-apiserver/.terminating\"\nStaticPodsDegraded: \nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver-check-endpoints\" is running for 18.097539475s but not ready: unknown reason\nStaticPodsDegraded: pods \"kube-apiserver-control-plane-1\" not found\nStaticPodsDegraded: pods \"kube-apiserver-control-plane-2\" not found")

with the error:

  Error while dialing dial tcp [::1]:2379: connect: connection refused

At the same time in etcd namespace I see:

  Status for clusteroperator/etcd changed: Degraded changed from False to True ("StaticPodsDegraded: pod/etcd-control-plane-0 container \"etcd\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/etcd-control-plane-0 container \"etcd-metrics\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/etcd-control-plane-0 container \"etcdctl\" is not ready: PodInitializing: \nStaticPodsDegraded: pods \"etcd-control-plane-1\" not found\nEtcdMembersDegraded: 1 of 2 members are available, NAME-PENDING-136.144.52.218 has not started")

Shortly after etcd-control-plane-0 report etcd pod startup. A little while later its operator notices:

  Status for clusteroperator/etcd changed: Available changed from False to True ("StaticPodsAvailable: 1 nodes are active; 2 nodes are at revision 0; 1 nodes are at revision 2\nEtcdMembersAvailable: 2 members are available")

Passing to etcd team for further investigation. The hickup seems to be short, though long enough to be visible on the kube-apiserver side.

Comment 8 Sam Batschelet 2020-11-17 21:31:15 UTC

> Passing to etcd team for further investigation. The hickup seems to be short, though long enough to be visible on the kube-apiserver side.

the logs referenced in https://bugzilla.redhat.com/show_bug.cgi?id=1890724#c6 describe etcd scaling up from the bootstrap node onto master-0. "NAME-PENDING-136.144.52.218 has not started" means that etcd membership has been scaled up by the operator but has not yet started. Reviewing the timing it appears to take a minute between the time we scale up etcd until the container is started. This feels like a long time for this process to complete. I would expect that this would be closer to a few seconds.


14:39:41 adding new peer https://136.144.52.218:2380 (scale up etcd)
14:40:43 kubelet reports container etcd has started.

Comment 9 Dan Mace 2020-11-24 18:37:00 UTC

So far I'm struggling to identify any evidence of an etcd bug. Here are some things I did notice which could contribute to apiserver etcd connection instability:

* Multiple instances of nodes becoming unready and the etcd process on those node being terminated, causing leader elections and client disconnects (see timestamps around 2020-10-22T07:14, 2020-10-22T07:43, 2020-10-22T07:04, 2020-10-22T07:30)
* etcd performed 44 leader elections over the life of the cluster, each time causing causing client reconnects

To me this indicates there's node instability and possibly performance issues impacting etcd which cause downstream effects on, for instance, the apiserver.

To really root cause these issues, I think we need a reproducer. 

* Is there a reproducer? If so, what are the exact steps to reproduce?
* Has this been observed in CI? If so, where?
* What are the specific machine characteristics hosting this cluster? It's possible they're insufficient for whatever workload is being tested. This information should be included in the reproducer instructions.

Comment 10 Dan Mace 2020-11-24 19:02:04 UTC

(In reply to Dan Mace from comment #9)
> So far I'm struggling to identify any evidence of an etcd bug. Here are some
> things I did notice which could contribute to apiserver etcd connection
> instability:
> 
> * Multiple instances of nodes becoming unready and the etcd process on those
> node being terminated, causing leader elections and client disconnects (see
> timestamps around 2020-10-22T07:14, 2020-10-22T07:43, 2020-10-22T07:04,
> 2020-10-22T07:30)
> * etcd performed 44 leader elections over the life of the cluster, each time
> causing causing client reconnects
> 
> To me this indicates there's node instability and possibly performance
> issues impacting etcd which cause downstream effects on, for instance, the
> apiserver.
> 
> To really root cause these issues, I think we need a reproducer. 
> 
> * Is there a reproducer? If so, what are the exact steps to reproduce?
> * Has this been observed in CI? If so, where?
> * What are the specific machine characteristics hosting this cluster? It's
> possible they're insufficient for whatever workload is being tested. This
> information should be included in the reproducer instructions.

And to add a little more detail here, it's apparent from the machine config logs and node journals that the nodes underwent multiple reboots.

Comment 11 Sam Batschelet 2020-12-01 14:58:26 UTC

It appears that a new machineconfig was applied that resulted in reboots but this is after the instability observed by Stefan https://bugzilla.redhat.com/show_bug.cgi?id=1890724#c6

> 14:39:41      adding new peer https://136.144.52.218:2380 (scale up etcd)

> 14:40:43      kubelet reports container etcd has started.

> 15:03:01	Node control-plane-1 now has machineconfiguration.openshift.io/desiredConfig=rendered-master-47cb7e1f069b0f762585bf8ea7083967

> 15:04:13      Node will reboot into config rendered-master-47cb7e1f069b0f762585bf8ea7083967

A review of CI should be done to exclude this as a normal pattern.

Comment 12 Sam Batschelet 2020-12-02 16:29:35 UTC

Below is a random review of (3) 4.7 nightly to understand if scaling is having issues. The test took events from the operator in regards to scale up MemberAdd. MemberAdd means that the operator has informed the etcd Cluster API of a new member. For scaling to succeed etcd will need to start with the proper runtime config. We see kubelet reporting when the container etcd starts. But when that container starts logic exists which waits for etcd to be observed by the etcd member list. Once that condition is met etcd will have all of the runtime ENV required and can start.

The problem is etcd can start before the operator scales and if there is trouble (example operator loses election) a fair amount of time can exist between when etcd starts and when the operator will scale it in this case. Each time etcd fails to find itself in the membership the container will restart. These continued restarts can lead to a crashloop backoff. The backoff can lead to extended time required for scaling to complete. This is a concern that I wanted to validate.

```
# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1333462642060169216
# 4.7.0-0.nightly-2020-11-30-172451

17:55:34 Started container etcd
17:56:03 adding new peer https://10.0.132.231:2380
17:56:46.584125 ip-10-0-132-231 Readiness probe "etcd-quorum-guard-78d96cd8f5-xh7n7_openshift-etcd(fcc8e83a-3d30-4a0a-8f1a-67768c95d4a7):guard" succeeded

17:56:49 Started container etcd
17:56:50 adding new peer https://10.0.234.154:2380
17:56:51.819233 ip-10-0-234-154 Readiness probe "etcd-quorum-guard-78d96cd8f5-br998_openshift-etcd(25504b4b-4837-428c-a228-599d07b64628):guard" succeeded

17:57:22 Started container etcd
17:57:23 adding new peer https://10.0.189.100:2380
17:57:25.600388 ip-10-0-189-100 Readiness probe "etcd-quorum-guard-78d96cd8f5-96vjr_openshift-etcd(8145555a-c299-450f-aefd-ae9f6d3a2a39):guard" succeeded

# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1332416259496611840
# 4.7.0-0.nightly-2020-11-27-154307

20:25:36 Started container etcd
20:25:51 adding new peer https://10.0.134.148:2380
20:27:26.337246 ip-10-0-134-148 Readiness probe "etcd-quorum-guard-78d96cd8f5-4825z_openshift-etcd(b1433fc2-3c35-4a89-9664-de5a49c23c51):guard" succeeded

20:26:53 Started container etcd
20:26:54 adding new peer https://10.0.182.143:2380
20:27:25.042238 ip-10-0-182-143 Readiness probe "etcd-quorum-guard-78d96cd8f5-4fmsm_openshift-etcd(ddc3d2a3-4409-4f74-8105-570e186fc205):guard" succeeded

20:27:24 Started container etcd
20:27:25 adding new peer https://10.0.227.171:2380
20:27:29.111541 ip-10-0-227-171 Readiness probe "etcd-quorum-guard-78d96cd8f5-85h7r_openshift-etcd(027a11a8-8c4c-4854-9713-9fdd4b75841c):guard" succeeded

# https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1331496637603778560
# 4.7.0-0.nightly-2020-11-25-055236

07:30:51 Started container etcd
07:31:01 adding new peer https://10.0.202.71:2380
07:31:17.980262 ip-10-0-202-71 Readiness probe "etcd-quorum-guard-78bf866559-kdk27_openshift-etcd(a35c45b6-95c9-407f-87b6-f840e6c54d63):guard" succeeded

07:31:31 Started container etcd
07:31:32 adding new peer https://10.0.132.235:2380
07:31:35.286824 ip-10-0-132-235 Readiness probe "etcd-quorum-guard-78bf866559-7ql8q_openshift-etcd(647fc209-23c6-480a-a961-b3b25806bc6b):guard" succeeded

07:32:06 Started container etcd
07:32:07 adding new peer https://10.0.136.92:2380
07:32:10.214308 ip-10-0-136-92 Readiness probe "etcd-quorum-guard-78bf866559-xjl9s_openshift-etcd(f0db0d7e-3b3e-4967-b425-6e096ed0411d):guard" succeeded
``

In general the above seems sane, 4.7.0-0.nightly-2020-11-27-154307 was strange in that quorum-guard appears to be reporting health checks much later than in other cases.

It might make sense to have a scaling metric which is exposed by the operator. Just so that understanding, this timing is easier to observe. Lowering severity as we have no proof this issue is systemic. Please provide additional proof for review.

Comment 13 Michal Fojtik 2021-01-01 16:58:23 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.