Bug 1890724
Summary: | openshift-apiserver and kube-apiserver flapping degraded on bare metal compact cluster during QE automation | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | |
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | ge liu <geliu> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 4.6 | CC: | aos-bugs, dmace, lxia, mfojtik, sbatsche, sttts | |
Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
|
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | LifecycleStale | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1891740 (view as bug list) | Environment: | ||
Last Closed: | 2021-01-23 12:04:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1891013 |
Description
Mike Fiedler
2020-10-22 19:30:26 UTC
About must-gather from comment 2: Next to lots of noise in the OperatorStatusChanged condition message (see attached PR to improve that), I only see one time where cluster-kube-apiserver-operator is degraded, namely at 14:40:28: Status for clusteroperator/kube-apiserver changed: Degraded changed from False to True ("NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver\" is not ready: unknown reason\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver\" is terminated: Error: iled to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:05.337823 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:05.825955 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:09.406065 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:09.763972 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:16.443803 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: W1022 06:40:17.417228 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\". Reconnecting...\nStaticPodsDegraded: Error: context deadline exceeded\nStaticPodsDegraded: I1022 06:40:19.703813 1 main.go:198] Termination finished with exit code 1\nStaticPodsDegraded: I1022 06:40:19.703851 1 main.go:151] Deleting termination lock file \"/var/log/kube-apiserver/.terminating\"\nStaticPodsDegraded: \nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver-check-endpoints\" is running for 18.097539475s but not ready: unknown reason\nStaticPodsDegraded: pods \"kube-apiserver-control-plane-1\" not found\nStaticPodsDegraded: pods \"kube-apiserver-control-plane-2\" not found") with the error: Error while dialing dial tcp [::1]:2379: connect: connection refused At the same time in etcd namespace I see: Status for clusteroperator/etcd changed: Degraded changed from False to True ("StaticPodsDegraded: pod/etcd-control-plane-0 container \"etcd\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/etcd-control-plane-0 container \"etcd-metrics\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/etcd-control-plane-0 container \"etcdctl\" is not ready: PodInitializing: \nStaticPodsDegraded: pods \"etcd-control-plane-1\" not found\nEtcdMembersDegraded: 1 of 2 members are available, NAME-PENDING-136.144.52.218 has not started") Shortly after etcd-control-plane-0 report etcd pod startup. A little while later its operator notices: Status for clusteroperator/etcd changed: Available changed from False to True ("StaticPodsAvailable: 1 nodes are active; 2 nodes are at revision 0; 1 nodes are at revision 2\nEtcdMembersAvailable: 2 members are available") Passing to etcd team for further investigation. The hickup seems to be short, though long enough to be visible on the kube-apiserver side. > Passing to etcd team for further investigation. The hickup seems to be short, though long enough to be visible on the kube-apiserver side. the logs referenced in https://bugzilla.redhat.com/show_bug.cgi?id=1890724#c6 describe etcd scaling up from the bootstrap node onto master-0. "NAME-PENDING-136.144.52.218 has not started" means that etcd membership has been scaled up by the operator but has not yet started. Reviewing the timing it appears to take a minute between the time we scale up etcd until the container is started. This feels like a long time for this process to complete. I would expect that this would be closer to a few seconds. 14:39:41 adding new peer https://136.144.52.218:2380 (scale up etcd) 14:40:43 kubelet reports container etcd has started. So far I'm struggling to identify any evidence of an etcd bug. Here are some things I did notice which could contribute to apiserver etcd connection instability: * Multiple instances of nodes becoming unready and the etcd process on those node being terminated, causing leader elections and client disconnects (see timestamps around 2020-10-22T07:14, 2020-10-22T07:43, 2020-10-22T07:04, 2020-10-22T07:30) * etcd performed 44 leader elections over the life of the cluster, each time causing causing client reconnects To me this indicates there's node instability and possibly performance issues impacting etcd which cause downstream effects on, for instance, the apiserver. To really root cause these issues, I think we need a reproducer. * Is there a reproducer? If so, what are the exact steps to reproduce? * Has this been observed in CI? If so, where? * What are the specific machine characteristics hosting this cluster? It's possible they're insufficient for whatever workload is being tested. This information should be included in the reproducer instructions. (In reply to Dan Mace from comment #9) > So far I'm struggling to identify any evidence of an etcd bug. Here are some > things I did notice which could contribute to apiserver etcd connection > instability: > > * Multiple instances of nodes becoming unready and the etcd process on those > node being terminated, causing leader elections and client disconnects (see > timestamps around 2020-10-22T07:14, 2020-10-22T07:43, 2020-10-22T07:04, > 2020-10-22T07:30) > * etcd performed 44 leader elections over the life of the cluster, each time > causing causing client reconnects > > To me this indicates there's node instability and possibly performance > issues impacting etcd which cause downstream effects on, for instance, the > apiserver. > > To really root cause these issues, I think we need a reproducer. > > * Is there a reproducer? If so, what are the exact steps to reproduce? > * Has this been observed in CI? If so, where? > * What are the specific machine characteristics hosting this cluster? It's > possible they're insufficient for whatever workload is being tested. This > information should be included in the reproducer instructions. And to add a little more detail here, it's apparent from the machine config logs and node journals that the nodes underwent multiple reboots. It appears that a new machineconfig was applied that resulted in reboots but this is after the instability observed by Stefan https://bugzilla.redhat.com/show_bug.cgi?id=1890724#c6 > 14:39:41 adding new peer https://136.144.52.218:2380 (scale up etcd) > 14:40:43 kubelet reports container etcd has started. > 15:03:01 Node control-plane-1 now has machineconfiguration.openshift.io/desiredConfig=rendered-master-47cb7e1f069b0f762585bf8ea7083967 > 15:04:13 Node will reboot into config rendered-master-47cb7e1f069b0f762585bf8ea7083967 A review of CI should be done to exclude this as a normal pattern. Below is a random review of (3) 4.7 nightly to understand if scaling is having issues. The test took events from the operator in regards to scale up MemberAdd. MemberAdd means that the operator has informed the etcd Cluster API of a new member. For scaling to succeed etcd will need to start with the proper runtime config. We see kubelet reporting when the container etcd starts. But when that container starts logic exists which waits for etcd to be observed by the etcd member list. Once that condition is met etcd will have all of the runtime ENV required and can start. The problem is etcd can start before the operator scales and if there is trouble (example operator loses election) a fair amount of time can exist between when etcd starts and when the operator will scale it in this case. Each time etcd fails to find itself in the membership the container will restart. These continued restarts can lead to a crashloop backoff. The backoff can lead to extended time required for scaling to complete. This is a concern that I wanted to validate. ``` # https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1333462642060169216 # 4.7.0-0.nightly-2020-11-30-172451 17:55:34 Started container etcd 17:56:03 adding new peer https://10.0.132.231:2380 17:56:46.584125 ip-10-0-132-231 Readiness probe "etcd-quorum-guard-78d96cd8f5-xh7n7_openshift-etcd(fcc8e83a-3d30-4a0a-8f1a-67768c95d4a7):guard" succeeded 17:56:49 Started container etcd 17:56:50 adding new peer https://10.0.234.154:2380 17:56:51.819233 ip-10-0-234-154 Readiness probe "etcd-quorum-guard-78d96cd8f5-br998_openshift-etcd(25504b4b-4837-428c-a228-599d07b64628):guard" succeeded 17:57:22 Started container etcd 17:57:23 adding new peer https://10.0.189.100:2380 17:57:25.600388 ip-10-0-189-100 Readiness probe "etcd-quorum-guard-78d96cd8f5-96vjr_openshift-etcd(8145555a-c299-450f-aefd-ae9f6d3a2a39):guard" succeeded # https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1332416259496611840 # 4.7.0-0.nightly-2020-11-27-154307 20:25:36 Started container etcd 20:25:51 adding new peer https://10.0.134.148:2380 20:27:26.337246 ip-10-0-134-148 Readiness probe "etcd-quorum-guard-78d96cd8f5-4825z_openshift-etcd(b1433fc2-3c35-4a89-9664-de5a49c23c51):guard" succeeded 20:26:53 Started container etcd 20:26:54 adding new peer https://10.0.182.143:2380 20:27:25.042238 ip-10-0-182-143 Readiness probe "etcd-quorum-guard-78d96cd8f5-4fmsm_openshift-etcd(ddc3d2a3-4409-4f74-8105-570e186fc205):guard" succeeded 20:27:24 Started container etcd 20:27:25 adding new peer https://10.0.227.171:2380 20:27:29.111541 ip-10-0-227-171 Readiness probe "etcd-quorum-guard-78d96cd8f5-85h7r_openshift-etcd(027a11a8-8c4c-4854-9713-9fdd4b75841c):guard" succeeded # https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1331496637603778560 # 4.7.0-0.nightly-2020-11-25-055236 07:30:51 Started container etcd 07:31:01 adding new peer https://10.0.202.71:2380 07:31:17.980262 ip-10-0-202-71 Readiness probe "etcd-quorum-guard-78bf866559-kdk27_openshift-etcd(a35c45b6-95c9-407f-87b6-f840e6c54d63):guard" succeeded 07:31:31 Started container etcd 07:31:32 adding new peer https://10.0.132.235:2380 07:31:35.286824 ip-10-0-132-235 Readiness probe "etcd-quorum-guard-78bf866559-7ql8q_openshift-etcd(647fc209-23c6-480a-a961-b3b25806bc6b):guard" succeeded 07:32:06 Started container etcd 07:32:07 adding new peer https://10.0.136.92:2380 07:32:10.214308 ip-10-0-136-92 Readiness probe "etcd-quorum-guard-78bf866559-xjl9s_openshift-etcd(f0db0d7e-3b3e-4967-b425-6e096ed0411d):guard" succeeded `` In general the above seems sane, 4.7.0-0.nightly-2020-11-27-154307 was strange in that quorum-guard appears to be reporting health checks much later than in other cases. It might make sense to have a scaling metric which is exposed by the operator. Just so that understanding, this timing is easier to observe. Lowering severity as we have no proof this issue is systemic. Please provide additional proof for review. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. |