Bug 1694169

Summary: etcd certs container is exit-looping which causes load on the cluster
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: gblomqui, wking
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-10 05:14:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Occurrences of this error in CI from 2019-04-02T21:37 to 2019-04-03T21:26 UTC none

Description Clayton Coleman 2019-03-29 17:05:33 UTC
fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:1, cap:1>: [
        "Pod kube-system/etcd-member-ip-10-0-141-160.ec2.internal is not healthy: container certs has restarted more than 5 times",
    ]
to be empty


Causes e2e failures 1/4-1/5 times.  Generally we don't allow infra components to "exit loop" and they should stay running and only exit on exceptional cases.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6297

Must fix for 4.1.

Comment 1 Scott Dodson 2019-04-01 19:27:43 UTC
This has happened 6 times over the past 24hr

Comment 2 W. Trevor King 2019-04-03 06:11:01 UTC
I'm not clear if this is related or not, since all of the kube-system_etcd-member-ip-*.ec2.internal_certs.log.gz are empty in [1], but Jim Rigsbee running the 0.16.1 installer today had one of his control-plane nodes fail to join the cluster after its certs pod logged:

  Error executing kube-client-agent: error requesting certificate: error sending CSR to signer: Post https://api.rht-16-1.dev.nextcle.com:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 10.0.134.197:6443: connect: connection refused

It's not clear to me exactly when that error was logged.  The bootkube.service logs had:

  Starting etcd certificate signer...
  Apr 02 22:11:38 ip-10-0-13-180 bootkube.sh[1398]: f024f639ae9c712e8f5ad9076a429989af6c358d80f02628932eeb0cd7f5b4e3
  Apr 02 22:11:38 ip-10-0-13-180 bootkube.sh[1398]: Waiting for etcd cluster...
  Apr 02 22:16:32 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-1.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 1.350605ms
  Apr 02 22:17:01 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-0.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.048892ms
  Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-2.rht-16-1.dev.nextcle.com:2379 is unhealthy: failed to connect: dial tcp 10.0.142.5:2379: connect: connection refused
  Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: Error: unhealthy cluster
  Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: etcdctl failed. Retrying in 5 seconds...
  Apr 02 22:21:48 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-0.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.443168ms
  Apr 02 22:21:48 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-1.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.918426ms

Restarting the kubelet on the broken control-plane node seems to have unstuck things, and the cluster subsequently came up without further trouble.  Our current guess is that the the certs container and/or wedged kubelet led to the single CSR request, which failed, after which the certs container was either not restarted or for some other reason did not retry.  The error message seems to be from [2] (somehow), and the certs container is defined in [3].

[1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6297/artifacts/e2e-aws/pods/
[2]: https://github.com/openshift/kubecsr/blob/f4aef20ab314adeea7e0e2f74e59dcb63c960448/pkg/certagent/agent.go#L121
[3]: https://github.com/openshift/machine-config-operator/blob/4826cc4a96ff97fddcd59101af5b6621b6d5036d/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L27-L69

Comment 3 Sam Batschelet 2019-04-03 11:12:47 UTC
I agree with this and certs now log more details [1]. What I see in the failure cases is client-agent begins to fail. This failure appears to coincide with bootkube.service also failing and restarting. The result is kube-etcd-cert-signer container is terminated and until that comes back up we have a crashlooping client-agent.

The proposal is instead of exiting 1 [2] we will use a retry mechanism to try forever.

[1] https://github.com/openshift/machine-config-operator/commit/399053149d17b943260da5fd9c65260c69ff1b65
[2] https://github.com/openshift/kubecsr/blob/f4aef20ab314adeea7e0e2f74e59dcb63c960448/pkg/certagent/agent.go#L121

Comment 5 W. Trevor King 2019-04-04 04:27:16 UTC
Created attachment 1551633 [details]
Occurrences of this error in CI from 2019-04-02T21:37 to 2019-04-03T21:26 UTC

This occurred in 19 of our 334 failures (5%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'container certs has restarted more than 5 times'

Seems to be more frequent since ~14:00Z for some reason.

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 6 W. Trevor King 2019-04-05 06:47:40 UTC
kubecsr#25 landed 2019-04-04T22:53Z [1], but here's a failing run that started 2019-04-05T03:57Z [2]:

Flaky tests:

[Feature:Platform][Smoke] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]
[k8s.io] Probing container should be restarted with a /healthz http liveness probe [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

Failing tests:

[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]

Double checking the fix had landed:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1534/pull-ci-openshift-installer-master-e2e-aws/4964/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "kube-client-agent").annotations'
{
  "io.openshift.build.commit.id": "9d3e068b3fc7de3df1cb8000492e43efc8200d4b",
  "io.openshift.build.commit.ref": "openshift-4.0",
  "io.openshift.build.source-location": "https://github.com/openshift/kubecsr"
}
$ git log --first-parent --format='%ad %h %d %s' --date=iso -2 origin/master | cat
2019-04-04 15:53:03 -0700 531aaeb  (HEAD -> master, origin/master, origin/HEAD) Merge pull request #25 from hexfusion/add_retry
2019-03-25 14:48:08 -0700 f4aef20  Merge pull request #24 from hexfusion/metrics
$ git log --graph --topo-order --oneline --decorate --first-parent -11 origin/master 9d3e068b3fc7
* 531aaeb (HEAD -> master, origin/master, origin/HEAD) Merge pull request #25 from hexfusion/add_retry
* f4aef20 Merge pull request #24 from hexfusion/metrics
* ac34e6b Merge pull request #22 from wking/client-usage-for-server-cert
* 52278c2 Merge pull request #21 from csrwng/kubeconfig_client
* 36c62cc Merge pull request #18 from abhinavdahiya/kubeconfig
| * 9d3e068 (origin/openshift-4.1, origin/openshift-4.0) Merge pull request #6 from openshift/cp_metrics_signer
| * ae8f7b5 Merge pull request #5 from openshift/rhel_images
| * 7c3e736 Merge pull request #4 from openshift/cp
| * 1b39e01 Merge pull request #1 from openshift/add_docker_4.0
| * 6c3776b Merge pull request #2 from openshift/add_owners
|/  
* 678cc8e Merge pull request #17 from yifan-gu/allow_empty_ipaddrs

So I'm not sure how branching is intended to work in that repository, but it looks like some cherry-picks might be in order.

[1]: https://github.com/coreos/kubecsr/pull/25#event-2254399904
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_origin-aggregated-logging/1586/pull-ci-openshift-origin-aggregated-logging-master-e2e-aws/385/build-log.txt

Comment 7 W. Trevor King 2019-04-05 06:50:36 UTC
Oh, here's the cherry-pick [1], sorry for the noise.

Comment 8 W. Trevor King 2019-04-05 23:19:54 UTC
kubecsr#8 landed.

Comment 10 W. Trevor King 2019-04-08 19:23:01 UTC
None of these in the past 24 hours, so I'm pretty confident that kubecsr#8 fixed us.