Bug 1694169 - etcd certs container is exit-looping which causes load on the cluster
Summary: etcd certs container is exit-looping which causes load on the cluster
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-29 17:05 UTC by Clayton Coleman
Modified: 2019-05-06 13:00 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-10 05:14:50 UTC
Target Upstream Version:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-04-02T21:37 to 2019-04-03T21:26 UTC (228.62 KB, image/svg+xml)
2019-04-04 04:27 UTC, W. Trevor King
no flags Details


Links
System ID Priority Status Summary Last Updated
Github coreos kubecsr pull 25 'None' 'closed' '*: add --max-retry flag to kube-client-agent' 2019-12-05 12:53:23 UTC
Github openshift kubecsr pull 7 'None' 'closed' 'cherry-pick: *: add --max-retry flag to kube-client-agent' 2019-12-05 12:53:23 UTC
Github openshift kubecsr pull 8 'None' 'closed' 'cherry-pick: pkg/certagent: wait forever for CSR request to reply' 2019-12-05 12:53:23 UTC

Description Clayton Coleman 2019-03-29 17:05:33 UTC
fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:1, cap:1>: [
        "Pod kube-system/etcd-member-ip-10-0-141-160.ec2.internal is not healthy: container certs has restarted more than 5 times",
    ]
to be empty


Causes e2e failures 1/4-1/5 times.  Generally we don't allow infra components to "exit loop" and they should stay running and only exit on exceptional cases.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6297

Must fix for 4.1.

Comment 1 Scott Dodson 2019-04-01 19:27:43 UTC
This has happened 6 times over the past 24hr

Comment 2 W. Trevor King 2019-04-03 06:11:01 UTC
I'm not clear if this is related or not, since all of the kube-system_etcd-member-ip-*.ec2.internal_certs.log.gz are empty in [1], but Jim Rigsbee running the 0.16.1 installer today had one of his control-plane nodes fail to join the cluster after its certs pod logged:

  Error executing kube-client-agent: error requesting certificate: error sending CSR to signer: Post https://api.rht-16-1.dev.nextcle.com:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 10.0.134.197:6443: connect: connection refused

It's not clear to me exactly when that error was logged.  The bootkube.service logs had:

  Starting etcd certificate signer...
  Apr 02 22:11:38 ip-10-0-13-180 bootkube.sh[1398]: f024f639ae9c712e8f5ad9076a429989af6c358d80f02628932eeb0cd7f5b4e3
  Apr 02 22:11:38 ip-10-0-13-180 bootkube.sh[1398]: Waiting for etcd cluster...
  Apr 02 22:16:32 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-1.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 1.350605ms
  Apr 02 22:17:01 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-0.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.048892ms
  Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-2.rht-16-1.dev.nextcle.com:2379 is unhealthy: failed to connect: dial tcp 10.0.142.5:2379: connect: connection refused
  Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: Error: unhealthy cluster
  Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: etcdctl failed. Retrying in 5 seconds...
  Apr 02 22:21:48 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-0.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.443168ms
  Apr 02 22:21:48 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-1.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.918426ms

Restarting the kubelet on the broken control-plane node seems to have unstuck things, and the cluster subsequently came up without further trouble.  Our current guess is that the the certs container and/or wedged kubelet led to the single CSR request, which failed, after which the certs container was either not restarted or for some other reason did not retry.  The error message seems to be from [2] (somehow), and the certs container is defined in [3].

[1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6297/artifacts/e2e-aws/pods/
[2]: https://github.com/openshift/kubecsr/blob/f4aef20ab314adeea7e0e2f74e59dcb63c960448/pkg/certagent/agent.go#L121
[3]: https://github.com/openshift/machine-config-operator/blob/4826cc4a96ff97fddcd59101af5b6621b6d5036d/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L27-L69

Comment 3 Sam Batschelet 2019-04-03 11:12:47 UTC
I agree with this and certs now log more details [1]. What I see in the failure cases is client-agent begins to fail. This failure appears to coincide with bootkube.service also failing and restarting. The result is kube-etcd-cert-signer container is terminated and until that comes back up we have a crashlooping client-agent.

The proposal is instead of exiting 1 [2] we will use a retry mechanism to try forever.

[1] https://github.com/openshift/machine-config-operator/commit/399053149d17b943260da5fd9c65260c69ff1b65
[2] https://github.com/openshift/kubecsr/blob/f4aef20ab314adeea7e0e2f74e59dcb63c960448/pkg/certagent/agent.go#L121

Comment 5 W. Trevor King 2019-04-04 04:27:16 UTC
Created attachment 1551633 [details]
Occurrences of this error in CI from 2019-04-02T21:37 to 2019-04-03T21:26 UTC

This occurred in 19 of our 334 failures (5%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'container certs has restarted more than 5 times'

Seems to be more frequent since ~14:00Z for some reason.

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 6 W. Trevor King 2019-04-05 06:47:40 UTC
kubecsr#25 landed 2019-04-04T22:53Z [1], but here's a failing run that started 2019-04-05T03:57Z [2]:

Flaky tests:

[Feature:Platform][Smoke] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]
[k8s.io] Probing container should be restarted with a /healthz http liveness probe [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

Failing tests:

[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]

Double checking the fix had landed:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1534/pull-ci-openshift-installer-master-e2e-aws/4964/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "kube-client-agent").annotations'
{
  "io.openshift.build.commit.id": "9d3e068b3fc7de3df1cb8000492e43efc8200d4b",
  "io.openshift.build.commit.ref": "openshift-4.0",
  "io.openshift.build.source-location": "https://github.com/openshift/kubecsr"
}
$ git log --first-parent --format='%ad %h %d %s' --date=iso -2 origin/master | cat
2019-04-04 15:53:03 -0700 531aaeb  (HEAD -> master, origin/master, origin/HEAD) Merge pull request #25 from hexfusion/add_retry
2019-03-25 14:48:08 -0700 f4aef20  Merge pull request #24 from hexfusion/metrics
$ git log --graph --topo-order --oneline --decorate --first-parent -11 origin/master 9d3e068b3fc7
* 531aaeb (HEAD -> master, origin/master, origin/HEAD) Merge pull request #25 from hexfusion/add_retry
* f4aef20 Merge pull request #24 from hexfusion/metrics
* ac34e6b Merge pull request #22 from wking/client-usage-for-server-cert
* 52278c2 Merge pull request #21 from csrwng/kubeconfig_client
* 36c62cc Merge pull request #18 from abhinavdahiya/kubeconfig
| * 9d3e068 (origin/openshift-4.1, origin/openshift-4.0) Merge pull request #6 from openshift/cp_metrics_signer
| * ae8f7b5 Merge pull request #5 from openshift/rhel_images
| * 7c3e736 Merge pull request #4 from openshift/cp
| * 1b39e01 Merge pull request #1 from openshift/add_docker_4.0
| * 6c3776b Merge pull request #2 from openshift/add_owners
|/  
* 678cc8e Merge pull request #17 from yifan-gu/allow_empty_ipaddrs

So I'm not sure how branching is intended to work in that repository, but it looks like some cherry-picks might be in order.

[1]: https://github.com/coreos/kubecsr/pull/25#event-2254399904
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_origin-aggregated-logging/1586/pull-ci-openshift-origin-aggregated-logging-master-e2e-aws/385/build-log.txt

Comment 7 W. Trevor King 2019-04-05 06:50:36 UTC
Oh, here's the cherry-pick [1], sorry for the noise.

Comment 8 W. Trevor King 2019-04-05 23:19:54 UTC
kubecsr#8 landed.

Comment 10 W. Trevor King 2019-04-08 19:23:01 UTC
None of these in the past 24 hours, so I'm pretty confident that kubecsr#8 fixed us.


Note You need to log in before you can comment on or make changes to this bug.