Hide Forgot
fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected <[]string | len:1, cap:1>: [ "Pod kube-system/etcd-member-ip-10-0-141-160.ec2.internal is not healthy: container certs has restarted more than 5 times", ] to be empty Causes e2e failures 1/4-1/5 times. Generally we don't allow infra components to "exit loop" and they should stay running and only exit on exceptional cases. https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6297 Must fix for 4.1.
This has happened 6 times over the past 24hr
I'm not clear if this is related or not, since all of the kube-system_etcd-member-ip-*.ec2.internal_certs.log.gz are empty in [1], but Jim Rigsbee running the 0.16.1 installer today had one of his control-plane nodes fail to join the cluster after its certs pod logged: Error executing kube-client-agent: error requesting certificate: error sending CSR to signer: Post https://api.rht-16-1.dev.nextcle.com:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 10.0.134.197:6443: connect: connection refused It's not clear to me exactly when that error was logged. The bootkube.service logs had: Starting etcd certificate signer... Apr 02 22:11:38 ip-10-0-13-180 bootkube.sh[1398]: f024f639ae9c712e8f5ad9076a429989af6c358d80f02628932eeb0cd7f5b4e3 Apr 02 22:11:38 ip-10-0-13-180 bootkube.sh[1398]: Waiting for etcd cluster... Apr 02 22:16:32 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-1.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 1.350605ms Apr 02 22:17:01 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-0.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.048892ms Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-2.rht-16-1.dev.nextcle.com:2379 is unhealthy: failed to connect: dial tcp 10.0.142.5:2379: connect: connection refused Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: Error: unhealthy cluster Apr 02 22:21:42 ip-10-0-13-180 bootkube.sh[1398]: etcdctl failed. Retrying in 5 seconds... Apr 02 22:21:48 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-0.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.443168ms Apr 02 22:21:48 ip-10-0-13-180 bootkube.sh[1398]: https://etcd-1.rht-16-1.dev.nextcle.com:2379 is healthy: successfully committed proposal: took = 2.918426ms Restarting the kubelet on the broken control-plane node seems to have unstuck things, and the cluster subsequently came up without further trouble. Our current guess is that the the certs container and/or wedged kubelet led to the single CSR request, which failed, after which the certs container was either not restarted or for some other reason did not retry. The error message seems to be from [2] (somehow), and the certs container is defined in [3]. [1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6297/artifacts/e2e-aws/pods/ [2]: https://github.com/openshift/kubecsr/blob/f4aef20ab314adeea7e0e2f74e59dcb63c960448/pkg/certagent/agent.go#L121 [3]: https://github.com/openshift/machine-config-operator/blob/4826cc4a96ff97fddcd59101af5b6621b6d5036d/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L27-L69
I agree with this and certs now log more details [1]. What I see in the failure cases is client-agent begins to fail. This failure appears to coincide with bootkube.service also failing and restarting. The result is kube-etcd-cert-signer container is terminated and until that comes back up we have a crashlooping client-agent. The proposal is instead of exiting 1 [2] we will use a retry mechanism to try forever. [1] https://github.com/openshift/machine-config-operator/commit/399053149d17b943260da5fd9c65260c69ff1b65 [2] https://github.com/openshift/kubecsr/blob/f4aef20ab314adeea7e0e2f74e59dcb63c960448/pkg/certagent/agent.go#L121
Created attachment 1551633 [details] Occurrences of this error in CI from 2019-04-02T21:37 to 2019-04-03T21:26 UTC This occurred in 19 of our 334 failures (5%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with [1]: $ deck-build-log-plot 'container certs has restarted more than 5 times' Seems to be more frequent since ~14:00Z for some reason. [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
kubecsr#25 landed 2019-04-04T22:53Z [1], but here's a failing run that started 2019-04-05T03:57Z [2]: Flaky tests: [Feature:Platform][Smoke] Managed cluster should start all core operators [Suite:openshift/conformance/parallel] [k8s.io] Probing container should be restarted with a /healthz http liveness probe [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] Failing tests: [Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel] Double checking the fix had landed: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1534/pull-ci-openshift-installer-master-e2e-aws/4964/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "kube-client-agent").annotations' { "io.openshift.build.commit.id": "9d3e068b3fc7de3df1cb8000492e43efc8200d4b", "io.openshift.build.commit.ref": "openshift-4.0", "io.openshift.build.source-location": "https://github.com/openshift/kubecsr" } $ git log --first-parent --format='%ad %h %d %s' --date=iso -2 origin/master | cat 2019-04-04 15:53:03 -0700 531aaeb (HEAD -> master, origin/master, origin/HEAD) Merge pull request #25 from hexfusion/add_retry 2019-03-25 14:48:08 -0700 f4aef20 Merge pull request #24 from hexfusion/metrics $ git log --graph --topo-order --oneline --decorate --first-parent -11 origin/master 9d3e068b3fc7 * 531aaeb (HEAD -> master, origin/master, origin/HEAD) Merge pull request #25 from hexfusion/add_retry * f4aef20 Merge pull request #24 from hexfusion/metrics * ac34e6b Merge pull request #22 from wking/client-usage-for-server-cert * 52278c2 Merge pull request #21 from csrwng/kubeconfig_client * 36c62cc Merge pull request #18 from abhinavdahiya/kubeconfig | * 9d3e068 (origin/openshift-4.1, origin/openshift-4.0) Merge pull request #6 from openshift/cp_metrics_signer | * ae8f7b5 Merge pull request #5 from openshift/rhel_images | * 7c3e736 Merge pull request #4 from openshift/cp | * 1b39e01 Merge pull request #1 from openshift/add_docker_4.0 | * 6c3776b Merge pull request #2 from openshift/add_owners |/ * 678cc8e Merge pull request #17 from yifan-gu/allow_empty_ipaddrs So I'm not sure how branching is intended to work in that repository, but it looks like some cherry-picks might be in order. [1]: https://github.com/coreos/kubecsr/pull/25#event-2254399904 [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_origin-aggregated-logging/1586/pull-ci-openshift-origin-aggregated-logging-master-e2e-aws/385/build-log.txt
Oh, here's the cherry-pick [1], sorry for the noise.
kubecsr#8 landed.
None of these in the past 24 hours, so I'm pretty confident that kubecsr#8 fixed us.