Bug 1950114
Summary: | Significant spike in "tls: internal errors" when contacting apiserver via exec (roughly same timeframe as rebase) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Elana Hashman <ehashman> | ||||
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.8 | CC: | alegrand, anpicker, aos-bugs, ccoleman, dgrisonn, erooth, kakkoyun, lcosic, mfojtik, pkrupa, schoudha, spasquie, surbania, wking, xxia | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 1949288 | Environment: | |||||
Last Closed: | 2021-04-26 07:57:37 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1940972, 1949288, 1951213 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Elana Hashman
2021-04-15 19:08:19 UTC
There was a change to the CI around a month ago to require no firing alerts during upgrades: https://github.com/openshift/origin/pull/25904 As far as I can tell, this change seems to be the main reason why we are seeing this test fail more often. Before that, the test was already failing quite often as alerts are a very powerful mean to detect bugs and regressions. But now, from what I've seen with the recent Bugzillas, the test has been detecting resiliency and availability issues during upgrades which is to be expected since some operators/operands may not be highly available or might not implement the HA conventions: https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability. I also had a look at the implementation of the test and nothing really struck me as having possibly been broken by the 1.21 rebase. Let me know if you think there might be something that I missed, otherwise I think we can safely close this bug. Agree with Damien. My understanding is that the test fails due to something failing between the api and the kubelet. Looking at one specific CI failure, the pod started by the e2e suite which checks for alerts is scheduled on node ip-10-0-253-175.ec2.internal and runs from 09:50:57 to 09:52:01. The logs from this node [1] show that it tries to rotate certificates at 09:37:34 but it times out at 09:52:34. Apr 16 09:37:28.230740 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:28.230712 1413 certificate_manager.go:282] Certificate rotation is enabled. Apr 16 09:37:28.231028 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:28.230994 1413 certificate_manager.go:412] Rotating certificates Apr 16 09:37:29.592548 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:29.592471 1413 certificate_manager.go:556] Certificate expiration is 2021-04-17 09:18:25 +0000 UTC, rotation deadline is 2021-04-17 02:52:32.076574345 +0000 UTC Apr 16 09:37:29.592548 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:29.592529 1413 certificate_manager.go:288] Waiting 17h15m2.484051627s for next certificate rotation Apr 16 09:37:33.354568 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:33.354535 1413 certificate_manager.go:282] Certificate rotation is enabled. Apr 16 09:37:34.355906 ip-10-0-253-175 hyperkube[1413]: I0416 09:37:34.355848 1413 certificate_manager.go:412] Rotating certificates Apr 16 09:52:34.519284 ip-10-0-253-175 hyperkube[1413]: E0416 09:52:34.519281 1413 certificate_manager.go:454] certificate request was not signed: timed out waiting for the condition Apr 16 09:52:36.716896 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:36.716831 1413 certificate_manager.go:412] Rotating certificates Apr 16 09:52:37.745923 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:37.745875 1413 certificate_manager.go:556] Certificate expiration is 2021-04-17 09:18:25 +0000 UTC, rotation deadline is 2021-04-17 05:51:42.525959018 +0000 UTC Apr 16 09:52:37.745923 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:37.745917 1413 certificate_manager.go:288] Waiting 19h59m4.780049227s for next certificate rotation Apr 16 09:52:38.747149 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:38.747084 1413 certificate_manager.go:556] Certificate expiration is 2021-04-17 09:18:25 +0000 UTC, rotation deadline is 2021-04-17 06:26:01.796651049 +0000 UTC Apr 16 09:52:38.747149 ip-10-0-253-175 hyperkube[1413]: I0416 09:52:38.747133 1413 certificate_manager.go:288] Waiting 20h33m23.049523342s for next certificate rotation [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26054/pull-ci-openshift-origin-master-e2e-aws-disruptive/1382983481180557312 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26054/pull-ci-openshift-origin-master-e2e-aws-disruptive/1382983481180557312/artifacts/e2e-aws-disruptive/gather-extra/artifacts/nodes/ip-10-0-253-175.ec2.internal/journal Created attachment 1772448 [details]
kubelet_server_expiration_renew_errors
Some kubelets report a non zero value for their kubelet_server_expiration_renew_errors metric.
One possible thing that comes to mind: right now, this test fetches alerts by (I think) execing to call kubectl. This seems a little convoluted and is more likely to pick up spurious failures. Is there another way we could pull the alert data that might be less failure-prone? While there may be other mechanisms to make PromQL calls to list alerts, we want 'exec' to be reliable, right? We could probably cover "we don't want alerts firing" and "we want exec to be reliably available" separately, but I don't think we want to address only the first without having something covering the latter. Yes, the issue with the kubelet being marked as Ready when it can't yet exec is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1949288 separately. Ok. And there are unit tests in the kubelet PRs proposed for bug 1949288. But currently this "shouldn't report any alerts in firing state" test-case is your integration coverage, right? So "stop using 'exec' for the alerts-are-firing test-case, and use something more reliable" makes sense to me. But you'd want to couple it with an "make sure 'exec' is reliable" integration test (possibly under the bug 1949288 umbrella), right? Or are we confident enough in bug 1949288's new unit tests that we don't think we need exec-is-reliable integration coverage? Generally I don't think integration coverage should be a side effect of another test's implementation; it's better to be direct. There's already integration coverage for exec working, see e.g. "[sig-cli] Kubectl client Simple pod should support exec using resource/name [Suite:openshift/conformance/parallel] [Suite:k8s]" The timing of this test seems to be exposing a race condition at initial kubelet boot which has gotten more frequent recently due to what I suspect is a number of startup timing changes lining up that are OpenShift-specific. I've never seen an instance of this failure in upstream Kubernetes (yet!). In a production environment, I suspect the likelihood and impact of this happening would be low/rare. I think it's also coincidence that an exec exposes this issue, as the impact would not be limited to exec; anything that tries to connect to a kubelet's server would be affected (e.g. oc logs). I'm agree that direct tests are better. Currently this test covers both (a) alerts not firing and (b) 'exec' works. Moving it to just cover (a) is great, but we want something on top of (b) too. Can we poll the kubelet APIs on each node from the test suite to get an uptime monitor, like we already do for the Kube and OpenShift APIs? Or something like that? Because I expect we will continue to see OpenShift-specific startup timing issues, and we want to notice when we do introduce them, even if they don't have much impact outside of node reboots. Because node reboots are already painful enough due to rescheduling user's legacy workloads; the smoother our parts are, the better. Exec MAY not be unreliable. Every exec call to openshift should pass except those that take >2m to complete, and this one does not. Alerts may not fire during upgrade, that was definite. I don't think this is a bug at all (please update the description though). Let me describe expected behavior: 1. exec during a normal serial or parallel conformance run has no disruption for exec calls <1m30s 2. apiserver rotation should be transparent to end users of exec 3. tests that fail on exec are only at fault if they take >1m30s to run AND don't retry their operation if the exec session is closed gracefully 4. it should be possible to see a graceful exec session close by the error code returned by exec closure Is there anything that the monitoring team needs to do regarding this bug? As far as I understand, we don't want to increase the retry period of the "kubectl exec" command above 2 minutes in the end-to-end tests because . Per [1], I thought this bug was about the implementation of the: Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured test case, and specifically, the use of exec to evaluate PromQL [2]. Bug 1949288 is about actually fixing the unreliable exec behavior. Per comment 8 here, Elana doesn't like exec availability getting covered indirectly via exec-based-PromQL, and that's fair. I'm not sure anyone's stepped up yet to own something that directly covers exec availability. It's possible that could go in under bug 1949288 as well, or via a third bug series. If for whatever reason the monitoring folks can think of a more canonical approach to executing PromQL within e2e test-cases, this bug would track that work. But with the hypothetical direct guard on exec availability in place, it's not obvious to me what the motivation would be for moving to another, exec-free PromQL approach. In which case we can close this NOTABUG or WORKSFORME or some such. Am I missing something? [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1949288#c10 [2]: https://github.com/openshift/origin/blob/e945cb88da780e21c021b6c8b430454bcfb881cf/test/extended/util/prometheus/helpers.go#L155-L156 That's an accurate summary. Thanks Trevor for the detailed summary! The only other option I can think of is to use go through the public prometheus-k8s route. This is already implemented by the "leader changes are not excessive [Late]" e2e test [1][2] but instead of depending on "oc exec", the test now depends on the availability of the route. Closing as WORKSFORME for now. [1] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/test/extended/etcd/leader_changes.go#L20 [2] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/test/extended/prometheus/client/via_route.go#L27 |