Bug 1750890
| Summary: | failed to acquire a resource (boskos.ci, "no such host" or EOF) | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Joseph Callen <jcallen> |
| Component: | Test Infrastructure | Assignee: | Steve Kuznetsov <skuznets> |
| Status: | CLOSED ERRATA | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.2.z | CC: | aos-bugs, bparees, calfonso, jokerman, rgudimet, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | buildcop | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-23 11:05:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Joseph Callen
2019-09-10 16:43:24 UTC
This issue is generally thought to happen when the api.ci cluster is overloaded and the leasing server gets moved between nodes. This takes time as it's running with a PV. Steps I will take to mitigate this: - run the server with a high priority class - run the server as a deployment without persistent volumes for rapid redeployment - run the server horizontally scaled, if possible, for HA - increase the size of the maximum node pool for api.ci to reduce contention Can we evict some of the existing pods periodically on the cluster to make room for leasing server? I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller? > I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller? Shouldn't get preempted since [1]. My impression is that in most cases this is "the node didn't check in because it was too swamped, and the scheduler decided it was unavailable". [1]: https://github.com/openshift/release/pull/4972 If node is getting swamped, increasing the priority might only help in making sure that leasing server pod gets scheduled quickly but there is nothing stopping from node getting swamped or leasing server pod getting evicted. Am I right in assuming that resource consumption of nodes should not reach a level where they get swamped and not able to communicate with apiserver properly? I am not sure about the type of pods(I assume they're build or run once pods, which not tolerant to eviction) running on ci cluster, if the pods are tolerant to eviction perhaps we can run https://github.com/kubernetes-incubator/descheduler to ensure the node utilization stays within a range we want(for example, run descheduler once every 2 hours or so to ensure that node utilization stays in 50-70). Another failure noticed today: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6939 We've done everything but running the server in an HA manner, which is technically challenging. This has mitigated the issue and I would consider this closed unless we see otherwise. Sometimes we get EOFs reported by clients who connect to Boskos, but where that Boskos pod subsequently dies before completing its response. Seems to happen as point-in-time events impacting a few CI jobs, with those events occuring every ~5 days [1]. For example [2]: heartbeat sent for resource "96d3f5a8-c408-4762-91de-56d9add422a4" failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: EOF failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host So that was an EOF request interrupted as the old Boskos pod went down, followed by some "no route to host" errors as the lease container attempted to connect to a Service that no longer had a backing Pod. [1]: https://search.svc.ci.openshift.org/?search=boskos.ci.*EOF [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2708/pull-ci-openshift-installer-master-e2e-aws/8664 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |