Description of problem: [INFO] Acquiring a lease ... failed to acquire a resource: Post http://boskos.ci/acquire?dest=leased&owner=ci-op-msik75td-0350a&request_id=1985695446706558308&state=free&type=aws-quota-slice: dial tcp: lookup boskos.ci on 10.142.15.247:53: no such host Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This issue is generally thought to happen when the api.ci cluster is overloaded and the leasing server gets moved between nodes. This takes time as it's running with a PV. Steps I will take to mitigate this: - run the server with a high priority class - run the server as a deployment without persistent volumes for rapid redeployment - run the server horizontally scaled, if possible, for HA - increase the size of the maximum node pool for api.ci to reduce contention
Can we evict some of the existing pods periodically on the cluster to make room for leasing server? I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller?
> I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller? Shouldn't get preempted since [1]. My impression is that in most cases this is "the node didn't check in because it was too swamped, and the scheduler decided it was unavailable". [1]: https://github.com/openshift/release/pull/4972
If node is getting swamped, increasing the priority might only help in making sure that leasing server pod gets scheduled quickly but there is nothing stopping from node getting swamped or leasing server pod getting evicted. Am I right in assuming that resource consumption of nodes should not reach a level where they get swamped and not able to communicate with apiserver properly? I am not sure about the type of pods(I assume they're build or run once pods, which not tolerant to eviction) running on ci cluster, if the pods are tolerant to eviction perhaps we can run https://github.com/kubernetes-incubator/descheduler to ensure the node utilization stays within a range we want(for example, run descheduler once every 2 hours or so to ensure that node utilization stays in 50-70).
Another failure noticed today: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6939
We've done everything but running the server in an HA manner, which is technically challenging. This has mitigated the issue and I would consider this closed unless we see otherwise.
Sometimes we get EOFs reported by clients who connect to Boskos, but where that Boskos pod subsequently dies before completing its response. Seems to happen as point-in-time events impacting a few CI jobs, with those events occuring every ~5 days [1]. For example [2]: heartbeat sent for resource "96d3f5a8-c408-4762-91de-56d9add422a4" failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: EOF failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host So that was an EOF request interrupted as the old Boskos pod went down, followed by some "no route to host" errors as the lease container attempted to connect to a Service that no longer had a backing Pod. [1]: https://search.svc.ci.openshift.org/?search=boskos.ci.*EOF [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2708/pull-ci-openshift-installer-master-e2e-aws/8664
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062