1750890 – failed to acquire a resource (boskos.ci, "no such host" or EOF)

Bug 1750890 - failed to acquire a resource (boskos.ci, "no such host" or EOF)

Summary: failed to acquire a resource (boskos.ci, "no such host" or EOF)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Infrastructure
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Steve Kuznetsov
QA Contact:
Docs Contact:
URL:
Whiteboard:	buildcop
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-10 16:43 UTC by Joseph Callen
Modified:	2020-01-23 11:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:05:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:06:11 UTC

Description Joseph Callen 2019-09-10 16:43:24 UTC

Description of problem:

[INFO] Acquiring a lease ...
failed to acquire a resource: Post http://boskos.ci/acquire?dest=leased&owner=ci-op-msik75td-0350a&request_id=1985695446706558308&state=free&type=aws-quota-slice: dial tcp: lookup boskos.ci on 10.142.15.247:53: no such host



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Steve Kuznetsov 2019-09-10 21:58:11 UTC

This issue is generally thought to happen when the api.ci cluster is overloaded and the leasing server gets moved between nodes. This takes time as it's running with a PV. Steps I will take to mitigate this:

 - run the server with a high priority class
 - run the server as a deployment without persistent volumes for rapid redeployment
 - run the server horizontally scaled, if possible, for HA
 - increase the size of the maximum node pool for api.ci to reduce contention

Comment 3 ravig 2019-09-10 23:22:28 UTC

Can we evict some of the existing pods periodically on the cluster to make room for leasing server? I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller?

Comment 4 W. Trevor King 2019-09-10 23:49:49 UTC

> I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller?

Shouldn't get preempted since [1].  My impression is that in most cases this is "the node didn't check in because it was too swamped, and the scheduler decided it was unavailable".

[1]: https://github.com/openshift/release/pull/4972

Comment 5 ravig 2019-09-11 11:05:44 UTC

If node is getting swamped, increasing the priority might only help in making sure that leasing server pod gets scheduled quickly but there is nothing stopping from node getting swamped or leasing server pod getting evicted. Am I right in assuming that resource consumption of nodes should not reach a level where they get swamped and not able to communicate with apiserver properly? 

I am not sure about the type of pods(I assume they're build or run once pods, which not tolerant to eviction) running on ci cluster, if the pods are tolerant to eviction perhaps we can run https://github.com/kubernetes-incubator/descheduler to ensure the node utilization stays within a range we want(for example, run descheduler once every 2 hours or so to ensure that node utilization stays in 50-70).

Comment 6 ravig 2019-09-11 14:12:59 UTC

Another failure noticed today:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6939

Comment 7 Steve Kuznetsov 2019-11-20 23:34:25 UTC

We've done everything but running the server in an HA manner, which is technically challenging. This has mitigated the issue and I would consider this closed unless we see otherwise.

Comment 8 W. Trevor King 2019-12-03 18:59:24 UTC

Sometimes we get EOFs reported by clients who connect to Boskos, but where that Boskos pod subsequently dies before completing its response.  Seems to happen as point-in-time events impacting a few CI jobs, with those events occuring every ~5 days [1].  For example [2]:

heartbeat sent for resource "96d3f5a8-c408-4762-91de-56d9add422a4"
failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: EOF
failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host
failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host

So that was an EOF request interrupted as the old Boskos pod went down, followed by some "no route to host" errors as the lease container attempted to connect to a Service that no longer had a backing Pod.

[1]: https://search.svc.ci.openshift.org/?search=boskos.ci.*EOF
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2708/pull-ci-openshift-installer-master-e2e-aws/8664

Comment 10 errata-xmlrpc 2020-01-23 11:05:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.