Bug 1750890

Summary:	failed to acquire a resource (boskos.ci, "no such host" or EOF)
Product:	OpenShift Container Platform	Reporter:	Joseph Callen <jcallen>
Component:	Test Infrastructure	Assignee:	Steve Kuznetsov <skuznets>
Status:	CLOSED ERRATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2.z	CC:	aos-bugs, bparees, calfonso, jokerman, rgudimet, wking
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	buildcop
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:05:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joseph Callen 2019-09-10 16:43:24 UTC

Description of problem:

[INFO] Acquiring a lease ...
failed to acquire a resource: Post http://boskos.ci/acquire?dest=leased&owner=ci-op-msik75td-0350a&request_id=1985695446706558308&state=free&type=aws-quota-slice: dial tcp: lookup boskos.ci on 10.142.15.247:53: no such host



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Steve Kuznetsov 2019-09-10 21:58:11 UTC

This issue is generally thought to happen when the api.ci cluster is overloaded and the leasing server gets moved between nodes. This takes time as it's running with a PV. Steps I will take to mitigate this:

 - run the server with a high priority class
 - run the server as a deployment without persistent volumes for rapid redeployment
 - run the server horizontally scaled, if possible, for HA
 - increase the size of the maximum node pool for api.ci to reduce contention

Comment 3 ravig 2019-09-10 23:22:28 UTC

Can we evict some of the existing pods periodically on the cluster to make room for leasing server? I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller?

Comment 4 W. Trevor King 2019-09-10 23:49:49 UTC

> I am curious if the leasing server pods gets preempted by scheduler or evicted by some other controller?

Shouldn't get preempted since [1].  My impression is that in most cases this is "the node didn't check in because it was too swamped, and the scheduler decided it was unavailable".

[1]: https://github.com/openshift/release/pull/4972

Comment 5 ravig 2019-09-11 11:05:44 UTC

If node is getting swamped, increasing the priority might only help in making sure that leasing server pod gets scheduled quickly but there is nothing stopping from node getting swamped or leasing server pod getting evicted. Am I right in assuming that resource consumption of nodes should not reach a level where they get swamped and not able to communicate with apiserver properly? 

I am not sure about the type of pods(I assume they're build or run once pods, which not tolerant to eviction) running on ci cluster, if the pods are tolerant to eviction perhaps we can run https://github.com/kubernetes-incubator/descheduler to ensure the node utilization stays within a range we want(for example, run descheduler once every 2 hours or so to ensure that node utilization stays in 50-70).

Comment 6 ravig 2019-09-11 14:12:59 UTC

Another failure noticed today:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6939

Comment 7 Steve Kuznetsov 2019-11-20 23:34:25 UTC

We've done everything but running the server in an HA manner, which is technically challenging. This has mitigated the issue and I would consider this closed unless we see otherwise.

Comment 8 W. Trevor King 2019-12-03 18:59:24 UTC

Sometimes we get EOFs reported by clients who connect to Boskos, but where that Boskos pod subsequently dies before completing its response.  Seems to happen as point-in-time events impacting a few CI jobs, with those events occuring every ~5 days [1].  For example [2]:

heartbeat sent for resource "96d3f5a8-c408-4762-91de-56d9add422a4"
failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: EOF
failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host
failed to send heartbeat for resource "96d3f5a8-c408-4762-91de-56d9add422a4": Post http://boskos.ci/update?name=96d3f5a8-c408-4762-91de-56d9add422a4&owner=ci-op-rk1j1xqq-1d3f3&state=leased: dial tcp 172.30.131.17:80: connect: no route to host

So that was an EOF request interrupted as the old Boskos pod went down, followed by some "no route to host" errors as the lease container attempted to connect to a Service that no longer had a backing Pod.

[1]: https://search.svc.ci.openshift.org/?search=boskos.ci.*EOF
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2708/pull-ci-openshift-installer-master-e2e-aws/8664

Comment 10 errata-xmlrpc 2020-01-23 11:05:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062