1685185 – API servers reject traffic before being removed as an endpoint

Bug 1685185 - API servers reject traffic before being removed as an endpoint

Summary: API servers reject traffic before being removed as an endpoint

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:	1686509
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-04 14:55 UTC by Adam Kaplan
Modified:	2019-06-04 10:45 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
image-blacklist yaml (157 bytes, text/plain) 2019-03-04 14:55 UTC, Adam Kaplan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:45:44 UTC

Description Adam Kaplan 2019-03-04 14:55:21 UTC

Created attachment 1540647 [details]
image-blacklist yaml

Description of problem:
OpenShift API server pods becomes unavailable and reject traffic while it is still listed as an endpoint. This can occur when the node that the pod lives on begins to drain due to a cluster config change.

Version-Release number of selected component (if applicable): 4.0.0


How reproducible: Always


Steps to Reproduce:
1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML).
2. Start a build (ex via oc new-app centos/ruby-25-centos7~https://github.com/sclorg/ruby-ex.git)
3. Attempt to list the build logs.

Actual results: Openshift API server is unavailable and returning 503 errors, nodes are draining due to config change.


Expected results: Openshift API server remains available while the nodes drain.

Additional Info: See CI failure https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_builder/48/pull-ci-openshift-builder-master-e2e-aws-builds/92

Comment 1 Stefan Schimanski 2019-03-05 09:31:37 UTC

This is probably a combination of https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 and the sometimes unexpected lag of endpoint convergence (we saw >30 sec in very bad cases).

Comment 2 Michal Fojtik 2019-03-07 10:05:55 UTC

With https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 merged, lets try to re-test this.

Comment 5 Xingxing Xia 2019-03-21 10:07:45 UTC

Today tried bug's "1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML)" which blocked quay.io, then my env worker nodes become scheduledisabled, after I uncordon, later they turned to not ready, the env is broken.
Will build another env, and try comment 0 steps again (will not include quay.io as blocked)

Comment 6 Xingxing Xia 2019-03-22 02:27:32 UTC

(In reply to Xingxing Xia from comment #5)
> Today tried bug's "1. Add a blocked registry to the cluster image
> configuration (see attachment for sample YAML)" which blocked quay.io, then
> my env worker nodes become scheduledisabled, after I uncordon, later they
> turned to not ready, the env is broken.
Found https://bugzilla.redhat.com/show_bug.cgi?id=1686509 reported same issue.

Comment 7 Xingxing Xia 2019-03-22 11:21:27 UTC

Latest payload 4.0.0-0.nightly-2019-03-22-002648 which already includes above fix https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 .
In terminal T1:
ssh to master, run:
$ tail -f /var/log/openshift-apiserver/audit.log # It outputs a flow of many requests constantly in every second

In terminal T2:
$ watch -n 1 oc get ep,po -n openshift-apiserver

In terminal T3:
$ oc delete po --all -n openshift-apiserver

After T3's command issued, look at T1 and T2, got:
In T2, endpoints and pods disappeared immediately meantime,
and the output flow in T1 suspended immediately meantime, too, until T2's endpoints and pods come back.

From this perspective, the issue is not hit again.
BTW, there is https://github.com/openshift/cluster-kube-apiserver-operator/pull/352 in case slowly converging SDN env.

Comment 9 errata-xmlrpc 2019-06-04 10:44:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.