Bug 1685185 - API servers reject traffic before being removed as an endpoint
Summary: API servers reject traffic before being removed as an endpoint
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On: 1686509
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-04 14:55 UTC by Adam Kaplan
Modified: 2019-06-04 10:45 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:44:55 UTC
Target Upstream Version:


Attachments (Terms of Use)
image-blacklist yaml (157 bytes, text/plain)
2019-03-04 14:55 UTC, Adam Kaplan
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:45:44 UTC

Description Adam Kaplan 2019-03-04 14:55:21 UTC
Created attachment 1540647 [details]
image-blacklist yaml

Description of problem:
OpenShift API server pods becomes unavailable and reject traffic while it is still listed as an endpoint. This can occur when the node that the pod lives on begins to drain due to a cluster config change.

Version-Release number of selected component (if applicable): 4.0.0


How reproducible: Always


Steps to Reproduce:
1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML).
2. Start a build (ex via oc new-app centos/ruby-25-centos7~https://github.com/sclorg/ruby-ex.git)
3. Attempt to list the build logs.

Actual results: Openshift API server is unavailable and returning 503 errors, nodes are draining due to config change.


Expected results: Openshift API server remains available while the nodes drain.

Additional Info: See CI failure https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_builder/48/pull-ci-openshift-builder-master-e2e-aws-builds/92

Comment 1 Stefan Schimanski 2019-03-05 09:31:37 UTC
This is probably a combination of https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 and the sometimes unexpected lag of endpoint convergence (we saw >30 sec in very bad cases).

Comment 2 Michal Fojtik 2019-03-07 10:05:55 UTC
With https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 merged, lets try to re-test this.

Comment 5 Xingxing Xia 2019-03-21 10:07:45 UTC
Today tried bug's "1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML)" which blocked quay.io, then my env worker nodes become scheduledisabled, after I uncordon, later they turned to not ready, the env is broken.
Will build another env, and try comment 0 steps again (will not include quay.io as blocked)

Comment 6 Xingxing Xia 2019-03-22 02:27:32 UTC
(In reply to Xingxing Xia from comment #5)
> Today tried bug's "1. Add a blocked registry to the cluster image
> configuration (see attachment for sample YAML)" which blocked quay.io, then
> my env worker nodes become scheduledisabled, after I uncordon, later they
> turned to not ready, the env is broken.
Found https://bugzilla.redhat.com/show_bug.cgi?id=1686509 reported same issue.

Comment 7 Xingxing Xia 2019-03-22 11:21:27 UTC
Latest payload 4.0.0-0.nightly-2019-03-22-002648 which already includes above fix https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 .
In terminal T1:
ssh to master, run:
$ tail -f /var/log/openshift-apiserver/audit.log # It outputs a flow of many requests constantly in every second

In terminal T2:
$ watch -n 1 oc get ep,po -n openshift-apiserver

In terminal T3:
$ oc delete po --all -n openshift-apiserver

After T3's command issued, look at T1 and T2, got:
In T2, endpoints and pods disappeared immediately meantime,
and the output flow in T1 suspended immediately meantime, too, until T2's endpoints and pods come back.

From this perspective, the issue is not hit again.
BTW, there is https://github.com/openshift/cluster-kube-apiserver-operator/pull/352 in case slowly converging SDN env.

Comment 9 errata-xmlrpc 2019-06-04 10:44:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.