Bug 1685185

Summary: API servers reject traffic before being removed as an endpoint
Product: OpenShift Container Platform Reporter: Adam Kaplan <adam.kaplan>
Component: MasterAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, jokerman, mfojtik, mmccomas, sttts, yinzhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:44:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1686509    
Bug Blocks:    
Attachments:
Description Flags
image-blacklist yaml none

Description Adam Kaplan 2019-03-04 14:55:21 UTC
Created attachment 1540647 [details]
image-blacklist yaml

Description of problem:
OpenShift API server pods becomes unavailable and reject traffic while it is still listed as an endpoint. This can occur when the node that the pod lives on begins to drain due to a cluster config change.

Version-Release number of selected component (if applicable): 4.0.0


How reproducible: Always


Steps to Reproduce:
1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML).
2. Start a build (ex via oc new-app centos/ruby-25-centos7~https://github.com/sclorg/ruby-ex.git)
3. Attempt to list the build logs.

Actual results: Openshift API server is unavailable and returning 503 errors, nodes are draining due to config change.


Expected results: Openshift API server remains available while the nodes drain.

Additional Info: See CI failure https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_builder/48/pull-ci-openshift-builder-master-e2e-aws-builds/92

Comment 1 Stefan Schimanski 2019-03-05 09:31:37 UTC
This is probably a combination of https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 and the sometimes unexpected lag of endpoint convergence (we saw >30 sec in very bad cases).

Comment 2 Michal Fojtik 2019-03-07 10:05:55 UTC
With https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 merged, lets try to re-test this.

Comment 5 Xingxing Xia 2019-03-21 10:07:45 UTC
Today tried bug's "1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML)" which blocked quay.io, then my env worker nodes become scheduledisabled, after I uncordon, later they turned to not ready, the env is broken.
Will build another env, and try comment 0 steps again (will not include quay.io as blocked)

Comment 6 Xingxing Xia 2019-03-22 02:27:32 UTC
(In reply to Xingxing Xia from comment #5)
> Today tried bug's "1. Add a blocked registry to the cluster image
> configuration (see attachment for sample YAML)" which blocked quay.io, then
> my env worker nodes become scheduledisabled, after I uncordon, later they
> turned to not ready, the env is broken.
Found https://bugzilla.redhat.com/show_bug.cgi?id=1686509 reported same issue.

Comment 7 Xingxing Xia 2019-03-22 11:21:27 UTC
Latest payload 4.0.0-0.nightly-2019-03-22-002648 which already includes above fix https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 .
In terminal T1:
ssh to master, run:
$ tail -f /var/log/openshift-apiserver/audit.log # It outputs a flow of many requests constantly in every second

In terminal T2:
$ watch -n 1 oc get ep,po -n openshift-apiserver

In terminal T3:
$ oc delete po --all -n openshift-apiserver

After T3's command issued, look at T1 and T2, got:
In T2, endpoints and pods disappeared immediately meantime,
and the output flow in T1 suspended immediately meantime, too, until T2's endpoints and pods come back.

From this perspective, the issue is not hit again.
BTW, there is https://github.com/openshift/cluster-kube-apiserver-operator/pull/352 in case slowly converging SDN env.

Comment 9 errata-xmlrpc 2019-06-04 10:44:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758