Created attachment 1540647 [details]
Description of problem:
OpenShift API server pods becomes unavailable and reject traffic while it is still listed as an endpoint. This can occur when the node that the pod lives on begins to drain due to a cluster config change.
Version-Release number of selected component (if applicable): 4.0.0
How reproducible: Always
Steps to Reproduce:
1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML).
2. Start a build (ex via oc new-app centos/ruby-25-centos7~https://github.com/sclorg/ruby-ex.git)
3. Attempt to list the build logs.
Actual results: Openshift API server is unavailable and returning 503 errors, nodes are draining due to config change.
Expected results: Openshift API server remains available while the nodes drain.
Additional Info: See CI failure https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_builder/48/pull-ci-openshift-builder-master-e2e-aws-builds/92
This is probably a combination of https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 and the sometimes unexpected lag of endpoint convergence (we saw >30 sec in very bad cases).
With https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 merged, lets try to re-test this.
Today tried bug's "1. Add a blocked registry to the cluster image configuration (see attachment for sample YAML)" which blocked quay.io, then my env worker nodes become scheduledisabled, after I uncordon, later they turned to not ready, the env is broken.
Will build another env, and try comment 0 steps again (will not include quay.io as blocked)
(In reply to Xingxing Xia from comment #5)
> Today tried bug's "1. Add a blocked registry to the cluster image
> configuration (see attachment for sample YAML)" which blocked quay.io, then
> my env worker nodes become scheduledisabled, after I uncordon, later they
> turned to not ready, the env is broken.
Found https://bugzilla.redhat.com/show_bug.cgi?id=1686509 reported same issue.
Latest payload 4.0.0-0.nightly-2019-03-22-002648 which already includes above fix https://github.com/openshift/cluster-openshift-apiserver-operator/pull/154 .
In terminal T1:
ssh to master, run:
$ tail -f /var/log/openshift-apiserver/audit.log # It outputs a flow of many requests constantly in every second
In terminal T2:
$ watch -n 1 oc get ep,po -n openshift-apiserver
In terminal T3:
$ oc delete po --all -n openshift-apiserver
After T3's command issued, look at T1 and T2, got:
In T2, endpoints and pods disappeared immediately meantime,
and the output flow in T1 suspended immediately meantime, too, until T2's endpoints and pods come back.
From this perspective, the issue is not hit again.
BTW, there is https://github.com/openshift/cluster-kube-apiserver-operator/pull/352 in case slowly converging SDN env.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.