Bug 1893386

Summary: false-positive ReadyIngressNodes_NoReadyIngressNodes: Auth operator makes risky "worker" assumption when guessing about ingress availability
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: apiserver-authAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6CC: aos-bugs, lmohanty, mfojtik, mifiedle, pmali, sttts, xxia
Target Milestone: ---Keywords: Upgrades
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Authentication operator only watched config resources with name "cluster", however the "ingress" config resource name is "default". Consequence: Authentication operator ignored changes in "ingress" config which led to wrong assumption that there are no schedulable "worker" nodes when "ingress" was configured with custom node selector. Fix: Do not filter config resources using "cluster" name, but instead watch all config resources, regardless of their name. Result: Operator properly observe the ingress config change and reconcile worker nodes availability.
Story Points: ---
Clone Of:
: 1893803 (view as bug list) Environment:
Last Closed: 2021-02-24 15:29:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1893803    

Description W. Trevor King 2020-10-30 23:49:24 UTC
Description of problem:

In 4.6, the auth operator grew logic to check if the router could schedule pods, but that logic assumes the router will be scheduled on "worker"-labeled nodes [1].

Version-Release number of selected component (if applicable):

4.6 and later.

How reproducible:

100%, when you have no vanilla 'worker' nodes.

Steps to Reproduce:

1. Have no vanilla 'worker' nodes, but have a bunch of custom compute pools [1]
2. Try to survive on 4.6+

Actual results:

Watch the authentication operator complain: Available=False ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication require functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).

Expected results:

Auth operator minds its own business, and the ingress operator complains when it is unscheduled (bug 1881155, [3]) ;).

Additional info:

Auth operator going Available=False on this can hang updates, e.g. updates from 4.5 into 4.6 for folks without vanilla compute nodes.

Workaround: scale up at least one node with a 'node-role.kubernetes.io/worker' label and an empty value.

[1]: https://github.com/openshift/cluster-authentication-operator/pull/344/files#diff-74035431d399f5431916d8624ce3080db323d3b4762cb875651311d703168425R66
[2]: https://github.com/openshift/machine-config-operator/blob/0170e082a8b8228373bd841d17555fff2cfb51b7/docs/custom-pools.md#custom-pools
[3]: https://github.com/openshift/cluster-ingress-operator/pull/465

Comment 1 W. Trevor King 2020-10-31 02:14:13 UTC
The check for "am I impacted by this?" looks like a ReadyIngressNodes_NoReadyIngressNodes Available=False authentication operator:

$ oc get -o json clusteroperator authentication | jq -r '.status.conditions[] | select(.type == "Available") | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
2020-10-29T12:49:29Z Available False ReadyIngressNodes_NoReadyIngressNodes ReadyIngressNodesAvailable: Authentication require functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).

combined with a lack of 'worker' nodes:

$ oc get -o json nodes | jq -r '.items[] | [.status.conditions[] | select(.type == "Ready")][0] as $ready | $ready.lastTransitionTime + " " + $ready.status + " " + .metadata.name + " " + (.metadata.labels | to_entries[] | select(.key | startswith("node-role.kubernetes.io/")).key| tostring)' | sort
2020-08-28T08:41:54Z True worker-0...local node-role.kubernetes.io/app
2020-08-28T09:45:56Z True worker-2...local node-role.kubernetes.io/app
2020-10-29T13:54:51Z True worker-1...local node-role.kubernetes.io/app
2020-10-30T09:59:20Z True infra-2...local node-role.kubernetes.io/app
2020-10-30T09:59:20Z True infra-2...local node-role.kubernetes.io/infra
2020-10-30T10:01:38Z True master-1...local node-role.kubernetes.io/master
2020-10-30T10:04:24Z True master-0...local node-role.kubernetes.io/master
2020-10-30T10:07:02Z True infra-1...local node-role.kubernetes.io/app
2020-10-30T10:07:02Z True infra-1...local node-role.kubernetes.io/infra
2020-10-30T10:07:27Z True master-2...local node-role.kubernetes.io/master
2020-10-30T10:10:10Z True infra-0...local node-role.kubernetes.io/app
2020-10-30T10:10:10Z True infra-0...local node-role.kubernetes.io/infra

This example cluster has nodes with "worker-..." names, but the roles are all app, infra, or master.

Comment 5 Lalatendu Mohanty 2020-11-02 11:50:46 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 7 Mike Fiedler 2020-11-03 15:04:46 UTC
Is this a reliable reproducer:

Steps to Reproduce:

1. Have no vanilla 'worker' nodes, but have a bunch of custom compute pools [1]
2. Try to survive on 4.6+

@mfojtik mentioned:

...If I understand the scenario right, there must be something tweaked in ingress config to make this work, right? something that put nodeSelector for router pods to "node-role.kubernetes.io/infra": "".....so to repro, we need to tweak the config and setup the nodes the way it can succeed ?

Comment 8 W. Trevor King 2020-11-03 15:20:09 UTC
No need for an update.  Steps to reproduce in [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1881155#c5

Comment 11 Xingxing Xia 2020-11-04 07:50:35 UTC
Launched fresh 4.7.0-0.nightly-2020-11-04-013819 env, this payload includes the fix PR. But after following steps in bug 1881155#c5, the issue is still reproduced:
$ oc get node
NAME                                              STATUS   ROLES    AGE     VERSION
ip-10-0-131-120.ap-southeast-2.compute.internal   Ready    master   4h45m   v1.19.2+6bd0f34
ip-10-0-158-57.ap-southeast-2.compute.internal    Ready    infra    4h34m   v1.19.2+6bd0f34
ip-10-0-163-149.ap-southeast-2.compute.internal   Ready    infra    4h34m   v1.19.2+6bd0f34
ip-10-0-173-64.ap-southeast-2.compute.internal    Ready    master   4h45m   v1.19.2+6bd0f34
ip-10-0-193-68.ap-southeast-2.compute.internal    Ready    master   4h45m   v1.19.2+6bd0f34
ip-10-0-221-91.ap-southeast-2.compute.internal    Ready    infra    4h34m   v1.19.2+6bd0f34
$ oc get ingresscontroller default -o yaml -n openshift-ingress-operator
...
spec:
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/infra: ""
  replicas: 2
...
$ oc -n openshift-ingress get po -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                              NOMINATED NODE   READINESS GATES
router-default-564fcd4d9-5xxbc   1/1     Running   0          6m    10.128.2.88   ip-10-0-163-149.ap-southeast-2.compute.internal   <none>           <none>
router-default-564fcd4d9-nbsb8   1/1     Running   0          6m    10.129.2.10   ip-10-0-221-91.ap-southeast-2.compute.internal    <none>           <none>
$ oc get co | grep -v "4.7.0-0.nightly-2020-11-04-013819.*T.*F.*F"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2020-11-04-013819   False       False         False      8m9s
$ oc describe co authentication
Name:         authentication
...
    Last Transition Time:  2020-11-04T07:37:17Z
    Message:               ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
    Reason:                ReadyIngressNodes_NoReadyIngressNodes
    Status:                False
    Type:                  Available
...
I guess the PR has some problem in function numberOfCustomIngressTargets which wrongly got 0 custom target nodes.

Comment 12 Xingxing Xia 2020-11-04 10:49:43 UTC
Per Dev's request, I have done pre-merge verification by launching cluster using the open PR cluster-authentication-operator/pull/373, not reproduced now.

Comment 14 Xingxing Xia 2020-11-05 03:34:51 UTC
Verified in 4.7.0-0.nightly-2020-11-05-010603. Everything is fine. The mad co/authentication is not reproduced:
NODES=`oc get node | grep worker | grep -o "^[^ ]*"`
echo $NODES
oc label node $NODES node-role.kubernetes.io/infra=; oc label node $NODES node-role.kubernetes.io/worker-
oc -n openshift-ingress-operator patch ingresscontroller default --type json -p '[{"op": "add", "path": "/spec/nodePlacement", "value": {"nodeSelector": {"matchLabels": {"node-role.kubernetes.io/infra": ""}}}}]'

$ oc get no
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-155-98.ap-southeast-2.compute.internal    Ready    infra    60m   v1.19.2+6bd0f34
ip-10-0-158-63.ap-southeast-2.compute.internal    Ready    master   73m   v1.19.2+6bd0f34
ip-10-0-166-26.ap-southeast-2.compute.internal    Ready    infra    63m   v1.19.2+6bd0f34
ip-10-0-180-210.ap-southeast-2.compute.internal   Ready    master   73m   v1.19.2+6bd0f34
ip-10-0-206-75.ap-southeast-2.compute.internal    Ready    master   73m   v1.19.2+6bd0f34
ip-10-0-222-131.ap-southeast-2.compute.internal   Ready    infra    60m   v1.19.2+6bd0f34
$ oc -n openshift-ingress get po
NAME                             READY   STATUS    RESTARTS   AGE
router-default-bbb78bc68-6nvw5   1/1     Running   0          2m5s
router-default-bbb78bc68-schff   1/1     Running   0          2m5s
$ oc get co | grep -v "4.7.*T.*F.*F"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
$ oc get co authentication ingress 
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.7.0-0.nightly-2020-11-05-010603   True        False         False      2m7s
ingress          4.7.0-0.nightly-2020-11-05-010603   True        False         False      59m

Comment 18 errata-xmlrpc 2021-02-24 15:29:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 20 Red Hat Bugzilla 2023-09-15 00:50:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days