Bug 1893803
| Summary: | false-positive ReadyIngressNodes_NoReadyIngressNodes: Auth operator makes risky "worker" assumption when guessing about ingress availability | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Scott Dodson <sdodson> |
| Component: | apiserver-auth | Assignee: | Standa Laznicka <slaznick> |
| Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.6 | CC: | aos-bugs, lmohanty, mfojtik, mifiedle, pmali, sttts, wking |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | 4.6.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1893386 | Environment: | |
| Last Closed: | 2020-11-16 14:37:43 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1893386 | ||
| Bug Blocks: | |||
|
Description
Scott Dodson
2020-11-02 16:29:59 UTC
This is a blocker for 4.6.2. This failed to verify on 4.6.0-0.nightly-2020-11-03-172112
Summary: upgrading a happy 4.5.16 cluster with ingress running on nodes labeled node-role.kubernetes.io/infra resulted in an upgrade stuck with auth with this status
Status:
Conditions:
Last Transition Time: 2020-11-03T23:50:44Z
Reason: AsExpected
Status: False
Type: Degraded
Last Transition Time: 2020-11-04T00:30:43Z
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2020-11-04T00:29:32Z
Message: ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
Reason: ReadyIngressNodes_NoReadyIngressNodes
Status: False
Type: Available
Last Transition Time: 2020-11-03T22:43:43Z
Reason: AsExpected
Status: True
Type: Upgradeable
I will get oc adm must-gather.
Details:
Verification steps based on https://bugzilla.redhat.com/show_bug.cgi?id=1881155#c5. Thanks to @wking.
1. Started with a basic AWS 4.5.16 cluster with 3 nodes labeled master and 3 worker
2. before the test starts, auth and ingress are happy
2020-11-03T22:43:43Z authentication Upgradeable True AsExpected -
2020-11-03T23:00:14Z authentication Available True AsExpected -
2020-11-03T23:01:08Z authentication Progressing False AsExpected -
2020-11-03T23:49:56Z ingress Degraded False NoIngressControllersDegraded -
2020-11-03T23:50:44Z authentication Degraded False AsExpected -
2020-11-03T23:53:40Z ingress Available True - desired and current number of IngressControllers are equal
2020-11-03T23:53:40Z ingress Progressing False - desired and current number of IngressControllers are equal
3. Label all worker nodes with node-role.kubernetes.io/infra= and remove node-role.kubernetes.io/worker-
# oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-131-118.us-east-2.compute.internal Ready infra 65m v1.18.3+2fbd7c7
ip-10-0-134-15.us-east-2.compute.internal Ready master 74m v1.18.3+2fbd7c7
ip-10-0-168-147.us-east-2.compute.internal Ready master 74m v1.18.3+2fbd7c7
ip-10-0-187-74.us-east-2.compute.internal Ready infra 65m v1.18.3+2fbd7c7
ip-10-0-201-177.us-east-2.compute.internal Ready master 73m v1.18.3+2fbd7c7
ip-10-0-222-88.us-east-2.compute.internal Ready infra 65m v1.18.3+2fbd7c7
Now, patch ingress to allow ingress pods to schedule on matchLabel infra
oc -n openshift-ingress-operator patch ingresscontroller default --type json -p '[{"op": "add", "path": "/spec/nodePlacement", "value": {"nodeSelector": {"matchLabels": {"node-role.kubernetes.io/infra": ""}}}}]'
4. kill all openshift-ingress pods and let them reschedule on infra nodes via matchLabels
ingress pods are now running ok with infra matchLabels
# oc get pods -n openshift-ingress
NAME READY STATUS RESTARTS AGE
router-default-76c5ff76db-l7wh6 1/1 Running 0 71s
router-default-76c5ff76db-wq8wz 1/1 Running 0 71s
5. on 4.5.16 both auth and ingress are happy at this point with ingress running on infra matchLabels with the patched IngressController - unlike 4.6+ auth does not go to Available=False
2020-11-03T22:43:43Z authentication Upgradeable True AsExpected -
2020-11-03T23:00:14Z authentication Available True AsExpected -
2020-11-03T23:01:08Z authentication Progressing False AsExpected -
2020-11-03T23:49:56Z ingress Degraded False NoIngressControllersDegraded -
2020-11-03T23:50:44Z authentication Degraded False AsExpected -
2020-11-03T23:53:40Z ingress Available True - desired and current number of IngressControllers are equal
2020-11-03T23:53:40Z ingress Progressing False - desired and current number of IngressControllers are equal
oc get co also shows everybody happy.
6. Now start the upgrade to release:4.6.0-0.nightly-2020-11-03-172112.
Upgrade goes smooth for a while but gets stuck on auth with the status listed above.
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.16 True True 107m Unable to apply 4.6.0-0.nightly-2020-11-03-172112: the cluster operator authentication has not yet successfully rolled out
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.0-0.nightly-2020-11-03-172112 False False False 52m
cloud-credential 4.6.0-0.nightly-2020-11-03-172112 True False False 167m
cluster-autoscaler 4.6.0-0.nightly-2020-11-03-172112 True False False 153m
config-operator 4.6.0-0.nightly-2020-11-03-172112 True False False 154m
console 4.6.0-0.nightly-2020-11-03-172112 True False False 52m
csi-snapshot-controller 4.6.0-0.nightly-2020-11-03-172112 True False False 154m
dns 4.5.16 True False False 157m
etcd 4.6.0-0.nightly-2020-11-03-172112 True False False 157m
image-registry 4.6.0-0.nightly-2020-11-03-172112 True False False 150m
ingress 4.6.0-0.nightly-2020-11-03-172112 True False False 52m
insights 4.6.0-0.nightly-2020-11-03-172112 True False False 154m
kube-apiserver 4.6.0-0.nightly-2020-11-03-172112 True False False 156m
kube-controller-manager 4.6.0-0.nightly-2020-11-03-172112 True False False 156m
kube-scheduler 4.6.0-0.nightly-2020-11-03-172112 True False False 153m
kube-storage-version-migrator 4.6.0-0.nightly-2020-11-03-172112 True False False 150m
machine-api 4.6.0-0.nightly-2020-11-03-172112 True False False 151m
machine-approver 4.6.0-0.nightly-2020-11-03-172112 True False False 155m
machine-config 4.5.16 True False False 157m
marketplace 4.6.0-0.nightly-2020-11-03-172112 True False False 52m
monitoring 4.6.0-0.nightly-2020-11-03-172112 True False False 144m
network 4.5.16 True False False 158m
node-tuning 4.6.0-0.nightly-2020-11-03-172112 True False False 52m
openshift-apiserver 4.6.0-0.nightly-2020-11-03-172112 True False False 153m
openshift-controller-manager 4.6.0-0.nightly-2020-11-03-172112 True False False 154m
openshift-samples 4.6.0-0.nightly-2020-11-03-172112 True False False 52m
operator-lifecycle-manager 4.6.0-0.nightly-2020-11-03-172112 True False False 157m
operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-11-03-172112 True False False 157m
operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-11-03-172112 True False False 51m
service-ca 4.6.0-0.nightly-2020-11-03-172112 True False False 158m
storage 4.6.0-0.nightly-2020-11-03-172112 True False False 52m
root@ip-172-31-64-58: ~ # oc describe co auth
Name: authentication
Namespace:
Labels: <none>
Annotations: exclude.release.openshift.io/internal-openshift-hosted: true
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2020-11-03T22:35:11Z
Generation: 1
Managed Fields:
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:exclude.release.openshift.io/internal-openshift-hosted:
f:spec:
f:status:
.:
f:extension:
Manager: cluster-version-operator
Operation: Update
Time: 2020-11-03T22:35:11Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:relatedObjects:
f:versions:
Manager: authentication-operator
Operation: Update
Time: 2020-11-04T00:31:13Z
Resource Version: 61547
Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication
UID: 71de293d-2939-4b84-af17-f941dbaf82f8
Spec:
Status:
Conditions:
Last Transition Time: 2020-11-03T23:50:44Z
Reason: AsExpected
Status: False
Type: Degraded
Last Transition Time: 2020-11-04T00:30:43Z
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2020-11-04T00:29:32Z
Message: ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
Reason: ReadyIngressNodes_NoReadyIngressNodes
Status: False
Type: Available
Last Transition Time: 2020-11-03T22:43:43Z
Reason: AsExpected
Status: True
Type: Upgradeable
Extension: <nil>
Related Objects:
Group: operator.openshift.io
Name: cluster
Resource: authentications
Group: config.openshift.io
Name: cluster
Resource: authentications
Group: config.openshift.io
Name: cluster
Resource: infrastructures
Group: config.openshift.io
Name: cluster
Resource: oauths
Group: route.openshift.io
Name: oauth-openshift
Namespace: openshift-authentication
Resource: routes
Group:
Name: oauth-openshift
Namespace: openshift-authentication
Resource: services
Group:
Name: openshift-config
Resource: namespaces
Group:
Name: openshift-config-managed
Resource: namespaces
Group:
Name: openshift-authentication
Resource: namespaces
Group:
Name: openshift-authentication-operator
Resource: namespaces
Group:
Name: openshift-ingress
Resource: namespaces
Group:
Name: openshift-oauth-apiserver
Resource: namespaces
Versions:
Name: operator
Version: 4.6.0-0.nightly-2020-11-03-172112
Name: oauth-openshift
Version: 4.6.0-0.nightly-2020-11-03-172112_openshift
Name: oauth-apiserver
Version: 4.6.0-0.nightly-2020-11-03-172112
Events: <none>
# oc get pods -n openshift-ingress
NAME READY STATUS RESTARTS AGE
router-default-5995d6bbbf-2tqxk 1/1 Running 0 89m
router-default-5995d6bbbf-dwfwr 1/1 Running 0 89m
Thanks Mike used upgrade to test it. From bug 1881155#c5, the reproducer can be seen in fresh env without doing upgrade. So I launched a fresh 4.6.0-0.nightly-2020-11-03-172112 env, then followed bug 1881155#c5 steps, I can also reproduce Mike's error: $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-211.ap-southeast-1.compute.internal Ready master 4h24m v1.19.0+9f84db3 ip-10-0-135-171.ap-southeast-1.compute.internal Ready infra 4h13m v1.19.0+9f84db3 ip-10-0-184-212.ap-southeast-1.compute.internal Ready infra 4h13m v1.19.0+9f84db3 ip-10-0-186-226.ap-southeast-1.compute.internal Ready master 4h24m v1.19.0+9f84db3 ip-10-0-206-99.ap-southeast-1.compute.internal Ready infra 4h13m v1.19.0+9f84db3 ip-10-0-207-229.ap-southeast-1.compute.internal Ready master 4h25m v1.19.0+9f84db3 $ oc get pods -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-7d6df7cf44-f9ffg 1/1 Running 0 10m 10.129.2.79 ip-10-0-206-99.ap-southeast-1.compute.internal <none> <none> router-default-7d6df7cf44-gf8f9 1/1 Running 0 10m 10.128.2.17 ip-10-0-184-212.ap-southeast-1.compute.internal <none> <none> $ oc get co | grep -v "4.6.0-0.nightly-2020-11-03-172112.*T.*F.*F" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-11-03-172112 False False False 13m $ oc describe co authentication Name: authentication ... Message: ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). Reason: ReadyIngressNodes_NoReadyIngressNodes Status: False Type: Available ... I guess the PR has some problem in function numberOfCustomIngressTargets which wrongly got 0 custom target nodes. I'm dropping the blocker flag because this continues to fail in CI. WE have numerous fixes already in 4.6.2 that we do not wish to delay any longer. We'll just have to deal with this failing upgrades until 4.6.3. Verified in 4.6.0-0.nightly-2020-11-05-024238 using the steps of bug 1893386#c14. Everything is fine. The mad co/authentication is not reproduced. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.4 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4987 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |