Bug 2021446
| Summary: | openshift-ingress-canary is not reporting DEGRADED state, even though the canary route is not available and accessible | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Reber <sreber> |
| Component: | Networking | Assignee: | Grant Spence <gspence> |
| Networking sub component: | router | QA Contact: | Melvin Joseph <mjoseph> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | low | CC: | aos-bugs, gspence, hongli, mmasters |
| Version: | 4.7 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The ingress operator status was not showing as degraded if the canary route was never admitted to an ingress controller.
Consequence: Users will be misled that the canary route is valid when it actually should be degraded since it is not admitted.
Fix: Update in cluster-ingress-operator code to set the ingress cluster operator as degraded (status unknown) if canary route wasn't admitted.
Result: Ingress operator status will more accurately reflect the status of the canary controller and better describe issues to end users.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:39:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon Reber
2021-11-09 09:21:47 UTC
Setting blocker- and low priority because this is more of a documentation issue and not causing immediate problems. We need to look into this to understand the problem and possibly improve the documentation when time allows. The purpose of the canary controller is to verify end-to-end connectivity for the default ingresscontroller. To this end, the canary controller creates a test application, service, and canary route, and once the canary route is admitted by the default ingresscontroller, the canary controller periodically sends a request to the canary route and verifies that the controller gets a response. In comment 0, you said that the default ingresscontroller has sharding enabled. In the must-gather archives referenced in comment 1, one cluster's default ingresscontroller is not reporting the CanaryChecksSucceeding status condition, and for this cluster, I can see that the default ingresscontroller has sharding configured, the canary route's namespace is not matched by the default ingresscontroller's namespace selector, and so the canary route's status indicates that it has not been admitted by the default ingresscontroller, which means that the canary controller does not check connectivity. For the other cluster, the default ingresscontroller is reporting CanaryChecksSucceeding=False; for this cluster, I can see that the default ingresscontroller has sharding configured too, but the canary route has nevertheless been admitted, and so the canary controller is checking the route. The fact that the canary route was admitted by the default ingresscontroller in one cluster and the canary route wasn't admitted in the other cluster explains the discrepancy in the reporting of the CanaryChecksSucceeding status condition. Is it possible that in the the first cluster, sharding was configured before the ingress operator started and in the second cluster (the one that reports CanaryChecksSucceeding=False), sharding was configured after the operator started, giving the canary route the opportunity to be admitted? That would explain the discrepancy in the admission status of the canary routes. (In reply to Miciah Dashiel Butler Masters from comment #4) > The purpose of the canary controller is to verify end-to-end connectivity > for the default ingresscontroller. To this end, the canary controller > creates a test application, service, and canary route, and once the canary > route is admitted by the default ingresscontroller, the canary controller > periodically sends a request to the canary route and verifies that the > controller gets a response. OK, but that still does indicate that we have a couple of problems here: - We don't highlight the need for the Canary route in the documentation and therefore in restricted environment this could expose problems. Therefore we should try to solve this. - The Canary check/Controller does not seem to be aware about router sharing if the default IngressController is sharded. Here we should fix that as we need to take that into consideration and apply the respective labels on the `route` as otherwise we might have false/positive alerts. > Is it possible that in the the first cluster, sharding was configured before > the ingress operator started and in the second cluster (the one that reports > CanaryChecksSucceeding=False), sharding was configured after the operator > started, giving the canary route the opportunity to be admitted? That would > explain the discrepancy in the admission status of the canary routes. So this appears to be a timing issue and thus depends mostly on luck. Meaning that if timing is not on our site, we might have a good number of OpenShift Container Platform - Cluster reporting degraded state for the Ingress Cluster Operator. But this again would be a false/positive as actually things may still work and it's more that the Canary Controller is not properly honoring restricted environments with DNS limitation and router sharding in use. I've been reviewing this bug and would like to add:
1. The canary route takes 5 minutes to fail. I didn't wait long enough and was beginning to get confused. Just an FYI for those that are looking at this.
2. I verified that what Miciah said is true: if you shard your default ingress controller and cause the canary route to be not admitted, there will be no error messages IF sharding was configured before the ingress operator starts AND your canary route has never been admitted before (there is no previous status).
- If ingress-operator is already running and you shard, then you WILL get an error for canary failing.
- If ingress-operator is restarted after you shard, but your canary route has been admitted previously, then you WILL get an error for canary failing.
- This is because we have a "stale" status issue with routes (see BZ1944851).
- If ingress-operator is restarted after you shard AND you clear the status of the canary route manually, then you will NOT get an error for the canary failing
- This is situation in the bug. Fresh cluster with sharding.
- "oc delete -n openshift-ingress-canary route canary" to manually clear status of the canary route.
Long story short, I am going to add a new status to the canary controller status of "unknown". "unknown" will mean the canary route is not admitted and therefore the canary controller isn't operating correctly. This will help with the situation in the bug in which the ingress operator was not showing as degraded, even though the canary controller was not working.
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False 136m Cluster version is 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest
melvinjoseph@mjoseph-mac Downloads % oc get co ingress
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 89m
melvinjoseph@mjoseph-mac Downloads % oc get pod -n openshift-ingress
NAME READY STATUS RESTARTS AGE
router-default-76699c5f9c-2jrr8 1/1 Running 0 96m
router-default-76699c5f9c-xhpfn 1/1 Running 0 96m
melvinjoseph@mjoseph-mac Downloads % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 136m
baremetal 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
cloud-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m
cloud-credential 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m
cluster-autoscaler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
config-operator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
console 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 140m
csi-snapshot-controller 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
dns 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
etcd 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 150m
image-registry 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
insights 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
kube-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m
kube-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m
kube-scheduler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m
kube-storage-version-migrator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
machine-api 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m
machine-approver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
machine-config 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 150m
marketplace 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
monitoring 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 140m
network 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
node-tuning 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
openshift-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 143m
openshift-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 148m
openshift-samples 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
operator-lifecycle-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
operator-lifecycle-manager-catalog 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
operator-lifecycle-manager-packageserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
service-ca 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
storage 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
melvinjoseph@mjoseph-mac Downloads % oc patch -n openshift-ingress-operator ingresscontroller/default --patch='{"spec":{"routeSelector":{"matchLabels":{"type": "default"}}}}' --type=merge
ingresscontroller.operator.openshift.io/default patched
melvinjoseph@mjoseph-mac Downloads %
melvinjoseph@mjoseph-mac Downloads % oc delete -n openshift-ingress-canary route canary
route.route.openshift.io "canary" deleted
melvinjoseph@mjoseph-mac Downloads % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False False 21s OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com/healthz" returned "503 Service Unavailable"
baremetal 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
cloud-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m
cloud-credential 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m
cluster-autoscaler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
config-operator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m
console 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False False 6s RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com returns '503 Service Unavailable'
csi-snapshot-controller 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
dns 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
etcd 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
image-registry 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m
insights 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m
kube-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m
kube-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m
kube-scheduler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m
kube-storage-version-migrator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m
machine-api 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 148m
machine-approver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
machine-config 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
marketplace 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
monitoring 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 141m
network 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
node-tuning 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
openshift-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 143m
openshift-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m
openshift-samples 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m
operator-lifecycle-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
operator-lifecycle-manager-catalog 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
operator-lifecycle-manager-packageserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m
service-ca 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m
storage 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
melvinjoseph@mjoseph-mac Downloads % oc get co ingress
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m
melvinjoseph@mjoseph-mac Downloads % oc get route -n openshift-ingress-canary
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
canary canary-openshift-ingress-canary.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com ingress-canary 8080 edge/Redirect None
melvinjoseph@mjoseph-mac Downloads % oc get all -n openshift-ingress-canary
NAME READY STATUS RESTARTS AGE
pod/ingress-canary-9qng2 1/1 Running 0 148m
pod/ingress-canary-dddcc 1/1 Running 0 148m
pod/ingress-canary-zg786 1/1 Running 0 145m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ingress-canary ClusterIP 172.30.187.227 <none> 8080/TCP,8888/TCP 148m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 148m
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/canary canary-openshift-ingress-canary.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com ingress-canary 8080 edge/Redirect None
melvinjoseph@mjoseph-mac Downloads %
melvinjoseph@mjoseph-mac Downloads %
After 5 minutes.....
melvinjoseph@mjoseph-mac Downloads % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False True 6m25s OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com/healthz" returned "503 Service Unavailable"
baremetal 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
cloud-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 160m
cloud-credential 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 160m
cluster-autoscaler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
config-operator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m
console 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False False 6m10s RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com returns '503 Service Unavailable'
csi-snapshot-controller 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
dns 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
etcd 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 157m
image-registry 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False True 152m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
insights 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
kube-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m
kube-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 155m
kube-scheduler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 155m
kube-storage-version-migrator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m
machine-api 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m
machine-approver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
machine-config 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 157m
marketplace 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
monitoring 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m
network 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m
node-tuning 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
openshift-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m
openshift-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 155m
openshift-samples 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m
operator-lifecycle-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
operator-lifecycle-manager-catalog 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
operator-lifecycle-manager-packageserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m
service-ca 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m
storage 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m
melvinjoseph@mjoseph-mac Downloads %
Hence verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |