Description of problem: See MGMT-10403 for more information OpenShift release version: 4.11 Cluster Platform: None How reproducible: 100% Steps to Reproduce (in detail): 1. Install a 4.10 cluster 2. Upgrade to 4.11 3. Add worker nodes Actual results: Ingress pods may float to worker nodes Expected results: Ingress pods should stay on control-plane nodes Impact of the problem: See MGMT-10403 Additional info:
upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-15-222801, but the ingress.status.defaultPlacement is still blank. $ oc get ingress.config cluster -oyaml <---snip---> status: componentRoutes: - conditions: - lastTransitionTime: "2022-06-16T06:49:41Z" message: All is well reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2022-06-16T06:49:41Z" message: All is well reason: AsExpected status: "False" type: Degraded consumingUsers: - system:serviceaccount:oauth-openshift:authentication-operator currentHostnames: - oauth-openshift.apps.hongli-sno.qe.devcluster.openshift.com defaultHostname: oauth-openshift.apps.hongli-sno.qe.devcluster.openshift.com name: oauth-openshift namespace: openshift-authentication relatedObjects: - group: route.openshift.io name: oauth-openshift namespace: openshift-authentication resource: routes defaultPlacement: "" $ oc get infrastructures.config.openshift.io cluster -oyaml <---snip---> status: apiServerInternalURI: https://api-int.hongli-sno.qe.devcluster.openshift.com:6443 apiServerURL: https://api.hongli-sno.qe.devcluster.openshift.com:6443 controlPlaneTopology: SingleReplica etcdDiscoveryDomain: "" infrastructureName: hongli-sno-2k9gn infrastructureTopology: SingleReplica platform: None platformStatus: type: None $ oc get clusterversion/version -oyaml <---snip---> history: - completionTime: "2022-06-16T07:22:05Z" image: registry.ci.openshift.org/ocp/release@sha256:bceac2ed723ce186c56b1db5e7b17cf0ef0a62e6bbfba5d545d419c3018498b2 startedTime: "2022-06-16T06:29:58Z" state: Completed verified: true version: 4.11.0-0.nightly-2022-06-15-222801 - completionTime: "2022-06-16T03:00:37Z" image: registry.ci.openshift.org/ocp/release@sha256:6bb01826e3996b4b792c0eed75316cfd55fd45f87fdd08a54d4953311c6ae985 startedTime: "2022-06-16T02:22:42Z" state: Completed verified: false version: 4.10.0-0.nightly-2022-06-08-150219
Thanks for noticing this. Can you please share the ingress operator logs from that run?
Thanks, > 2022-06-16T07:16:50.802Z ERROR operator.init ingress-operator/start.go:197 failed to handle single node 4.11 upgrade logic {"error": "failed fetching cluster nodes: nodes is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"} So it's a permissions issue. Which also explains why I didn't encounter this when testing locally, because I used a kubeadmin kubeconfig. Will fix
checked with latest ci build (since no available nightly build so far) but see new errors in the logs: 2022-06-21T03:32:12.813Z ERROR operator.init ingress-operator/start.go:197 failed to handle single node 4.11 upgrade logic {"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"} and the ingress.status.defaultPlacement is still blank defaultPlacement: ""
checked with latest ci build : 4.11.0-0.nightly-2022-06-21-040754 upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-21-040754 oc get clusterversions NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-21-040754 True False 4m48s Cluster version is 4.11.0-0.nightly-2022-06-21-040754 oc get ingresses.config.openshift.io cluster -ojson | jq '.status.defaultPlacement' "" oc get deployment -n openshift-ingress -ojson | jq -r '.items[].spec.template.spec.nodeSelector | keys[] | select(. | test("node"))' | cut -d'/' -f2 worker error info: 2022-06-21T15:55:27.579Z ERROR operator.init ingress-operator/start.go:197 failed to handle single node 4.11 upgrade logic {"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"} 2022-06-21T15:55:28.202Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "error sending canary HTTP request: DNS error: Get \"https://canary-openshift-ingress-canary.apps.wwei-0621h.qe.devcluster.openshift.com\": dial tcp: lookup canary-openshift-ingress-canary.apps.wwei-0621h.qe.devcluster.openshift.com on 172.30.0.10:53: read udp 10.128.0.105:38668->172.30.0.10:53: read: connection refused"} 2022-06-21T15:55:28.476Z ERROR operator.ingress_controller controller/controller.go:114 got retryable error; requeueing {"after": "59m59.999992937s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}
upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-22-015220 and passed. $ oc get ingress.config cluster -o=jsonpath={.status.defaultPlacement} ControlPlane $ oc get deployment -n openshift-ingress -ojson | jq -r '.items[].spec.template.spec.nodeSelector' { "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069