Bug 2094932 - MGMT-10403 Ingress should enable single-node cluster expansion on upgraded clusters
Summary: MGMT-10403 Ingress should enable single-node cluster expansion on upgraded cl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.11.0
Assignee: Omer Tuchfeld
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-08 16:25 UTC by Omer Tuchfeld
Modified: 2022-08-10 11:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:17:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 767 0 None Merged Bug 2094932: MGMT-10403: Set the `defaultPlacement` for none-platform SNO clusters installed before 4.11 2022-06-09 15:51:29 UTC
Github openshift cluster-ingress-operator pull 785 0 None open Bug 2094932: Ingress operator needs permission to list nodes 2022-06-16 14:01:03 UTC
Github openshift cluster-ingress-operator pull 788 0 None open Bug 2094932: Patching config.ingresses status requires patch permissions 2022-06-21 07:22:07 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:17:22 UTC

Description Omer Tuchfeld 2022-06-08 16:25:09 UTC
Description of problem:
See MGMT-10403 for more information

OpenShift release version:
4.11

Cluster Platform:
None

How reproducible:
100%

Steps to Reproduce (in detail):
1. Install a 4.10 cluster
2. Upgrade to 4.11
3. Add worker nodes


Actual results:
Ingress pods may float to worker nodes

Expected results:
Ingress pods should stay on control-plane nodes

Impact of the problem:
See MGMT-10403

Additional info:

Comment 4 Hongan Li 2022-06-16 09:41:40 UTC
upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-15-222801, but the ingress.status.defaultPlacement is still blank.

$ oc get ingress.config cluster -oyaml
<---snip--->
status:
  componentRoutes:
  - conditions:
    - lastTransitionTime: "2022-06-16T06:49:41Z"
      message: All is well
      reason: AsExpected
      status: "False"
      type: Progressing
    - lastTransitionTime: "2022-06-16T06:49:41Z"
      message: All is well
      reason: AsExpected
      status: "False"
      type: Degraded
    consumingUsers:
    - system:serviceaccount:oauth-openshift:authentication-operator
    currentHostnames:
    - oauth-openshift.apps.hongli-sno.qe.devcluster.openshift.com
    defaultHostname: oauth-openshift.apps.hongli-sno.qe.devcluster.openshift.com
    name: oauth-openshift
    namespace: openshift-authentication
    relatedObjects:
    - group: route.openshift.io
      name: oauth-openshift
      namespace: openshift-authentication
      resource: routes
  defaultPlacement: ""


$ oc get infrastructures.config.openshift.io cluster -oyaml
<---snip--->
status:
  apiServerInternalURI: https://api-int.hongli-sno.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.hongli-sno.qe.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: hongli-sno-2k9gn
  infrastructureTopology: SingleReplica
  platform: None
  platformStatus:
    type: None


$ oc get clusterversion/version -oyaml
<---snip--->
  history:
  - completionTime: "2022-06-16T07:22:05Z"
    image: registry.ci.openshift.org/ocp/release@sha256:bceac2ed723ce186c56b1db5e7b17cf0ef0a62e6bbfba5d545d419c3018498b2
    startedTime: "2022-06-16T06:29:58Z"
    state: Completed
    verified: true
    version: 4.11.0-0.nightly-2022-06-15-222801
  - completionTime: "2022-06-16T03:00:37Z"
    image: registry.ci.openshift.org/ocp/release@sha256:6bb01826e3996b4b792c0eed75316cfd55fd45f87fdd08a54d4953311c6ae985
    startedTime: "2022-06-16T02:22:42Z"
    state: Completed
    verified: false
    version: 4.10.0-0.nightly-2022-06-08-150219

Comment 5 Omer Tuchfeld 2022-06-16 09:58:17 UTC
Thanks for noticing this. Can you please share the ingress operator logs from that run?

Comment 7 Omer Tuchfeld 2022-06-16 12:47:51 UTC
Thanks,

> 2022-06-16T07:16:50.802Z	ERROR	operator.init	ingress-operator/start.go:197	failed to handle single node 4.11 upgrade logic	{"error": "failed fetching cluster nodes: nodes is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"}

So it's a permissions issue. Which also explains why I didn't encounter this when testing locally, because I used a kubeadmin kubeconfig. Will fix

Comment 9 Hongan Li 2022-06-21 04:03:29 UTC
checked with latest ci build (since no available nightly build so far) but see new errors in the logs:

2022-06-21T03:32:12.813Z        ERROR   operator.init   ingress-operator/start.go:197   failed to handle single node 4.11 upgrade logic {"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"}

and the ingress.status.defaultPlacement is still blank
    defaultPlacement: ""

Comment 12 Wenxin Wei 2022-06-21 16:19:39 UTC
checked with latest ci build : 4.11.0-0.nightly-2022-06-21-040754
upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-21-040754

oc get clusterversions
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-21-040754   True        False         4m48s   Cluster version is 4.11.0-0.nightly-2022-06-21-040754

 oc get ingresses.config.openshift.io cluster -ojson  | jq '.status.defaultPlacement'
""

oc get deployment -n openshift-ingress -ojson | jq -r '.items[].spec.template.spec.nodeSelector | keys[] | select(. | test("node"))' | cut -d'/' -f2
worker

error info:

2022-06-21T15:55:27.579Z	ERROR	operator.init	ingress-operator/start.go:197	failed to handle single node 4.11 upgrade logic	{"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"}

2022-06-21T15:55:28.202Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "error sending canary HTTP request: DNS error: Get \"https://canary-openshift-ingress-canary.apps.wwei-0621h.qe.devcluster.openshift.com\": dial tcp: lookup canary-openshift-ingress-canary.apps.wwei-0621h.qe.devcluster.openshift.com on 172.30.0.10:53: read udp 10.128.0.105:38668->172.30.0.10:53: read: connection refused"}

2022-06-21T15:55:28.476Z	ERROR	operator.ingress_controller	controller/controller.go:114	got retryable error; requeueing	{"after": "59m59.999992937s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}

Comment 14 Hongan Li 2022-06-22 11:33:11 UTC
upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-22-015220 and passed.

$ oc get ingress.config cluster -o=jsonpath={.status.defaultPlacement}
ControlPlane

$ oc get deployment -n openshift-ingress -ojson | jq -r '.items[].spec.template.spec.nodeSelector'
{
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/master": ""
}

Comment 15 errata-xmlrpc 2022-08-10 11:17:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.