Bug 2094932

Summary:	MGMT-10403 Ingress should enable single-node cluster expansion on upgraded clusters
Product:	OpenShift Container Platform	Reporter:	Omer Tuchfeld <otuchfel>
Component:	Networking	Assignee:	Omer Tuchfeld <otuchfel>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, jhou, mmasters, wking, wwei
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:17:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Omer Tuchfeld 2022-06-08 16:25:09 UTC

Description of problem:
See MGMT-10403 for more information

OpenShift release version:
4.11

Cluster Platform:
None

How reproducible:
100%

Steps to Reproduce (in detail):
1. Install a 4.10 cluster
2. Upgrade to 4.11
3. Add worker nodes


Actual results:
Ingress pods may float to worker nodes

Expected results:
Ingress pods should stay on control-plane nodes

Impact of the problem:
See MGMT-10403

Additional info:

Comment 4 Hongan Li 2022-06-16 09:41:40 UTC

upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-15-222801, but the ingress.status.defaultPlacement is still blank.

$ oc get ingress.config cluster -oyaml
<---snip--->
status:
  componentRoutes:
  - conditions:
    - lastTransitionTime: "2022-06-16T06:49:41Z"
      message: All is well
      reason: AsExpected
      status: "False"
      type: Progressing
    - lastTransitionTime: "2022-06-16T06:49:41Z"
      message: All is well
      reason: AsExpected
      status: "False"
      type: Degraded
    consumingUsers:
    - system:serviceaccount:oauth-openshift:authentication-operator
    currentHostnames:
    - oauth-openshift.apps.hongli-sno.qe.devcluster.openshift.com
    defaultHostname: oauth-openshift.apps.hongli-sno.qe.devcluster.openshift.com
    name: oauth-openshift
    namespace: openshift-authentication
    relatedObjects:
    - group: route.openshift.io
      name: oauth-openshift
      namespace: openshift-authentication
      resource: routes
  defaultPlacement: ""


$ oc get infrastructures.config.openshift.io cluster -oyaml
<---snip--->
status:
  apiServerInternalURI: https://api-int.hongli-sno.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.hongli-sno.qe.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  etcdDiscoveryDomain: ""
  infrastructureName: hongli-sno-2k9gn
  infrastructureTopology: SingleReplica
  platform: None
  platformStatus:
    type: None


$ oc get clusterversion/version -oyaml
<---snip--->
  history:
  - completionTime: "2022-06-16T07:22:05Z"
    image: registry.ci.openshift.org/ocp/release@sha256:bceac2ed723ce186c56b1db5e7b17cf0ef0a62e6bbfba5d545d419c3018498b2
    startedTime: "2022-06-16T06:29:58Z"
    state: Completed
    verified: true
    version: 4.11.0-0.nightly-2022-06-15-222801
  - completionTime: "2022-06-16T03:00:37Z"
    image: registry.ci.openshift.org/ocp/release@sha256:6bb01826e3996b4b792c0eed75316cfd55fd45f87fdd08a54d4953311c6ae985
    startedTime: "2022-06-16T02:22:42Z"
    state: Completed
    verified: false
    version: 4.10.0-0.nightly-2022-06-08-150219

Comment 5 Omer Tuchfeld 2022-06-16 09:58:17 UTC

Thanks for noticing this. Can you please share the ingress operator logs from that run?

Comment 7 Omer Tuchfeld 2022-06-16 12:47:51 UTC

Thanks,

> 2022-06-16T07:16:50.802Z	ERROR	operator.init	ingress-operator/start.go:197	failed to handle single node 4.11 upgrade logic	{"error": "failed fetching cluster nodes: nodes is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"}

So it's a permissions issue. Which also explains why I didn't encounter this when testing locally, because I used a kubeadmin kubeconfig. Will fix

Comment 9 Hongan Li 2022-06-21 04:03:29 UTC

checked with latest ci build (since no available nightly build so far) but see new errors in the logs:

2022-06-21T03:32:12.813Z        ERROR   operator.init   ingress-operator/start.go:197   failed to handle single node 4.11 upgrade logic {"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"}

and the ingress.status.defaultPlacement is still blank
    defaultPlacement: ""

Comment 12 Wenxin Wei 2022-06-21 16:19:39 UTC

checked with latest ci build : 4.11.0-0.nightly-2022-06-21-040754
upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-21-040754

oc get clusterversions
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-21-040754   True        False         4m48s   Cluster version is 4.11.0-0.nightly-2022-06-21-040754

 oc get ingresses.config.openshift.io cluster -ojson  | jq '.status.defaultPlacement'
""

oc get deployment -n openshift-ingress -ojson | jq -r '.items[].spec.template.spec.nodeSelector | keys[] | select(. | test("node"))' | cut -d'/' -f2
worker

error info:

2022-06-21T15:55:27.579Z	ERROR	operator.init	ingress-operator/start.go:197	failed to handle single node 4.11 upgrade logic	{"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"}

2022-06-21T15:55:28.202Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "error sending canary HTTP request: DNS error: Get \"https://canary-openshift-ingress-canary.apps.wwei-0621h.qe.devcluster.openshift.com\": dial tcp: lookup canary-openshift-ingress-canary.apps.wwei-0621h.qe.devcluster.openshift.com on 172.30.0.10:53: read udp 10.128.0.105:38668->172.30.0.10:53: read: connection refused"}

2022-06-21T15:55:28.476Z	ERROR	operator.ingress_controller	controller/controller.go:114	got retryable error; requeueing	{"after": "59m59.999992937s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"}

Comment 14 Hongan Li 2022-06-22 11:33:11 UTC

upgrade from 4.10.0-0.nightly-2022-06-08-150219 to 4.11.0-0.nightly-2022-06-22-015220 and passed.

$ oc get ingress.config cluster -o=jsonpath={.status.defaultPlacement}
ControlPlane

$ oc get deployment -n openshift-ingress -ojson | jq -r '.items[].spec.template.spec.nodeSelector'
{
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/master": ""
}

Comment 15 errata-xmlrpc 2022-08-10 11:17:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069