Bug 1997226

Summary:	Ingresscontroller reconcilations failing but not shown in operator logs or status of ingresscontroller.
Product:	OpenShift Container Platform	Reporter:	Manish Pandey <mapandey>
Component:	Networking	Assignee:	Miheer Salunke <misalunk>
Networking sub component:	router	QA Contact:	Arvind iyengar <aiyengar>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aiyengar, aos-bugs, apizarro, bmcelvee, hdo, imm, jko, lmohanty, misalunk, mmasters, tkondvil, vrutkovs
Version:	4.8
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Before OpenShift 4.8, the IngressController API did not have any subfields under the "status.endpointPublishingStrategy.hostNetwork" and "status.endpointPublishingStrategy.nodePort" fields. As result, these fields could be null even if the "spec.endpointPublishingStrategy.type" field was set to "HostNetwork" or "NodePortService". OpenShift 4.8 added the "status.endpointPublishingStrategy.hostNetwork.protocol" and "status.endpointPublishingStrategy.nodePort.protocol" subfields, and the ingress operator now sets default values for these subfields when the operator admits or re-admits an IngressController that specifies the "HostNetwork" or "NodePortService" strategy type, respectively. However, a cluster that was upgraded from an earlier version of OpenShift could have an already admitted IngressController with null values for these status fields even when the IngressController specified the "HostNetwork" or "NodePortService" endpoint publishing strategy type. In this case, the operator ignored updates to these spec fields. Consequence: Updating "spec.endpointPublishingStrategy.hostNetwork.protocol" or "spec.endpointPublishingStrategy.nodePort.protocol" to "PROXY" to enable PROXY protocol on an existing IngressController had no effect, and it was necessary to delete and recreate the IngressController to enable PROXY protocol. Fix: The ingress operator was changed so that it correctly updates the status fields when "status.endpointPublishingStrategy.hostNetwork" or "status.endpointPublishingStrategy.nodePort" is null and the IngressController's spec fields specify PROXY protocol with the "HostNetwork" or "NodePortService" endpoint publishing strategy type, respectively. Result: Setting "spec.endpointPublishingStrategy.hostNetwork.protocol" or "spec.endpointPublishingStrategy.nodePort.protocol" to "PROXY" now takes proper effect on upgraded clusters.	Story Points:	---
Clone Of:
Clones:	2084336 (view as bug list)		Environment:
Last Closed:	2022-03-12 04:37:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2084336

Comment 1 Miheer Salunke 2021-08-30 03:08:54 UTC

Hi,

Working on to reproduce this in my env. I will get back to you once I find something.

Thanks and regards,
Miheer

Comment 2 Miheer Salunke 2021-09-02 02:21:17 UTC

Hi,

I was able to reproduce this issue issue in my env

I  upgraded from  4.7.23 -> 4.8.4  on Vsphere

Then I added the following in the spec section of the default ingress controller ->

  endpointPublishingStrategy:
    hostNetwork:
      protocol: PROXY
    type: HostNetwork


[miheer@localhost cluster-ingress-operator]$ cat default-ingress-controller.yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2021-08-30T07:52:20Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "70349"
  uid: 9a306096-627f-4335-bddc-662540940999
spec:
  endpointPublishingStrategy:
    hostNetwork:
      protocol: PROXY
    type: HostNetwork
  replicas: 2
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2021-08-30T07:55:44Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2021-08-30T09:28:53Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2021-08-30T09:23:30Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2021-08-30T09:23:30Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2021-08-30T09:29:23Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2021-08-30T07:55:45Z"
    message: The configured endpoint publishing strategy does not include a managed
      load balancer
    reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
    status: "False"
    type: LoadBalancerManaged
  - lastTransitionTime: "2021-08-30T07:55:45Z"
    message: No DNS zones are defined in the cluster dns config.
    reason: NoDNSZones
    status: "False"
    type: DNSManaged
  - lastTransitionTime: "2021-08-30T09:23:30Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2021-08-30T08:00:14Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-08-30T08:01:05Z"
    message: Canary route checks for the default ingress controller are successful
    reason: CanaryChecksSucceeding
    status: "True"
    type: CanaryChecksSucceeding
  domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    type: HostNetwork
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12
[miheer@localhost cluster-ingress-operator]$ 


And it did not get reflected on the router.

Workaround -> I had to delete the default ingresscontroller and then add the changes as [0] to have them reflected in the router 

[0] 
  endpointPublishingStrategy:
    hostNetwork:
      protocol: PROXY
    type: HostNetwork


When I directly installed  4.8.4  on VSphere

and I made the changes [0] they were correctly reflected on the router.


So this issue is happening only during upgrade. But I am not sure why. I think this might be related the API might have been storing the object in etcd using the schema from 4.7.
But not sure.



Things I checked were the yaml contents of the default ingresscontroller.
 


A) Before deletion of ingress controller the yaml looked as follows when adding of the following section [0] in the spec did not work-

[0]
  endpointPublishingStrategy:
    hostNetwork:
      protocol: PROXY
    type: HostNetwork


~~~~~~~~~~~~~

Note this is the yaml just created after deletion so you wont see the [0] under the spec. I had added it later and it was not correctly reflected in the router.


[miheer@localhost cluster-ingress-operator]$ cat default-ingress-controller.yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2021-08-30T07:52:20Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "70349"
  uid: 9a306096-627f-4335-bddc-662540940999
spec:
  replicas: 2
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2021-08-30T07:55:44Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2021-08-30T09:28:53Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2021-08-30T09:23:30Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2021-08-30T09:23:30Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2021-08-30T09:29:23Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2021-08-30T07:55:45Z"
    message: The configured endpoint publishing strategy does not include a managed
      load balancer
    reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
    status: "False"
    type: LoadBalancerManaged
  - lastTransitionTime: "2021-08-30T07:55:45Z"
    message: No DNS zones are defined in the cluster dns config.
    reason: NoDNSZones
    status: "False"
    type: DNSManaged
  - lastTransitionTime: "2021-08-30T09:23:30Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2021-08-30T08:00:14Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-08-30T08:01:05Z"
    message: Canary route checks for the default ingress controller are successful
    reason: CanaryChecksSucceeding
    status: "True"
    type: CanaryChecksSucceeding
  domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    type: HostNetwork
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12
[miheer@localhost cluster-ingress-operator]$



~~~~~~~~~~~~~~


B) After deletion the yaml when adding of the following section [0] in the spec did work  looked as follows -

[0]
  endpointPublishingStrategy:
    hostNetwork:
      protocol: PROXY
    type: HostNetwork


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`

Note this is the yaml just created after deletion so you wont see the [0] under the spec. I had added it later and it was correctly reflected in the router.

[miheer@localhost cluster-ingress-operator]$ cat ing-cont-after-deletion
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2021-08-30T15:44:10Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "209953"
  uid: d18380e7-0c39-439b-96c7-2c56a5f7fd7e
spec:
  httpErrorCodePages:
    name: ""
  replicas: 2
  tuningOptions: {}   ------------------------------------------------------------------------------This was added after deletion
  unsupportedConfigOverrides: null    -------------------------------------------------------------- This was added after deletion
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-08-30T15:44:10Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2021-08-30T15:45:04Z"
    message: 'Some pods are not scheduled: Pod "router-default-64f9c4985b-wjj99" cannot ---------------------Please ignore this as this was related to my env which was fixed later.
      be scheduled: 0/5 nodes are available: 2 node(s) didn''t have free ports for
      the requested pod ports, 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate. Make sure you have sufficient worker nodes.'
    reason: PodsNotScheduled
    status: "False"
    type: PodsScheduled
  - lastTransitionTime: "2021-08-30T15:45:49Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2021-08-30T15:45:49Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2021-08-30T15:45:49Z"
    message: 1/2 of replicas are available
    reason: DeploymentReplicasNotAvailable
    status: "False"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2021-08-30T15:44:11Z"
    message: The configured endpoint publishing strategy does not include a managed
      load balancer
    reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
    status: "False"
    type: LoadBalancerManaged
  - lastTransitionTime: "2021-08-30T15:44:11Z"
    message: No DNS zones are defined in the cluster dns config.
    reason: NoDNSZones
    status: "False"
    type: DNSManaged
  - lastTransitionTime: "2021-08-30T15:45:49Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2021-08-30T15:45:49Z"
    status: "False"
    type: Degraded
  domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    hostNetwork:
      protocol: TCP
    type: HostNetwork
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12
[miheer@localhost cluster-ingress-operator]$ 


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Only difference I saw is that 

  tuningOptions: {}   ------------------------------------------------------------------------------This was added after deletion
  unsupportedConfigOverrides: null    -------------------------------------------------------------- This was added after deletion


fields were added after deleting the default ingress controller.


The http://infrastructures.config.openshift.io/cluster also looked fine to me when I compared it with the 4.8.4 installed cluster and 4.8.4 upgraded cluster

I think this might be related the API might have been storing the object in etcd using the schema from 4.7.

But again why it was not working after upgrade I am not yet sure.

Is the customer not OK with the workaround ?

Comment 3 Miheer Salunke 2021-09-02 02:25:39 UTC

I am assigning this to the team which handles upgrade as I am not sure why this issue is happening only after upgrade.
As mentioned earlier it might be the case where the API might have been storing the object in etcd using the schema from 4.7.

Comment 4 Miheer Salunke 2021-09-04 01:46:20 UTC

Please test the credentials from app used by ingress operator:

- Get credentials of App used by ingress:
./oc-4.7.23 get secret cloud-credentials -n openshift-ingress-operator -o jsonpath='{.data}' > ig-creds.json

export azure_client_id=$(jq -r '.azure_client_id' ig-creds.json |base64 -d )
export azure_client_secret=$(jq -r '.azure_client_secret' ig-creds.json |base64 -d )
export azure_region=$(jq -r '.azure_region' ig-creds.json |base64 -d )
export azure_resource_prefix=$(jq -r '.azure_resource_prefix' ig-creds.json |base64 -d )
export azure_resourcegroup=$(jq -r '.azure_resourcegroup' ig-creds.json |base64 -d )
export azure_subscription_id=$(jq -r '.azure_subscription_id' ig-creds.json |base64 -d )
export azure_tenant_id=$(jq -r '.azure_tenant_id' ig-creds.json |base64 -d )


- Login using this credentials
az login --service-principal -u $azure_client_id \
         --password $azure_client_secret --tenant $azure_tenant_id

- Check if roleDefinitionName=Contributor for AppId
az role assignment list --assignee $azure_client_id  -g $azure_resourcegroup

az role assignment list --assignee $azure_client_id  -g $azure_resourcegroup |jq .[].roleDefinitionName
"Contributor"

Comment 5 Miheer Salunke 2021-09-04 01:47:50 UTC

Oh sorry please ignore comment 4 as it was not meant for this bug

Comment 7 Vadim Rutkovsky 2021-09-09 15:19:42 UTC

CVO is applying manifests specified by ingress. If these manifests are correct and CVO needs to be resolved we'll need a clarification which fields need to be changed.

Moving back to Routing.

Comment 8 Manish Pandey 2021-11-18 21:34:40 UTC

Any update on this ?

Comment 9 Miciah Dashiel Butler Masters 2021-11-24 19:34:38 UTC

*** Bug 2025949 has been marked as a duplicate of this bug. ***

Comment 10 Alfredo Pizarro 2021-11-26 13:09:06 UTC

Hello team! From a lab environment where I replicated the issue, the only difference I can see is the data written in etcd is different for key /kubernetes.io/operator.openshift.io/ingresscontrollers/openshift-ingress-operator/default. 


Upgraded cluster:
---
..
 "domain": "apps.o.rlab.sh",
    "endpointPublishingStrategy": {
      "type": "HostNetwork"
    },
..
---

Fresh 4.8.18 cluster:
---
...
   "domain": "apps.ocp4upi2.rhlabs.local",
    "endpointPublishingStrategy": {
      "hostNetwork": {
        "protocol": "PROXY"
      },
      "type": "HostNetwork"
    },
...
---

Comment 11 Miheer Salunke 2021-12-01 07:00:59 UTC

https://github.com/openshift/cluster-ingress-operator/pull/681  will fix this.

Comment 16 Arvind iyengar 2021-12-22 07:51:13 UTC

Verified in "4.10.0-0.nightly-2021-12-21-130047" release version. The changes made in the "hostnetwork" protocol gets reflected correctly on the proxy pods:
--------
oc -n openshift-ingress-operator edit ingresscontroller default 
ingresscontroller.operator.openshift.io/default edited

  domain: apps.aiyengar410vsp.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    hostNetwork:
      protocol: PROXY
    type: HostNetwork
  observedGeneration: 4
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default

oc -n openshift-ingress get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP               NODE                                NOMINATED NODE   READINESS GATES
router-default-5865ccbfb6-25pl9   1/1     Running   0          6m42s   172.31.249.221   aiyengar410vsp-ppvps-worker-tm99z   <none>           <none>
router-default-5865ccbfb6-xfb5n   1/1     Running   0          8m      172.31.249.77    aiyengar410vsp-ppvps-worker-87jfw   <none>           <none>


oc -n openshift-ingress rsh router-default-5865ccbfb6-xfb5n
sh-4.4$ env | grep -i ROUTER_USE_PROXY_PROTOCOL
ROUTER_USE_PROXY_PROTOCOL=true

sh-4.4$ grep -ir "accept-proxy"  haproxy.config 
  bind :80 accept-proxy
  bind :443 accept-proxy
--------

Comment 19 Brandi Munilla 2022-02-10 20:35:17 UTC

Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!

Comment 23 errata-xmlrpc 2022-03-12 04:37:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056