Bug 1997226
Summary: | Ingresscontroller reconcilations failing but not shown in operator logs or status of ingresscontroller. | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Manish Pandey <mapandey> | |
Component: | Networking | Assignee: | Miheer Salunke <misalunk> | |
Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aiyengar, aos-bugs, apizarro, bmcelvee, hdo, imm, jko, lmohanty, misalunk, mmasters, tkondvil, vrutkovs | |
Version: | 4.8 | |||
Target Milestone: | --- | |||
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Before OpenShift 4.8, the IngressController API did not have any subfields under the "status.endpointPublishingStrategy.hostNetwork" and "status.endpointPublishingStrategy.nodePort" fields. As result, these fields could be null even if the "spec.endpointPublishingStrategy.type" field was set to "HostNetwork" or "NodePortService". OpenShift 4.8 added the "status.endpointPublishingStrategy.hostNetwork.protocol" and "status.endpointPublishingStrategy.nodePort.protocol" subfields, and the ingress operator now sets default values for these subfields when the operator admits or re-admits an IngressController that specifies the "HostNetwork" or "NodePortService" strategy type, respectively. However, a cluster that was upgraded from an earlier version of OpenShift could have an already admitted IngressController with null values for these status fields even when the IngressController specified the "HostNetwork" or "NodePortService" endpoint publishing strategy type. In this case, the operator ignored updates to these spec fields.
Consequence: Updating "spec.endpointPublishingStrategy.hostNetwork.protocol" or "spec.endpointPublishingStrategy.nodePort.protocol" to "PROXY" to enable PROXY protocol on an existing IngressController had no effect, and it was necessary to delete and recreate the IngressController to enable PROXY protocol.
Fix: The ingress operator was changed so that it correctly updates the status fields when "status.endpointPublishingStrategy.hostNetwork" or "status.endpointPublishingStrategy.nodePort" is null and the IngressController's spec fields specify PROXY protocol with the "HostNetwork" or "NodePortService" endpoint publishing strategy type, respectively.
Result: Setting "spec.endpointPublishingStrategy.hostNetwork.protocol" or "spec.endpointPublishingStrategy.nodePort.protocol" to "PROXY" now takes proper effect on upgraded clusters.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2084336 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-12 04:37:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2084336 |
Comment 1
Miheer Salunke
2021-08-30 03:08:54 UTC
Hi, I was able to reproduce this issue issue in my env I upgraded from 4.7.23 -> 4.8.4 on Vsphere Then I added the following in the spec section of the default ingress controller -> endpointPublishingStrategy: hostNetwork: protocol: PROXY type: HostNetwork [miheer@localhost cluster-ingress-operator]$ cat default-ingress-controller.yaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2021-08-30T07:52:20Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: default namespace: openshift-ingress-operator resourceVersion: "70349" uid: 9a306096-627f-4335-bddc-662540940999 spec: endpointPublishingStrategy: hostNetwork: protocol: PROXY type: HostNetwork replicas: 2 status: availableReplicas: 2 conditions: - lastTransitionTime: "2021-08-30T07:55:44Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2021-08-30T09:28:53Z" status: "True" type: PodsScheduled - lastTransitionTime: "2021-08-30T09:23:30Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2021-08-30T09:23:30Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2021-08-30T09:29:23Z" message: All replicas are available reason: DeploymentReplicasAvailable status: "True" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2021-08-30T07:55:45Z" message: The configured endpoint publishing strategy does not include a managed load balancer reason: EndpointPublishingStrategyExcludesManagedLoadBalancer status: "False" type: LoadBalancerManaged - lastTransitionTime: "2021-08-30T07:55:45Z" message: No DNS zones are defined in the cluster dns config. reason: NoDNSZones status: "False" type: DNSManaged - lastTransitionTime: "2021-08-30T09:23:30Z" status: "True" type: Available - lastTransitionTime: "2021-08-30T08:00:14Z" status: "False" type: Degraded - lastTransitionTime: "2021-08-30T08:01:05Z" message: Canary route checks for the default ingress controller are successful reason: CanaryChecksSucceeding status: "True" type: CanaryChecksSucceeding domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com endpointPublishingStrategy: type: HostNetwork observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 [miheer@localhost cluster-ingress-operator]$ And it did not get reflected on the router. Workaround -> I had to delete the default ingresscontroller and then add the changes as [0] to have them reflected in the router [0] endpointPublishingStrategy: hostNetwork: protocol: PROXY type: HostNetwork When I directly installed 4.8.4 on VSphere and I made the changes [0] they were correctly reflected on the router. So this issue is happening only during upgrade. But I am not sure why. I think this might be related the API might have been storing the object in etcd using the schema from 4.7. But not sure. Things I checked were the yaml contents of the default ingresscontroller. A) Before deletion of ingress controller the yaml looked as follows when adding of the following section [0] in the spec did not work- [0] endpointPublishingStrategy: hostNetwork: protocol: PROXY type: HostNetwork ~~~~~~~~~~~~~ Note this is the yaml just created after deletion so you wont see the [0] under the spec. I had added it later and it was not correctly reflected in the router. [miheer@localhost cluster-ingress-operator]$ cat default-ingress-controller.yaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2021-08-30T07:52:20Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: default namespace: openshift-ingress-operator resourceVersion: "70349" uid: 9a306096-627f-4335-bddc-662540940999 spec: replicas: 2 status: availableReplicas: 2 conditions: - lastTransitionTime: "2021-08-30T07:55:44Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2021-08-30T09:28:53Z" status: "True" type: PodsScheduled - lastTransitionTime: "2021-08-30T09:23:30Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2021-08-30T09:23:30Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2021-08-30T09:29:23Z" message: All replicas are available reason: DeploymentReplicasAvailable status: "True" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2021-08-30T07:55:45Z" message: The configured endpoint publishing strategy does not include a managed load balancer reason: EndpointPublishingStrategyExcludesManagedLoadBalancer status: "False" type: LoadBalancerManaged - lastTransitionTime: "2021-08-30T07:55:45Z" message: No DNS zones are defined in the cluster dns config. reason: NoDNSZones status: "False" type: DNSManaged - lastTransitionTime: "2021-08-30T09:23:30Z" status: "True" type: Available - lastTransitionTime: "2021-08-30T08:00:14Z" status: "False" type: Degraded - lastTransitionTime: "2021-08-30T08:01:05Z" message: Canary route checks for the default ingress controller are successful reason: CanaryChecksSucceeding status: "True" type: CanaryChecksSucceeding domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com endpointPublishingStrategy: type: HostNetwork observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 [miheer@localhost cluster-ingress-operator]$ ~~~~~~~~~~~~~~ B) After deletion the yaml when adding of the following section [0] in the spec did work looked as follows - [0] endpointPublishingStrategy: hostNetwork: protocol: PROXY type: HostNetwork ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` Note this is the yaml just created after deletion so you wont see the [0] under the spec. I had added it later and it was correctly reflected in the router. [miheer@localhost cluster-ingress-operator]$ cat ing-cont-after-deletion apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2021-08-30T15:44:10Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: default namespace: openshift-ingress-operator resourceVersion: "209953" uid: d18380e7-0c39-439b-96c7-2c56a5f7fd7e spec: httpErrorCodePages: name: "" replicas: 2 tuningOptions: {} ------------------------------------------------------------------------------This was added after deletion unsupportedConfigOverrides: null -------------------------------------------------------------- This was added after deletion status: availableReplicas: 1 conditions: - lastTransitionTime: "2021-08-30T15:44:10Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2021-08-30T15:45:04Z" message: 'Some pods are not scheduled: Pod "router-default-64f9c4985b-wjj99" cannot ---------------------Please ignore this as this was related to my env which was fixed later. be scheduled: 0/5 nodes are available: 2 node(s) didn''t have free ports for the requested pod ports, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate. Make sure you have sufficient worker nodes.' reason: PodsNotScheduled status: "False" type: PodsScheduled - lastTransitionTime: "2021-08-30T15:45:49Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2021-08-30T15:45:49Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2021-08-30T15:45:49Z" message: 1/2 of replicas are available reason: DeploymentReplicasNotAvailable status: "False" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2021-08-30T15:44:11Z" message: The configured endpoint publishing strategy does not include a managed load balancer reason: EndpointPublishingStrategyExcludesManagedLoadBalancer status: "False" type: LoadBalancerManaged - lastTransitionTime: "2021-08-30T15:44:11Z" message: No DNS zones are defined in the cluster dns config. reason: NoDNSZones status: "False" type: DNSManaged - lastTransitionTime: "2021-08-30T15:45:49Z" status: "True" type: Available - lastTransitionTime: "2021-08-30T15:45:49Z" status: "False" type: Degraded domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com endpointPublishingStrategy: hostNetwork: protocol: TCP type: HostNetwork observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 [miheer@localhost cluster-ingress-operator]$ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Only difference I saw is that tuningOptions: {} ------------------------------------------------------------------------------This was added after deletion unsupportedConfigOverrides: null -------------------------------------------------------------- This was added after deletion fields were added after deleting the default ingress controller. The http://infrastructures.config.openshift.io/cluster also looked fine to me when I compared it with the 4.8.4 installed cluster and 4.8.4 upgraded cluster I think this might be related the API might have been storing the object in etcd using the schema from 4.7. But again why it was not working after upgrade I am not yet sure. Is the customer not OK with the workaround ? I am assigning this to the team which handles upgrade as I am not sure why this issue is happening only after upgrade. As mentioned earlier it might be the case where the API might have been storing the object in etcd using the schema from 4.7. Please test the credentials from app used by ingress operator: - Get credentials of App used by ingress: ./oc-4.7.23 get secret cloud-credentials -n openshift-ingress-operator -o jsonpath='{.data}' > ig-creds.json export azure_client_id=$(jq -r '.azure_client_id' ig-creds.json |base64 -d ) export azure_client_secret=$(jq -r '.azure_client_secret' ig-creds.json |base64 -d ) export azure_region=$(jq -r '.azure_region' ig-creds.json |base64 -d ) export azure_resource_prefix=$(jq -r '.azure_resource_prefix' ig-creds.json |base64 -d ) export azure_resourcegroup=$(jq -r '.azure_resourcegroup' ig-creds.json |base64 -d ) export azure_subscription_id=$(jq -r '.azure_subscription_id' ig-creds.json |base64 -d ) export azure_tenant_id=$(jq -r '.azure_tenant_id' ig-creds.json |base64 -d ) - Login using this credentials az login --service-principal -u $azure_client_id \ --password $azure_client_secret --tenant $azure_tenant_id - Check if roleDefinitionName=Contributor for AppId az role assignment list --assignee $azure_client_id -g $azure_resourcegroup az role assignment list --assignee $azure_client_id -g $azure_resourcegroup |jq .[].roleDefinitionName "Contributor" Oh sorry please ignore comment 4 as it was not meant for this bug CVO is applying manifests specified by ingress. If these manifests are correct and CVO needs to be resolved we'll need a clarification which fields need to be changed. Moving back to Routing. Any update on this ? *** Bug 2025949 has been marked as a duplicate of this bug. *** Hello team! From a lab environment where I replicated the issue, the only difference I can see is the data written in etcd is different for key /kubernetes.io/operator.openshift.io/ingresscontrollers/openshift-ingress-operator/default. Upgraded cluster: --- .. "domain": "apps.o.rlab.sh", "endpointPublishingStrategy": { "type": "HostNetwork" }, .. --- Fresh 4.8.18 cluster: --- ... "domain": "apps.ocp4upi2.rhlabs.local", "endpointPublishingStrategy": { "hostNetwork": { "protocol": "PROXY" }, "type": "HostNetwork" }, ... --- Verified in "4.10.0-0.nightly-2021-12-21-130047" release version. The changes made in the "hostnetwork" protocol gets reflected correctly on the proxy pods: -------- oc -n openshift-ingress-operator edit ingresscontroller default ingresscontroller.operator.openshift.io/default edited domain: apps.aiyengar410vsp.qe.devcluster.openshift.com endpointPublishingStrategy: hostNetwork: protocol: PROXY type: HostNetwork observedGeneration: 4 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default oc -n openshift-ingress get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-5865ccbfb6-25pl9 1/1 Running 0 6m42s 172.31.249.221 aiyengar410vsp-ppvps-worker-tm99z <none> <none> router-default-5865ccbfb6-xfb5n 1/1 Running 0 8m 172.31.249.77 aiyengar410vsp-ppvps-worker-87jfw <none> <none> oc -n openshift-ingress rsh router-default-5865ccbfb6-xfb5n sh-4.4$ env | grep -i ROUTER_USE_PROXY_PROTOCOL ROUTER_USE_PROXY_PROTOCOL=true sh-4.4$ grep -ir "accept-proxy" haproxy.config bind :80 accept-proxy bind :443 accept-proxy -------- Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |