Bug 1997226
| Summary: | Ingresscontroller reconcilations failing but not shown in operator logs or status of ingresscontroller. | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Manish Pandey <mapandey> | |
| Component: | Networking | Assignee: | Miheer Salunke <misalunk> | |
| Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | aiyengar, aos-bugs, apizarro, bmcelvee, hdo, imm, jko, lmohanty, misalunk, mmasters, tkondvil, vrutkovs | |
| Version: | 4.8 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.10.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: Before OpenShift 4.8, the IngressController API did not have any subfields under the "status.endpointPublishingStrategy.hostNetwork" and "status.endpointPublishingStrategy.nodePort" fields. As result, these fields could be null even if the "spec.endpointPublishingStrategy.type" field was set to "HostNetwork" or "NodePortService". OpenShift 4.8 added the "status.endpointPublishingStrategy.hostNetwork.protocol" and "status.endpointPublishingStrategy.nodePort.protocol" subfields, and the ingress operator now sets default values for these subfields when the operator admits or re-admits an IngressController that specifies the "HostNetwork" or "NodePortService" strategy type, respectively. However, a cluster that was upgraded from an earlier version of OpenShift could have an already admitted IngressController with null values for these status fields even when the IngressController specified the "HostNetwork" or "NodePortService" endpoint publishing strategy type. In this case, the operator ignored updates to these spec fields.
Consequence: Updating "spec.endpointPublishingStrategy.hostNetwork.protocol" or "spec.endpointPublishingStrategy.nodePort.protocol" to "PROXY" to enable PROXY protocol on an existing IngressController had no effect, and it was necessary to delete and recreate the IngressController to enable PROXY protocol.
Fix: The ingress operator was changed so that it correctly updates the status fields when "status.endpointPublishingStrategy.hostNetwork" or "status.endpointPublishingStrategy.nodePort" is null and the IngressController's spec fields specify PROXY protocol with the "HostNetwork" or "NodePortService" endpoint publishing strategy type, respectively.
Result: Setting "spec.endpointPublishingStrategy.hostNetwork.protocol" or "spec.endpointPublishingStrategy.nodePort.protocol" to "PROXY" now takes proper effect on upgraded clusters.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2084336 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-12 04:37:30 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2084336 | |||
|
Comment 1
Miheer Salunke
2021-08-30 03:08:54 UTC
Hi,
I was able to reproduce this issue issue in my env
I upgraded from 4.7.23 -> 4.8.4 on Vsphere
Then I added the following in the spec section of the default ingress controller ->
endpointPublishingStrategy:
hostNetwork:
protocol: PROXY
type: HostNetwork
[miheer@localhost cluster-ingress-operator]$ cat default-ingress-controller.yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
creationTimestamp: "2021-08-30T07:52:20Z"
finalizers:
- ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
generation: 1
name: default
namespace: openshift-ingress-operator
resourceVersion: "70349"
uid: 9a306096-627f-4335-bddc-662540940999
spec:
endpointPublishingStrategy:
hostNetwork:
protocol: PROXY
type: HostNetwork
replicas: 2
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2021-08-30T07:55:44Z"
reason: Valid
status: "True"
type: Admitted
- lastTransitionTime: "2021-08-30T09:28:53Z"
status: "True"
type: PodsScheduled
- lastTransitionTime: "2021-08-30T09:23:30Z"
message: The deployment has Available status condition set to True
reason: DeploymentAvailable
status: "True"
type: DeploymentAvailable
- lastTransitionTime: "2021-08-30T09:23:30Z"
message: Minimum replicas requirement is met
reason: DeploymentMinimumReplicasMet
status: "True"
type: DeploymentReplicasMinAvailable
- lastTransitionTime: "2021-08-30T09:29:23Z"
message: All replicas are available
reason: DeploymentReplicasAvailable
status: "True"
type: DeploymentReplicasAllAvailable
- lastTransitionTime: "2021-08-30T07:55:45Z"
message: The configured endpoint publishing strategy does not include a managed
load balancer
reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
status: "False"
type: LoadBalancerManaged
- lastTransitionTime: "2021-08-30T07:55:45Z"
message: No DNS zones are defined in the cluster dns config.
reason: NoDNSZones
status: "False"
type: DNSManaged
- lastTransitionTime: "2021-08-30T09:23:30Z"
status: "True"
type: Available
- lastTransitionTime: "2021-08-30T08:00:14Z"
status: "False"
type: Degraded
- lastTransitionTime: "2021-08-30T08:01:05Z"
message: Canary route checks for the default ingress controller are successful
reason: CanaryChecksSucceeding
status: "True"
type: CanaryChecksSucceeding
domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com
endpointPublishingStrategy:
type: HostNetwork
observedGeneration: 1
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
tlsProfile:
ciphers:
- TLS_AES_128_GCM_SHA256
- TLS_AES_256_GCM_SHA384
- TLS_CHACHA20_POLY1305_SHA256
- ECDHE-ECDSA-AES128-GCM-SHA256
- ECDHE-RSA-AES128-GCM-SHA256
- ECDHE-ECDSA-AES256-GCM-SHA384
- ECDHE-RSA-AES256-GCM-SHA384
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-RSA-CHACHA20-POLY1305
- DHE-RSA-AES128-GCM-SHA256
- DHE-RSA-AES256-GCM-SHA384
minTLSVersion: VersionTLS12
[miheer@localhost cluster-ingress-operator]$
And it did not get reflected on the router.
Workaround -> I had to delete the default ingresscontroller and then add the changes as [0] to have them reflected in the router
[0]
endpointPublishingStrategy:
hostNetwork:
protocol: PROXY
type: HostNetwork
When I directly installed 4.8.4 on VSphere
and I made the changes [0] they were correctly reflected on the router.
So this issue is happening only during upgrade. But I am not sure why. I think this might be related the API might have been storing the object in etcd using the schema from 4.7.
But not sure.
Things I checked were the yaml contents of the default ingresscontroller.
A) Before deletion of ingress controller the yaml looked as follows when adding of the following section [0] in the spec did not work-
[0]
endpointPublishingStrategy:
hostNetwork:
protocol: PROXY
type: HostNetwork
~~~~~~~~~~~~~
Note this is the yaml just created after deletion so you wont see the [0] under the spec. I had added it later and it was not correctly reflected in the router.
[miheer@localhost cluster-ingress-operator]$ cat default-ingress-controller.yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
creationTimestamp: "2021-08-30T07:52:20Z"
finalizers:
- ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
generation: 1
name: default
namespace: openshift-ingress-operator
resourceVersion: "70349"
uid: 9a306096-627f-4335-bddc-662540940999
spec:
replicas: 2
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2021-08-30T07:55:44Z"
reason: Valid
status: "True"
type: Admitted
- lastTransitionTime: "2021-08-30T09:28:53Z"
status: "True"
type: PodsScheduled
- lastTransitionTime: "2021-08-30T09:23:30Z"
message: The deployment has Available status condition set to True
reason: DeploymentAvailable
status: "True"
type: DeploymentAvailable
- lastTransitionTime: "2021-08-30T09:23:30Z"
message: Minimum replicas requirement is met
reason: DeploymentMinimumReplicasMet
status: "True"
type: DeploymentReplicasMinAvailable
- lastTransitionTime: "2021-08-30T09:29:23Z"
message: All replicas are available
reason: DeploymentReplicasAvailable
status: "True"
type: DeploymentReplicasAllAvailable
- lastTransitionTime: "2021-08-30T07:55:45Z"
message: The configured endpoint publishing strategy does not include a managed
load balancer
reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
status: "False"
type: LoadBalancerManaged
- lastTransitionTime: "2021-08-30T07:55:45Z"
message: No DNS zones are defined in the cluster dns config.
reason: NoDNSZones
status: "False"
type: DNSManaged
- lastTransitionTime: "2021-08-30T09:23:30Z"
status: "True"
type: Available
- lastTransitionTime: "2021-08-30T08:00:14Z"
status: "False"
type: Degraded
- lastTransitionTime: "2021-08-30T08:01:05Z"
message: Canary route checks for the default ingress controller are successful
reason: CanaryChecksSucceeding
status: "True"
type: CanaryChecksSucceeding
domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com
endpointPublishingStrategy:
type: HostNetwork
observedGeneration: 1
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
tlsProfile:
ciphers:
- TLS_AES_128_GCM_SHA256
- TLS_AES_256_GCM_SHA384
- TLS_CHACHA20_POLY1305_SHA256
- ECDHE-ECDSA-AES128-GCM-SHA256
- ECDHE-RSA-AES128-GCM-SHA256
- ECDHE-ECDSA-AES256-GCM-SHA384
- ECDHE-RSA-AES256-GCM-SHA384
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-RSA-CHACHA20-POLY1305
- DHE-RSA-AES128-GCM-SHA256
- DHE-RSA-AES256-GCM-SHA384
minTLSVersion: VersionTLS12
[miheer@localhost cluster-ingress-operator]$
~~~~~~~~~~~~~~
B) After deletion the yaml when adding of the following section [0] in the spec did work looked as follows -
[0]
endpointPublishingStrategy:
hostNetwork:
protocol: PROXY
type: HostNetwork
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Note this is the yaml just created after deletion so you wont see the [0] under the spec. I had added it later and it was correctly reflected in the router.
[miheer@localhost cluster-ingress-operator]$ cat ing-cont-after-deletion
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
creationTimestamp: "2021-08-30T15:44:10Z"
finalizers:
- ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
generation: 1
name: default
namespace: openshift-ingress-operator
resourceVersion: "209953"
uid: d18380e7-0c39-439b-96c7-2c56a5f7fd7e
spec:
httpErrorCodePages:
name: ""
replicas: 2
tuningOptions: {} ------------------------------------------------------------------------------This was added after deletion
unsupportedConfigOverrides: null -------------------------------------------------------------- This was added after deletion
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-08-30T15:44:10Z"
reason: Valid
status: "True"
type: Admitted
- lastTransitionTime: "2021-08-30T15:45:04Z"
message: 'Some pods are not scheduled: Pod "router-default-64f9c4985b-wjj99" cannot ---------------------Please ignore this as this was related to my env which was fixed later.
be scheduled: 0/5 nodes are available: 2 node(s) didn''t have free ports for
the requested pod ports, 3 node(s) had taint {node-role.kubernetes.io/master:
}, that the pod didn''t tolerate. Make sure you have sufficient worker nodes.'
reason: PodsNotScheduled
status: "False"
type: PodsScheduled
- lastTransitionTime: "2021-08-30T15:45:49Z"
message: The deployment has Available status condition set to True
reason: DeploymentAvailable
status: "True"
type: DeploymentAvailable
- lastTransitionTime: "2021-08-30T15:45:49Z"
message: Minimum replicas requirement is met
reason: DeploymentMinimumReplicasMet
status: "True"
type: DeploymentReplicasMinAvailable
- lastTransitionTime: "2021-08-30T15:45:49Z"
message: 1/2 of replicas are available
reason: DeploymentReplicasNotAvailable
status: "False"
type: DeploymentReplicasAllAvailable
- lastTransitionTime: "2021-08-30T15:44:11Z"
message: The configured endpoint publishing strategy does not include a managed
load balancer
reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
status: "False"
type: LoadBalancerManaged
- lastTransitionTime: "2021-08-30T15:44:11Z"
message: No DNS zones are defined in the cluster dns config.
reason: NoDNSZones
status: "False"
type: DNSManaged
- lastTransitionTime: "2021-08-30T15:45:49Z"
status: "True"
type: Available
- lastTransitionTime: "2021-08-30T15:45:49Z"
status: "False"
type: Degraded
domain: apps.mislaunkvsphereipi.qe.devcluster.openshift.com
endpointPublishingStrategy:
hostNetwork:
protocol: TCP
type: HostNetwork
observedGeneration: 1
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
tlsProfile:
ciphers:
- TLS_AES_128_GCM_SHA256
- TLS_AES_256_GCM_SHA384
- TLS_CHACHA20_POLY1305_SHA256
- ECDHE-ECDSA-AES128-GCM-SHA256
- ECDHE-RSA-AES128-GCM-SHA256
- ECDHE-ECDSA-AES256-GCM-SHA384
- ECDHE-RSA-AES256-GCM-SHA384
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-RSA-CHACHA20-POLY1305
- DHE-RSA-AES128-GCM-SHA256
- DHE-RSA-AES256-GCM-SHA384
minTLSVersion: VersionTLS12
[miheer@localhost cluster-ingress-operator]$
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Only difference I saw is that
tuningOptions: {} ------------------------------------------------------------------------------This was added after deletion
unsupportedConfigOverrides: null -------------------------------------------------------------- This was added after deletion
fields were added after deleting the default ingress controller.
The http://infrastructures.config.openshift.io/cluster also looked fine to me when I compared it with the 4.8.4 installed cluster and 4.8.4 upgraded cluster
I think this might be related the API might have been storing the object in etcd using the schema from 4.7.
But again why it was not working after upgrade I am not yet sure.
Is the customer not OK with the workaround ?
I am assigning this to the team which handles upgrade as I am not sure why this issue is happening only after upgrade. As mentioned earlier it might be the case where the API might have been storing the object in etcd using the schema from 4.7. Please test the credentials from app used by ingress operator:
- Get credentials of App used by ingress:
./oc-4.7.23 get secret cloud-credentials -n openshift-ingress-operator -o jsonpath='{.data}' > ig-creds.json
export azure_client_id=$(jq -r '.azure_client_id' ig-creds.json |base64 -d )
export azure_client_secret=$(jq -r '.azure_client_secret' ig-creds.json |base64 -d )
export azure_region=$(jq -r '.azure_region' ig-creds.json |base64 -d )
export azure_resource_prefix=$(jq -r '.azure_resource_prefix' ig-creds.json |base64 -d )
export azure_resourcegroup=$(jq -r '.azure_resourcegroup' ig-creds.json |base64 -d )
export azure_subscription_id=$(jq -r '.azure_subscription_id' ig-creds.json |base64 -d )
export azure_tenant_id=$(jq -r '.azure_tenant_id' ig-creds.json |base64 -d )
- Login using this credentials
az login --service-principal -u $azure_client_id \
--password $azure_client_secret --tenant $azure_tenant_id
- Check if roleDefinitionName=Contributor for AppId
az role assignment list --assignee $azure_client_id -g $azure_resourcegroup
az role assignment list --assignee $azure_client_id -g $azure_resourcegroup |jq .[].roleDefinitionName
"Contributor"
Oh sorry please ignore comment 4 as it was not meant for this bug CVO is applying manifests specified by ingress. If these manifests are correct and CVO needs to be resolved we'll need a clarification which fields need to be changed. Moving back to Routing. Any update on this ? *** Bug 2025949 has been marked as a duplicate of this bug. *** Hello team! From a lab environment where I replicated the issue, the only difference I can see is the data written in etcd is different for key /kubernetes.io/operator.openshift.io/ingresscontrollers/openshift-ingress-operator/default.
Upgraded cluster:
---
..
"domain": "apps.o.rlab.sh",
"endpointPublishingStrategy": {
"type": "HostNetwork"
},
..
---
Fresh 4.8.18 cluster:
---
...
"domain": "apps.ocp4upi2.rhlabs.local",
"endpointPublishingStrategy": {
"hostNetwork": {
"protocol": "PROXY"
},
"type": "HostNetwork"
},
...
---
Verified in "4.10.0-0.nightly-2021-12-21-130047" release version. The changes made in the "hostnetwork" protocol gets reflected correctly on the proxy pods:
--------
oc -n openshift-ingress-operator edit ingresscontroller default
ingresscontroller.operator.openshift.io/default edited
domain: apps.aiyengar410vsp.qe.devcluster.openshift.com
endpointPublishingStrategy:
hostNetwork:
protocol: PROXY
type: HostNetwork
observedGeneration: 4
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
oc -n openshift-ingress get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
router-default-5865ccbfb6-25pl9 1/1 Running 0 6m42s 172.31.249.221 aiyengar410vsp-ppvps-worker-tm99z <none> <none>
router-default-5865ccbfb6-xfb5n 1/1 Running 0 8m 172.31.249.77 aiyengar410vsp-ppvps-worker-87jfw <none> <none>
oc -n openshift-ingress rsh router-default-5865ccbfb6-xfb5n
sh-4.4$ env | grep -i ROUTER_USE_PROXY_PROTOCOL
ROUTER_USE_PROXY_PROTOCOL=true
sh-4.4$ grep -ir "accept-proxy" haproxy.config
bind :80 accept-proxy
bind :443 accept-proxy
--------
Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |