Bug 1979312
Summary: | network clusteroperator is degraded after upgrading to OCP 4.7.18 in Azure | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | oarribas <oarribas> |
Component: | Networking | Assignee: | mcambria <mcambria> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | aconstan, apurty, bbennett, christopher.obrien, dcaldwel, dramseur, dseals, jjacob, kechung, mangirdas, mcambria, mjudeiki, oarribas, rgopired |
Version: | 4.7 | Keywords: | ServiceDeliveryImpact |
Target Milestone: | --- | Flags: | oarribas:
needinfo-
|
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-21 14:23:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
oarribas
2021-07-05 15:02:19 UTC
Issue is that we basically started deploying same workaround from 2 locations For temporary fix: 1. Scale down deployment in openshift-azure-operator for aro-master operator. 2. Remove openshift-azure-routefix namespace For non-ARO: 1. Remove azure-routefix namespace (I making assumption if this https://access.redhat.com/solutions/5252831 is showing this fix was applied) ARO will adjust its operator to stop deploying our side of "fix" from 4.7.15 version. Fix is rolling out now. We are hitting this issue for 4.6.26 -> 4.7.19 upgrade in Azure and workaround helped. We have an ARO customer that we've confirmed is running into this issue. To allow their OCP 4.6 -> 4.7 upgrade to proceed, they've applied the following workaround suggested by support: ~~~ $ oc scale -n openshift-azure-operator deployment/aro-operator-master --replicas=0 $ oc delete daemonset -n openshift-azure-routefix routefix ~~~ One question our customer has is, by deleting the routefix daemonset, are they now at risk for running into the large packet issue described in KCS # 5252831 [1]? The customer has additional clusters that they planned to upgrade, but they are now not sure whether to proceed. [1] https://access.redhat.com/solutions/5252831 (In reply to Kevin Chung from comment #7) > > One question our customer has is, by deleting the routefix daemonset, are > they now at risk for running into the large packet issue described in KCS # > 5252831 [1]? The customer has additional clusters that they planned to > upgrade, but they are now not sure whether to proceed. > > > [1] https://access.redhat.com/solutions/5252831 No, the fix for that issue [1] is already included in 4.7.18 [2]. The problem with the fix is that, if the workaround was applied to a cluster, the new container with the fix in the sdn pod fails to start. Removing the "old" routefix daemonset (and the namespace) will let the new container to start. [1] https://github.com/openshift/cluster-network-operator/pull/1119/files#diff-ec483261857ecc5887c65ea0a5eddfb3f7768ac3cef206e99599bd82e140accbR235 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1967994#c7 I see that the but is still in "NEW" state. What are the next steps on this? Customer is asking for an update on this issue. I'm closing this. Before upgrading, the ARO workaround has to be removed as discussed above. Hi @mcambria Please reconsider opening this bugzilla for the following reasons. Our customer has provided some relevant feedback, and while the workaround technically resolves this issue, in practice it results in an impacting outage that requires manual intervention. Here are the possible scenarios: 1. Customer wants to upgrade an Azure or ARO cluster to 4.7.18 and applies the workaround to delete the routefix namespace prior to the upgrade. They will now encounter the TLS large packet issue described here [1] until the sdn pods are upgraded to 4.7.18. It's one of the later cluster operators to be updated in the process, so you're looking at having TLS large packet issues for any time prior to pressing upgrade plus most of the duration of the upgrade. 2. If instead, customer wants to wait until after the upgrade to 4.7.18 to apply the workaround, they'll have sdn pods constantly restarting, resulting in impact to the OpenShift networking. This is even worse because manual remediation is required - in other words, if the customer presses upgrade overnight and comes back the next day, they incur impact from the time the sdn pods are upgraded to the time they apply the manual remediation. 3. Thus, the only real soloution for the customer is to have someone sit through the upgrade, wait for the precise moment the sdn pods are rolling out, and apply the workaround to minimize downtime. In practice, this isn't a good solution because the customer needs to manually intervene while upgrading every cluster, with work in prod environments that may be performed during evenings or weekends. I'd propose that a fix to this bug should include deleting the openshift-azure-routefix daemonset automatically as part of the upgrade, and just prior to rolling out the new sdn pods. The end result would be that the workaround would be automatically applied. In conclusion, a bug that results in a customer impact and requires manual intervention should really be fixed. Since this impacts ALL Azure customers, the customer base that is currently impacted and requires manual remediation is quite large. [1] https://access.redhat.com/solutions/5252831 Kevin |