Bug 1979312

Summary: network clusteroperator is degraded after upgrading to OCP 4.7.18 in Azure
Product: OpenShift Container Platform Reporter: oarribas <oarribas>
Component: NetworkingAssignee: mcambria <mcambria>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, apurty, bbennett, christopher.obrien, dcaldwel, dramseur, dseals, jjacob, kechung, mangirdas, mcambria, mjudeiki, oarribas, rgopired
Version: 4.7Keywords: ServiceDeliveryImpact
Target Milestone: ---Flags: oarribas: needinfo-
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-21 14:23:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description oarribas 2021-07-05 15:02:19 UTC
Description of problem:

After upgrading cluster to OCP 4.7.18 in Azure, the clusteroperator network is degraded. The error messages shown are: `clusteroperator/network is degraded because DaemonSet "openshift-sdn/sdn" rollout is not making progress`.

Checking the `sdn` pods, there are 2 containers running of 3:
~~~
$ oc get pods -n openshift-sdn
NAME                  READY  STATUS   RESTARTS  AGE
sdn-xxxxx             2/3    Running  296       1d
sdn-yyyyy             2/3    Running  296       1d
sdn-xxxxx             2/3    Running  296       1d
~~~

The `drop-icmp` container is failing with the following error:
~~~
2021-07-02T17:44:10.674314413Z F0702 17:44:10.674140  907250 observe.go:436] Unable to listen on ":11251": listen tcp :11251: bind: address already in use
~~~

Before teh upgrade, customer applied the workaround in KCS 5252831 [1], creating a DS with a `drop-icmp` container to do the same.
In 4.7.18, a container to do that work was added to the `sdn` pod as part of BZ 1967994 [2] [3].



Version-Release number of selected component (if applicable):

4.7.18 in Azure (including ARO)



How reproducible:

Always


Steps to Reproduce:
1. Cluster in a version previous to 4.7.18
2. Apply the workaround for KCS 5252831 [1]
3. Upgrade the cluster to 4.7.18 and check the `network` clusteroperator and the `sdn` pods.



Actual results:

Container `drop-icmp` in `sdn` pod fails.



Expected results:

Cluster upgrade without issues.






[1] https://access.redhat.com/solutions/5252831
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967994#c7
[3] https://github.com/openshift/cluster-network-operator/pull/1119/files#diff-ec483261857ecc5887c65ea0a5eddfb3f7768ac3cef206e99599bd82e140accbR235

Comment 3 Mangirdas Judeikis 2021-07-06 06:41:00 UTC
Issue is that we basically started deploying same workaround from 2 locations

For temporary fix: 
1. Scale down deployment in  openshift-azure-operator for aro-master operator.
2. Remove openshift-azure-routefix namespace

For non-ARO:
1. Remove azure-routefix namespace (I making assumption if this https://access.redhat.com/solutions/5252831 is showing this fix was applied)

ARO will adjust its operator to stop deploying our side of "fix" from 4.7.15 version. Fix is rolling out now.

Comment 6 Jeesmon Jacob 2021-07-08 18:44:18 UTC
We are hitting this issue for 4.6.26 -> 4.7.19 upgrade in Azure and workaround helped.

Comment 7 Kevin Chung 2021-07-13 15:30:27 UTC
We have an ARO customer that we've confirmed is running into this issue.  To allow their OCP 4.6 -> 4.7 upgrade to proceed, they've applied the following workaround suggested by support:

~~~
$ oc scale -n openshift-azure-operator deployment/aro-operator-master --replicas=0
$ oc delete daemonset -n openshift-azure-routefix routefix
~~~

One question our customer has is, by deleting the routefix daemonset, are they now at risk for running into the large packet issue described in KCS # 5252831 [1]?  The customer has additional clusters that they planned to upgrade, but they are now not sure whether to proceed.


[1] https://access.redhat.com/solutions/5252831

Comment 8 oarribas 2021-07-14 07:38:30 UTC
(In reply to Kevin Chung from comment #7)
> 
> One question our customer has is, by deleting the routefix daemonset, are
> they now at risk for running into the large packet issue described in KCS #
> 5252831 [1]?  The customer has additional clusters that they planned to
> upgrade, but they are now not sure whether to proceed.
> 
> 
> [1] https://access.redhat.com/solutions/5252831


No, the fix for that issue [1] is already included in 4.7.18 [2].

The problem with the fix is that, if the workaround was applied to a cluster, the new container with the fix in the sdn pod fails to start.
Removing the "old" routefix daemonset (and the namespace) will let the new container to start.




[1] https://github.com/openshift/cluster-network-operator/pull/1119/files#diff-ec483261857ecc5887c65ea0a5eddfb3f7768ac3cef206e99599bd82e140accbR235
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967994#c7

Comment 10 Ram Gopireddy 2021-07-21 13:58:50 UTC
I see that the but is still in "NEW" state. What are the next steps on this? Customer is asking for an update on this issue.

Comment 11 mcambria@redhat.com 2021-07-21 14:23:16 UTC
I'm closing this.   Before upgrading, the ARO workaround has to be removed as discussed above.

Comment 12 Kevin Chung 2021-07-21 14:45:44 UTC
Hi @mcambria 

Please reconsider opening this bugzilla for the following reasons.  Our customer has provided some relevant feedback, and while the workaround technically resolves this issue, in practice it results in an impacting outage that requires manual intervention.  Here are the possible scenarios:

1. Customer wants to upgrade an Azure or ARO cluster to 4.7.18 and applies the workaround to delete the routefix namespace prior to the upgrade.  They will now encounter the TLS large packet issue described here [1] until the sdn pods are upgraded to 4.7.18.  It's one of the later cluster operators to be updated in the process, so you're looking at having TLS large packet issues for any time prior to pressing upgrade plus most of the duration of the upgrade.

2. If instead, customer wants to wait until after the upgrade to 4.7.18 to apply the workaround, they'll have sdn pods constantly restarting, resulting in impact to the OpenShift networking.  This is even worse because manual remediation is required - in other words, if the customer presses upgrade overnight and comes back the next day, they incur impact from the time the sdn pods are upgraded to the time they apply the manual remediation.

3. Thus, the only real soloution for the customer is to have someone sit through the upgrade, wait for the precise moment the sdn pods are rolling out, and apply the workaround to minimize downtime.  In practice, this isn't a good solution because the customer needs to manually intervene while upgrading every cluster, with work in prod environments that may be performed during evenings or weekends.

I'd propose that a fix to this bug should include deleting the openshift-azure-routefix daemonset automatically as part of the upgrade, and just prior to rolling out the new sdn pods.  The end result would be that the workaround would be automatically applied.

In conclusion, a bug that results in a customer impact and requires manual intervention should really be fixed.  Since this impacts ALL Azure customers, the customer base that is currently impacted and requires manual remediation is quite large.


[1] https://access.redhat.com/solutions/5252831


Kevin