Bug 2040715 - post 1.23 rebase: regression in service-load balancer reliability [NEEDINFO]
Summary: post 1.23 rebase: regression in service-load balancer reliability
Keywords:
Status: POST
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Riccardo Ravaioli
QA Contact: Anurag saxena
URL:
Whiteboard: EmergencyConfirmed
Depends On:
Blocks: 2056948
TreeView+ depends on / blocked
 
Reported: 2022-01-14 15:37 UTC by Abu Kashem
Modified: 2022-06-09 12:38 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2056948 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:
vpickard: needinfo? (wlewis)
dgoodwin: needinfo? (anusaxen)
rravaiol: needinfo? (anusaxen)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes pull 108149 0 None Merged fix: do not return early in the node informer when there is no change… 2022-02-25 19:06:23 UTC
Github openshift kubernetes pull 1184 0 None Merged Bug 2040715: upstream 108149: do not return early in the node informer when there is no change 2022-02-25 19:06:26 UTC
Github openshift kubernetes pull 1190 0 None Merged Bug 2040715: UPSTREAM: 108284: fix: exclude non-ready nodes from azure load balancer 2022-02-25 19:06:28 UTC
Github openshift machine-config-operator pull 2948 0 None open Bug 2040715: set "node.kubernetes.io/exclude-from-external-load-balancers" label at reboot and unset it at startup 2022-05-23 13:10:52 UTC

Description Abu Kashem 2022-01-14 15:37:37 UTC
service-load balancer reliability regressing on "periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade"

seems to be specific to:
- azure
- limited to service load balancer

(needs triage)



https://files.slack.com/files-pri/T027F3GAJ-F02UFQKGS6M/screenshot_from_2022-01-14_10-27-51.png

see slack thread: https://coreos.slack.com/archives/C01CQA76KMX/p1642174183197200

Comment 1 Michal Fojtik 2022-01-14 15:41:43 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 Michal Fojtik 2022-01-14 16:11:42 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 6 Abu Kashem 2022-01-21 17:13:48 UTC
Based on Antonio's finding, can we have some eyes from the networking team please? This is currently marked as a blocker bug for 4.10 so some further investigation will be appreciated.

Comment 7 Surya Seetharaman 2022-01-21 18:01:32 UTC
Moving component as ovn-k

Comment 8 Casey Callendrello 2022-01-24 15:10:08 UTC
Here's what I can gather so far:

A test, "[sig-network-edge] Application behind service load balancer with PDB remains available using new connections", started failing HARD on 2022-01-12, for e2e-azure-upgrade and upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback. Other jobs were mostly not affected.

On 2022-01-19, it started passing reliably again.

See https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=[sig-network-edge]%20Application%20behind%20service%20load%20balancer%20with%20PDB%20remains%20available%20using%20new%20connections for the raw data.

e2e-azure-upgrade does not use OVN, whereas the ovn-upgrade job certainly does.

So, not sure what the issue was. But I think the heat is off.

Comment 9 Casey Callendrello 2022-01-24 15:44:21 UTC
BTW, the way this test works is that it continually connects from the CI cluster to a service on the test cluster. So, there are a lot of things in the chain that could fail. It's not the easiest to debug.

The full traffic flow is CI -> (cloud provider NAT) -> internet -> ALB (or equivalent) -> Node -> (Possibly another Node) -> Pod.

Comment 11 David Eads 2022-01-24 20:36:26 UTC
Changing back to a blocker+. The regression remains.  See the disruption chart for https://drive.google.com/drive/folders/1HX6q4j6B6QVLrasWwN_5Ct_TdGMIB5AS?usp=sharing

Before kube update, about 5s.  After kube update about 13s on azure SDN HA.

The test moved to passing because we shifted criteria to avoid blocking the entire org while this bug was fixed because it failed so frequently.

Comment 13 Casey Callendrello 2022-01-24 20:51:00 UTC
> The test moved to passing because we shifted criteria to avoid blocking the entire org while this bug was fixed because it failed so frequently.

Aha. Good to know.

> Before kube update, about 5s.  After kube update about 13s on azure SDN HA.

So, I'll start to take a look tomorrow, but I wonder what has changed in the cloudprovider code. I wonder if it's allocating LBs differently.

Comment 14 Casey Callendrello 2022-01-25 17:10:01 UTC
I've done some analysis, and if this is related to the kube rebase, then I'm not sure how. I'm at a bit of a dead end here.

We're using the in-tree (legacy) cloud provider, so one can get the list of included changes: https://github.com/kubernetes/kubernetes/commits/v1.23.2/staging/src/k8s.io/legacy-cloud-providers/azure

I actually extracted the Azure credentials and diffed the load balancer configurations - nothing has changed between 4.9 and 4.10. So, whatever the issue is, it's not likely to be divergent cloud LB configuration.

My first guess is that there is some different behavior when nodes are removed. This is made more complicated by the test creating a Service with an ExternalTrafficPolicy=Cluster rather than Local. The recommended configuration (and what we use for the Router / apiserver) is to use Local.

I need to read up on exactly what happens when nodes go down for ETP=Cluster LoadBalancer services. It could be that we somehow aren't returning something that allows the LB to drop the node in a timely manner.

Comment 25 David Eads 2022-01-27 16:58:19 UTC
The distribution I attached shows we have runs with little disruption even now, but it's not the common case.  Can you collect the disruption times from all the job runs so we can see the distribution as opposed to just 2/10 runs?

Comment 26 Yu Qi Zhang 2022-02-01 17:49:44 UTC
Not sure if this is exactly the same error being observed, but starting on Jan. 28th, the MCO e2e-agnostic-upgrade (Azure) tests have been permanently failing, blocking all of our PRs.

Looking at our most recent jobs,

disruption_tests: [sig-network-edge] Console remains available via cluster ingress using new connections
disruption_tests: [sig-network-edge] Console remains available via cluster ingress using reused connections
disruption_tests: [sig-network-edge] Application behind service load balancer with PDB remains available using new connections

All consistently fail. Example jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2899/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1488485570593165312
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2911/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1488485568248549376
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2909/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1488459393153372160
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2911/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1488430213187506176
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2899/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1488405544535330816

Link to job history

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade?buildId=

Would this be the same issue or should I open a new BZ for the other 2 test failures?

Comment 27 David Eads 2022-02-07 16:15:02 UTC
updating target release to fix the regression before we ship 4.10.0

Comment 29 Riccardo Ravaioli 2022-02-11 21:59:20 UTC
Here's an update.

I tried Casey's second idea from above: https://github.com/openshift/machine-config-operator/pull/2948, which correctly applies a special label ("node.kubernetes.io/exclude-from-external-load-balancers") to a node that is about to reboot, in reaction to which cloud controller manager is supposed to update the backend list for the Azure Load Balancer.  I saw no changes in the disruption times of 20 upgrade jobs compared to the values collected with a 4.10 cluster.

So I had to dig deeper and see what actually gets triggered in cloud controller manager and what gets sent over to Azure when a node gets the label from above and when it reboots.

In short, in 1.23 this PR https://github.com/kubernetes-sigs/cloud-provider-azure/pull/856 changed the logic that decides which nodes must be deleted in AZ load balancers and it turns out that the azure-specific cloud controller code NEVER deletes any node from the AZ load balancers. As a result, the `az network lb list` command always shows the same output while I reboot nodes or apply the "exclude" label to them.

In 4.9, the output from `az` command clearly shows that the rebooting or labeled nodes are removed from the backend list. Also, the cloud controller manager shows this line in the logs, which NEVER gets printed in 4.10:

azure_loadbalancer.go\01139:
klog.V(2).Infof("reconcileLoadBalancer for service (%s)(%t): lb backendpool - found unwanted node %s, decouple it from the LB", serviceName, wantLb, nodeName)

That effectively removes a node from the AZ LB.

In 4.10 with some extra debug messages, I see that reconcileLoadBalancer from azure_loadbalancer.go correctly receives the updated list of nodes from the cloud controller manager in controller.go:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L1082
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L164-L191

In particular, here is where the list of available nodes is promptly updated based on a number of conditions, among which the label from above: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L671-L699


The above PR however changed the logic of when a node gets deleted in Azure LBs. A few lines further down in reconcileLoadBalancer, before that PR was merged, the code would just compare the node names it had from the previous call with the new list of nodes it gets as input and would delete any spurious node:

if !nodeNameInNodes(nodeName, nodes) {
	klog.V(2).Infof("reconcileLoadBalancer for service (%s)(%t): lb backendpool - found unwanted node %s, decouple it from the LB", serviceName, wantLb, nodeName)
	// construct a backendPool that only contains the IP config of the node to be deleted
	backendIPConfigurationsToBeDeleted = append(backendIPConfigurationsToBeDeleted, network.InterfaceIPConfiguration{ID: to.StringPtr(ipConfID)})


With that PR merged, the code makes more sophisticated decisions in a dedicated function (ShouldNodeExcludedFromLoadBalancer) that is used all throughout the two commits that the PR introduces:

shouldExcludeLoadBalancer, err := az.ShouldNodeExcludedFromLoadBalancer(nodeName)
if err != nil {
        klog.Errorf("ShouldNodeExcludedFromLoadBalancer(%s) failed with error: %v", nodeName, err)
        return nil, err
}
if shouldExcludeLoadBalancer {
        klog.V(2).Infof("reconcileLoadBalancer for service (%s)(%t): lb backendpool - found unwanted node %s, decouple it from the LB", serviceName, wantLb, nodeName)
        // construct a backendPool that only contains the IP config of the node to be deleted
        backendIPConfigurationsToBeDeleted = append(backendIPConfigurationsToBeDeleted, network.InterfaceIPConfiguration{ID: to.StringPtr(ipConfID)})



This function does NOT make use of the node list received as input by the reconciler. It looks for the node name in az.excludeLoadBalancerNodes, which is updated by az.updateNodeCaches, which is in turn triggered by separate node informers in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure.go#L741-L777

For some reason, these last node informers trigger a cache update only if the topology label (v1.LabelTopologyZone) on them changes, so the list in az.excludeLoadBalancerNodes is only correct at startup, and then becomes stale. As a consequence, in my tests, as nodes reboot or get labeled, the code always checks for inclusion against an empty az.excludeLoadBalancerNodes list...


In order resolve this, I see the following options:

(1) re-establish the check against the /valid/ list of nodes as in 4.9, in: https://github.com/kubernetes/kubernetes/blob/f74f91d3d3116f695c5e42d6eb8fbf2f017d1ad4/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L1138  (this would make us diverge from upstream!)

(2) remove entirely the two commits introduced by the PR in question

(3) broaden the number of cases for which we update the azure-specific cache in the second informers described above.

As for the third option, what I don't like is that:
- the cloud-agnostic code in cloud controller has its own rules for deciding whether a node is to be included as a backend or to be removed
- AZ-specific code ignores those rules and uses its own...

Comment 30 Riccardo Ravaioli 2022-02-11 22:09:57 UTC
The actual PR that introduced the changes I discuss is this:
https://github.com/kubernetes/kubernetes/pull/105839

As I scroll down that page, I see that the commits were backported all the way down to ***1.20*** around a month ago, so as we rebase our z stream releases, we will carry this issue with us as well...


(small correction from above: when I say "That effectively removes a node from the AZ LB", of course I don't mean that a log message removes a backend node from the AZ LB! :) )

Comment 31 Riccardo Ravaioli 2022-02-15 16:52:49 UTC
Adding more info after comparing more in details with the execution flow in 4.9...


We analyzed how Azure Load Balancer backends are modified as cluster nodes become available/unavailable, which in our tests can happen in two ways:
- by labeling nodes with "node.kubernetes.io/exclude-from-external-load-balancers"
- by rebooting nodes

In openshift 4.9, we notice that:

1. the cloud service controller in staging/src/k8s.io/cloud-provider/controllers/service/controller.go receives event notifications on nodes and updates accordingly the list of available nodes for load balancers (https://github.com/openshift/kubernetes/blob/b93fd35dd030519c24dd02f8bc2a7f873d2928cd/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L668-L688): a node is excluded from a load balancer backend pool if it has the "exclude" label above, or if its status is not Ready;

2. `reconcileLoadBalancer` in azure_loadbalancer.go receives the updated list of nodes from the cloud service controller
   
3. if this list differs from the current list of nodes in the backend pool for a given load balancer, it removes spurious nodes and adds any new nodes.
   
4. in particular, node deletion begins here: https://github.com/openshift/kubernetes/blob/release-4.9/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L1137-L1162 and takes place in the `EnsureBackendPoolDeleted` function.

So far so good, if I label a node that I want to exclude from a load balancer, the following logs appear:

I0211 14:38:44.651083       1 azure_loadbalancer.go:1142] reconcileLoadBalancer for service (openshift-ingress/router-default)(true): lb backendpool - found unwanted node ci-ln-3qh25qk-1d09d-5m9q8-worker-centralus2-zg6hc, decouple it from
the LB
I0211 14:38:44.852854       1 azure_backoff.go:90] GetVirtualMachineWithRetry(ci-ln-3qh25qk-1d09d-5m9q8-worker-centralus2-zg6hc): backoff success
I0211 14:38:44.881964       1 azure_standard.go:1031] EnsureBackendPoolDeleted begins to CreateOrUpdate for NIC(ci-ln-3qh25qk-1d09d-5m9q8-rg, ci-ln-3qh25qk-1d09d-5m9q8-worker-centralus2-zg6hc-nic) with backendPoolID
/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-3qh25qk-1d09d-5m9q8-rg/providers/Microsoft.Network/loadBalancers/ci-ln-3qh25qk-1d09d-5m9q8/backendAddressPools/ci-ln-3qh25qk-1d09d-5m9q8


To confirm that the deletion took place on the Azure side, the output of `az network lb list -g $myRG` taken before and after adding the above label shows clearly that the affected node was removed from the backend pool of this load balancer.

Now, if we try to do the same with openshift 4.10, we see that the above deletion NEVER takes place:
- steps 1 and 2 from above are the same: the cloud service controller realizes that a node has been labeled (or rebooted) and removes from the list of nodes it passes to the Azure reconcileLoadBalancer
- This PR (https://github.com/kubernetes/kubernetes/pull/105839, which solves this issue: https://github.com/kubernetes-sigs/cloud-provider-azure/issues/851
), however, changes the logic of how nodes are removed and the result is that deletion never happens, at least for the two cases mentioned above (a node that gets the "exclude" label, a node that is rebooting), the last of which (the reboot) is crucial for the result of our upgrade job...


From what I understand in the current code:
- deletion now happens if this call returns true: az.ShouldNodeExcludedFromLoadBalancer(nodeName)
- this function does NOT make use of the node list received as input by the reconciler. It looks for the node name in az.excludeLoadBalancerNodes, which is updated by az.updateNodeCaches, which is in turn triggered by separate node informers in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure.go#L741-L777
- for some reason, these node informers trigger a cache update only if the topology label (`v1.LabelTopologyZone`) on them changes, so the list in `az.excludeLoadBalancerNodes` is only correct at startup, and then becomes stale. As a consequence, in our tests, as nodes reboot or get labeled, the code always checks for inclusion against an empty `az.excludeLoadBalancerNodes` list...

The update logic with these node informers was not part of the PR mentioned above. From the issue linked to the PR above, this doesn't seem to be intentional...


In order to overcome this, I tried removing the condition on the topology label and triggering a cache update on every update event. https://github.com/openshift/kubernetes/pull/1174/commits/4ca6c0db6e74cece118546907bd51763c068fe15

This goes a little further in the execution, but deletion is skipped because `EnsureBackendPoolDeleted` only issues a delete if vmasName==vmSet:

I0212 12:57:03.499252       1 azure_standard.go:1005] EnsureBackendPoolDeleted: skipping the node ci-ln-87q96z2-1d09d-hnn5d-worker-westus-5x7n9 belonging to another vm set CI-LN-87Q96Z2-1D09D-HNN5D_CI-LN-87Q96Z2-1D09D-HNN5D-WORKER-WESTUS-AS


Indeed, if I add debug lines I see that:
- vmasName=CI-LN-87Q96Z2-1D09D-HNN5D_CI-LN-87Q96Z2-1D09D-HNN5D-WORKER-WESTUS-AS
- vmSet=385791c7-1c0a-4d8c-a6a5-d01104521d42

I am not familiar with these terms and what they entail. What's interesting, though, is that this code hasn't changed all that much since openshift 4.9, so I added debug lines to see what the values for vmasName and vmSet were back then and it turned out that they were both empty strings! The check for equality was then successful and deletion took place.

Now, why were they both empty?
`vmSet` is an input parameter in `EnsureBackendPoolDeleted` and is set here, in the reconciler function:

    vmSetName := az.mapLoadBalancerNameToVMSet(lbName, clusterName)
    err = az.VMSet.EnsureBackendPoolDeleted(service, lbBackendPoolID, vmSetName, backendpoolToBeDeleted)


In one of my tests, lbName=ci-ln-stbgb3b-1d09d-2xxfd
(unfortunately I did not log clusterName...)

    func (az *Cloud) mapLoadBalancerNameToVMSet(lbName string, clusterName string) (vmSetName string) {
	    vmSetName = strings.TrimSuffix(lbName, InternalLoadBalancerNameSuffix) // removing -internal suffix
	    if strings.EqualFold(clusterName, vmSetName) {
		    vmSetName = az.VMSet.GetPrimaryVMSetName()
	    }
	    return vmSetName
    }

My understanding is that `az.VMSet.GetPrimaryVMSetName()` returns an empty value in 4.9:

// GetPrimaryVMSetName returns the VM set name depending on the configured vmType.
// It returns config.PrimaryScaleSetName for vmss and config.PrimaryAvailabilitySetName for standard vmType.
func (as *availabilitySet) GetPrimaryVMSetName() string {
        return as.Config.PrimaryAvailabilitySetName
}


This code wasn't modified in the PR, so I do not see how now with 1.23 we get a proper value for vmSetName...

While I work on figuring this out, I will share all the above in the upstream issue that David very kindly opened here: https://github.com/kubernetes-sigs/cloud-provider-azure/issues/851



In order to recap, we can easily update the node cache in the azure code, but then we hit another problem that prevents the delete operation to take place. This involves some subtleties in the Azure configuration (vmasName vs vmSet, and also Config.PrimaryAvailabilitySetName) and we definitely need someone from the upstream community to take a look at any code changes we might want to introduce in there...

Comment 32 Riccardo Ravaioli 2022-02-16 16:20:09 UTC
Here's today's update.

To recap, there are three places in the code base that need or might
need a fix:
1) [upstream] the way azure load balancer code updates its internal node
   cache, as discussed yesterday
2) [upstream] the way azure load balancer deletes unavailable nodes from
   the list of backend pools of a load balancer, as discussed yesterday
3) [downstream] the way Machine Config Operator treats nodes that are
   going for a reboot

So, in the same order: (1) we managed to get the Azure people from the
upstream community involved
(https://github.com/kubernetes/kubernetes/issues/108112) and they
promptly opened a PR with the fix we had suggested in the issue page:
https://github.com/kubernetes/kubernetes/pull/108149 That'll have to
be merged upstream and then backported to 1.23, 1.22 and 1.20 (1.21
weirdly doesn't have the incriminated if condition...)

(2) David posted a fix for this second issue
(https://github.com/openshift/kubernetes/pull/1179#issuecomment-1040632084,
including my commits for the first issue), but we see some
inconsistencies in what we see in different cluster instances. More in
detail, he allowed the bypass of an equality check between `vmasName`
and `vmSet` which previously interrupted the deletion of a node in a
backend pool; with a cluster created from that PR, I see that these
variables are again empty, as in 4.9. With a previous cluster created
out of https://github.com/openshift/kubernetes/pull/1174 and
https://github.com/openshift/machine-config-operator/pull/2948 , I saw
a couple of days ago that these variables held values that were
different, which impeded the node deletion from the backend pools. I am
currently testing this again with different images (with/without the MCO
PR) to see what that yields.

(3) I finalized the PR I had previously created for machine config
operator:
https://github.com/openshift/machine-config-operator/pull/2948. The
objective is to label rebooting nodes for load balancer exclusion. This
is not necessary in theory, since the cloud service controller takes
care of updating the list of available nodes according to (1) whether or
not they have that special label (2) their Ready state. So case 2
applies already, but we might gain some small fractions of a second by
labeling nodes at the very start of the reboot process. This is to be
verified with a batch of upgrade jobs.


I also ran around twenty upgrade jobs from the PR from above posted by
David in order to see if the disruption times are back to the old values
of 4.9.  Unfortunately, I do NOT see any improvements with the small
sample I have (8 values out of the 20 jobs I ran...)

 image                              25th percentile  median  75th percentile  average  stdev  min  max  # of values 
 kubernetes#1179 (new connections)              8.0    12.5             43.0     38.7   52.7    5  124            8 
 kubernetes#1148 (new connections)              1.5       4              6.5     12.7   31.3    0  121           27 

... where:
- kubernetes#1179 is the PR david, including fixes 1 and 2
- kubernetes#1148 is kubernetes 1.23 minus the three AZ-LB related
  commits, as identified by Casey a couple of weeks ago. This showed
  values similar to what we had in 4.9.

I am now running more upgrade jobs with an image built out of
kubernetes#1179 + the MCO PR from above, to see if that improves our
disruption times.

As a side node, I just wanted to point out that due to constraints in
clusterbot, the testing of any hypothesis (any combination of PRs from
above) takes quite some time, since we are limited to 1 cluster per
user, and when the image has to be built from a PR, it has recently
taken 2 hours to have clusters up and running. This significantly slowed
me down in the investigation.

Also, upgrade jobs are not fail proof either:
- an upgrade job might fail for other reasons before the e2e test we are
  interested actually gets executed (this is probably to be improved on
  our side)
- a lot of upgrade jobs in Azure fail for insufficient capacity on Azure
  side, yielding this error:

  level=error msg=Error: waiting for creation of Linux Virtual Machine
  "ci-ln-0yxix32-002ac-8bfql-master-1" (Resource Group
  "ci-ln-0yxix32-002ac-8bfql-rg"): Code="ZonalAllocationFailed"
  Message="Allocation failed. We do not have sufficient capacity for the
  requested VM size in this zone. Read more about improving likelihood
  of allocation success at <http://aka.ms/allocation-guidance>"

Comment 33 Riccardo Ravaioli 2022-02-17 15:21:42 UTC
Here's today's update.

Same list of objectives as yesterday:
1) [upstream] the way azure load balancer code updates its internal node
   cache, as discussed yesterday
2) [upstream] the way azure load balancer deletes unavailable nodes from
   the list of backend pools of a load balancer, as discussed yesterday
3) [downstream] the way Machine Config Operator treats nodes that are
   going for a reboot

In the same order:

(1)
 The upstream fix was merged in main (https://github.com/kubernetes/kubernetes/pull/108149) and backported to 1.23 (https://github.com/kubernetes/kubernetes/pull/108177) and 1.22 (https://github.com/kubernetes/kubernetes/pull/108178). The backports haven't been merged yet. The original PR (causing the bug) made it to 1.20, which is not open for backports any longer: https://github.com/kubernetes/kubernetes/blob/ec1738576d8511a4cc03973d26b5606ec04e8693/staging/src/k8s.io/legacy-cloud-providers/azure/azure.go#L759. This poses a risk for openshift 4.7, but we're a few months away from being in sync with the latest 1.20.z, so we can deal with this at a later moment.

(2)
The problem here was that, in one of my cluster instances I tested when using the fix in point 1, I had `vmasName` and `vmSet` set in the code (`vmasName=CI-LN-87Q96Z2-1D09D-HNN5D_CI-LN-87Q96Z2-1D09D-HNN5D-WORKER-WESTUS-AS`, `vmSet=385791c7-1c0a-4d8c-a6a5-d01104521d42`), which prevented `EnsureBackendPoolDeleted` from deleting the node from the backend pool. David promptly posted a patch that dealt with this, but soon aftewards we realized that these variables were (again) always empty. I could not reproduce the environment that yielded non-empty variables in /all/ subsequent tests, despite testing all the combinations of the openshift/kubernetes and openshift/machine-config-operator PRs we had used until now. At this point, I suspect that this was a peculiarity of the instance I ran that time, which might occasionally happen, so we definitely need David's PR to handle this specific case.

In particular, the values of `vmasName` and `vmSet` in the end depend on this variable in the Azure LB configuration, which we keep seeing empty:

// GetPrimaryVMSetName returns the VM set name depending on the configured vmType.
// It returns config.PrimaryScaleSetName for vmss and config.PrimaryAvailabilitySetName for standard vmType.
func (as *availabilitySet) GetPrimaryVMSetName() string {
        return as.Config.PrimaryAvailabilitySetName
}


(3) 
I've been working on finalizing the MCO PR (https://github.com/openshift/machine-config-operator/pull/2948) after receiving David's and Kirsten's comments. 


As for upgrade jobs, which are the final target of this BZ:
- as pointed out yesterday, fix (1) from above doesn't unfortunately take us back to the disruption times we had in 4.9
- I wanted to see if we still get 4.9-like disruption times when removing altogether the entire commit from the upstream PR for which we got a fix in point (1) (https://github.com/kubernetes/kubernetes/pull/105839); this was mainly to verify that the Azure environment we test against does not add additional noise today, with respect to when we initially ran the tests a couple of weeks ago, and it is still able to yield the results we are aiming for; the results show 4.9-like disruption times, which is great, and which means that there might still be something we missed from the upstream PR linked above. My next step is to attentively compare the original PR in kubernetes-sigs (https://github.com/kubernetes-sigs/cloud-provider-azure/pull/856) with what actually got merged upstream (https://github.com/kubernetes/kubernetes/pull/105839)  to see if there are any more conflicts like the one for which we got the upstream fix, which I had pointed out here https://github.com/kubernetes/kubernetes/issues/108112#issuecomment-1040629963
- all upgrade jobs I ran against images built from the MCO PR take 3-4 hours to finish, but they /never/ show results from the e2e tests under focus. This is to be investigated.


Updating the table I posted yesterday:

 | image                  | 25th percentile | median | 75th percentile | average | stdev | min | max | # of values |
 | kubernetes#1179        |            8.25 |   10.0 |            13.0 |    26.1 |  41.6 |   5 | 124 |          14 |
 | kubernetes#1148        |             1.5 |      4 |             6.5 |    12.7 |  31.3 |   0 | 121 |          27 |
 | kubernetes#1149        |             2.0 |      3 |             4.0 |     3.4 |   2.4 |   0 |   8 |           9 |
 | 4.10                   |               9 |     10 |            13.5 |    17.5 |  26.6 |   4 | 113 |          15 |

... where:
- I got more results for kubernetes#1179, which is the PR by David, including fixes 1 and 2
- kubernetes#1148 is kubernetes 1.23 minus the three AZ-LB related commits, as identified by Casey a couple of weeks ago.
- kubernetes#1149 is kubernetes 1.23 minus the upstream commit on azure load balancers described above

This shows that #1148 and #1149 are roughly aligned, and so are #1179 and 4.10.
The next step is to identify what other problems are brought by the upstream PR above...

Comment 37 Riccardo Ravaioli 2022-02-22 14:20:46 UTC
Here's today's update on the bug.

End of last week I finalized the MCO PR, thanks to the involvement David and the MCO team (Kirsten, Zack), who promptly provided feedback: https://github.com/openshift/machine-config-operator/pull/2948
This PR works as expected in manual tests; however, it has a negative impact on MCO during upgrades, which we did not at all expect: this is currently under investigation.

Taking a step back and looking at the global picture, the upstream patch that we pursued and obtained last week for the azure load balancer code did not yield the disruption times we were aiming at.
A few more manual tests showed why:  the upstream fix allows nodes that are labeled with `node.kubernetes.io/exclude-from-external-load-balancer` to be removed from the Azure load balancer. Before this fix, nothing ever got deleted from Azure load balancers in kubernetes 1.23. However with this patch, nodes that are no longer ready (i.e. due to a reboot) still do NOT get removed from the load balancer. This is why all upgrade jobs I ran against this patch do not show the disruption times we had in 4.9: the Azure code does not delete any node from the cloud load balancer; it only reacts to the label node.kubernetes.io/exclude-from-external-load-balancer (and to another label that I mention in my comment on github), which we do not set.

The original PR changing the logic of what gets deleted and when is this https://github.com/kubernetes-sigs/cloud-provider-azure/issues/851 and what I understand is that it was done on purpose (!) to limit the number of API calls to Azure:
[quote from the github issue above]
"the [existing] logic increases VMSS API calls and may potentially introduce VMScaleSetActiveModelsCountLimitReached (limit is 10) issue."

I asked today for more details about this (https://github.com/kubernetes/kubernetes/issues/108112#issuecomment-1047674752), as it has clearly a negative impact on us.

I see two ways forwards:

(1)
As I suggest on the github issue, we can extend the current upstream code to take into account changes in the node status. This depends on the feedback we will get from Azure engineers.

(2)
We use a combination of the existing upstream patch, already merged in 4.11, and the MCO PR: the downside is that, as mentioned before, the patch interferes severely with MCO during upgrades (currently under investigation).

Comment 38 Riccardo Ravaioli 2022-02-24 18:17:58 UTC
Following our call on Tuesday 22nd, we decided to pursue another upstream fix that, along with the one that was merged last week, fully resolves the problem and gives us back the disruption times we had in 4.9.

In short, the fix from last week allowed the internal node cache in azure code to be updated according to node events.
This new fix allows the internal node cache to take into account node status and not only node labels when it comes to deciding whether a node should be excluded from an azure load balancer. This is a regression with respect to 4.9, where the Azure-specific code took into account the updated list of nodes from the (cloud-agnostic) cloud provider controller, which does look up node status as well as node labels.

Since we are tight on time, we decided to continue with two parallel actions:
- discuss the changes with Azure folks in the kubernetes-sigs/cloud-provider-azure repo, possibly leading to a more significant rework of the code and being dependant on our exchanges with them;
- have the same changes in a PR in kubernetes/kubernetes, test it both manually and with upgrade jobs, and exceptionally merging it downstream asap to openshift/kubernetes master and release-4.10 branches, before it gets merged upstream.

With David's helped, I also reworked the logic for the handling of the aforementioned node cache, which seemed flawed in some edge cases:
- it is now correctly updated according to the current state of the node we are processing;
- we make sure that nodes that are removed from the cluster are also excluded from azure load balancers, which didn't seem to be the case before.

Here are all relevant links:
- kubernetes-sigs/cloud-provider-azure
  + https://github.com/kubernetes-sigs/cloud-provider-azure/issues/1194
  + https://github.com/kubernetes-sigs/cloud-provider-azure/pull/1195

- upstream kubernetes
  + https://github.com/kubernetes/kubernetes/issues/108283
  + https://github.com/kubernetes/kubernetes/pull/108284

- openshift/kubernetes
  + https://github.com/openshift/kubernetes/pull/1190/
  + https://github.com/openshift/kubernetes/pull/1195

I'm now testing these new changes (manually and with 23 upgrade jobs) and also waiting for CI to pass.

Finally, last week's patch made it to openshift/kubernetes master, but not yet to release-4.10 (https://github.com/openshift/kubernetes/pull/1183).  Waiting for CI to pass there too.

Comment 39 Riccardo Ravaioli 2022-02-25 14:15:22 UTC
The regression has been fixed downstream, all PRs in openshift/kubernetes have been merged down to 4.10:
- First fix from last week:
https://github.com/openshift/kubernetes/pull/1184 (master)
https://github.com/openshift/kubernetes/pull/1183 (4.10)

- Second fix from this week:
https://github.com/openshift/kubernetes/pull/1190 (master)
https://github.com/openshift/kubernetes/pull/1195 (4.10)

I will actively continue to discuss these changes upstream both in kubernetes/kubernetes and in kubernetes-sigs/cloud-provider-azure in order to have them approved and merged.

Comment 46 Riccardo Ravaioli 2022-06-09 12:38:36 UTC
Hi Anurag,

I see this bug was never verified. Back in February I wrote down the exact steps to verify the bug here:
https://github.com/ricky-rav/cloud_lb/blob/main/steps.txt

Or more easily, we should look for regressions in the "periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade" CI job.

Thanks!


Note You need to log in before you can comment on or make changes to this bug.