Bug 2010083

Summary: Service - Type: Load Balancer swaps subnets causing network outage
Product: OpenShift Container Platform Reporter: David <dsquirre>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Cloud Controller Manager QA Contact: sunzhaohua <zhsun>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-26 17:23:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Cloud Trail Logs none

Description David 2021-10-03 15:18:36 UTC
Created attachment 1828622 [details]
Cloud Trail Logs

Description of problem:
The customer has a ServiceMesh-Service of type  Load Balancer.  This service looses connectivity intermittently.  

The customer uses the AWS console to resolve the issue by swapping the subnet attached to the AWS ELSB back to the expected value.

This appears as bug, but could just be a misconfiguration between AWS and OCP.

This appears to have similar symptoms to Bug 1978396
---
Cu deployed OCP 4.7 on AWS (UPI) and was notified by thier AWS rep that OCP tried to delete all three target groups from the loadbalancer (classic) that is in front of the OCP ingress controller (and AWS logs show OCP issuing DetachLoadBalancerFromSubnets), but AWS code prevents the last target group from being removed.   However the removal of two target groups impacted Cu's cluster and it was inaccessible since both router pods could not be reached for authentication.
---

Note: While I don't have it documented our SRE engineers did mention
---
When I look at the CloudTrail events for one of those users, ie. "i-04ac58167e03a223a", I see it trying to change other load balancers as well. Like, for instance, trying to detach all subnets from load balancer aeefd93635a194b9a9a160cf67e3edea  (and getting an error, because AWS doesn't allow this).
---
Which is similar to the what is noted in the other bug.  But in this case AWS is blocking it.

Version-Release number of selected component (if applicable):
OCPv4.7.30

How reproducible:
Not reproducible on a test cluster yet, and is intermittent on the customers cluster.  Although on initial review it looks like issue is triggered by a change to the deployment/replica set

Steps to Reproduce:
1.
2.
3.

Actual results:
Once the issue is triggered, the subnet associated with a particular AWS ELB is swapped by a service account which causes a network outage

Expected results:
Triggering events should not change the AWS ELB subnets.

Additional info:
Looking through cloud trails (attached) we can see a number of potential service accounts making changes to the ELB just prior to the ELB subnet changing to a subnet that causes network issues.  This event is then followed by an AWS account holder flipping the subnet back to the correct subnet.  This can be seen in the attached cloudtrails log

Some of these network events appear to be associated with changes to the deployment/replicaset
---
$ oc get replicasets -n istio-system -o jsonpath='{range .items[*]}{@.metadata.creationTimestamp}{" "}{@.metadata.name}{" "}{"\n"}{end}' | column -t | grep "private-ingressgateway" | sort -rd
2021-09-20T17:46:47Z  private-ingressgateway-5666c9cfc7
2021-09-03T03:49:01Z  private-ingressgateway-7dfc88c876
2021-08-19T17:04:13Z  private-ingressgateway-69b9bb9c74
...
---
These dates match up to events made by a service account in cloud rails.


Since this is a Managed Service we have full access to the cluster and enough access to view most of the AWS account information where the cluster resides.

Please let us know what information you would like to see.

Comment 1 Joel Speed 2021-10-08 11:35:42 UTC
If this is reproducible or still on going, a must gather with the logs from the cluster would be particularly useful. It would also be good to know exactly which service is causing the issues and to see the details of any configuration/annotations on the service object itself

Comment 2 David 2021-10-15 10:07:57 UTC
Hi Joe

It seem to happen intermittently, yeah the best type of bugs.  Right.  It has only happened a handful of times over the lifetime of the cluster (6 months).

It seems to be triggered at the same time the replica set is changed:
~~~
# Look for the replicaset that manages the ELB
$ oc get replicasets -n istio-system -o jsonpath='{range .items[*]}{@.metadata.creationTimestamp}{" "}{@.metadata.name}{" "}{"\n"}{end}' | column -t | grep "private-ingressgateway" | sort -rd
2021-09-20T17:46:47Z  private-ingressgateway-5666c9cfc7
2021-09-03T03:49:01Z  private-ingressgateway-7dfc88c876
2021-08-19T17:04:13Z  private-ingressgateway-69b9bb9c74
2021-07-15T17:04:10Z  private-ingressgateway-65b5575ccd
2021-07-08T20:35:28Z  private-ingressgateway-6d8f9468d5
2021-06-23T17:58:09Z  private-ingressgateway-668cb56787
2021-05-20T20:38:21Z  private-ingressgateway-54949bcddb
~~~

These dates match up with the cloudflare dates that show the AWS service accounts
 - "i-04ac58167e03a223a", 
 - "i-07d0a1f5ab6601b61", 
 - "i-0eee5e4d55badd5bf"
Breaking the cluster by modifying the subnets to the impacted ELB.



Questions
---------
I can certainly grab the mustgather but which projects / namesspaces / operators should be included?  
I assume istio-system but which others have the ability to modify AWS resources?

The AWS service accounts above do not map directly to OCP service accounts.  I am wondering how I would pods / operators are actually making AWS calls using the above AWS service accounts?

I would like to see if I can workout a way to trigger this issue on demand.  Is there a way I might be able to simulate an istio operator change?  
As mentioned above it appears to coincide with a new replicaset. Running:
~~~
# Look for all replicasets
$ oc get replicasets -n istio-system -o jsonpath='{range .items[*]}{@.metadata.creationTimestamp}{" "}{@.metadata.name}{" "}{"\n"}{end}' | column -t | sort -rd
2021-09-20T17:46:48Z  grafana-7776848dbd
2021-09-20T17:46:47Z  public-ingressgateway-64dfc6d74f
2021-09-20T17:46:47Z  private-ingressgateway-5666c9cfc7
2021-09-20T17:46:47Z  partner-ingressgateway-68db65c4fd
2021-09-20T17:46:47Z  istio-ingressgateway-57f8b48f4f
2021-09-20T17:46:47Z  istio-egressgateway-784c57fd99
2021-09-20T17:46:20Z  prometheus-dc5b86859
2021-09-20T17:45:58Z  istiod-nonprod-service-mesh-smcp-794b89b4b6
2021-09-03T03:49:02Z  public-ingressgateway-7dc9bff7b6
2021-09-03T03:49:02Z  partner-ingressgateway-655b69d96c
2021-09-03T03:49:02Z  istio-egressgateway-57d958bbfd
2021-09-03T03:49:01Z  private-ingressgateway-7dfc88c876
2021-09-03T03:49:01Z  istio-ingressgateway-85ff4c9b5b
2021-09-03T03:48:26Z  prometheus-76f756cbc
~~~

I can see that the change in replicaset is for all deployments at the same time, so it might be safe to assume that this change is triggered by a version change to the operator.  Is there a way I might be able to simulate this on the cluster to see if that will trigger the issue?  I have already tried deleting the pod associated with this but that did not work.


I know I have allocated this bug to cloud compute, even though istio is involved, but I have assumed that istio is just an operator running on the cluster and would not be given AWS access.  And it is OpenShift/Kubernetes that is creating the ELB based on the ocp service object that is being turned itno a loadbalancer.

Comment 3 Joel Speed 2021-10-25 10:55:13 UTC
The component that is likely to be making these changes is the KubeControllerManager, as this is the component that is currently responsible for the updates of the load balancer. So it would be good to grab anything kubecontrollermanager related.

Additionally, I would gather `Service`, `Endpoints`, `Machines` and `Node` objects from the cluster so that we can try to work out where the pods are being assigned (node and machine wise) and then which pods are expected to be in service based on the endpoints and finally any configuration added to the Service objects.

As for simulating a rollout, I don't have any suggestions off the top of my head, but I wonder if this issue is reproducible with a different workload that isn't istio? That might allow us to simplify a reproduction environment.
I wouldn't expect Istio to be responsible for managing anything on the AWS side, but I could be wrong, I'm not very familiar with the Istio project.

Comment 4 Joel Speed 2021-11-26 17:23:51 UTC
There has been no activity on this bug, nor the connected customer case for over a month. At this point there isn't much we can do without additional details.

Please reopen this bug if the issue occurs again