Bug 2010083
| Summary: | Service - Type: Load Balancer swaps subnets causing network outage | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David <dsquirre> | ||||
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | ||||
| Cloud Compute sub component: | Cloud Controller Manager | QA Contact: | sunzhaohua <zhsun> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | unspecified | CC: | aos-bugs | ||||
| Version: | 4.7 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-11-26 17:23:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
David
2021-10-03 15:18:36 UTC
If this is reproducible or still on going, a must gather with the logs from the cluster would be particularly useful. It would also be good to know exactly which service is causing the issues and to see the details of any configuration/annotations on the service object itself Hi Joe
It seem to happen intermittently, yeah the best type of bugs. Right. It has only happened a handful of times over the lifetime of the cluster (6 months).
It seems to be triggered at the same time the replica set is changed:
~~~
# Look for the replicaset that manages the ELB
$ oc get replicasets -n istio-system -o jsonpath='{range .items[*]}{@.metadata.creationTimestamp}{" "}{@.metadata.name}{" "}{"\n"}{end}' | column -t | grep "private-ingressgateway" | sort -rd
2021-09-20T17:46:47Z private-ingressgateway-5666c9cfc7
2021-09-03T03:49:01Z private-ingressgateway-7dfc88c876
2021-08-19T17:04:13Z private-ingressgateway-69b9bb9c74
2021-07-15T17:04:10Z private-ingressgateway-65b5575ccd
2021-07-08T20:35:28Z private-ingressgateway-6d8f9468d5
2021-06-23T17:58:09Z private-ingressgateway-668cb56787
2021-05-20T20:38:21Z private-ingressgateway-54949bcddb
~~~
These dates match up with the cloudflare dates that show the AWS service accounts
- "i-04ac58167e03a223a",
- "i-07d0a1f5ab6601b61",
- "i-0eee5e4d55badd5bf"
Breaking the cluster by modifying the subnets to the impacted ELB.
Questions
---------
I can certainly grab the mustgather but which projects / namesspaces / operators should be included?
I assume istio-system but which others have the ability to modify AWS resources?
The AWS service accounts above do not map directly to OCP service accounts. I am wondering how I would pods / operators are actually making AWS calls using the above AWS service accounts?
I would like to see if I can workout a way to trigger this issue on demand. Is there a way I might be able to simulate an istio operator change?
As mentioned above it appears to coincide with a new replicaset. Running:
~~~
# Look for all replicasets
$ oc get replicasets -n istio-system -o jsonpath='{range .items[*]}{@.metadata.creationTimestamp}{" "}{@.metadata.name}{" "}{"\n"}{end}' | column -t | sort -rd
2021-09-20T17:46:48Z grafana-7776848dbd
2021-09-20T17:46:47Z public-ingressgateway-64dfc6d74f
2021-09-20T17:46:47Z private-ingressgateway-5666c9cfc7
2021-09-20T17:46:47Z partner-ingressgateway-68db65c4fd
2021-09-20T17:46:47Z istio-ingressgateway-57f8b48f4f
2021-09-20T17:46:47Z istio-egressgateway-784c57fd99
2021-09-20T17:46:20Z prometheus-dc5b86859
2021-09-20T17:45:58Z istiod-nonprod-service-mesh-smcp-794b89b4b6
2021-09-03T03:49:02Z public-ingressgateway-7dc9bff7b6
2021-09-03T03:49:02Z partner-ingressgateway-655b69d96c
2021-09-03T03:49:02Z istio-egressgateway-57d958bbfd
2021-09-03T03:49:01Z private-ingressgateway-7dfc88c876
2021-09-03T03:49:01Z istio-ingressgateway-85ff4c9b5b
2021-09-03T03:48:26Z prometheus-76f756cbc
~~~
I can see that the change in replicaset is for all deployments at the same time, so it might be safe to assume that this change is triggered by a version change to the operator. Is there a way I might be able to simulate this on the cluster to see if that will trigger the issue? I have already tried deleting the pod associated with this but that did not work.
I know I have allocated this bug to cloud compute, even though istio is involved, but I have assumed that istio is just an operator running on the cluster and would not be given AWS access. And it is OpenShift/Kubernetes that is creating the ELB based on the ocp service object that is being turned itno a loadbalancer.
The component that is likely to be making these changes is the KubeControllerManager, as this is the component that is currently responsible for the updates of the load balancer. So it would be good to grab anything kubecontrollermanager related. Additionally, I would gather `Service`, `Endpoints`, `Machines` and `Node` objects from the cluster so that we can try to work out where the pods are being assigned (node and machine wise) and then which pods are expected to be in service based on the endpoints and finally any configuration added to the Service objects. As for simulating a rollout, I don't have any suggestions off the top of my head, but I wonder if this issue is reproducible with a different workload that isn't istio? That might allow us to simplify a reproduction environment. I wouldn't expect Istio to be responsible for managing anything on the AWS side, but I could be wrong, I'm not very familiar with the Istio project. There has been no activity on this bug, nor the connected customer case for over a month. At this point there isn't much we can do without additional details. Please reopen this bug if the issue occurs again |