Bug 1978396 - Cannot trace cloud provider calls and it looks like ocp is executing "DetachLoadBalancerFromSubnets" in aws causing outage
Summary: Cannot trace cloud provider calls and it looks like ocp is executing "DetachL...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: ---
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-01 17:40 UTC by Anand Paladugu
Modified: 2023-09-15 01:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-09 12:53:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Anand Paladugu 2021-07-01 17:40:27 UTC
Description of problem:

It looks like ocp is executing "DetachLoadBalancerFromSubnets" in aws causing outage. We (and Cu) cannot see these calls in the OCP logs.

Version-Release number of selected component (if applicable):

4.7

How reproducible:

Once

Steps to Reproduce:
1. Remove tags on the target groups
2.
3.

Actual results:

Logs dont show cloud provider calls.

Expected results:

Logs should show cloud provider calls

Additional info:

Cu deployed OCP 4.7 on AWS (UPI) and was notified by thier AWS rep that OCP tried to delete all three target groups from the loadbalancer (classic) that is in front of the OCP ingress controller (and AWS logs show OCP issuing DetachLoadBalancerFromSubnets), but AWS code prevents the last target group from being removed.   However the removal of two target groups impacted Cu's cluster and it was inaccessible since both router pods could not be reached for authentication.   Upon investigation, they thought the label on the target groups (kubernetes.io/cluster/ocp-prod-5j6wg  shared) was removed and that could have made the cloud provider to issue the call  DetachLoadBalancerFromSubnets.    After they fixed the label issue, they did not see this issue re-occur, but are interested in knowing how to trace the cloud provider calls in future.  Interestingly, when I remove the tags in our local setups, I dont see the same issue happening.  Cu's request is how can we trace the cloud provider calls, so they can see when what is being done.

Comment 1 Michael McCune 2021-07-01 20:19:39 UTC
any chance we could get a must-gather, or the logs from the machine-api controllers?

Comment 2 Joel Speed 2021-08-16 12:39:43 UTC
This bug seems to be a request for information regarding tracing AWS calls.

At the moment, the load balancer attachement for ingress is handled by kube controller manager.
If there are issues or the customer wants to know more about what's happening there, they should review the Kube Controller Manager logs.

Eventually this will move into a dedicated cloud controller manager (approx 4.11), which might make it easier to trace in the longer term.

Anand, is that sufficient for the customer?

Comment 3 Anand Paladugu 2021-08-20 17:24:07 UTC
@jspeed   we could not find any info in the default controller manager logs.  Do you think trace logs will contain the calls we are issuing to AWS ?

Thanks

Anand

Comment 4 Joel Speed 2021-08-22 18:37:11 UTC
Yes i think they may. By the time you get to about `-v=8` on the logging within KCM, it should log every single network request and response that it makes. In this case you could filter the logs to determine which calls to AWS are being made

Comment 5 Anand Paladugu 2021-08-23 11:50:59 UTC
Ok.  Let me try to check the debug logs then.  In the future controller, is there a plan to expose the calls at a high level in the normal logs, so customers need not turn on debug logs to trace the calls ?

Comment 6 Joel Speed 2021-08-23 16:33:35 UTC
I'm aware that it does log some of the calls it makes, or at least it seems to, but I don't know if it explicitly logs all calls. Something we can log into though

Comment 8 Joel Speed 2021-10-29 11:33:59 UTC
Some notes on what's happening here:

- The detach call is here [1], there is only one log at level 2 and it doesn't give specific details
- The detach call is intended to remove any subnets that are currently attached to the load balancer which are no longer desired [2]
- The subnet IDs come from getLoadBalancerSubnets [3] which either reads the subnet IDs from an annotation [4] or works them out from the subnets available in the VPC [5]
- Based on the original description, and looking at the code in [5], i can confirm that the lack of the cluster label on the subnets is what caused them to be removed [6]

In terms of increasing visibility into this issue, all of the relevant logging will be set at level 2 (-v=2), so going any higher than this doesn't help the situation.
Otherwise, I don't think there's much I can recommend. Kubernetes resources rely on being appropriately tagged across many components, admins should not remove these tags as this will interfere with the ownership.

I'm not sure if there's anything we can explicitly recommend to the customer here.

[1]: https://github.com/kubernetes/cloud-provider-aws/blob/59ae724ba8a09ca5b6266a8452e937e3e99a6953/pkg/providers/v1/aws_loadbalancer.go#L1049
[2]: https://github.com/kubernetes/cloud-provider-aws/blob/59ae724ba8a09ca5b6266a8452e937e3e99a6953/pkg/providers/v1/aws_loadbalancer.go#L1042
[3]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L4404
[4]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3801-L3803
[5]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3683
[6]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3655-L3656

Comment 10 Joel Speed 2022-03-09 12:53:35 UTC
Customer case was closed

Comment 11 Red Hat Bugzilla 2023-09-15 01:10:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.