Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1978396

Summary:	Cannot trace cloud provider calls and it looks like ocp is executing "DetachLoadBalancerFromSubnets" in aws causing outage
Product:	OpenShift Container Platform	Reporter:	Anand Paladugu <apaladug>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	high
Priority:	low	CC:	jspeed, mfedosin, mimccune
Version:	4.7
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-09 12:53:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Anand Paladugu 2021-07-01 17:40:27 UTC

Description of problem:

It looks like ocp is executing "DetachLoadBalancerFromSubnets" in aws causing outage. We (and Cu) cannot see these calls in the OCP logs.

Version-Release number of selected component (if applicable):

4.7

How reproducible:

Once

Steps to Reproduce:
1. Remove tags on the target groups
2.
3.

Actual results:

Logs dont show cloud provider calls.

Expected results:

Logs should show cloud provider calls

Additional info:

Cu deployed OCP 4.7 on AWS (UPI) and was notified by thier AWS rep that OCP tried to delete all three target groups from the loadbalancer (classic) that is in front of the OCP ingress controller (and AWS logs show OCP issuing DetachLoadBalancerFromSubnets), but AWS code prevents the last target group from being removed.   However the removal of two target groups impacted Cu's cluster and it was inaccessible since both router pods could not be reached for authentication.   Upon investigation, they thought the label on the target groups (kubernetes.io/cluster/ocp-prod-5j6wg  shared) was removed and that could have made the cloud provider to issue the call  DetachLoadBalancerFromSubnets.    After they fixed the label issue, they did not see this issue re-occur, but are interested in knowing how to trace the cloud provider calls in future.  Interestingly, when I remove the tags in our local setups, I dont see the same issue happening.  Cu's request is how can we trace the cloud provider calls, so they can see when what is being done.

Comment 1 Michael McCune 2021-07-01 20:19:39 UTC

any chance we could get a must-gather, or the logs from the machine-api controllers?

Comment 2 Joel Speed 2021-08-16 12:39:43 UTC

This bug seems to be a request for information regarding tracing AWS calls.

At the moment, the load balancer attachement for ingress is handled by kube controller manager.
If there are issues or the customer wants to know more about what's happening there, they should review the Kube Controller Manager logs.

Eventually this will move into a dedicated cloud controller manager (approx 4.11), which might make it easier to trace in the longer term.

Anand, is that sufficient for the customer?

Comment 3 Anand Paladugu 2021-08-20 17:24:07 UTC

@jspeed   we could not find any info in the default controller manager logs.  Do you think trace logs will contain the calls we are issuing to AWS ?

Thanks

Anand

Comment 4 Joel Speed 2021-08-22 18:37:11 UTC

Yes i think they may. By the time you get to about `-v=8` on the logging within KCM, it should log every single network request and response that it makes. In this case you could filter the logs to determine which calls to AWS are being made

Comment 5 Anand Paladugu 2021-08-23 11:50:59 UTC

Ok.  Let me try to check the debug logs then.  In the future controller, is there a plan to expose the calls at a high level in the normal logs, so customers need not turn on debug logs to trace the calls ?

Comment 6 Joel Speed 2021-08-23 16:33:35 UTC

I'm aware that it does log some of the calls it makes, or at least it seems to, but I don't know if it explicitly logs all calls. Something we can log into though

Comment 8 Joel Speed 2021-10-29 11:33:59 UTC

Some notes on what's happening here:

- The detach call is here [1], there is only one log at level 2 and it doesn't give specific details
- The detach call is intended to remove any subnets that are currently attached to the load balancer which are no longer desired [2]
- The subnet IDs come from getLoadBalancerSubnets [3] which either reads the subnet IDs from an annotation [4] or works them out from the subnets available in the VPC [5]
- Based on the original description, and looking at the code in [5], i can confirm that the lack of the cluster label on the subnets is what caused them to be removed [6]

In terms of increasing visibility into this issue, all of the relevant logging will be set at level 2 (-v=2), so going any higher than this doesn't help the situation.
Otherwise, I don't think there's much I can recommend. Kubernetes resources rely on being appropriately tagged across many components, admins should not remove these tags as this will interfere with the ownership.

I'm not sure if there's anything we can explicitly recommend to the customer here.

[1]: https://github.com/kubernetes/cloud-provider-aws/blob/59ae724ba8a09ca5b6266a8452e937e3e99a6953/pkg/providers/v1/aws_loadbalancer.go#L1049
[2]: https://github.com/kubernetes/cloud-provider-aws/blob/59ae724ba8a09ca5b6266a8452e937e3e99a6953/pkg/providers/v1/aws_loadbalancer.go#L1042
[3]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L4404
[4]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3801-L3803
[5]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3683
[6]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3655-L3656

Comment 10 Joel Speed 2022-03-09 12:53:35 UTC

Customer case was closed

Comment 11 Red Hat Bugzilla 2023-09-15 01:10:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days