Description of problem: It looks like ocp is executing "DetachLoadBalancerFromSubnets" in aws causing outage. We (and Cu) cannot see these calls in the OCP logs. Version-Release number of selected component (if applicable): 4.7 How reproducible: Once Steps to Reproduce: 1. Remove tags on the target groups 2. 3. Actual results: Logs dont show cloud provider calls. Expected results: Logs should show cloud provider calls Additional info: Cu deployed OCP 4.7 on AWS (UPI) and was notified by thier AWS rep that OCP tried to delete all three target groups from the loadbalancer (classic) that is in front of the OCP ingress controller (and AWS logs show OCP issuing DetachLoadBalancerFromSubnets), but AWS code prevents the last target group from being removed. However the removal of two target groups impacted Cu's cluster and it was inaccessible since both router pods could not be reached for authentication. Upon investigation, they thought the label on the target groups (kubernetes.io/cluster/ocp-prod-5j6wg shared) was removed and that could have made the cloud provider to issue the call DetachLoadBalancerFromSubnets. After they fixed the label issue, they did not see this issue re-occur, but are interested in knowing how to trace the cloud provider calls in future. Interestingly, when I remove the tags in our local setups, I dont see the same issue happening. Cu's request is how can we trace the cloud provider calls, so they can see when what is being done.
any chance we could get a must-gather, or the logs from the machine-api controllers?
This bug seems to be a request for information regarding tracing AWS calls. At the moment, the load balancer attachement for ingress is handled by kube controller manager. If there are issues or the customer wants to know more about what's happening there, they should review the Kube Controller Manager logs. Eventually this will move into a dedicated cloud controller manager (approx 4.11), which might make it easier to trace in the longer term. Anand, is that sufficient for the customer?
@jspeed we could not find any info in the default controller manager logs. Do you think trace logs will contain the calls we are issuing to AWS ? Thanks Anand
Yes i think they may. By the time you get to about `-v=8` on the logging within KCM, it should log every single network request and response that it makes. In this case you could filter the logs to determine which calls to AWS are being made
Ok. Let me try to check the debug logs then. In the future controller, is there a plan to expose the calls at a high level in the normal logs, so customers need not turn on debug logs to trace the calls ?
I'm aware that it does log some of the calls it makes, or at least it seems to, but I don't know if it explicitly logs all calls. Something we can log into though
Some notes on what's happening here: - The detach call is here [1], there is only one log at level 2 and it doesn't give specific details - The detach call is intended to remove any subnets that are currently attached to the load balancer which are no longer desired [2] - The subnet IDs come from getLoadBalancerSubnets [3] which either reads the subnet IDs from an annotation [4] or works them out from the subnets available in the VPC [5] - Based on the original description, and looking at the code in [5], i can confirm that the lack of the cluster label on the subnets is what caused them to be removed [6] In terms of increasing visibility into this issue, all of the relevant logging will be set at level 2 (-v=2), so going any higher than this doesn't help the situation. Otherwise, I don't think there's much I can recommend. Kubernetes resources rely on being appropriately tagged across many components, admins should not remove these tags as this will interfere with the ownership. I'm not sure if there's anything we can explicitly recommend to the customer here. [1]: https://github.com/kubernetes/cloud-provider-aws/blob/59ae724ba8a09ca5b6266a8452e937e3e99a6953/pkg/providers/v1/aws_loadbalancer.go#L1049 [2]: https://github.com/kubernetes/cloud-provider-aws/blob/59ae724ba8a09ca5b6266a8452e937e3e99a6953/pkg/providers/v1/aws_loadbalancer.go#L1042 [3]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L4404 [4]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3801-L3803 [5]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3683 [6]: https://github.com/kubernetes/cloud-provider-aws/blob/a1590733fac851b3a27d351c9c80e9b2bf8d6f7e/pkg/providers/v1/aws.go#L3655-L3656
Customer case was closed
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days