Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2101397

Summary:	[GSS] due an issue with the API server the Whereabouts ability to query for the IP address was hampered.
Product:	OpenShift Container Platform	Reporter:	amansan <amanzane>
Component:	Unknown	Assignee:	Sudha Ponnaganti <sponnaga>
Status:	CLOSED DEFERRED	QA Contact:	Jianwei Hou <jhou>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.9	CC:	aglotov, akashem, azaky, brgardne, cldavey, eparis, mfojtik, openshift-bugs-escalate, parodrig, wlewis, xxia
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-03-09 01:47:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 8 Blaine Gardner 2022-08-01 15:24:41 UTC

The issue this customer is experiencing is resulting in errors with ODF storage applications when upgrading, causing instability. Is anyone assigned to look at this issue?

I am also attempting to reset the assignees to default in case there have been any changes there causing this issue to get lost.

Blaine

Comment 15 Abu Kashem 2022-09-06 15:11:05 UTC

TLDR for me

> I0602 09:14:02.592387 2889416 request.go:655] Throttling request took 1.177089099s, request: GET:https://api-int.ocp.cloud.stc:6443/apis/lighthouse.submariner.io/v2alpha1?timeout=32s
> ...
> I0602 09:14:32.593139 2889416 leaderelection.go:243] attempting to acquire leader lease openshift-multus/whereabouts...

> There's a 30 second delay there before Whereabouts was able to get information from the k8s api (e.g. about the pod and it's own IP address pools)

My thoughts:
I assume the log is from some component/operator (not kube-apiserver). Can you confirm which component the log is from?

> Throttling request took 1.177089099s,

This is a very misleading message, we fixed this in newer client-go version. In most cases, this indicates client side throttling, nothing to do with the apiserver
and the gap of 30s in the log does not seem to be related to api slowness. 
The component in question is attempting to acquire leader lease, and whether it is delaying in doing that will depend on lease configuration. I would recommend checking the leader election logs of the component (multus).

If you think the API is slow (as seen by multus?), there are a few things you can do to verify it:
- If the customer is running the kube-apiserver in log level 3 or more, then grep for 'httplog.go' in kube-paiserver logs and you can see all the requests coming to the apiserver, each request line has user agent so you can narrow it down to the component in question, and then check the latency
- If log level is below 3, then you can grep the audit logs, and check the latency of the API requests coming from the component

Once we prove that API is slow, then we can dig deeper to find the root cause.

Comment 17 Blaine Gardner 2022-09-19 21:15:59 UTC

Has anyone taken a look at the events Alicia attached? I see a lot o these

> error killing pod: [failed to "KillContainer" for "db" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 6efbd62f186ab05cf13839a7e707c5ace293bb3627ab0bbd7744a8f7bf0bc9cd: context deadline exceeded", failed to "KillPodSandbox" for "bd72ee75-45d9-45af-9d42-78e785861e1d" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]


When looking into issues related to failing to kill pod sandboxes, a common suggestion is that the nodes may be overloaded making kubelet unresponsive. If nodes are overloaded at the same time Ceph daemons are being updated (adding even more load), that certainly could cause system instability, but I'm not sure how to prove if that is what's going on for these systems.

ODF is deployed with default resource requests and limits that help reduce issues issues with resource overuse, so there is likely to be something else going on. Are there any processes on the node that are using a lot of resources? This could be a process on the host itself or a pod running that doesn't have resource requests/limits set.

Is this worth investigating, and how would the openshift team suggest testing that?

Comment 22 Abu Kashem 2022-12-02 13:51:53 UTC

I have explained the logs, they don't imply AP slowness, and so far no evidence of API slowness; so setting the component to Unknown. Feel free to assign it back to kube-apiserver if you find any API issues.

Comment 25 Shiftzilla 2023-03-09 01:47:38 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9852