1814804 – Pod service endpoints are getting dropped for healthy pods

Bug 1814804 - Pod service endpoints are getting dropped for healthy pods

Summary: Pod service endpoints are getting dropped for healthy pods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1825989 (view as bug list)
Depends On:	1802687
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-18 16:51 UTC by Luke Stanton
Modified:	2025-01-30 04:25 UTC (History)
CC List:	29 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-28 05:44:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24895	None	closed	[release-3.11] Bug 1814804: UPSTREAM: 88251: Partially fix incorrect configuration of kubepods.slice unit by kubelet	2021-02-12 10:30:21 UTC
Red Hat Knowledge Base (Solution)	5002781	None	None	None	2020-04-21 17:13:34 UTC
Red Hat Product Errata	RHBA-2020:2215	None	None	None	2020-05-28 05:44:30 UTC

Description Luke Stanton 2020-03-18 16:51:43 UTC

Description of problem:

Pod service endpoints are being dropped without an apparent reason, even though associated pods are running and responding to health checks.


Version-Release number of selected component (if applicable):

OCP 3.11.161


How reproducible:

Appears to be consistent


Actual results:

Service endpoints are being dropped when associated pods are healthy


Expected results:

Healthy pods would have endpoints added to their corresponding service

Comment 14 Venkata Tadimarri 2020-03-27 02:31:21 UTC

Created attachment 1673943 [details]
docker logs case#02613945

Comment 18 Maciej Szulik 2020-04-01 14:11:08 UTC

I looked through several pods+endpoints pairs and every time where endpoint is not available the pod has the following condition:

- lastProbeTime: null
  lastTransitionTime: <some-date>
  status: "False"
  type: Ready

which clearly reflects the pod is not ready. This condition is then one of the prerequisites for endpoint. Until that status
value changes to "True" the endpoint will not work. I've noticed in the logs there's a bunch of failed liveness and readiness
pods. I haven't found anyone related to the pairs I checked but these amount are worrisome. I don't have access to events
or logs to particular pods to be able to identify what has failed in those pods.

My current recommendations are:
- identify a single pair pod+endpoint
- check the condition on a pod, if it's Ready=False continue
- check logs of the pod
- check both liveness and readiness, even up to the point where one rsh into pods and verifies if they work fine.


Without above I can't clearly state what's causing the pod not to get ready, but that's clearly the root cause of this problem.

Comment 21 Maciej Szulik 2020-04-03 15:48:44 UTC

A couple more things to check.

1. Kubelet sends the probe to the pod’s IP address (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
but in the logs I see you were trying either localhost ro 127.0.0.1 when invoking curl, can you check pod's IP like kubelet does?

2. Can we get kubelet logs from after when this app/pod gets restarted for the next hour or so

3. Can we get events from this particular pod or namespace from above period as well. 

I don't see anything else stand out from the current logs.

Comment 34 Ryan Phillips 2020-04-20 16:14:22 UTC

*** Bug 1825989 has been marked as a duplicate of this bug. ***

Comment 59 errata-xmlrpc 2020-05-28 05:44:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2215

Comment 61 Red Hat Bugzilla 2025-01-30 04:25:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

aanjarle
aarapov
anisal
aos-bugs
apurty
bmeng
clichybi
cpassare
dahernan
dkulkarn
emarquez
jkaur
joedward
jokerman
ktadimar
maszulik
mfojtik
mvardhan
pweil
rhowe
rphillips
sbhavsar
schoudha
sttts
syang
weihuang
wzheng
yaoli
zyu