Bug 1814804 - Pod service endpoints are getting dropped for healthy pods [NEEDINFO]
Summary: Pod service endpoints are getting dropped for healthy pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1825989 (view as bug list)
Depends On: 1802687
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-18 16:51 UTC by Luke Stanton
Modified: 2023-10-06 19:27 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-28 05:44:13 UTC
Target Upstream Version:
Embargoed:
joedward: needinfo? (rphillips)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24895 0 None closed [release-3.11] Bug 1814804: UPSTREAM: 88251: Partially fix incorrect configuration of kubepods.slice unit by kubelet 2021-02-12 10:30:21 UTC
Red Hat Knowledge Base (Solution) 5002781 0 None None None 2020-04-21 17:13:34 UTC
Red Hat Product Errata RHBA-2020:2215 0 None None None 2020-05-28 05:44:30 UTC

Description Luke Stanton 2020-03-18 16:51:43 UTC
Description of problem:

Pod service endpoints are being dropped without an apparent reason, even though associated pods are running and responding to health checks.


Version-Release number of selected component (if applicable):

OCP 3.11.161


How reproducible:

Appears to be consistent


Actual results:

Service endpoints are being dropped when associated pods are healthy


Expected results:

Healthy pods would have endpoints added to their corresponding service

Comment 14 Venkata Tadimarri 2020-03-27 02:31:21 UTC
Created attachment 1673943 [details]
docker logs case#02613945

Comment 18 Maciej Szulik 2020-04-01 14:11:08 UTC
I looked through several pods+endpoints pairs and every time where endpoint is not available the pod has the following condition:

- lastProbeTime: null
  lastTransitionTime: <some-date>
  status: "False"
  type: Ready

which clearly reflects the pod is not ready. This condition is then one of the prerequisites for endpoint. Until that status
value changes to "True" the endpoint will not work. I've noticed in the logs there's a bunch of failed liveness and readiness
pods. I haven't found anyone related to the pairs I checked but these amount are worrisome. I don't have access to events
or logs to particular pods to be able to identify what has failed in those pods.

My current recommendations are:
- identify a single pair pod+endpoint
- check the condition on a pod, if it's Ready=False continue
- check logs of the pod
- check both liveness and readiness, even up to the point where one rsh into pods and verifies if they work fine.


Without above I can't clearly state what's causing the pod not to get ready, but that's clearly the root cause of this problem.

Comment 21 Maciej Szulik 2020-04-03 15:48:44 UTC
A couple more things to check.

1. Kubelet sends the probe to the pod’s IP address (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
but in the logs I see you were trying either localhost ro 127.0.0.1 when invoking curl, can you check pod's IP like kubelet does?

2. Can we get kubelet logs from after when this app/pod gets restarted for the next hour or so

3. Can we get events from this particular pod or namespace from above period as well. 

I don't see anything else stand out from the current logs.

Comment 34 Ryan Phillips 2020-04-20 16:14:22 UTC
*** Bug 1825989 has been marked as a duplicate of this bug. ***

Comment 59 errata-xmlrpc 2020-05-28 05:44:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2215


Note You need to log in before you can comment on or make changes to this bug.