Bug 1814804

Summary: Pod service endpoints are getting dropped for healthy pods
Product: OpenShift Container Platform Reporter: Luke Stanton <lstanton>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aanjarle, aarapov, anisal, aos-bugs, apurty, bmeng, clichybi, cpassare, dahernan, dkulkarn, emarquez, jkaur, joedward, jokerman, ktadimar, maszulik, mfojtik, mvardhan, pweil, rhowe, rphillips, sbhavsar, schoudha, sttts, syang, weihuang, wzheng, yaoli, zyu
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-28 05:44:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1802687    
Bug Blocks:    

Description Luke Stanton 2020-03-18 16:51:43 UTC
Description of problem:

Pod service endpoints are being dropped without an apparent reason, even though associated pods are running and responding to health checks.


Version-Release number of selected component (if applicable):

OCP 3.11.161


How reproducible:

Appears to be consistent


Actual results:

Service endpoints are being dropped when associated pods are healthy


Expected results:

Healthy pods would have endpoints added to their corresponding service

Comment 14 Venkata Tadimarri 2020-03-27 02:31:21 UTC
Created attachment 1673943 [details]
docker logs case#02613945

Comment 18 Maciej Szulik 2020-04-01 14:11:08 UTC
I looked through several pods+endpoints pairs and every time where endpoint is not available the pod has the following condition:

- lastProbeTime: null
  lastTransitionTime: <some-date>
  status: "False"
  type: Ready

which clearly reflects the pod is not ready. This condition is then one of the prerequisites for endpoint. Until that status
value changes to "True" the endpoint will not work. I've noticed in the logs there's a bunch of failed liveness and readiness
pods. I haven't found anyone related to the pairs I checked but these amount are worrisome. I don't have access to events
or logs to particular pods to be able to identify what has failed in those pods.

My current recommendations are:
- identify a single pair pod+endpoint
- check the condition on a pod, if it's Ready=False continue
- check logs of the pod
- check both liveness and readiness, even up to the point where one rsh into pods and verifies if they work fine.


Without above I can't clearly state what's causing the pod not to get ready, but that's clearly the root cause of this problem.

Comment 21 Maciej Szulik 2020-04-03 15:48:44 UTC
A couple more things to check.

1. Kubelet sends the probe to the pod’s IP address (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
but in the logs I see you were trying either localhost ro 127.0.0.1 when invoking curl, can you check pod's IP like kubelet does?

2. Can we get kubelet logs from after when this app/pod gets restarted for the next hour or so

3. Can we get events from this particular pod or namespace from above period as well. 

I don't see anything else stand out from the current logs.

Comment 34 Ryan Phillips 2020-04-20 16:14:22 UTC
*** Bug 1825989 has been marked as a duplicate of this bug. ***

Comment 59 errata-xmlrpc 2020-05-28 05:44:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2215

Comment 61 Red Hat Bugzilla 2025-01-30 04:25:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days