Description of problem:
Pod service endpoints are being dropped without an apparent reason, even though associated pods are running and responding to health checks.
Version-Release number of selected component (if applicable):
Appears to be consistent
Service endpoints are being dropped when associated pods are healthy
Healthy pods would have endpoints added to their corresponding service
Created attachment 1673943 [details]
docker logs case#02613945
I looked through several pods+endpoints pairs and every time where endpoint is not available the pod has the following condition:
- lastProbeTime: null
which clearly reflects the pod is not ready. This condition is then one of the prerequisites for endpoint. Until that status
value changes to "True" the endpoint will not work. I've noticed in the logs there's a bunch of failed liveness and readiness
pods. I haven't found anyone related to the pairs I checked but these amount are worrisome. I don't have access to events
or logs to particular pods to be able to identify what has failed in those pods.
My current recommendations are:
- identify a single pair pod+endpoint
- check the condition on a pod, if it's Ready=False continue
- check logs of the pod
- check both liveness and readiness, even up to the point where one rsh into pods and verifies if they work fine.
Without above I can't clearly state what's causing the pod not to get ready, but that's clearly the root cause of this problem.
A couple more things to check.
1. Kubelet sends the probe to the pod’s IP address (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
but in the logs I see you were trying either localhost ro 127.0.0.1 when invoking curl, can you check pod's IP like kubelet does?
2. Can we get kubelet logs from after when this app/pod gets restarted for the next hour or so
3. Can we get events from this particular pod or namespace from above period as well.
I don't see anything else stand out from the current logs.
*** Bug 1825989 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.