| Summary: | OpenStack snapshot of OpenShift node with running pods crashed pods | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Eric Jones <erjones> |
| Component: | Node | Assignee: | Jan Chaloupka <jchaloup> |
| Status: | CLOSED NOTABUG | QA Contact: | DeShuai Ma <dma> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.2.0 | CC: | agoldste, aos-bugs, decarr, erjones, jokerman, mmccomas |
| Target Milestone: | --- | Keywords: | NeedsTestCase |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
OpenShift Enterprise 3.1.1.6 running on OpenStack 7
|
|
| Last Closed: | 2016-04-07 16:03:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Eric Jones
2016-04-01 22:05:34 UTC
I'm not sure if I'm moving this to the correct component but I would think the problem is either in the scheduler or the kubelet. A node becoming unresponsive for any reason shouldn't result in anything becoming permanently stuck. As far as I can tell, the log entry apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending" only shows up when you request the logs for a container in a pod. Would it be possible to `oc describe pod/$pod` for any of the pods that are stuck Pending? Andy, my though is: 1. Node got paused -> node stopped to respond to apiserver -> apiserver had the pods rescheduled to another node 2. Once the node got unpaused, kubelet continue running with all its pods as it was before. From the pod's description [1] of Events the pod got started => so it must be running. 3. As the apiserver had the pod rescheduled to another node, the pending pod is actually running, just the apiserver thinks it is not. Not familiar with how scheduler deals with pods that ceased to exist. Need to check that out. FWIK, scheduler will just pick another node with sufficient resources. Eric, for how long the node got paused? The only detail I have on that is the customer indicated "several minutes" The error message pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending" comes from kubelet. The execution paths leads to kubelet server. Here, getContainerLogs is called for a given pod. Maybe, as the apiserver request pod's logs and the response data leads to internal error, pod's state is still pending and does not get updated with Running. When running locally pure kubernetes and starting a pod, once the event shows: Reason = Started, Message = Started, the pod is in running state. Eric, they can `oc describe` running pods too. If they could describe one of the pods that was stuck Pending and then eventually ran, that would be helpful. Andy, I improperly explained how the pods are running now, my apologies. The pods that _were_ pending were deleted, in addition to deleting the project. The customer then recreated the project and redeployed the pods which "successfully" deployed until one just shifted to pending. With that explanation in mind, would the `oc describe` on one of those pods (now running) still be of assistance? If these new pods didn't exhibit any delayed startup, then no, I don't think it will provide much value. Thanks for the clarification! Derek, is there anything we can look at to see if this is the same as your "pods stuck terminating or pending" bug? |