Hide Forgot
Description of problem: Customer had a functioning OpenShift 3.1.1.6 environment, took a snapshot of a node that had several pods running on it. This caused the node to be "paused" which moved some of the pods and crashed the others. After the node came back up there were several pods that got stuck in a pending state and the following error message [0] was showing up in the log for `atomic-openshift-master-api`. [0] apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending" How reproducible: Unsure, there are several(four) pods that are having this issue. Additional info: Restarting the atomic-openshift-master-api fixed the issue for several of the pods, but not all of them.
I'm not sure if I'm moving this to the correct component but I would think the problem is either in the scheduler or the kubelet. A node becoming unresponsive for any reason shouldn't result in anything becoming permanently stuck.
As far as I can tell, the log entry apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending" only shows up when you request the logs for a container in a pod. Would it be possible to `oc describe pod/$pod` for any of the pods that are stuck Pending?
Andy, my though is: 1. Node got paused -> node stopped to respond to apiserver -> apiserver had the pods rescheduled to another node 2. Once the node got unpaused, kubelet continue running with all its pods as it was before. From the pod's description [1] of Events the pod got started => so it must be running. 3. As the apiserver had the pod rescheduled to another node, the pending pod is actually running, just the apiserver thinks it is not. Not familiar with how scheduler deals with pods that ceased to exist. Need to check that out. FWIK, scheduler will just pick another node with sufficient resources. Eric, for how long the node got paused?
The only detail I have on that is the customer indicated "several minutes"
The error message pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending" comes from kubelet. The execution paths leads to kubelet server. Here, getContainerLogs is called for a given pod. Maybe, as the apiserver request pod's logs and the response data leads to internal error, pod's state is still pending and does not get updated with Running. When running locally pure kubernetes and starting a pod, once the event shows: Reason = Started, Message = Started, the pod is in running state.
Eric, they can `oc describe` running pods too. If they could describe one of the pods that was stuck Pending and then eventually ran, that would be helpful.
Andy, I improperly explained how the pods are running now, my apologies. The pods that _were_ pending were deleted, in addition to deleting the project. The customer then recreated the project and redeployed the pods which "successfully" deployed until one just shifted to pending. With that explanation in mind, would the `oc describe` on one of those pods (now running) still be of assistance?
If these new pods didn't exhibit any delayed startup, then no, I don't think it will provide much value. Thanks for the clarification!
Derek, is there anything we can look at to see if this is the same as your "pods stuck terminating or pending" bug?