Bug 1323331 - OpenStack snapshot of OpenShift node with running pods crashed pods
Summary: OpenStack snapshot of OpenShift node with running pods crashed pods
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Jan Chaloupka
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-01 22:05 UTC by Eric Jones
Modified: 2019-10-10 11:45 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
OpenShift Enterprise 3.1.1.6 running on OpenStack 7
Last Closed: 2016-04-07 16:03:08 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Eric Jones 2016-04-01 22:05:34 UTC
Description of problem:
Customer had a functioning OpenShift 3.1.1.6 environment, took a snapshot of a node that had several pods running on it. This caused the node to be "paused" which moved some of the pods and crashed the others. After the node came back up there were several pods that got stuck in a pending state and the following error message [0] was showing up in the log for `atomic-openshift-master-api`.

[0] apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending"


How reproducible:
Unsure, there are several(four) pods that are having this issue.

Additional info:
Restarting the atomic-openshift-master-api fixed the issue for several of the pods, but not all of them.

Comment 1 Brenton Leanhardt 2016-04-04 12:17:51 UTC
I'm not sure if I'm moving this to the correct component but I would think the problem is either in the scheduler or the kubelet.  A node becoming unresponsive for any reason shouldn't result in anything becoming permanently stuck.

Comment 4 Andy Goldstein 2016-04-05 13:58:47 UTC
As far as I can tell, the log entry

  apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending"

only shows up when you request the logs for a container in a pod.

Would it be possible to `oc describe pod/$pod` for any of the pods that are stuck Pending?

Comment 7 Jan Chaloupka 2016-04-06 14:38:46 UTC
Andy, my though is:
1. Node got paused -> node stopped to respond to apiserver -> apiserver had the pods rescheduled to another node

2. Once the node got unpaused, kubelet continue running with all its pods as it was before.
From the pod's description [1] of Events the pod got started => so it must be running.

3. As the apiserver had the pod rescheduled to another node, the pending pod is actually running, just the apiserver thinks it is not.

Not familiar with how scheduler deals with pods that ceased to exist. Need to check that out. FWIK, scheduler will just pick another node with sufficient resources.

Eric, for how long the node got paused?

Comment 8 Eric Jones 2016-04-06 14:46:00 UTC
The only detail I have on that is the customer indicated "several minutes"

Comment 9 Jan Chaloupka 2016-04-06 15:00:30 UTC
The error message

pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending"

comes from kubelet. The execution paths leads to kubelet server. Here, getContainerLogs is called for a given pod.

Maybe, as the apiserver request pod's logs and the response data leads to internal error, pod's state is still pending and does not get updated with Running.

When running locally pure kubernetes and starting a pod, once the event shows:
Reason = Started, Message = Started, the pod is in running state.

Comment 10 Andy Goldstein 2016-04-06 15:03:53 UTC
Eric, they can `oc describe` running pods too. If they could describe one of the pods that was stuck Pending and then eventually ran, that would be helpful.

Comment 11 Eric Jones 2016-04-06 15:09:25 UTC
Andy,

I improperly explained how the pods are running now, my apologies. The pods that _were_ pending were deleted, in addition to deleting the project. The customer then recreated the project and redeployed the pods which "successfully" deployed until one just shifted to pending. 

With that explanation in mind, would the `oc describe` on one of those pods (now running) still be of assistance?

Comment 12 Andy Goldstein 2016-04-06 15:10:49 UTC
If these new pods didn't exhibit any delayed startup, then no, I don't think it will provide much value. Thanks for the clarification!

Comment 13 Andy Goldstein 2016-04-06 15:31:27 UTC
Derek, is there anything we can look at to see if this is the same as your "pods stuck terminating or pending" bug?


Note You need to log in before you can comment on or make changes to this bug.