Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1323331

Summary:	OpenStack snapshot of OpenShift node with running pods crashed pods
Product:	OpenShift Container Platform	Reporter:	Eric Jones <erjones>
Component:	Node	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED NOTABUG	QA Contact:	DeShuai Ma <dma>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.2.0	CC:	agoldste, aos-bugs, decarr, erjones, jokerman, mmccomas
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:	OpenShift Enterprise 3.1.1.6 running on OpenStack 7
Last Closed:	2016-04-07 16:03:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Jones 2016-04-01 22:05:34 UTC

Description of problem:
Customer had a functioning OpenShift 3.1.1.6 environment, took a snapshot of a node that had several pods running on it. This caused the node to be "paused" which moved some of the pods and crashed the others. After the node came back up there were several pods that got stuck in a pending state and the following error message [0] was showing up in the log for `atomic-openshift-master-api`.

[0] apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending"


How reproducible:
Unsure, there are several(four) pods that are having this issue.

Additional info:
Restarting the atomic-openshift-master-api fixed the issue for several of the pods, but not all of them.

Comment 1 Brenton Leanhardt 2016-04-04 12:17:51 UTC

I'm not sure if I'm moving this to the correct component but I would think the problem is either in the scheduler or the kubelet.  A node becoming unresponsive for any reason shouldn't result in anything becoming permanently stuck.

Comment 4 Andy Goldstein 2016-04-05 13:58:47 UTC

As far as I can tell, the log entry

  apiserver was unable to write a JSON response: Internal error occurred: Pod "pcn-product-svc-3-6a1m6" in namespace "pcn-prod" : pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending"

only shows up when you request the logs for a container in a pod.

Would it be possible to `oc describe pod/$pod` for any of the pods that are stuck Pending?

Comment 7 Jan Chaloupka 2016-04-06 14:38:46 UTC

Andy, my though is:
1. Node got paused -> node stopped to respond to apiserver -> apiserver had the pods rescheduled to another node

2. Once the node got unpaused, kubelet continue running with all its pods as it was before.
From the pod's description [1] of Events the pod got started => so it must be running.

3. As the apiserver had the pod rescheduled to another node, the pending pod is actually running, just the apiserver thinks it is not.

Not familiar with how scheduler deals with pods that ceased to exist. Need to check that out. FWIK, scheduler will just pick another node with sufficient resources.

Eric, for how long the node got paused?

Comment 8 Eric Jones 2016-04-06 14:46:00 UTC

The only detail I have on that is the customer indicated "several minutes"

Comment 9 Jan Chaloupka 2016-04-06 15:00:30 UTC

The error message

pod is not in 'Running', 'Succeeded' or 'Failed' state - State: “Pending"

comes from kubelet. The execution paths leads to kubelet server. Here, getContainerLogs is called for a given pod.

Maybe, as the apiserver request pod's logs and the response data leads to internal error, pod's state is still pending and does not get updated with Running.

When running locally pure kubernetes and starting a pod, once the event shows:
Reason = Started, Message = Started, the pod is in running state.

Comment 10 Andy Goldstein 2016-04-06 15:03:53 UTC

Eric, they can `oc describe` running pods too. If they could describe one of the pods that was stuck Pending and then eventually ran, that would be helpful.

Comment 11 Eric Jones 2016-04-06 15:09:25 UTC

Andy,

I improperly explained how the pods are running now, my apologies. The pods that _were_ pending were deleted, in addition to deleting the project. The customer then recreated the project and redeployed the pods which "successfully" deployed until one just shifted to pending. 

With that explanation in mind, would the `oc describe` on one of those pods (now running) still be of assistance?

Comment 12 Andy Goldstein 2016-04-06 15:10:49 UTC

If these new pods didn't exhibit any delayed startup, then no, I don't think it will provide much value. Thanks for the clarification!

Comment 13 Andy Goldstein 2016-04-06 15:31:27 UTC

Derek, is there anything we can look at to see if this is the same as your "pods stuck terminating or pending" bug?