1534492 – Pod phase doesn't respect allowed transitions

Bug 1534492 - Pod phase doesn't respect allowed transitions

Summary: Pod phase doesn't respect allowed transitions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.9.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Seth Jennings
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1544172 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-15 11:18 UTC by Tomáš Nožička
Modified:	2019-01-30 15:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2019-01-30 15:10:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	https://github.com/openshift origin issues 17595	0	None	None	None	2018-01-15 11:18:24 UTC
Red Hat Product Errata	RHBA-2019:0098	0	None	None	None	2019-01-30 15:10:36 UTC

Description Tomáš Nožička 2018-01-15 11:18:25 UTC

Description of problem:
Pods with restart policy never can transition from phase Failed to Succeeded breaking all the workload controller logic. Breaks DeploymentConfig invariants and result in undefined behaviour.

Version-Release number of selected component (if applicable):
Master (3.9 currently)

How reproducible:
Sometimes.

Actual results:
Pod phase is violating Pod state transition diagram.

Expected results:
Pod phase respects allowed pod transitions and it is enforced in apiserver (admission) so this can never ever happen again.


Additional info:
https://github.com/openshift/origin/issues/17595

Comment 3 Seth Jennings 2018-01-22 19:24:16 UTC

Origin issue:
https://github.com/openshift/origin/issues/17595

Comment 4 Seth Jennings 2018-01-22 21:25:51 UTC

I've pull the node log and filtered by the uuid of the test pod to reduce the noise and added breaks between timed chunks of log lines:

http://file.rdu.redhat.com/sjenning/bz1534492.log

The logs do show the status manager updating the API server with phase:Failed followed by phase:Succeeded.  Something is messed up here.

Comment 5 Seth Jennings 2018-01-22 23:10:04 UTC

The flow is this:

14:05:11 node sees new pod to start
14:05:17 sandbox container is running according to PLEG
14:05:22 app container is running according to PLEG
14:05:26 status manager update pod with phase Running
14:05:26 (1/5s later) api requests pod DELETE
14:05:32 both sandbox and app containers have exited according to PLEG
14:05:34 status manager updates pod with phase Failed (exit code 2 in container status)
14:05:35 app container is deletef according to PLEG (non-existent)
14:05:35 (1/2s later) status manager tries to regen status from container status but container is deleted and container status is null.  somehow decides that is "Succeeded" and updates the pod
14:05:40 api requests pod REMOVE, removal fails with "pod not found"
14:05:44 pod removed from status map
14:06:27 sandbox container is deleted according to PLEG (non-existent)

The trigger is that the app container is removed before the API calls for its removal and in the process of processing the API request for deletion, the status is updated with no ContainerState information.

Comment 6 Seth Jennings 2018-02-12 02:07:10 UTC

*** Bug 1544172 has been marked as a duplicate of this bug. ***

Comment 7 Seth Jennings 2018-02-12 20:34:07 UTC

I've opened a PR upstream.  Doesn't fix the cause (still undetermined), but does log when it occurs and prevent the illegal transition from propagating.

https://github.com/kubernetes/kubernetes/pull/59767

Comment 8 Seth Jennings 2018-02-12 23:33:09 UTC

Origin PR:
https://github.com/openshift/origin/pull/18585

Comment 9 Tomáš Nožička 2018-03-01 11:48:58 UTC

it's still missing the enforcement in apiserver

also the fix in the PR seems not to be enough - https://github.com/openshift/origin/issues/17595#issuecomment-368286922

Comment 10 Seth Jennings 2018-03-01 18:27:14 UTC

switching back to this issue

Comment 11 Seth Jennings 2018-03-01 19:42:30 UTC

Origin master:
https://github.com/openshift/origin/pull/18791

Origin 3.9:
https://github.com/openshift/origin/pull/18792

Comment 12 Seth Jennings 2018-03-05 20:30:48 UTC

3.9 PR is merged.  Master should merge soon.

Comment 15 Weinan Liu 2019-01-15 09:23:19 UTC

Verified to be fixed

# oc run always-test --image=nginx --generator=run-pod/v1  --command=true /bin/false --restart='Always'
# oc run never-test --image=nginx --generator=run-pod/v1  --command=true /bin/false --restart='Never'

[root@qe-weinliu-3951-master-etcd-1 ~]# oc get po
NAME          READY     STATUS    RESTARTS   AGE
always-test   0/1       Error     5          3m
never-test    0/1       Error     0          2m
(pod never-test does not restart)

[root@qe-weinliu-3951-master-etcd-1 ~]# oc version
oc v3.9.51
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-weinliu-3951-master-etcd-1:8443
openshift v3.9.51
kubernetes v1.9.1+a0ce1bc657
[root@qe-weinliu-3951-master-etcd-1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 17 errata-xmlrpc 2019-01-30 15:10:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0098

Note You need to log in before you can comment on or make changes to this bug.