Bug 1534492

Summary:	Pod phase doesn't respect allowed transitions
Product:	OpenShift Container Platform	Reporter:	Tomáš Nožička <tnozicka>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED ERRATA	QA Contact:	Weinan Liu <weinliu>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, ccoleman, jokerman, mmccomas
Target Milestone:	---
Target Release:	3.9.z
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-30 15:10:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomáš Nožička 2018-01-15 11:18:25 UTC

Description of problem:
Pods with restart policy never can transition from phase Failed to Succeeded breaking all the workload controller logic. Breaks DeploymentConfig invariants and result in undefined behaviour.

Version-Release number of selected component (if applicable):
Master (3.9 currently)

How reproducible:
Sometimes.

Actual results:
Pod phase is violating Pod state transition diagram.

Expected results:
Pod phase respects allowed pod transitions and it is enforced in apiserver (admission) so this can never ever happen again.


Additional info:
https://github.com/openshift/origin/issues/17595

Comment 3 Seth Jennings 2018-01-22 19:24:16 UTC

Origin issue:
https://github.com/openshift/origin/issues/17595

Comment 4 Seth Jennings 2018-01-22 21:25:51 UTC

I've pull the node log and filtered by the uuid of the test pod to reduce the noise and added breaks between timed chunks of log lines:

http://file.rdu.redhat.com/sjenning/bz1534492.log

The logs do show the status manager updating the API server with phase:Failed followed by phase:Succeeded.  Something is messed up here.

Comment 5 Seth Jennings 2018-01-22 23:10:04 UTC

The flow is this:

14:05:11 node sees new pod to start
14:05:17 sandbox container is running according to PLEG
14:05:22 app container is running according to PLEG
14:05:26 status manager update pod with phase Running
14:05:26 (1/5s later) api requests pod DELETE
14:05:32 both sandbox and app containers have exited according to PLEG
14:05:34 status manager updates pod with phase Failed (exit code 2 in container status)
14:05:35 app container is deletef according to PLEG (non-existent)
14:05:35 (1/2s later) status manager tries to regen status from container status but container is deleted and container status is null.  somehow decides that is "Succeeded" and updates the pod
14:05:40 api requests pod REMOVE, removal fails with "pod not found"
14:05:44 pod removed from status map
14:06:27 sandbox container is deleted according to PLEG (non-existent)

The trigger is that the app container is removed before the API calls for its removal and in the process of processing the API request for deletion, the status is updated with no ContainerState information.

Comment 6 Seth Jennings 2018-02-12 02:07:10 UTC

*** Bug 1544172 has been marked as a duplicate of this bug. ***

Comment 7 Seth Jennings 2018-02-12 20:34:07 UTC

I've opened a PR upstream.  Doesn't fix the cause (still undetermined), but does log when it occurs and prevent the illegal transition from propagating.

https://github.com/kubernetes/kubernetes/pull/59767

Comment 8 Seth Jennings 2018-02-12 23:33:09 UTC

Origin PR:
https://github.com/openshift/origin/pull/18585

Comment 9 Tomáš Nožička 2018-03-01 11:48:58 UTC

it's still missing the enforcement in apiserver

also the fix in the PR seems not to be enough - https://github.com/openshift/origin/issues/17595#issuecomment-368286922

Comment 10 Seth Jennings 2018-03-01 18:27:14 UTC

switching back to this issue

Comment 11 Seth Jennings 2018-03-01 19:42:30 UTC

Origin master:
https://github.com/openshift/origin/pull/18791

Origin 3.9:
https://github.com/openshift/origin/pull/18792

Comment 12 Seth Jennings 2018-03-05 20:30:48 UTC

3.9 PR is merged.  Master should merge soon.

Comment 15 Weinan Liu 2019-01-15 09:23:19 UTC

Verified to be fixed

# oc run always-test --image=nginx --generator=run-pod/v1  --command=true /bin/false --restart='Always'
# oc run never-test --image=nginx --generator=run-pod/v1  --command=true /bin/false --restart='Never'

[root@qe-weinliu-3951-master-etcd-1 ~]# oc get po
NAME          READY     STATUS    RESTARTS   AGE
always-test   0/1       Error     5          3m
never-test    0/1       Error     0          2m
(pod never-test does not restart)

[root@qe-weinliu-3951-master-etcd-1 ~]# oc version
oc v3.9.51
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-weinliu-3951-master-etcd-1:8443
openshift v3.9.51
kubernetes v1.9.1+a0ce1bc657
[root@qe-weinliu-3951-master-etcd-1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 17 errata-xmlrpc 2019-01-30 15:10:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0098