Bug 1534492 - Pod phase doesn't respect allowed transitions
Summary: Pod phase doesn't respect allowed transitions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.9.0
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.9.z
Assignee: Seth Jennings
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1544172 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-15 11:18 UTC by Tomáš Nožička
Modified: 2019-01-30 15:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2019-01-30 15:10:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/openshift origin issues 17595 0 None None None 2018-01-15 11:18:24 UTC
Red Hat Product Errata RHBA-2019:0098 0 None None None 2019-01-30 15:10:36 UTC

Description Tomáš Nožička 2018-01-15 11:18:25 UTC
Description of problem:
Pods with restart policy never can transition from phase Failed to Succeeded breaking all the workload controller logic. Breaks DeploymentConfig invariants and result in undefined behaviour.

Version-Release number of selected component (if applicable):
Master (3.9 currently)

How reproducible:
Sometimes.

Actual results:
Pod phase is violating Pod state transition diagram.

Expected results:
Pod phase respects allowed pod transitions and it is enforced in apiserver (admission) so this can never ever happen again.


Additional info:
https://github.com/openshift/origin/issues/17595

Comment 3 Seth Jennings 2018-01-22 19:24:16 UTC
Origin issue:
https://github.com/openshift/origin/issues/17595

Comment 4 Seth Jennings 2018-01-22 21:25:51 UTC
I've pull the node log and filtered by the uuid of the test pod to reduce the noise and added breaks between timed chunks of log lines:

http://file.rdu.redhat.com/sjenning/bz1534492.log

The logs do show the status manager updating the API server with phase:Failed followed by phase:Succeeded.  Something is messed up here.

Comment 5 Seth Jennings 2018-01-22 23:10:04 UTC
The flow is this:

14:05:11 node sees new pod to start
14:05:17 sandbox container is running according to PLEG
14:05:22 app container is running according to PLEG
14:05:26 status manager update pod with phase Running
14:05:26 (1/5s later) api requests pod DELETE
14:05:32 both sandbox and app containers have exited according to PLEG
14:05:34 status manager updates pod with phase Failed (exit code 2 in container status)
14:05:35 app container is deletef according to PLEG (non-existent)
14:05:35 (1/2s later) status manager tries to regen status from container status but container is deleted and container status is null.  somehow decides that is "Succeeded" and updates the pod
14:05:40 api requests pod REMOVE, removal fails with "pod not found"
14:05:44 pod removed from status map
14:06:27 sandbox container is deleted according to PLEG (non-existent)

The trigger is that the app container is removed before the API calls for its removal and in the process of processing the API request for deletion, the status is updated with no ContainerState information.

Comment 6 Seth Jennings 2018-02-12 02:07:10 UTC
*** Bug 1544172 has been marked as a duplicate of this bug. ***

Comment 7 Seth Jennings 2018-02-12 20:34:07 UTC
I've opened a PR upstream.  Doesn't fix the cause (still undetermined), but does log when it occurs and prevent the illegal transition from propagating.

https://github.com/kubernetes/kubernetes/pull/59767

Comment 8 Seth Jennings 2018-02-12 23:33:09 UTC
Origin PR:
https://github.com/openshift/origin/pull/18585

Comment 9 Tomáš Nožička 2018-03-01 11:48:58 UTC
it's still missing the enforcement in apiserver

also the fix in the PR seems not to be enough - https://github.com/openshift/origin/issues/17595#issuecomment-368286922

Comment 10 Seth Jennings 2018-03-01 18:27:14 UTC
switching back to this issue

Comment 11 Seth Jennings 2018-03-01 19:42:30 UTC
Origin master:
https://github.com/openshift/origin/pull/18791

Origin 3.9:
https://github.com/openshift/origin/pull/18792

Comment 12 Seth Jennings 2018-03-05 20:30:48 UTC
3.9 PR is merged.  Master should merge soon.

Comment 15 Weinan Liu 2019-01-15 09:23:19 UTC
Verified to be fixed

# oc run always-test --image=nginx --generator=run-pod/v1  --command=true /bin/false --restart='Always'
# oc run never-test --image=nginx --generator=run-pod/v1  --command=true /bin/false --restart='Never'

[root@qe-weinliu-3951-master-etcd-1 ~]# oc get po
NAME          READY     STATUS    RESTARTS   AGE
always-test   0/1       Error     5          3m
never-test    0/1       Error     0          2m
(pod never-test does not restart)

[root@qe-weinliu-3951-master-etcd-1 ~]# oc version
oc v3.9.51
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-weinliu-3951-master-etcd-1:8443
openshift v3.9.51
kubernetes v1.9.1+a0ce1bc657
[root@qe-weinliu-3951-master-etcd-1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 17 errata-xmlrpc 2019-01-30 15:10:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0098


Note You need to log in before you can comment on or make changes to this bug.