Description of problem: Pods with restart policy never can transition from phase Failed to Succeeded breaking all the workload controller logic. Breaks DeploymentConfig invariants and result in undefined behaviour. Version-Release number of selected component (if applicable): Master (3.9 currently) How reproducible: Sometimes. Actual results: Pod phase is violating Pod state transition diagram. Expected results: Pod phase respects allowed pod transitions and it is enforced in apiserver (admission) so this can never ever happen again. Additional info: https://github.com/openshift/origin/issues/17595
Origin issue: https://github.com/openshift/origin/issues/17595
I've pull the node log and filtered by the uuid of the test pod to reduce the noise and added breaks between timed chunks of log lines: http://file.rdu.redhat.com/sjenning/bz1534492.log The logs do show the status manager updating the API server with phase:Failed followed by phase:Succeeded. Something is messed up here.
The flow is this: 14:05:11 node sees new pod to start 14:05:17 sandbox container is running according to PLEG 14:05:22 app container is running according to PLEG 14:05:26 status manager update pod with phase Running 14:05:26 (1/5s later) api requests pod DELETE 14:05:32 both sandbox and app containers have exited according to PLEG 14:05:34 status manager updates pod with phase Failed (exit code 2 in container status) 14:05:35 app container is deletef according to PLEG (non-existent) 14:05:35 (1/2s later) status manager tries to regen status from container status but container is deleted and container status is null. somehow decides that is "Succeeded" and updates the pod 14:05:40 api requests pod REMOVE, removal fails with "pod not found" 14:05:44 pod removed from status map 14:06:27 sandbox container is deleted according to PLEG (non-existent) The trigger is that the app container is removed before the API calls for its removal and in the process of processing the API request for deletion, the status is updated with no ContainerState information.
*** Bug 1544172 has been marked as a duplicate of this bug. ***
I've opened a PR upstream. Doesn't fix the cause (still undetermined), but does log when it occurs and prevent the illegal transition from propagating. https://github.com/kubernetes/kubernetes/pull/59767
Origin PR: https://github.com/openshift/origin/pull/18585
it's still missing the enforcement in apiserver also the fix in the PR seems not to be enough - https://github.com/openshift/origin/issues/17595#issuecomment-368286922
switching back to this issue
Origin master: https://github.com/openshift/origin/pull/18791 Origin 3.9: https://github.com/openshift/origin/pull/18792
3.9 PR is merged. Master should merge soon.
Verified to be fixed # oc run always-test --image=nginx --generator=run-pod/v1 --command=true /bin/false --restart='Always' # oc run never-test --image=nginx --generator=run-pod/v1 --command=true /bin/false --restart='Never' [root@qe-weinliu-3951-master-etcd-1 ~]# oc get po NAME READY STATUS RESTARTS AGE always-test 0/1 Error 5 3m never-test 0/1 Error 0 2m (pod never-test does not restart) [root@qe-weinliu-3951-master-etcd-1 ~]# oc version oc v3.9.51 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-weinliu-3951-master-etcd-1:8443 openshift v3.9.51 kubernetes v1.9.1+a0ce1bc657 [root@qe-weinliu-3951-master-etcd-1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0098