Bug 1534492
| Summary: | Pod phase doesn't respect allowed transitions | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Tomáš Nožička <tnozicka> |
| Component: | Node | Assignee: | Seth Jennings <sjenning> |
| Status: | CLOSED ERRATA | QA Contact: | Weinan Liu <weinliu> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.9.0 | CC: | aos-bugs, ccoleman, jokerman, mmccomas |
| Target Milestone: | --- | ||
| Target Release: | 3.9.z | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-30 15:10:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tomáš Nožička
2018-01-15 11:18:25 UTC
Origin issue: https://github.com/openshift/origin/issues/17595 I've pull the node log and filtered by the uuid of the test pod to reduce the noise and added breaks between timed chunks of log lines: http://file.rdu.redhat.com/sjenning/bz1534492.log The logs do show the status manager updating the API server with phase:Failed followed by phase:Succeeded. Something is messed up here. The flow is this: 14:05:11 node sees new pod to start 14:05:17 sandbox container is running according to PLEG 14:05:22 app container is running according to PLEG 14:05:26 status manager update pod with phase Running 14:05:26 (1/5s later) api requests pod DELETE 14:05:32 both sandbox and app containers have exited according to PLEG 14:05:34 status manager updates pod with phase Failed (exit code 2 in container status) 14:05:35 app container is deletef according to PLEG (non-existent) 14:05:35 (1/2s later) status manager tries to regen status from container status but container is deleted and container status is null. somehow decides that is "Succeeded" and updates the pod 14:05:40 api requests pod REMOVE, removal fails with "pod not found" 14:05:44 pod removed from status map 14:06:27 sandbox container is deleted according to PLEG (non-existent) The trigger is that the app container is removed before the API calls for its removal and in the process of processing the API request for deletion, the status is updated with no ContainerState information. *** Bug 1544172 has been marked as a duplicate of this bug. *** I've opened a PR upstream. Doesn't fix the cause (still undetermined), but does log when it occurs and prevent the illegal transition from propagating. https://github.com/kubernetes/kubernetes/pull/59767 it's still missing the enforcement in apiserver also the fix in the PR seems not to be enough - https://github.com/openshift/origin/issues/17595#issuecomment-368286922 switching back to this issue Origin master: https://github.com/openshift/origin/pull/18791 Origin 3.9: https://github.com/openshift/origin/pull/18792 3.9 PR is merged. Master should merge soon. Verified to be fixed # oc run always-test --image=nginx --generator=run-pod/v1 --command=true /bin/false --restart='Always' # oc run never-test --image=nginx --generator=run-pod/v1 --command=true /bin/false --restart='Never' [root@qe-weinliu-3951-master-etcd-1 ~]# oc get po NAME READY STATUS RESTARTS AGE always-test 0/1 Error 5 3m never-test 0/1 Error 0 2m (pod never-test does not restart) [root@qe-weinliu-3951-master-etcd-1 ~]# oc version oc v3.9.51 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-weinliu-3951-master-etcd-1:8443 openshift v3.9.51 kubernetes v1.9.1+a0ce1bc657 [root@qe-weinliu-3951-master-etcd-1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0098 |