Bug 1788741
| Summary: | Build hung in Running state after backing pod died with CreateContainerError | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
| Component: | Build | Assignee: | Gabe Montero <gmontero> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | wewang <wewang> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.3.0 | CC: | aos-bugs, gmontero, wzheng | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 4.5.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-04-23 15:30:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Mike Fiedler
2020-01-08 01:10:48 UTC
I only ever saw this one time on 4.3. I've attempted multiple times to reproduce on 4.4 without success. Closing this as works for me and will re-open or file a new bz if it is seen again. I was able to reproduce this on 4.4.0-0.nightly-2020-03-05-104321. The cluster had OVN as the network plugin, not sure if that had anything to do with it. I will attach the build and pod yaml as requested in comment 2 Created attachment 1667946 [details]
pod and build yaml
pod completed with ContainerError, build hung in Running state.
I'm guessing the ContainerError could be caused by OVN, but it seems the build should also error out when the pod errors. So yeah the container status for this one was
containerStatuses:
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5387bbe2fc8239290c576925c8192cf0bddb0faf4c81446987a3ae8ddede12c3
imageID: ""
lastState: {}
name: sti-build
ready: false
restartCount: 0
started: false
state:
waiting:
message: 'the container name "k8s_sti-build_django-psql-example-6-build_svt3_50b3e76c-9f82-469c-8d5e-255e6302ffa5_0"
is already in use by "e39e696bd259f174c77f652886936f2273fa61cd6069931a863e058089c4b662".
You have to remove that container to be able to reuse that name.: that name
is already in use'
reason: CreateContainerError
Notice the ContainerState is waiting, not terminated
The build controller currently in the various spots it checks container status assumes the state will be terminated, and does not consider waiting state with reasons like CreateContainerError
kube appears to have been this way for a while, citing this as waiting state ... check out the cri-o issue from Derek Carr almost 3 years ago: https://github.com/cri-o/cri-o/issues/815
containerStatuses:
- image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.6.3
imageID: ""
lastState: {}
name: kubernetes-dashboard
ready: false
restartCount: 0
state:
waiting:
message: |
container create failed: container_linux.go:265: starting container process caused "process_linux.go:264: applying cgroup configuration for process caused \"No such device or address\""
reason: CreateContainerError
hostIP: 127.0.0.1
phase: Pending
podIP: 10.88.4.255
qosClass: Burstable
startTime: 2017-08-30T00:50:12Z
my initial scans of k8s code vendored into openshift have not yet found the actual setting of this, to see if their are other waiting reason codes to worry about, but based on the messaage, it implies to me that
it is allowing for manual intervention
Pending feedback either here or in subsequent PRs, looks like we've have to expand our status checks minimally for this permutation.
Hit send too soon ... found that error code in the kubelet:
./vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go: ErrCreateContainer = errors.New("CreateContainerError")
These appear the be the error reasons with waiting state we'll need to look for based on what I see in the kubelet's start container logic:
// ErrCreateContainerConfig - failed to create container config
ErrCreateContainerConfig = errors.New("CreateContainerConfigError")
// ErrCreateContainer - failed to create container
ErrCreateContainer = errors.New("CreateContainerError")
// ErrPreStartHook - failed to execute PreStartHook
ErrPreStartHook = errors.New("PreStartHookError")
// ErrPostStartHook - failed to execute PostStartHook
ErrPostStartHook = errors.New("PostStartHookError")
After a discussion between Adam, Nalin, and myself, we've concluded the root here was the cri-o bug 1823949 which has just been verified in 4.4.0 *** This bug has been marked as a duplicate of bug 1823949 *** I saw no mention of 4.3 and potential backport efforts for bug 1823949 Please work through that bug for questions around that if needed for the e2e's that found this. thanks |