Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1788741

Summary: Build hung in Running state after backing pod died with CreateContainerError
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: BuildAssignee: Gabe Montero <gmontero>
Status: CLOSED DUPLICATE QA Contact: wewang <wewang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, gmontero, wzheng
Target Milestone: ---Keywords: Reopened
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-23 15:30:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pod and build yaml none

Description Mike Fiedler 2020-01-08 01:10:48 UTC
Description of problem:

While running concurrent build stress tests of the cakephp example app, hit a condition where 2 builds were stuck in Running state even though the pods behind them had terminated due to CreateContainerError.

- oc logs for the builds just hangs
- oc logs for the pod returns:

# oc logs cakephp-mysql-example-4-build
Error from server (BadRequest): container "sti-build" in pod "cakephp-mysql-example-4-build" is waiting to start: CreateContainerError

Seems like the build should error out as well.   oc adm must-gather will be attached.

The builds are:

# oc get builds --all-namespaces | grep Running
svt-cakephp-53   cakephp-mysql-example-4   Source   Git@133cb8b   Running                       3 hours ago   
svt-cakephp-64   cakephp-mysql-example-4   Source   Git@133cb8b   Running                       3 hours ago


Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2020-01-06-185654


How reproducible:  Unknown.  I'll clean up and try again


Steps to Reproduce:
1. On a 3 master/3 worker (m5.2xlarge) cluster, run 75 concurrent builds of the cakephp application 4 times
2. Possibly the issue will be hit.  I will see how reproducible it is.

Comment 5 Mike Fiedler 2020-02-06 14:10:54 UTC
I only ever saw this one time on 4.3.  I've attempted multiple times to reproduce on 4.4 without success.  Closing this as works for me and will re-open or file a new bz if it is seen again.

Comment 6 Mike Fiedler 2020-03-05 22:50:56 UTC
I was able to reproduce this on 4.4.0-0.nightly-2020-03-05-104321.   The cluster had OVN as the network plugin, not sure if that had anything to do with it.

I will attach the build and pod yaml as requested in comment 2

Comment 8 Mike Fiedler 2020-03-05 23:04:47 UTC
Created attachment 1667946 [details]
pod and build yaml

pod completed with ContainerError, build hung in Running state.

Comment 9 Mike Fiedler 2020-03-05 23:05:42 UTC
I'm guessing the ContainerError could be caused by OVN, but it seems the build should also error out when the pod errors.

Comment 10 Gabe Montero 2020-04-16 18:56:42 UTC
So yeah the container status for this one was 

 containerStatuses:
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5387bbe2fc8239290c576925c8192cf0bddb0faf4c81446987a3ae8ddede12c3
    imageID: ""
    lastState: {}
    name: sti-build
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'the container name "k8s_sti-build_django-psql-example-6-build_svt3_50b3e76c-9f82-469c-8d5e-255e6302ffa5_0"
          is already in use by "e39e696bd259f174c77f652886936f2273fa61cd6069931a863e058089c4b662".
          You have to remove that container to be able to reuse that name.: that name
          is already in use'
        reason: CreateContainerError
 

Notice the ContainerState is waiting, not terminated

The build controller currently in the various spots it checks container status assumes the state will be terminated, and does not consider waiting state with reasons like CreateContainerError

kube appears to have been this way for a while, citing this as waiting state  ... check out the cri-o issue from Derek Carr almost 3 years ago:  https://github.com/cri-o/cri-o/issues/815

  containerStatuses:
  - image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.6.3
    imageID: ""
    lastState: {}
    name: kubernetes-dashboard
    ready: false
    restartCount: 0
    state:
      waiting:
        message: |
          container create failed: container_linux.go:265: starting container process caused "process_linux.go:264: applying cgroup configuration for process caused \"No such device or address\""
        reason: CreateContainerError
  hostIP: 127.0.0.1
  phase: Pending
  podIP: 10.88.4.255
  qosClass: Burstable
  startTime: 2017-08-30T00:50:12Z


my initial scans of k8s code vendored into openshift have not yet found the actual setting of this, to see if their are other waiting reason codes to worry about, but based on the messaage, it implies to me that 
it is allowing for manual intervention

Pending feedback either here or in subsequent PRs, looks like we've have to expand our status checks minimally for this permutation.

Comment 11 Gabe Montero 2020-04-16 19:06:02 UTC
Hit send too soon ... found that error code in the kubelet:

./vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:	ErrCreateContainer = errors.New("CreateContainerError")


These appear the be the error reasons with waiting state we'll need to look for based on what I see in the kubelet's start container logic:

	// ErrCreateContainerConfig - failed to create container config
	ErrCreateContainerConfig = errors.New("CreateContainerConfigError")
	// ErrCreateContainer - failed to create container
	ErrCreateContainer = errors.New("CreateContainerError")
	// ErrPreStartHook - failed to execute PreStartHook
	ErrPreStartHook = errors.New("PreStartHookError")
	// ErrPostStartHook - failed to execute PostStartHook
	ErrPostStartHook = errors.New("PostStartHookError")

Comment 12 Gabe Montero 2020-04-23 15:30:43 UTC
After a discussion between Adam, Nalin, and myself, we've concluded the root here was the cri-o bug 1823949 which has just been verified in 4.4.0

*** This bug has been marked as a duplicate of bug 1823949 ***

Comment 13 Gabe Montero 2020-04-23 15:32:58 UTC
I saw no mention of 4.3 and potential backport efforts for bug 1823949

Please work through that bug for questions around that if needed for the e2e's that found this.

thanks