Bug 1469572

Summary: Build pod status shows wrong value ExitCode:0
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED INSUFFICIENT_DATA QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, decarr, jokerman, mifiedle, mmccomas, vlaad, wmeng
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-12 15:58:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pod yaml none

Description Vikas Laad 2017-07-11 14:05:14 UTC
Created attachment 1296263 [details]
pod yaml

Description of problem:
After doing lots of concurrent builds some build pod status show wrong value "ExitCode:0". In this case I saw it when I did around 20 rounds of 30 concurrent builds on 2 m4.xlarge compute nodes.

Version-Release number of selected component (if applicable):
openshift v3.6.114
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-28.git1398f24.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1398f24/1.12.6
 Built:           Wed May 17 01:16:44 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-28.git1398f24.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1398f24/1.12.6
 Built:           Wed May 17 01:16:44 2017
 OS/Arch:         linux/amd64


Actual results:
Build pod status show wrong value.

Expected results:
Build pod status should show correct value (Complete, Error etc)

Additional info:
Please see pod yaml attached.

Comment 1 Derek Carr 2017-07-11 15:54:39 UTC
From the attached pod YAML file, the CLI is operating as expected.

If the pod.status.containerStatuses[x].state.terminated.reason value is populated, we show that value.

If the pod.status.containerStatuses[x].state.terminated.reason is empty, we show the ExitCode:%d value as reported.

The pod.status.containerStatuses[x].state.terminated reason value is populated to "Completed" by default in dockershim docker_container.go if the container had a startedAt and finishedAt value reported from the container runtime.  Absent those values, no reason is provided.

From the attached pod YAML file, we can see that finishedAt and startedAt are actually null:

    state:
      terminated:
        containerID: docker://da46510645225c5d3cd86cc59e08ff232c35b08c15edff18673de27c003b5157
        exitCode: 0
        finishedAt: null
        startedAt: null

The timestamps are taken from the output of a docker inspect call on the container.

Are there any node logs that report:

"failed to parse timestamp for container: %q: %v"

Comment 2 Vikas Laad 2017-07-11 16:13:41 UTC
Hi Derek,

I do not see this string on any of the nodes in logs.

Comment 3 Seth Jennings 2017-07-11 21:15:37 UTC
I suspect that this is the docker inspect call timing out and not filling in the container status.  This results in everything except the ConatinerID being empty:

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_pods.go#L1265-L1273

Without the log entry to prove it, the only thing I can think is to look at the kubelet latency metrics on the docker inspect calls and see if the 99th percentile is up against the timeout.

Comment 4 Derek Carr 2017-07-12 15:58:58 UTC
I do not think this should block a release as it had no impact on the actual workload running to completion.  Instead, the kubelet was unable to derive the state of the container from the inspect call.  Absent a log, I do not know how to proceed.  If we encounter this bug again, can we re-open with log detail so we can verify what we believe happened?

Comment 5 Vikas Laad 2017-07-12 17:09:43 UTC
I still have the env, I just did not find that string in logs. Please let me know if someone wants to look at the env.

Comment 6 Seth Jennings 2017-07-12 17:52:56 UTC
All the pods that have this issue are over a day old.  

# oc get pods --all-namespaces | grep ExitCode:0
proj13      cakephp-mysql-example-33-build   0/1       ExitCode:0             0          1d
proj20      cakephp-mysql-example-33-build   0/1       ExitCode:0             0          1d
proj22      cakephp-mysql-example-33-build   0/1       ExitCode:0             0          1d
proj26      cakephp-mysql-example-33-build   0/1       ExitCode:0             0          1d
proj3       cakephp-mysql-example-33-build   0/1       ExitCode:0             0          1d
proj8       cakephp-mysql-example-33-build   0/1       ExitCode:0             0          1d

I doubt you will find anything in the logs on any of them with the verbosity of the logging.

We need a fresh pod that exhibits the behavior and capture the node log with all the context around it if we are to look into this.