Bug 1449820

Summary: Pod STATUS field is showing actual error message
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED ERRATA QA Contact: Vikas Laad <vlaad>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: amurdaca, aos-bugs, bparees, decarr, dma, dwalsh, jhonce, jokerman, jupierce, mifiedle, mmccomas, mpatel, trankin, vlaad, wmeng, xtian
Target Milestone: ---Keywords: Regression
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 21:54:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc describe pod
none
oc describe build
none
describe pod
none
pod yaml where i see ExitCode:0 status none

Description Vikas Laad 2017-05-10 19:46:37 UTC
Description of problem:
oc get pods shows actual error message in Status field instead of just showing "Error"

root@ip-172-31-4-211: ~ # oc get pods --all-namespaces -o wide| grep Error
proj6       cakephp-mysql-example-17-build   0/1       rpc error: code = 2 desc = failed to start container "f4b8445c1ef6564ff2e6641a98c59f2cdc83cf714854bf1924d41076a7894b62": Error response from daemon: {"message":"linux runtime spec devices: lstat /dev/block/253:19: no such file or directory"}   0          2m        172.20.2.193    ip-172-31-17-182.us-west-2.compute.internal
proj6       cakephp-mysql-example-3-build    0/1       Error                                                                                                                                                                                                                                               0          48m       172.20.2.45     ip-172-31-17-182.us-west-2.compute.internal

Version-Release number of selected component (if applicable):
root@ip-172-31-4-211: ~ # oc version 
oc v3.6.74
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-4-211.us-west-2.compute.internal:8443
openshift v3.6.74
kubernetes v1.6.1+5115d708d7

How reproducible:
Few error builds show this not all of them.

Steps to Reproduce:
1. Start cakephp builds, I am running 30 concurrent builds
2. See the status field for few failed builds

Actual results:
Shows actual error message.

Expected results:
Should show the Status "Error" not the actual error message.


Additional info:

Comment 1 Vikas Laad 2017-05-10 20:22:14 UTC
Created attachment 1277696 [details]
oc describe pod

Comment 2 Vikas Laad 2017-05-10 20:22:39 UTC
Created attachment 1277697 [details]
oc describe build

Comment 3 Ben Parees 2017-05-10 20:47:50 UTC
I don't think that's right but i also don't think it's a build issue.  We don't set that field, this seems like a pure pod issue.

Comment 4 Derek Carr 2017-05-11 18:34:49 UTC
The kubelet is reporting the error returned back from the docker daemon when the container fails to start.  The reason is "ContainerCannotRun" and message seems useful to not omit in this case "linux runtime spec devices: lstat /dev/block/253:19: no such file or directory".

I am moving this to the containers team to understand why or when this message would come back from docker.

Comment 5 Ben Parees 2017-05-11 18:52:31 UTC
I agree it's a useful message, but should it really be getting reported in the status column like that, vs an event or something?

i'd expect the status column to contain well-defined enumerated values like "new", "pending", "running", "complete", "error", and "failed".

Comment 6 Vikas Laad 2017-05-11 18:56:48 UTC
This has started happening recently, in 3.5 it used to show ContainerCannotRun in Status column.

Comment 7 Ben Parees 2017-05-12 01:54:10 UTC
sounds like there are really two bugs that should be open here.  

1) the root cause of the container error
2) the fact that the error is being exposed in the oc get pods output when it should masked into a general pod state like "ContainerCannotRun"

Comment 8 Vivek Goyal 2017-05-12 11:47:09 UTC
Please always paste "docker info" output in all the bugs you open. That way we have some idea what version you are working with.

I think this is runc issue and I suspect this is related to one of the races we had fixed. 

https://bugzilla.redhat.com/show_bug.cgi?id=1445134

Fix has been committed in docker-1.12.6 branch now.


commit ddc16b9c2ad8408ee4557b52abbb55e179a56dfc
Author: Antonio Murdaca <runcom>
Date:   Mon Apr 24 20:01:35 2017 +0200

    Fix BZ#1409076: Race condition in container creation using devicemapper "no such file or directory"
    
    Signed-off-by: Antonio Murdaca <runcom>


You will need to make sure that your docker contains this fix.

Comment 9 Vikas Laad 2017-05-12 13:52:03 UTC
Hi Vivek,

Here is docker version, please let me know if you want me to run this test with any specific version of docker.

root@ip-172-31-4-211: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Comment 10 Vikas Laad 2017-05-12 13:54:05 UTC
(In reply to Ben Parees from comment #7)
> sounds like there are really two bugs that should be open here.  
> 
> 1) the root cause of the container error
> 2) the fact that the error is being exposed in the oc get pods output when
> it should masked into a general pod state like "ContainerCannotRun"

Hi Ben,

I already have a bug open for root cause

https://bugzilla.redhat.com/show_bug.cgi?id=1436731

Comment 11 Ben Parees 2017-05-12 13:55:42 UTC
Thanks, then I think this bug should go back to the kubernetes component to determine why the error is showing up in the get pods output.

Comment 12 Vivek Goyal 2017-05-12 19:33:46 UTC
Vikas, I don't know which docker version has the fix. Just upgrade to latest available.

Antonio, do you know in which docker version has this fix?

Comment 13 Antonio Murdaca 2017-05-12 20:42:22 UTC
It's available from build "-18":

commit e29c0400453c4e759d9892943280cd5e198864df
Author: Lokesh Mandvekar <lsm5>
Date:   Mon Apr 24 14:55:10 2017 -0400

    docker-2:1.12.6-18.git29d6f69

    - built docker @projectatomic/docker-1.12.6 commit 29d6f69
    - built docker-novolume-plugin commit 385ec70
    - built rhel-push-plugin commit 70653ed
    - built docker-lvm-plugin commit 8647404
    - built docker-runc @projectatomic/docker-1.12.6 commit 81b2542
    - built docker-containerd @projectatomic/docker-1.12.6 commit f3f35e9

    Signed-off-by: Lokesh Mandvekar <lsm5>

Comment 14 Vikas Laad 2017-05-12 20:50:22 UTC
I will run the tests again with *-18 version, I had to stop testing on this version of docker in last run because of following issue

https://bugzilla.redhat.com/show_bug.cgi?id=1449297

Comment 15 Vikas Laad 2017-06-13 15:22:08 UTC
I am still seeing this issue in following version

openshift v3.6.106
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-28.git1398f24.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1398f24/1.12.6
 Built:           Wed May 17 01:16:44 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-28.git1398f24.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1398f24/1.12.6
 Built:           Wed May 17 01:16:44 2017
 OS/Arch:         linux/amd64


root@ip-172-31-22-163: ~/svt/openshift_performance/ci/scripts # oc get pods -n proj21
NAME                             READY     STATUS                                                 RESTARTS   AGE
cakephp-mysql-example-22-build   0/1       Completed                                              0          12m
cakephp-mysql-example-23-build   0/1       Completed                                              0          10m
cakephp-mysql-example-24-build   0/1       rpc error: code = 4 desc = context deadline exceeded   0          8m
cakephp-mysql-example-3-build    0/1       Completed                                              0          51m

Also attaching describe pod.

Comment 16 Vikas Laad 2017-06-13 15:23:55 UTC
Created attachment 1287371 [details]
describe pod

Comment 17 Vivek Goyal 2017-06-14 18:32:11 UTC
Vikas, 

How do you know its same error. I don't see any "lstat" errors as we were seeing previously. 

I think this is a different issue.

Comment 18 Vikas Laad 2017-06-14 18:37:42 UTC
Hi Vivek,

So I think there is some confusion, there were 2 problems as Ben comment 7
1. root cause of the problem, for which we have bug https://bugzilla.redhat.com/show_bug.cgi?id=1436731
2. Error message showing in the list instead of the status of the pod.

We are tracking second problem in this bug, if you are talking about the actual error its tracked in the above mentioned bug.

Comment 19 Vivek Goyal 2017-06-14 18:48:43 UTC
Vikas, 

Ok, if you are not tracking "lstat" error, then its some other problem. I am not sure why it should be assigned to me.

Comment 20 Vivek Goyal 2017-06-14 19:15:34 UTC
Ok, if you are complaining about that in status field it should just show error and not actuall error message, then it is an openshift issue and should be assigned to an openshift engineer.

Comment 21 Ben Parees 2017-06-14 19:16:21 UTC
it's a k8s issue, moving to the k8s component.

Comment 22 Vikas Laad 2017-06-14 19:26:03 UTC
Jhon, what information do you need from me ?

Comment 24 Seth Jennings 2017-06-21 03:25:00 UTC
Looking at the pod description:

    State:		Waiting
      Reason:		rpc error: code = 2 desc = failed to start container "f4b8445c1ef6564ff2e6641a98c59f2cdc83cf714854bf1924d41076a7894b62": Error response from daemon: {"message":"linux runtime spec devices: lstat /dev/block/253:19: no such file or directory"}
    Last State:		Terminated
      Reason:		ContainerCannotRun
      Message:		linux runtime spec devices: lstat /dev/block/253:19: no such file or directory
      Exit Code:	128
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Wed, 10 May 2017 15:33:18 -0400

The reason in the Waiting state is the CRI error.  Looking at the code, reason should only ever be ContainerCreating, ErrImagePull, ImagePullBackOff, or Blocked while in the Waiting state.  Init containers can also have reason PodInitializing in Waiting state.

I think this is where the raw error messages is getting into the status:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1378

Comment 25 Seth Jennings 2017-06-21 13:59:50 UTC
Kube PR:
https://github.com/kubernetes/kubernetes/pull/47823

Comment 26 Seth Jennings 2017-06-23 03:31:55 UTC
PR is in flight. Needs review/approval upstream. Should make next release.

Comment 27 Seth Jennings 2017-06-26 18:29:24 UTC
Origin PR:
https://github.com/openshift/origin/pull/14887

Comment 29 DeShuai Ma 2017-07-05 08:51:32 UTC
@Vikas Laad, help verify the bug, thanks

Comment 30 Vikas Laad 2017-07-10 19:52:13 UTC
After running lot of builds I see following status for few pods

proj20      cakephp-mysql-example-33-build   0/1       ExitCode:0             0          53m

Also attaching yaml file for the pod. I will send an email with cluster details in case you want to look at the cluster.

Comment 31 Vikas Laad 2017-07-10 19:53:20 UTC
Created attachment 1295932 [details]
pod yaml where i see ExitCode:0 status

Comment 32 Seth Jennings 2017-07-10 21:58:38 UTC
Moving back to QA.  This is not the same issue.  The original issue was raw CRI error messages being reported in the status.

However, this does seem to be another bug.

Status that results in "Complete":

    state:
      terminated:
        containerID: docker://7fd8bf26c913de11b21e61d59828294a5f2deb23f7812e5e20aec926981b9c0d
        exitCode: 0
        finishedAt: 2017-07-10T17:17:49Z
        reason: Completed
        startedAt: 2017-07-10T17:16:24Z

Status that results in "ExitCode:0"

    state:
      terminated:
        containerID: docker://da46510645225c5d3cd86cc59e08ff232c35b08c15edff18673de27c003b5157
        exitCode: 0
        finishedAt: null
        startedAt: null

The startedAt, finishedAt, and reason are not being populated in some cases.

Please open a separate bug for this.

Comment 33 Vikas Laad 2017-07-11 13:48:01 UTC
Verified in

openshift v3.6.114
kubernetes v1.6.1+5115d708d7
etcd 3.2.0


Completed around 1500 builds with lot of failures, did not see the error message in status field. Will open another bug as mentioned in Comment #32

Comment 34 Seth Jennings 2017-07-14 21:17:30 UTC
*** Bug 1470695 has been marked as a duplicate of this bug. ***

Comment 39 errata-xmlrpc 2017-11-28 21:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188