Description of problem: oc get pods shows actual error message in Status field instead of just showing "Error" root@ip-172-31-4-211: ~ # oc get pods --all-namespaces -o wide| grep Error proj6 cakephp-mysql-example-17-build 0/1 rpc error: code = 2 desc = failed to start container "f4b8445c1ef6564ff2e6641a98c59f2cdc83cf714854bf1924d41076a7894b62": Error response from daemon: {"message":"linux runtime spec devices: lstat /dev/block/253:19: no such file or directory"} 0 2m 172.20.2.193 ip-172-31-17-182.us-west-2.compute.internal proj6 cakephp-mysql-example-3-build 0/1 Error 0 48m 172.20.2.45 ip-172-31-17-182.us-west-2.compute.internal Version-Release number of selected component (if applicable): root@ip-172-31-4-211: ~ # oc version oc v3.6.74 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-31-4-211.us-west-2.compute.internal:8443 openshift v3.6.74 kubernetes v1.6.1+5115d708d7 How reproducible: Few error builds show this not all of them. Steps to Reproduce: 1. Start cakephp builds, I am running 30 concurrent builds 2. See the status field for few failed builds Actual results: Shows actual error message. Expected results: Should show the Status "Error" not the actual error message. Additional info:
Created attachment 1277696 [details] oc describe pod
Created attachment 1277697 [details] oc describe build
I don't think that's right but i also don't think it's a build issue. We don't set that field, this seems like a pure pod issue.
The kubelet is reporting the error returned back from the docker daemon when the container fails to start. The reason is "ContainerCannotRun" and message seems useful to not omit in this case "linux runtime spec devices: lstat /dev/block/253:19: no such file or directory". I am moving this to the containers team to understand why or when this message would come back from docker.
I agree it's a useful message, but should it really be getting reported in the status column like that, vs an event or something? i'd expect the status column to contain well-defined enumerated values like "new", "pending", "running", "complete", "error", and "failed".
This has started happening recently, in 3.5 it used to show ContainerCannotRun in Status column.
sounds like there are really two bugs that should be open here. 1) the root cause of the container error 2) the fact that the error is being exposed in the oc get pods output when it should masked into a general pod state like "ContainerCannotRun"
Please always paste "docker info" output in all the bugs you open. That way we have some idea what version you are working with. I think this is runc issue and I suspect this is related to one of the races we had fixed. https://bugzilla.redhat.com/show_bug.cgi?id=1445134 Fix has been committed in docker-1.12.6 branch now. commit ddc16b9c2ad8408ee4557b52abbb55e179a56dfc Author: Antonio Murdaca <runcom> Date: Mon Apr 24 20:01:35 2017 +0200 Fix BZ#1409076: Race condition in container creation using devicemapper "no such file or directory" Signed-off-by: Antonio Murdaca <runcom> You will need to make sure that your docker contains this fix.
Hi Vivek, Here is docker version, please let me know if you want me to run this test with any specific version of docker. root@ip-172-31-4-211: ~ # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-16.el7.x86_64 Go version: go1.7.4 Git commit: 3a094bd/1.12.6 Built: Tue Mar 21 13:30:59 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-16.el7.x86_64 Go version: go1.7.4 Git commit: 3a094bd/1.12.6 Built: Tue Mar 21 13:30:59 2017 OS/Arch: linux/amd64
(In reply to Ben Parees from comment #7) > sounds like there are really two bugs that should be open here. > > 1) the root cause of the container error > 2) the fact that the error is being exposed in the oc get pods output when > it should masked into a general pod state like "ContainerCannotRun" Hi Ben, I already have a bug open for root cause https://bugzilla.redhat.com/show_bug.cgi?id=1436731
Thanks, then I think this bug should go back to the kubernetes component to determine why the error is showing up in the get pods output.
Vikas, I don't know which docker version has the fix. Just upgrade to latest available. Antonio, do you know in which docker version has this fix?
It's available from build "-18": commit e29c0400453c4e759d9892943280cd5e198864df Author: Lokesh Mandvekar <lsm5> Date: Mon Apr 24 14:55:10 2017 -0400 docker-2:1.12.6-18.git29d6f69 - built docker @projectatomic/docker-1.12.6 commit 29d6f69 - built docker-novolume-plugin commit 385ec70 - built rhel-push-plugin commit 70653ed - built docker-lvm-plugin commit 8647404 - built docker-runc @projectatomic/docker-1.12.6 commit 81b2542 - built docker-containerd @projectatomic/docker-1.12.6 commit f3f35e9 Signed-off-by: Lokesh Mandvekar <lsm5>
I will run the tests again with *-18 version, I had to stop testing on this version of docker in last run because of following issue https://bugzilla.redhat.com/show_bug.cgi?id=1449297
I am still seeing this issue in following version openshift v3.6.106 kubernetes v1.6.1+5115d708d7 etcd 3.2.0 Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-28.git1398f24.el7.x86_64 Go version: go1.7.4 Git commit: 1398f24/1.12.6 Built: Wed May 17 01:16:44 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-28.git1398f24.el7.x86_64 Go version: go1.7.4 Git commit: 1398f24/1.12.6 Built: Wed May 17 01:16:44 2017 OS/Arch: linux/amd64 root@ip-172-31-22-163: ~/svt/openshift_performance/ci/scripts # oc get pods -n proj21 NAME READY STATUS RESTARTS AGE cakephp-mysql-example-22-build 0/1 Completed 0 12m cakephp-mysql-example-23-build 0/1 Completed 0 10m cakephp-mysql-example-24-build 0/1 rpc error: code = 4 desc = context deadline exceeded 0 8m cakephp-mysql-example-3-build 0/1 Completed 0 51m Also attaching describe pod.
Created attachment 1287371 [details] describe pod
Vikas, How do you know its same error. I don't see any "lstat" errors as we were seeing previously. I think this is a different issue.
Hi Vivek, So I think there is some confusion, there were 2 problems as Ben comment 7 1. root cause of the problem, for which we have bug https://bugzilla.redhat.com/show_bug.cgi?id=1436731 2. Error message showing in the list instead of the status of the pod. We are tracking second problem in this bug, if you are talking about the actual error its tracked in the above mentioned bug.
Vikas, Ok, if you are not tracking "lstat" error, then its some other problem. I am not sure why it should be assigned to me.
Ok, if you are complaining about that in status field it should just show error and not actuall error message, then it is an openshift issue and should be assigned to an openshift engineer.
it's a k8s issue, moving to the k8s component.
Jhon, what information do you need from me ?
Looking at the pod description: State: Waiting Reason: rpc error: code = 2 desc = failed to start container "f4b8445c1ef6564ff2e6641a98c59f2cdc83cf714854bf1924d41076a7894b62": Error response from daemon: {"message":"linux runtime spec devices: lstat /dev/block/253:19: no such file or directory"} Last State: Terminated Reason: ContainerCannotRun Message: linux runtime spec devices: lstat /dev/block/253:19: no such file or directory Exit Code: 128 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Wed, 10 May 2017 15:33:18 -0400 The reason in the Waiting state is the CRI error. Looking at the code, reason should only ever be ContainerCreating, ErrImagePull, ImagePullBackOff, or Blocked while in the Waiting state. Init containers can also have reason PodInitializing in Waiting state. I think this is where the raw error messages is getting into the status: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1378
Kube PR: https://github.com/kubernetes/kubernetes/pull/47823
PR is in flight. Needs review/approval upstream. Should make next release.
Origin PR: https://github.com/openshift/origin/pull/14887
@Vikas Laad, help verify the bug, thanks
After running lot of builds I see following status for few pods proj20 cakephp-mysql-example-33-build 0/1 ExitCode:0 0 53m Also attaching yaml file for the pod. I will send an email with cluster details in case you want to look at the cluster.
Created attachment 1295932 [details] pod yaml where i see ExitCode:0 status
Moving back to QA. This is not the same issue. The original issue was raw CRI error messages being reported in the status. However, this does seem to be another bug. Status that results in "Complete": state: terminated: containerID: docker://7fd8bf26c913de11b21e61d59828294a5f2deb23f7812e5e20aec926981b9c0d exitCode: 0 finishedAt: 2017-07-10T17:17:49Z reason: Completed startedAt: 2017-07-10T17:16:24Z Status that results in "ExitCode:0" state: terminated: containerID: docker://da46510645225c5d3cd86cc59e08ff232c35b08c15edff18673de27c003b5157 exitCode: 0 finishedAt: null startedAt: null The startedAt, finishedAt, and reason are not being populated in some cases. Please open a separate bug for this.
Verified in openshift v3.6.114 kubernetes v1.6.1+5115d708d7 etcd 3.2.0 Completed around 1500 builds with lot of failures, did not see the error message in status field. Will open another bug as mentioned in Comment #32
*** Bug 1470695 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188