Description of problem: Failed to "StartContainer" for "sti-build" with RunContainerError. stat /usr/bin/ openshift-sti-build: no such file or directory Version-Release number of selected component (if applicable): v3.4.1.24 How reproducible: Not in every node and not for every image. Steps to Reproduce: 1. Start a build or use "docker run" in one of the faulty node+image 2. 3. Actual results: Builds fail with: Jun 8 18:52:56 shift-node-7kgh00o8 atomic-openshift-node: E0608 18:52:56.327484 17907 pod_workers.go:184] Error syncing pod e7dee8f4-4c6a-11e7-a1ea-fa163e52ca24, skipping: failed to "StartContainer" for "sti-build" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"exec: \\\\\\ \\\\\\\\\"/usr/bin/openshift-sti-build\\\\\\\\\\\\\\\": stat /usr/bin/ openshift-sti-build: no such file or directory\\\\\\\"\\\\n\\\"\"}" And docker run with: + docker run -it openshift3/ose-deployer:v3.4.1.24 /usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"exec: \\\"/usr/bin/openshift-deploy\\\": stat /usr/bin/openshift-deploy: no such file or directory\"\n". Expected results: Builds to finish and docker run to work Additional info: We already checked: - packages versions to be the same on faulty and working nodes - Selinux context - Docker process running with the same parameters. - Docker inspect works and show all layers
More examples of failed runs: container_linux.go:247: starting container process caused "exec: \"/usr/bin/openshift-deploy\": stat /usr/bin/openshift-deploy: no such file or directory" /usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"exec: \\\"/usr/bin/openshift-deploy\\\": stat /usr/bin/openshift-deploy: no such file or directory\"\n". The error it does not happen with all the tags, for docker run works with instance openshift3/ose-deployer:latest but not with openshift3/ose-deployer:v3.4.1.24 And also happens with non-s2i images: [root@shift-node-ww5m32us ~]# docker run -ti --rm rhscl/mysql-56-rhel7 /usr/bin/run-mysqld: line 3: /usr/share/container-scripts/mysql/common.sh: No such file or directory /usr/bin/run-mysqld: line 6: export_setting_variables: command not found If I try to dump the image I get the error [root@shift-node-ww5m32us cloud-user]# docker save registry.access.redhat.com/rhscl/mysql-56-rhel7:5.6 -o mysql-5.6-rhel7.tar Error response from daemon: open /var/lib/docker/devicemapper/mnt/f9c3760aac67f0e8dc79a696a67692340eab0bc1f5b8c3dccb0e6757f5d7a79e/rootfs/usr/share/container-scripts/mysql/README.md: no such file or director
I am not surprised. This looks very strange. I would like to see if we can recreated this on a newer version of docker. The containers that this happens on, can you execute /bin/sh for the entrypoint or does this fail also?
(In reply to Daniel Walsh from comment #11) > I am not surprised. This looks very strange. I would like to see if we can > recreated this on a newer version of docker. > > The containers that this happens on, can you execute /bin/sh for the > entrypoint or does this fail also? This is what customer reported: --- docker run -it rhscl/mysql-56-rhel7:5.6 --- the input device is not a TTY --- /docker run -it rhscl/mysql-56-rhel7:5.6 --- --- docker save registry.access.redhat.com/rhscl/mysql-56-rhel7:5.6 -o mysql-5.6-rhel7.tar --- Error response from daemon: open /var/lib/docker/devicemapper/mnt/f9c3760aac67f0e8dc79a696a67692340eab0bc1f5b8c3dccb0e6757f5d7a79e/rootfs/usr/share/co ntainer-scripts/mysql/README.md: no such file or directory --- /docker save registry.access.redhat.com/rhscl/mysql-56-rhel7:5.6 -o mysql-5.6-rhel7.tar --- --- docker run -it --entrypoint /bin/sh rhscl/mysql-56-rhel7:5.6 -c 'ls -ltr /usr/share/container-scripts/mysql/common.sh' --- the input device is not a TTY --- /docker run -it --entrypoint /bin/sh rhscl/mysql-56-rhel7:5.6 -c 'ls -ltr /usr/share/container-scripts/mysql/common.sh' --- --- docker run -it --entrypoint /bin/sh rhscl/mysql-56-rhel7:5.6 -c 'ls -ltr /usr/share/ | grep container-scripts' --- the input device is not a TTY --- /docker run -it --entrypoint /bin/sh rhscl/mysql-56-rhel7:5.6 -c 'ls -ltr /usr/share/ | grep container-scripts' ---
Seems like something very bad has happened to the device. Vivek have you seen something like this?
I know that OpenShift used to do force removals could this be one that failed and then left the image/container in a weird state?
Jon, this looks like it is not a container runtime problem, but the actual container image, which does not have the proper content,
the image looks fine to me (v3.4.1.44, as referenced in comment 17): $ docker run --entrypoint=/bin/sh -it registry.access.redhat.com/openshift3/ose-sti-builder:v3.4.1.44 # ls -l /usr/bin/openshift-sti-build lrwxrwxrwx. 1 root root 9 Jun 29 13:53 /usr/bin/openshift-sti-build -> openshift # ls -l /usr/bin/openshift -rwxr-xr-x. 1 root root 213359736 Jun 23 18:47 /usr/bin/openshift $ docker run -it registry.access.redhat.com/openshift3/ose-sti-builder:v3.4.1.44 The Build "" is invalid: * metadata.name: Required value: name or generateName is required * metadata.namespace: Required value * spec.strategy: Invalid value: {"DockerStrategy":null,"SourceStrategy":null,"CustomStrategy":null,"JenkinsPipelineStrategy":null}: must provide a value for exactly one of sourceStrategy, customStrategy, dockerStrategy, or jenkinsPipelineStrategy Note that the original bug report also reported problems for two different images: Builds fail with: Jun 8 18:52:56 shift-node-7kgh00o8 atomic-openshift-node: E0608 18:52:56.327484 17907 pod_workers.go:184] Error syncing pod e7dee8f4-4c6a-11e7-a1ea-fa163e52ca24, skipping: failed to "StartContainer" for "sti-build" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"exec: \\\\\\ \\\\\\\\\"/usr/bin/openshift-sti-build\\\\\\\\\\\\\\\": stat /usr/bin/ openshift-sti-build: no such file or directory\\\\\\\"\\\\n\\\"\"}" And docker run with: + docker run -it openshift3/ose-deployer:v3.4.1.24 /usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"exec: \\\"/usr/bin/openshift-deploy\\\": stat /usr/bin/openshift-deploy: no such file or directory\"\n". Both of which could not run binaries. And things worked on some nodes but not others (using presumably the same images). So this seems much more likely to be an issue of either the container runtime, or bad images being somehow pulled to some nodes but not others. Either way, that seems like a container issue to me.
Still looking for an update and potential work around on this issue.
Ben, no one knows what is going on here. :^(
Benjamin - I'd some initial trouble setting up my test system that was not at all related to this issue. After having done so, I ran a number of tests and was not able to recreate the issue that Rackspace is seeing. If you could, can you grab the docker and rhel version from the Rackspace system please? Is it at all possible to upgrade docker/rhel there to the latest greatest? I know there were some changes on both sides that may or may not have helped this situation. FWIW, my versions are: # uname -a Linux rhelbz.localdomain 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux # docker -v Docker version 1.12.6, build 85d7426/1.12.6 Also if possible, when this error occurs there, could you grab the portion of the journalctl that happened a few minutes before and a minute or so after? From looking at the issues and browsing through stuff, I think something might be corrupted down in their /var/lib/docker directories. Vivek - would it make sense to recommend deleting that directory structure based on the errors that are going on in these reports?
Please re-open if issue is reproduced.
Their version of OpenShift. # rpm -qa | grep openshift atomic-openshift-3.5.5.31-1.git.0.b6f55a2.el7.x86_64 atomic-openshift-sdn-ovs-3.5.5.31-1.git.0.b6f55a2.el7.x86_64 tuned-profiles-atomic-openshift-node-3.5.5.31-1.git.0.b6f55a2.el7.x86_64 atomic-openshift-clients-3.5.5.31-1.git.0.b6f55a2.el7.x86_64 atomic-openshift-node-3.5.5.31-1.git.0.b6f55a2.el7.x86_64 atomic-openshift-excluder-3.5.5.31-1.git.0.b6f55a2.el7.noarch atomic-openshift-docker-excluder-3.5.5.31-1.git.0.b6f55a2.el7.noarch If you guys think upgrade of docker version will help then to what shall I upgrade as per the above mentioned OpenShift version.
Yes I would get them upgraded.
We're tracking latest issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1518519
Verified this with: docker run -it openshift3/ose-deployer:v3.9.0-0.20.0 openshift v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8 docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-68.gitec8512b.el7.x86_64 Go version: go1.8.3 Git commit: ec8512b/1.12.6 Built: Thu Nov 16 15:19:17 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-68.gitec8512b.el7.x86_64 Go version: go1.8.3 Git commit: ec8512b/1.12.6 Built: Thu Nov 16 15:19:17 2017 OS/Arch: linux/amd64