Description of problem: run 'oadm diagnostics NetworkCheck' on container install env. show errors: chroot: can't execute 'openshift': No such file or directory Version-Release number of selected component (if applicable): # openshift version openshift v3.4.0.24+52fd77b kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 How reproducible: always Steps to Reproduce: 1. Setup OCP cluster with container install 2. run 'oadm diagnostics NetworkCheck' Actual results: # oadm diagnostics NetworkCheck [Note] Determining if client configuration exists for client/cluster diagnostics Info: Successfully read a client config file at '/root/.kube/config' [Note] Running diagnostic: NetworkCheck Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint Info: Output from the network diagnostic pod on node "qe-zzhao-34-acc-node-registry-router-1": chroot: can't execute 'openshift': No such file or directory [Note] Summary of diagnostics execution (version v3.4.0.24+52fd77b): [Note] Completed with no errors or warnings seen. Expected results: no error. Additional info: this error only happen on container install env.
Since there is no openshift binary presents on the node role after install the ocp env with openshift-ansible. Maybe we can request the productization team to fix this as what they have done for master role.
Ravi: Can we bundle the openshift binary in the debug pod somehow?
Ben: We intentionally did not bundle openshift binary in the debug pod. OpenShift image is 500MB+ and pulling this image on every node is expensive, instead we want to use busybox image which is only 1MB and rely on openshift running on the node.
The OpenShift image is 500M but the binary is smaller... Not sure how much smaller...
(In reply to Eric Paris from comment #4) > The OpenShift image is 500M but the binary is smaller... Not sure how much > smaller... The openshift binary is 200M+, oc binary is 100M+ for now, and keep increasing. It is still too big for a debug pod IMO.
Fixed in https://github.com/openshift/openshift-ansible/pull/2804
Tried to add an option to diagnostics cmd that pulls the openshift image for network diagnostics instead of reusing the existing binary on the node. Code branch: https://github.com/openshift/origin/compare/master...pravisankar:network-diag-pull-openshift-bin?expand=1 While testing the code, I realized that pulling the openshift/origin image is not sufficient. We need an image that has openshift, docker, tc, ip, brctl, ovs-ctl, etc. binaries. I made it working by hacking a little bit but need to maintain a new image that has all these needed binaries. As of today, openshift image is 540M and even openshift binary is 252M. Network diagnostics with this pull openshift image option is significantly slower than the current method (machine has 16G RAM and 32 cpu cores). I don't see much issues with our current approach, so I prefer to trash this code.
Ok, not maintaining that image seems reasonable for now. Let's hold the code in case we need it later, but for now, let's not do anything with it. Thanks! Does it make sense to add a way to pass a custom path to the debug pod?
If we don't get this fixed for 3.4 then we need to release note that containerized install has trouble with this.
Fixed in https://github.com/openshift/origin/pull/12127
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/d4744a9bb6773509f99864f7637318925799b486 Bug 1393716 - Fix network diagnostics on containerized openshift install Added openshift/diagnostics-deployer image: It has openshift-network-debug script and it handles containerized and non-containerized openshift env and initiates network diagnostics on the node.
This has been merged into ocp and is in OCP v3.5.0.19 or newer.
Verified this bug on v3.5.0.19 this bug has been fixed.
@zhaozhanqi, @bomeng, Can you please reverify this bug one more time once this pr https://github.com/openshift/origin/pull/12982 is merged?
Assigned this bug since the PR https://github.com/openshift/origin/pull/12982 is not be merged yet
Tested this issue on v3.5.0.35 on atomic host env. met the pod cannot be running. # openshift version openshift v3.5.0.35 kubernetes v1.5.2+43a9be4 etcd 3.1.0 # cat /etc/redhat-release Red Hat Enterprise Linux Atomic Host release 7.3 QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 8s 8s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} spec.containers{network-diag-pod-kbst5} Normal Pulled Container image "openshift3/ose" already present on machine 8s 8s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} spec.containers{network-diag-pod-kbst5} Normal Created Created container with docker id 528da7892463; Security:[seccomp=unconfined] 7s 7s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} spec.containers{network-diag-pod-kbst5} Warning Failed Failed to start container with docker id 528da7892463 with error: Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""} 7s 7s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-kbst5" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"process_linux.go:359: container init caused \\\\\\\\\\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\"\\\\\\\"\\\\n\\\"\"}"
This has proven to be hard to reproduce. Weibin was going to try to reproduce.
Ravi is accessing and debugging the issue on: host-8-174-253.host.centralci.eng.rdu2.redhat.com
Flagged upcoming release since this is not a regression.
'oadm diagnostics NetworkCheck' output was not helpful and there is no relevant logs on the node as these test namespaces/pods/services were deleted after the operation. Created custom openshift/origin and openshift/node images with debug info and the test run showed it was failing due to https://bugzilla.redhat.com/show_bug.cgi?id=1439142 I will revisit this issue once we tackle bug#1439142
https://bugzilla.redhat.com/show_bug.cgi?id=1439142 has merged... any update on this bug? Thanks
*** Bug 1421643 has been marked as a duplicate of this bug. ***
@zhaozhanqi Can you check if this diagnostics issue be reproduced by disabling selinux on the node (setenforce 0)?
@Ravi Sankar if set setenforce 0 on node, the error in comment 17 will NOT be found. but the pod container still cannot be running with error: # docker logs 0bef11792cb4 error: --deployment or OPENSHIFT_DEPLOYMENT_NAME is required FYI, seems using the latest 3.6 atomic host containerized install will always failed in my side. will research this later. I just using the 3.5 version.
We are trying to solve 2 issues for this bug: (1) For containerized openshift install, expecting specific node image name format is fragile and that caused '--deployment <name> is required' error. (2) We would like to remove the hack where we call docker directly to fetch node container id and pid. After discussion with container team(mrunalp), this will be rewritten using runc that works with other container runtimes (CRI-O, docker, etc.)
@mrunalp will you work with Ravi on this please?
*** Bug 1501797 has been marked as a duplicate of this bug. ***
CRI changes are too risky for 3.7.
In 3.10 this is no longer supported.