Description of problem: oadm diagnostics NetworkCheck' timeout due to image 'openshift/diagnostics-deployer' pull failed Version-Release number of selected component (if applicable): # openshift version openshift v3.5.0.19+199197c kubernetes v1.5.2+43a9be4 etcd 3.1.0 How reproducible: always Steps to Reproduce: 1. setup multi-node env 2. run 'oadm diagnostics NetworkCheck' 3. monitor the pods in another terminal Actual results: step2: # oadm diagnostics NetworkCheck [Note] Determining if client configuration exists for client/cluster diagnostics Info: Successfully read a client config file at '/root/.kube/config' [Note] Running diagnostic: NetworkCheck Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint ERROR: [DNet2007 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:137] timed out waiting for the condition [Note] Summary of diagnostics execution (version v3.5.0.19+199197c): [Note] Errors seen: 1 step 3: Found failed pull 'openshift/diagnostics-deployer' image # oc describe pod network-diag-pod-jkb8q -n network-diag-ns-46kph Name: network-diag-pod-jkb8q Namespace: network-diag-ns-46kph Security Policy: privileged Node: ip-172-18-2-17.ec2.internal/172.18.2.17 Start Time: Mon, 13 Feb 2017 05:06:36 -0500 Labels: <none> Status: Pending IP: 172.18.2.17 Controllers: <none> Containers: network-diag-pod-jkb8q: Container ID: Image: openshift/diagnostics-deployer Image ID: Port: Command: sh -c openshift-network-debug /host openshift infra network-diagnostic-pod -l 1 State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Volume Mounts: /host from host-root-dir (rw) /host/secrets from kconfig-secret (ro) /var/run/secrets/kubernetes.io/serviceaccount from default-token-h8wg1 (ro) Environment Variables: KUBECONFIG: /secrets/kubeconfig Conditions: Type Status Initialized True Ready False Volumes: host-root-dir: Type: HostPath (bare host directory volume) Path: / kconfig-secret: Type: Secret (a volume populated by a Secret) SecretName: network-diag-secret default-token-h8wg1: Type: Secret (a volume populated by a Secret) SecretName: default-token-h8wg1 QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 40s 15s 2 {kubelet ip-172-18-2-17.ec2.internal} spec.containers{network-diag-pod-jkb8q} Normal BackOff Back-off pulling image "openshift/diagnostics-deployer" 40s 15s 2 {kubelet ip-172-18-2-17.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-jkb8q" with ImagePullBackOff: "Back-off pulling image \"openshift/diagnostics-deployer\"" 42s 2s 3 {kubelet ip-172-18-2-17.ec2.internal} spec.containers{network-diag-pod-jkb8q} Normal Pulling pulling image "openshift/diagnostics-deployer" 41s 1s 3 {kubelet ip-172-18-2-17.ec2.internal} spec.containers{network-diag-pod-jkb8q} Warning Failed Failed to pull image "openshift/diagnostics-deployer": unauthorized: authentication required 41s 1s 3 {kubelet ip-172-18-2-17.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-jkb8q" with ErrImagePull: "unauthorized: authentication required" Expected results: oadm diagnostics NetworkCheck can work well Additional info:
The image isn't available yet. But for OpenShift Container Platform, it needs to be openshift3/ose-diagnostics-deployer
Fixed in https://github.com/openshift/origin/pull/12982
ravi, because of the 75 bazillion flakes we need this fixed in origin/master and origin/release-1.5. Can you do that backport?
PR on origin/release-1.5 : https://github.com/openshift/origin/pull/13062
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/b640a0e55fdebb6e2b3eb52fc404ecfee16ead98 Bug 1421643 - Use existing openshift/origin image instead of new openshift/diagnostics-deployer Any new image like 'openshift/diagnostics-deployer' incurs build/lifecycle costs to maintian and diagnostics-deployer image has only small block of shell code. To alleviate this problem, now the script is embedded into the pod definition and openshift/origin is used as diagnostics deployer image. On dev machines, currently openshift/origin is close to 800MB but we expect the size to be under 200MB when it is released (compressed, debug headers removed).
Seems it still cannot work well on atomic host env # cat /etc/redhat-release Red Hat Enterprise Linux Atomic Host release 7.3 # oadm diagnostics NetworkCheck [Note] Determining if client configuration exists for client/cluster diagnostics Info: Successfully read a client config file at '/root/.kube/config' [Note] Running diagnostic: NetworkCheck Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint Info: Output from the network diagnostic pod on node "host-8-175-189.host.centralci.eng.rdu2.redhat.com": Info: Output from the network diagnostic pod on node "host-8-174-76.host.centralci.eng.rdu2.redhat.com": [Note] Summary of diagnostics execution (version v3.5.0.35): [Note] Completed with no errors or warnings seen. *************************** check the logs find there some pod still cannot be running. Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 8s 8s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} spec.containers{network-diag-pod-kbst5} Normal Pulled Container image "openshift3/ose" already present on machine 8s 8s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} spec.containers{network-diag-pod-kbst5} Normal Created Created container with docker id 528da7892463; Security:[seccomp=unconfined] 7s 7s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} spec.containers{network-diag-pod-kbst5} Warning Failed Failed to start container with docker id 528da7892463 with error: Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""} 7s 7s 1 {kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-kbst5" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"process_linux.go:359: container init caused \\\\\\\\\\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\"\\\\\\\"\\\\n\\\"\"}"
Ravi is working on this to make the code more robust when only some nodes manage to pull the image.
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/4f5f8a6ae45b19e7c4c00cee2025bcf329d77b34 Bug 1421643 - Fix network diagnostics timeouts waitForNetworkPod() is called from few places and it has a fixed timeout of 82 seconds which was insufficient in few cases where the network bandwidth is low or network latency is high. This change will make waitForNetworkPod() to take custom timeout value based on the operation performed.
Checked this issue on openshift v3.6.94 in atomic host env. still met the error when describe pod `oc describe pod network-diag-pod-rwr53 -n network-diag-ns-w06vl` Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 27s 27s 1 kubelet, ip-172-18-7-114.ec2.internal spec.containers{network-diag-pod-rwr53} Normal Pulled Container image "registry.access.redhat.com/openshift3/ose" already present on machine 27s 27s 1 kubelet, ip-172-18-7-114.ec2.internal spec.containers{network-diag-pod-rwr53} Normal Created Created container with id 74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431 26s 26s 1 kubelet, ip-172-18-7-114.ec2.internal spec.containers{network-diag-pod-rwr53} Warning Failed Failed to start container with id 74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431 with error: rpc error: code = 2 desc = failed to start container "74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/9eb87cf1-49b9-11e7-a4f6-0e2345259696/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""} 26s 26s 1 kubelet, ip-172-18-7-114.ec2.internal Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-rwr53" with rpc error: code = 2 desc = failed to start container "74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/9eb87cf1-49b9-11e7-a4f6-0e2345259696/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""}: "Start Container Failed"
*** Bug 1439142 has been marked as a duplicate of this bug. ***
Original timeout issue is fixed, current issue in comment# 10 is same as https://bugzilla.redhat.com/show_bug.cgi?id=1393716 (comment# 17). Closing this bug as the current issue has a open bug.
*** This bug has been marked as a duplicate of bug 1393716 ***
As comment 12 said, the original timeout issue is fixed, So in fact this is not a duplicated bug. marked this as 'verified'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716