1534513 – 'oc adm diagnostic NetworkCheck' cannot be working

Bug 1534513 - 'oc adm diagnostic NetworkCheck' cannot be working

Summary: 'oc adm diagnostic NetworkCheck' cannot be working

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	3.9.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Luke Meyer
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1537478 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-15 12:10 UTC by zhaozhanqi
Modified:	2018-09-11 18:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-09-11 18:26:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description zhaozhanqi 2018-01-15 12:10:58 UTC

Description of problem:
Failed to run `oc adm diagnostic NetworkCheck` with error:
chroot: failed to run command 'openshift-diagnostics': No such file or directory

Version-Release number of selected component (if applicable):
 oc version
oc v3.9.0-0.19.0
kubernetes v1.9.0-beta1
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
always

Steps to Reproduce:
1. Set up env with multitenant plugin
2. run `oc adm diagnostic NetworkCheck`
3.

Actual results:
# oc adm diagnostics NetworkCheck
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
Info:  Output from the network diagnostic pod on node "172.16.120.88":
       chroot: failed to run command 'openshift-diagnostics': No such file or directory
       
Info:  Output from the network diagnostic pod on node "172.16.120.100":
       chroot: failed to run command 'openshift-diagnostics': No such file or directory
       
[Note] Summary of diagnostics execution (version v3.9.0-0.19.0):
[Note] Completed with no errors or warnings seen.


Expected results:
no this error

Additional info:
find there 2 pods cannot be running:

#oc describe pod network-diag-pod-4vj7l -n network-diag-ns-jrhp9
Name:         network-diag-pod-4vj7l
Namespace:    network-diag-ns-jrhp9
Node:         172.16.120.88/172.16.120.88
Start Time:   Mon, 15 Jan 2018 04:50:14 -0500
Labels:       <none>
Annotations:  openshift.io/scc=privileged
Status:       Failed
IP:           172.16.120.88
Containers:
  network-diag-pod-4vj7l:
    Container ID:  docker://6574a1027c4b773fbbe120b583ba9a5990dcaf75806d9a30ef0c53a770f2767d
    Image:         openshift3/ose:v3.9.0-0.19.0
    Image ID:      docker-pullable://openshift3/ose@sha256:48ab445c678ee7a35ab9db61d3bd3dd015ac4de81f239ae4a45545b32e0d1f63
    Port:          <none>
    Command:
      /bin/bash
      -c
    Args:
      
#!/bin/bash
#
# Based on containerized/non-containerized openshift install,
# this script sets the environment so that docker, openshift, iptables, etc.
# binaries are availble for network diagnostics.
#
set -o nounset
set -o pipefail

node_rootfs=/host
cmd="openshift-diagnostics network-diagnostic-pod -l 1"

# Origin image: openshift/node, OSE image: openshift3/node
node_image_regex="^openshift.*/node"

node_container_id="$(chroot "${node_rootfs}" docker ps --format='{{.Image}} {{.ID}}' | grep "${node_image_regex}" | cut -d' ' -f2)"

if [[ -z "${node_container_id}" ]]; then # non-containerized openshift env

    chroot "${node_rootfs}" ${cmd}

else # containerized env

    # On containerized install, docker on the host is used by node container,
    # For the privileged network diagnostics pod to use all the binaries on the node:
    # - Copy kubeconfig secret to node mount namespace
    # - Run openshift under the mount namespace of node

    node_docker_pid="$(chroot "${node_rootfs}" docker inspect --format='{{.State.Pid}}' "${node_container_id}")"
    kubeconfig="/etc/origin/node/kubeconfig"
    cp "${node_rootfs}/secrets/kubeconfig" "${node_rootfs}/${kubeconfig}"

    chroot "${node_rootfs}" nsenter -m -t "${node_docker_pid}" -- /bin/bash -c 'KUBECONFIG='"${kubeconfig} ${cmd}"''

fi
    State:          Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Mon, 15 Jan 2018 04:50:16 -0500
      Finished:     Mon, 15 Jan 2018 04:50:16 -0500
    Ready:          False
    Restart Count:  0
    Environment:
      KUBECONFIG:  /secrets/kubeconfig
    Mounts:
      /host from host-root-dir (rw)
      /host/secrets from kconfig-secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5hrdw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  host-root-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  kconfig-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  network-diag-secret
    Optional:    false
  default-token-5hrdw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5hrdw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type    Reason                 Age   From                    Message
  ----    ------                 ----  ----                    -------
  Normal  SuccessfulMountVolume  19s   kubelet, 172.16.120.88  MountVolume.SetUp succeeded for volume "host-root-dir"
  Normal  SuccessfulMountVolume  19s   kubelet, 172.16.120.88  MountVolume.SetUp succeeded for volume "default-token-5hrdw"
  Normal  SuccessfulMountVolume  19s   kubelet, 172.16.120.88  MountVolume.SetUp succeeded for volume "kconfig-secret"
  Normal  Pulled                 18s   kubelet, 172.16.120.88  Container image "openshift3/ose:v3.9.0-0.19.0" already present on machine
  Normal  Created                18s   kubelet, 172.16.120.88  Created container
  Normal  Started                17s   kubelet, 172.16.120.88  Started container

Comment 1 Ravi Sankar 2018-01-18 21:05:44 UTC

This is dup of https://github.com/openshift/origin/issues/18141

Luke has already proposed these fixes: https://github.com/openshift/origin/pull/18145 and https://github.com/openshift/origin/issues/18141

Comment 2 Luke Meyer 2018-01-19 22:19:58 UTC

https://github.com/openshift/origin/pull/18186 is the current favored candidate fix.

Comment 3 Luke Meyer 2018-01-23 16:45:49 UTC

*** Bug 1537478 has been marked as a duplicate of this bug. ***

Comment 4 Luke Meyer 2018-01-25 13:31:03 UTC

https://github.com/openshift/origin/pull/18186 merged

Comment 6 zhaozhanqi 2018-02-06 10:18:01 UTC

the above issue has been fixed. 

but when I verified this bug  on oc v3.9.0-0.38.0 always met 'Failed to pull image "openshift3/ose-deployer:v3.9.0-0.38.0": rpc error: code = Unknown desc = repository docker.io/openshift3/ose-deployer not found: does not exist or no pull access'


#oc describe pod network-diag-test-pod-b4bfk -n network-diag-ns-2zf55
Name:         network-diag-test-pod-b4bfk
Namespace:    network-diag-ns-2zf55
Node:         172.16.120.64/172.16.120.64
Start Time:   Tue, 06 Feb 2018 05:13:46 -0500
Labels:       network-diag-pod-name=network-diag-test-pod-b4bfk
Annotations:  openshift.io/scc=anyuid
Status:       Pending
IP:           
Containers:
  network-diag-test-pod-b4bfk:
    Container ID:  
    Image:         openshift3/ose-deployer:v3.9.0-0.38.0
    Image ID:      
    Port:          <none>
    Command:
      socat
      -T
      1
      -d
      TCP-l:8080,reuseaddr,fork,crlf
      system:"echo 'HTTP/1.0 200 OK'; echo 'Content-Type: text/plain'; echo; echo 'Hello OpenShift'"
    State:          Waiting
      Reason:       ErrImagePull
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jbbbm (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  default-token-jbbbm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jbbbm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age                From                    Message
  ----     ------                 ----               ----                    -------
  Normal   SuccessfulMountVolume  31s                kubelet, 172.16.120.64  MountVolume.SetUp succeeded for volume "default-token-jbbbm"
  Normal   Pulling                29s                kubelet, 172.16.120.64  pulling image "openshift3/ose-deployer:v3.9.0-0.38.0"
  Warning  Failed                 27s                kubelet, 172.16.120.64  Failed to pull image "openshift3/ose-deployer:v3.9.0-0.38.0": rpc error: code = Unknown desc = repository docker.io/openshift3/ose-deployer not found: does not exist or no pull access
  Warning  Failed                 27s                kubelet, 172.16.120.64  Error: ErrImagePull
  Normal   SandboxChanged         0s (x10 over 27s)  kubelet, 172.16.120.64  Pod sandbox changed, it will be killed and re-created.

Comment 7 Luke Meyer 2018-02-07 21:47:53 UTC

I would certainly not expect docker.io/openshift3/ose-deployer to exist (at any version). And of course this version is not yet shipped via registry.access.redhat.com either. To test this diagnostic with pre-GA OCP images you'll have to either:

1. configure docker to include a registry that does have exactly the right image requested, or
2. use the available flags on the NetworkCheck diagnostic to specify (probably including registry) an image that is available.

Since the test pods aren't actually getting deployed I don't think you've yet verified that the diagnostic is able to invoke them successfully.

(Worth noting re methods above -- there's some question at https://github.com/openshift/origin/pull/18260#issuecomment-360302197 about whether NetworkCheck should continue to omit the registry from the default image; if that is reverted then the second method would be required for testing, as with DiagnosticPod; hopefully they will become consistent one way or the other shortly.)

There is actually a third method, I guess, which is to docker tag all the necessary images on all nodes before testing the diagnostic. And to be quite clear, all this is only necessary for using pre-GA or non-RH images.

Comment 8 zhaozhanqi 2018-02-09 06:59:35 UTC

According to comment 7, if docker pull the correct image. it can work

Verified this bug.

Note You need to log in before you can comment on or make changes to this bug.