1421643 – 'oadm diagnostics NetworkCheck' timeout due to image 'openshift/diagnostics-deployer' pull failed

Bug 1421643 - 'oadm diagnostics NetworkCheck' timeout due to image 'openshift/diagnostics-deployer' pull failed

Summary: 'oadm diagnostics NetworkCheck' timeout due to image 'openshift/diagnostics-d...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.5.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravi Sankar
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1439142 (view as bug list)
Depends On:
Blocks:	1481550 1481551 1505898
TreeView+	depends on / blocked

Reported:	2017-02-13 10:41 UTC by zhaozhanqi
Modified:	2017-10-24 14:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1481550 1481551 (view as bug list)
Environment:
Last Closed:	2017-08-10 05:17:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	12982	0	None	None	None	2017-02-16 02:45:59 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description zhaozhanqi 2017-02-13 10:41:22 UTC

Description of problem:
oadm diagnostics NetworkCheck' timeout due to image 'openshift/diagnostics-deployer' pull failed 

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.5.0.19+199197c
kubernetes v1.5.2+43a9be4
etcd 3.1.0


How reproducible:
always

Steps to Reproduce:
1. setup multi-node env 
2. run 'oadm diagnostics NetworkCheck'
3. monitor the pods in another terminal


Actual results:

step2:
# oadm diagnostics NetworkCheck
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
ERROR: [DNet2007 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:137]
       timed out waiting for the condition
       
[Note] Summary of diagnostics execution (version v3.5.0.19+199197c):
[Note] Errors seen: 1


step 3: Found failed pull 'openshift/diagnostics-deployer' image

# oc describe pod network-diag-pod-jkb8q -n network-diag-ns-46kph
Name:			network-diag-pod-jkb8q
Namespace:		network-diag-ns-46kph
Security Policy:	privileged
Node:			ip-172-18-2-17.ec2.internal/172.18.2.17
Start Time:		Mon, 13 Feb 2017 05:06:36 -0500
Labels:			<none>
Status:			Pending
IP:			172.18.2.17
Controllers:		<none>
Containers:
  network-diag-pod-jkb8q:
    Container ID:	
    Image:		openshift/diagnostics-deployer
    Image ID:		
    Port:		
    Command:
      sh
      -c
      openshift-network-debug /host openshift infra network-diagnostic-pod -l 1
    State:		Waiting
      Reason:		ImagePullBackOff
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /host from host-root-dir (rw)
      /host/secrets from kconfig-secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-h8wg1 (ro)
    Environment Variables:
      KUBECONFIG:	/secrets/kubeconfig
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
Volumes:
  host-root-dir:
    Type:	HostPath (bare host directory volume)
    Path:	/
  kconfig-secret:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	network-diag-secret
  default-token-h8wg1:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-h8wg1
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----					-------------				--------	------		-------
  40s		15s		2	{kubelet ip-172-18-2-17.ec2.internal}	spec.containers{network-diag-pod-jkb8q}	Normal		BackOff		Back-off pulling image "openshift/diagnostics-deployer"
  40s		15s		2	{kubelet ip-172-18-2-17.ec2.internal}						Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-jkb8q" with ImagePullBackOff: "Back-off pulling image \"openshift/diagnostics-deployer\""

  42s	2s	3	{kubelet ip-172-18-2-17.ec2.internal}	spec.containers{network-diag-pod-jkb8q}	Normal	Pulling		pulling image "openshift/diagnostics-deployer"
  41s	1s	3	{kubelet ip-172-18-2-17.ec2.internal}	spec.containers{network-diag-pod-jkb8q}	Warning	Failed		Failed to pull image "openshift/diagnostics-deployer": unauthorized: authentication required
  41s	1s	3	{kubelet ip-172-18-2-17.ec2.internal}						Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-jkb8q" with ErrImagePull: "unauthorized: authentication required"

Expected results:

oadm diagnostics NetworkCheck can work well


Additional info:

Comment 1 Troy Dawson 2017-02-13 20:22:54 UTC

The image isn't available yet.  But for OpenShift Container Platform, it needs to be openshift3/ose-diagnostics-deployer

Comment 2 Ravi Sankar 2017-02-16 02:46:00 UTC

Fixed in https://github.com/openshift/origin/pull/12982

Comment 3 Eric Paris 2017-02-22 16:30:45 UTC

ravi, because of the 75 bazillion flakes we need this fixed in origin/master and origin/release-1.5.  Can you do that backport?

Comment 4 Ravi Sankar 2017-02-22 18:34:37 UTC

PR on origin/release-1.5 : https://github.com/openshift/origin/pull/13062

Comment 5 openshift-github-bot 2017-02-23 01:25:42 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/b640a0e55fdebb6e2b3eb52fc404ecfee16ead98
Bug 1421643 - Use existing openshift/origin image instead of new openshift/diagnostics-deployer

Any new image like 'openshift/diagnostics-deployer' incurs build/lifecycle costs to maintian and
diagnostics-deployer image has only small block of shell code. To alleviate this problem,
now the script is embedded into the pod definition and openshift/origin is used as diagnostics deployer
image. On dev machines, currently openshift/origin is close to 800MB but we expect the size to be under 200MB
when it is released (compressed, debug headers removed).

Comment 6 zhaozhanqi 2017-03-01 05:19:30 UTC

Seems it still cannot work well on atomic host env

# cat /etc/redhat-release 
Red Hat Enterprise Linux Atomic Host release 7.3


# oadm diagnostics NetworkCheck
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
Info:  Output from the network diagnostic pod on node "host-8-175-189.host.centralci.eng.rdu2.redhat.com":
Info:  Output from the network diagnostic pod on node "host-8-174-76.host.centralci.eng.rdu2.redhat.com":
[Note] Summary of diagnostics execution (version v3.5.0.35):
[Note] Completed with no errors or warnings seen.


***************************

check the logs find there some pod still cannot be running.


Events:
  FirstSeen	LastSeen	Count	From								SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----								-------------				--------	------		-------
  8s		8s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}	spec.containers{network-diag-pod-kbst5}	Normal		Pulled		Container image "openshift3/ose" already present on machine
  8s		8s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}	spec.containers{network-diag-pod-kbst5}	Normal		Created		Created container with docker id 528da7892463; Security:[seccomp=unconfined]
  7s		7s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}	spec.containers{network-diag-pod-kbst5}	Warning		Failed		Failed to start container with docker id 528da7892463 with error: Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""}
  7s		7s		1	{kubelet host-8-175-189.host.centralci.eng.rdu2.redhat.com}						Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-kbst5" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"process_linux.go:359: container init caused \\\\\\\\\\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/af3c94ac-fe3d-11e6-a07f-fa163ebcec06/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/7bbe3c708f48e9867c0a16f0e5fd162337c88284e3972b6e971d3dcc7abb6b5c/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\"\\\\\\\"\\\\n\\\"\"}"

Comment 7 Ben Bennett 2017-04-19 13:52:47 UTC

Ravi is working on this to make the code more robust when only some nodes manage to pull the image.

Comment 8 openshift-github-bot 2017-04-25 13:52:31 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/4f5f8a6ae45b19e7c4c00cee2025bcf329d77b34
Bug 1421643 - Fix network diagnostics timeouts

waitForNetworkPod() is called from few places and it has a fixed
timeout of 82 seconds which was insufficient in few cases where
the network bandwidth is low or network latency is high.
This change will make waitForNetworkPod() to take custom timeout
value based on the operation performed.

Comment 10 zhaozhanqi 2017-06-05 06:41:45 UTC

Checked this issue on openshift v3.6.94 in atomic host env.
still met the error when describe pod `oc describe pod network-diag-pod-rwr53 -n network-diag-ns-w06vl`

Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----					-------------				--------	------		-------
  27s		27s		1	kubelet, ip-172-18-7-114.ec2.internal	spec.containers{network-diag-pod-rwr53}	Normal		Pulled		Container image "registry.access.redhat.com/openshift3/ose" already present on machine
  27s		27s		1	kubelet, ip-172-18-7-114.ec2.internal	spec.containers{network-diag-pod-rwr53}	Normal		Created		Created container with id 74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431
  26s		26s		1	kubelet, ip-172-18-7-114.ec2.internal	spec.containers{network-diag-pod-rwr53}	Warning		Failed		Failed to start container with id 74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431 with error: rpc error: code = 2 desc = failed to start container "74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/9eb87cf1-49b9-11e7-a4f6-0e2345259696/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""}
  26s		26s		1	kubelet, ip-172-18-7-114.ec2.internal						Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "network-diag-pod-rwr53" with rpc error: code = 2 desc = failed to start container "74442f6f035c0a61258c771e21b014669269c2bda0a83079e33afbfbee610431": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/var/lib/origin/openshift.local.volumes/pods/9eb87cf1-49b9-11e7-a4f6-0e2345259696/volumes/kubernetes.io~secret/kconfig-secret\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"mkdir /var/lib/docker/devicemapper/mnt/3478616fca5a482f2c2a809dca2309074e025758abdf7e365faad7d2ff6eaff4/rootfs/host/secrets: permission denied\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""}: "Start Container Failed"

Comment 11 zhaozhanqi 2017-06-06 08:43:50 UTC

*** Bug 1439142 has been marked as a duplicate of this bug. ***

Comment 12 Ravi Sankar 2017-06-27 18:53:12 UTC

Original timeout issue is fixed, current issue in comment# 10 is same as https://bugzilla.redhat.com/show_bug.cgi?id=1393716 (comment# 17).
Closing this bug as the current issue has a open bug.

Comment 13 Ravi Sankar 2017-06-27 18:55:46 UTC


*** This bug has been marked as a duplicate of bug 1393716 ***

Comment 14 zhaozhanqi 2017-06-28 03:14:33 UTC

As comment 12 said,

the original timeout issue is fixed, So in fact this is not a duplicated bug. 
marked this as 'verified'.

Comment 16 errata-xmlrpc 2017-08-10 05:17:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.