1481147 – oc adm diagnostics gets stuck in disconnected environment

Bug 1481147 - oc adm diagnostics gets stuck in disconnected environment

Summary: oc adm diagnostics gets stuck in disconnected environment

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Ravi Sankar
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1476232 (view as bug list)
Depends On:
Blocks:	1505900
TreeView+	depends on / blocked

Reported:	2017-08-14 08:36 UTC by Marko Myllynen
Modified:	2018-07-18 19:58 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: incorrect image matching for default network diagnostics pod image Consequence: network diagnostics failed Fix: Fixed default pod image checking Result: network diagnostics to work without errors
Clone Of:
Clones:	1505900 (view as bug list)
Environment:
Last Closed:	2017-11-28 22:07:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Marko Myllynen 2017-08-14 08:36:41 UTC

Description of problem:
In a disconnected environment oc adm diagnostics gets stuck:

$ oc adm diagnostics --images=registry.example.com:5000/openshift3/ose-deployer:v3.6.173.0.5 --network-pod-image=registry.example.com:5000/openshift3/ose-deployer:v3.6.173.0.5
Version-Release number of selected component (if applicable):
...
[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint

^CERROR: [DNet2006 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:136]
       Creating network diagnostic pod "network-diag-pod-qth3j" on node "infra01.test.example.com" with command "openshift infra network-diagnostic-pod -l 1" failed: namespaces "network-diag-ns-nzdz1" not found

Not sure whether this is related to this later warning:

ERROR: [DClu1019 from diagnostic ClusterRegistry@openshift/origin/pkg/diagnostics/cluster/registry.go:343]
       Diagnostics created a test ImageStream and compared the registry IP
       it received to the registry IP available via the docker-registry service.

       docker-registry      : 172.30.116.71:5000
       ImageStream registry : docker-registry.default.svc:5000

       They do not match, which probably means that an administrator re-created
       the docker-registry service but the master has cached the old service
       IP address. Builds or deployments that use ImageStreams with the wrong
       docker-registry IP will fail under this condition.

       To resolve this issue, restarting the master (to clear the cache) should
       be sufficient. Existing ImageStreams may need to be re-created.

Version-Release number of selected component (if applicable):
atomic-openshift-clients-3.6.173.0.5-1.git.0.f30b99e.el7.x86_64

Comment 1 Marko Myllynen 2017-09-04 09:56:11 UTC

Also

$ oc adm diagnostics --images=registry.example.com:5000/openshift3/ose-deployer:v3.6.173.0.5 --network-pod-image=registry.example.com:5000/openshift3/ose-deployer:v3.6.173.0.5 --network-test-pod-image=registry.example.com:5000/openshift3/ose-deployer:v3.6.173.0.5

is failing:

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
ERROR: [DNet2005 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:119]
       Setting up test environment for network diagnostics failed: Failed to run network diags test pod and service: Failed to run network diags test pods, failed: 40, total: 40

I've reported the registry issue separately at https://bugzilla.redhat.com/show_bug.cgi?id=1488059 so we can keep this BZ only about the network test.

Thanks.

Comment 2 David Sundqvist 2017-09-05 06:20:43 UTC

I also seem to run in to this issue on two separate clusters. The node logs will contain entries like these and if you snatch the logs out of a pod (by for example suspending the diagnostics command so you have time to grab it), the log will only contain error: --deployment or OPENSHIFT_DEPLOYMENT_NAME is required :

Sep 01 13:59:26 ocp-4 oci-register-machine[38529]: 2017/09/01 13:59:26 Register machine: prestart d8d507821244185b55f3486316140f2748aa211b5c47396f67e715b49fbdaf7b 385
15 /var/lib/docker/devicemapper/mnt/f6fbd5d819bdf917d23a1b90b6096262670219414f59a35b7cba1d65e038c828/rootfs
Sep 01 13:59:26 ocp-4 systemd-machined[2963]: New machine d8d507821244185b55f3486316140f27.
Sep 01 13:59:26 ocp-4 oci-systemd-hook[38536]: systemdhook <info>: gidMappings not found in config
Sep 01 13:59:26 ocp-4 oci-systemd-hook[38536]: systemdhook <debug>: GID: 0
Sep 01 13:59:26 ocp-4 oci-systemd-hook[38536]: systemdhook <info>: uidMappings not found in config
Sep 01 13:59:26 ocp-4 oci-systemd-hook[38536]: systemdhook <debug>: UID: 0
Sep 01 13:59:26 ocp-4 oci-systemd-hook[38536]: systemdhook <debug>: Skipping as container command is /usr/bin/openshift-deploy, not init or systemd
...
Sep 01 13:59:27 ocp-4 kernel: XFS (dm-19): Starting recovery (logdev: internal)
Sep 01 13:59:27 ocp-4 dockerd-current[1317]: error: --deployment or OPENSHIFT_DEPLOYMENT_NAME is required
Sep 01 13:59:27 ocp-4 systemd[1]: Stopped docker container 99ecb2540ef8246bb60f4b85b486ab5110c91bf39e365719efaae13ef73d984d.

Comment 3 Magnus Glantz 2017-09-06 11:23:46 UTC

I can replicate this issue as well, on a fresh install of a connected OCP 3.6 (OpenShift Master: v3.6.173.0.5 Kubernetes Master: v1.6.1+5115d708d7) 

[root@ocpm-0 ~]# oc adm diagnostics 
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'
Info:  Using context for cluster-admin access: 'appx-prod/rhf17ocpmaster-northeurope-cloudapp-azure-com:8443/system:admin'
[Note] Performing systemd discovery

[Note] Running diagnostic: ConfigContexts[appx-test/rhf17ocpmaster-northeurope-cloudapp-azure-com:8443/system:admin]
       Description: Validate client config context is complete and has connectivity
       
Info:  For client config context 'appx-test/rhf17ocpmaster-northeurope-cloudapp-azure-com:8443/system:admin':
       The server URL is 'https://rhf17ocpmaster.northeurope.cloudapp.azure.com:8443'
       The user authentication is 'system:admin/rhf17ocpmaster-northeurope-cloudapp-azure-com:8443'
       The current project is 'appx-test'
       Successfully requested project list; has access to project(s):
         [appx-dev appx-prod appx-test default demo-hpa kube-public kube-system logging management-infra openshift ...]
       
[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint
       
Info:  Output from the diagnostic pod (image registry.access.redhat.com/openshift3/ose-deployer:v3.6.173.0.5):
       [Note] Running diagnostic: PodCheckAuth
              Description: Check that service account credentials authenticate as expected
              
       Info:  Service account token successfully authenticated to master
       Info:  Service account token was authenticated by the integrated registry.
       [Note] Running diagnostic: PodCheckDns
              Description: Check that DNS within a pod works as expected
              
       [Note] Summary of diagnostics execution (version v3.6.173.0.5):
       [Note] Completed with no errors or warnings seen.
       
[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
ERROR: [DNet2005 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:119]
       Setting up test environment for network diagnostics failed: Failed to run network diags test pod and service: Failed to run network diags test pods, failed: 14, total: 16

Comment 5 Ben Bennett 2017-09-15 18:17:44 UTC

*** Bug 1476232 has been marked as a duplicate of this bug. ***

Comment 8 Ravi Sankar 2017-09-19 22:03:42 UTC

Fixed in https://github.com/openshift/origin/pull/16439

Comment 9 openshift-github-bot 2017-09-23 10:23:25 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/85aa3968871269b21c823edef55233bac9adbd01
Bug 1481147 - Fix default pod image for network diagnostics

- This also ensures network diagnostics pod and test pod images uses
  deployed openshift version/tag (not the latest) so that it doesn't
  need to download another same image with latest tag.

https://github.com/openshift/origin/commit/fc7190d95e791408538abef8f779ba8493bec867
Merge pull request #16439 from pravisankar/netdiags-image-check

Automatic merge from submit-queue

Bug 1481147 - Fix default pod image for network diagnostics

- This also ensures network diagnostics pod and test pod images uses
  deployed openshift version/tag (not the latest) so that it doesn't
  need to download another same image with latest tag.

- Print more details when network diagnostics test setup fails.
  Currently when network diags fails, it only informs how many test pods failed but doesn't provide why those pods failed. This change will fetch logs for the pods in case of setup failure.

Comment 10 Vladislav Walek 2017-10-11 12:34:38 UTC

Hello,

when the fix will be released in enterprise edition? In which version it should be included?
Thank you

Comment 11 Luke Meyer 2017-10-11 19:25:04 UTC

Should be released with OCP 3.7 at GA. Not sure if something needs to be done to get this bug attached to the errata, but QE should have a build to test so moving to ON_QA.

Comment 12 zhaozhanqi 2017-10-12 08:30:46 UTC

Verified this bug on v3.7.0-0.143.2


# oadm diagnostics NetworkCheck --network-pod-image='registry.ops.openshift.com/openshift3/ose:v3.7.0-0.143.2' --network-test-pod-image='registry.ops.openshift.com/openshift3/ose-deployer:v3.7.0-0.143.2'
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
Info:  Output from the network diagnostic pod on node "ip-172-18-3-195.ec2.internal":
       [Note] Running diagnostic: CheckExternalNetwork
              Description: Check that external network is accessible within a pod
              
       [Note] Running diagnostic: CheckNodeNetwork
              Description: Check that pods in the cluster can access its own node.
              
       [Note] Running diagnostic: CheckPodNetwork
              Description: Check pod to pod communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with each other and in case of multitenant network plugin, pods in non-global projects should be isolated and pods in global projects should be able to access any pod in the cluster and vice versa.
              
       [Note] Running diagnostic: CheckServiceNetwork
              Description: Check pod to service communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with all services and in case of multitenant network plugin, services in non-global projects should be isolated and pods in global projects should be able to access any service in the cluster.
              
       [Note] Running diagnostic: CollectNetworkInfo
              Description: Collect network information in the cluster.
              
       [Note] Summary of diagnostics execution (version v3.7.0-0.143.2):
       [Note] Completed with no errors or warnings seen.
       
Info:  Output from the network diagnostic pod on node "ip-172-18-2-33.ec2.internal":
       [Note] Running diagnostic: CheckExternalNetwork
              Description: Check that external network is accessible within a pod
              
       [Note] Running diagnostic: CheckNodeNetwork
              Description: Check that pods in the cluster can access its own node.
              
       [Note] Running diagnostic: CheckPodNetwork
              Description: Check pod to pod communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with each other and in case of multitenant network plugin, pods in non-global projects should be isolated and pods in global projects should be able to access any pod in the cluster and vice versa.
              
       [Note] Running diagnostic: CheckServiceNetwork
              Description: Check pod to service communication in the cluster. In case of ovs-subnet network plugin, all pods should be able to communicate with all services and in case of multitenant network plugin, services in non-global projects should be isolated and pods in global projects should be able to access any service in the cluster.
              
       [Note] Running diagnostic: CollectNetworkInfo
              Description: Collect network information in the cluster.
              
       [Note] Summary of diagnostics execution (version v3.7.0-0.143.2):
       [Note] Completed with no errors or warnings seen.

Comment 17 errata-xmlrpc 2017-11-28 22:07:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.