Hide Forgot
Description of problem: When the test pods created failed in the network-diag-ns-xxxx namespaces. oadm diagnostics NetworkCheck should delete them in the end. otherwise. it will affect the next time execution. Version-Release number of selected component (if applicable): #openshift version openshift v1.4.0-alpha.0+0787d9f-738 kubernetes v1.4.0+776c994 etcd 3.1.0-alpha.1 How reproducible: already Steps to Reproduce: 1. Set up openshift cluster with 2 nodes 2. make one node pull the test image failed 3. run 'oadm diagnostics NetworkCheck 4. Check the test pod in the test namespace Actual results: # oadm diagnostics NetworkCheck [Note] Determining if client configuration exists for client/cluster diagnostics Info: Successfully read a client config file at '/root/.kube/config' [Note] Running diagnostic: NetworkCheck Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint ERROR: [DNet2005 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:101] Setting up test environment for network diagnostics failed: Failed to run network diags test pod and service: timed out waiting for the condition [Note] Summary of diagnostics execution (version v1.4.0-alpha.0+0787d9f-738): [Note] Errors seen: 1 step 4: oc get pod --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-2-elbmv 1/1 Running 0 1h 10.1.2.7 ip-172-18-15-181.ec2.internal default docker-registry-2-wsb31 1/1 Running 1 2h 10.1.2.3 ip-172-18-15-181.ec2.internal default registry-console-1-nhlv1 1/1 Running 0 1h 10.1.2.8 ip-172-18-15-181.ec2.internal default router-1-8gwi3 1/1 Running 1 2h 172.18.15.181 ip-172-18-15-181.ec2.internal default router-1-apwkm 0/1 Pending 0 32m <none> install-test dancer-mysql-example-1-build 0/1 Completed 0 2h 10.1.2.2 ip-172-18-15-181.ec2.internal install-test dancer-mysql-example-1-sw6og 1/1 Running 6 2h 10.1.2.2 ip-172-18-15-181.ec2.internal install-test database-1-l2hvc 1/1 Running 0 1h 10.1.2.9 ip-172-18-15-181.ec2.internal network-diag-ns-1jf8t network-diag-test-pod-70os8 0/1 ContainerCreating 0 12m <none> ip-172-18-13-225.ec2.internal network-diag-ns-1jf8t network-diag-test-pod-mga0r 0/1 ContainerCreating 0 12m <none> ip-172-18-13-225.ec2.internal network-diag-ns-1jf8t network-diag-test-pod-r30nm 0/1 OutOfpods 0 12m <none> ip-172-18-15-181.ec2.internal network-diag-ns-1jf8t network-diag-test-pod-ryi4a 0/1 ContainerCreating 0 12m <none> ip-172-18-13-225.ec2.internal network-diag-ns-1jf8t network-diag-test-pod-tkxum 0/1 OutOfpods 0 12m <none> ip-172-18-15-181.ec2.internal Expected results: 1. should ignore the not ready node and give some meaningful messages like 'node ip-172-18-15-180.ec2.internal is not ready' Additional info:
github pr: https://github.com/openshift/origin/pull/11719
@zhaozhanqi 'make one node pull the test image failed' => did you remove existing hello-openshift image and disabled internet access to docker containers by setting sysctl net.bridge.bridge-nf-call-iptables/net.ipv4.ip_forward=0?
@ravi I made a wrong registry to pull the hello-openshift image.
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/78378e8969467d7c2a51033a60423b7a43bdac18 Bug 1388026 - Ensure deletion of namespaces created by network diagnostics command
Tested this issue on: openshift version openshift v3.4.0.23+24b1a58 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 I still can reproduced this issue with my local env I'm not sure maybe due to the cpu and memory is not enough. the pod cannot be running. # oadm diagnostics NetworkCheck [Note] Determining if client configuration exists for client/cluster diagnostics Info: Successfully read a client config file at '/root/.kube/config' [Note] Running diagnostic: NetworkCheck Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint ERROR: [DNet2005 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:114] Setting up test environment for network diagnostics failed: Failed to run network diags test pod and service: [timed out waiting for the condition, timed out waiting for the condition] [Note] Summary of diagnostics execution (version v3.4.0.23+24b1a58): [Note] Errors seen: 1 [root@minion1 ~]# oc get pod --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default caddy-docker 1/1 Running 1 4d default router-9-70hqt 1/1 Running 0 3d default test-rc-jzgcp 1/1 Running 0 3d default test-rc-wnf9x 1/1 Running 0 3d network-diag-global-ns-dddni network-diag-test-pod-l9xm8 0/1 OutOfpods 0 2m network-diag-global-ns-dddni network-diag-test-pod-pwal3 0/1 OutOfpods 0 2m network-diag-global-ns-dddni network-diag-test-pod-ufdwz 0/1 OutOfpods 0 2m network-diag-global-ns-rjx7c network-diag-test-pod-49win 0/1 OutOfpods 0 2m network-diag-global-ns-rjx7c network-diag-test-pod-adtmm 0/1 OutOfpods 0 2m network-diag-global-ns-rjx7c network-diag-test-pod-tzejk 0/1 OutOfpods 0 2m network-diag-ns-t7mnc network-diag-test-pod-eshrk 0/1 OutOfpods 0 2m network-diag-ns-t7mnc network-diag-test-pod-w5id3 1/1 Running 0 2m network-diag-ns-t7mnc network-diag-test-pod-wuiuw 0/1 OutOfpods 0 2m network-diag-ns-vtdqb network-diag-test-pod-c3sbw 0/1 ContainerCreating 0 2m network-diag-ns-vtdqb network-diag-test-pod-iv4fl 1/1 Running 0 2m network-diag-ns-vtdqb network-diag-test-pod-ksy20 0/1 ContainerCreating 0 2m
Fixed in https://github.com/openshift/origin/pull/11818
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/9566c24cc21b5480b98822fbed67c762fbae1923 Bug 1388026 - Fix network diagnostics cleanup when test setup fails Diagnostics cleanup is called after completion of the network diagnostics tests or anything goes wrong after the setup (launching test pods/services) is done. But setup itself can fail if it is unable to deploy pods or services on the nodes. So make cleanup to be called even if setup fails.
This issue should be fixed in openshift version openshift v3.4.0.25+1f36858 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Could you help change status to 'ON_QA' and I will verify this bug.
Verified this bug according to comment 8