Good catch, that situation should be handled without a panic. Also I think the duplicate warnings could be suppressed.
Under way in https://github.com/openshift/origin/pull/7317
I think above PR fixes this, however I've run into https://github.com/openshift/origin/issues/7358 so I am getting no actual logs from the diagnostic pod anyway, so I can't test until that's resolved.
Planning to merge https://github.com/openshift/origin/issues/7358 tonight. Note that the code that was panicking is inside the deployer container, so this needs to be rebuilt with updated openshift binary in order to fix the code. Note also, I couldn't reproduce this as described; what did reproduce it, however, is to break the master DNS. In master config I changed to: dnsConfig: bindAddress: 0.0.0.0:8053 This completely breaks DNS and causes the panic in unpatched code. Updated code should calmly report errors and warnings (without so many duplicates).
ON_QA assuming merge completes
Thanks for fixing this. I Will test once the bug fix PR https://github.com/openshift/origin/pull/7317 get merged into EC2 images. (The latest devenv-rhel7_3509 did not contain this fix)
Finally merged.
Issue was reproduced with devenv-rhel7_3522 by setting this in master-config.yaml: dnsConfig: bindAddress: 0.0.0.0:8053 The problem about duplicated messages were already fixed except for DP2015. Here are the full logs on devenv-rhel7_3522: [Note] Running diagnostic: DiagnosticPod Description: Create a pod to run diagnostics from the application standpoint ERROR: [DCli2012 from diagnostic DiagnosticPod@openshift/origin/pkg/diagnostics/client/run_diagnostics_pod.go:155] See the errors below in the output from the diagnostic pod: [Note] Running diagnostic: PodCheckAuth Description: Check that service account credentials authenticate as expected ERROR: [CED5001 from controller openshift/origin/pkg/cmd/experimental/diagnostics/pod.go] While running the PodCheckAuth diagnostic, a panic was encountered. This is a bug in diagnostics. Error and stack trace follow: runtime error: invalid memory address or nil pointer dereference /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:168 (0x537812) /usr/lib/golang/src/runtime/asm_amd64.s:401 (0x43e335) /usr/lib/golang/src/runtime/panic.go:387 (0x414d58) /usr/lib/golang/src/runtime/panic.go:42 (0x41407e) /usr/lib/golang/src/runtime/sigpanic_unix.go:26 (0x41a904) /go/src/github.com/openshift/origin/pkg/diagnostics/pod/auth.go:109 (0xc54d63) /go/src/github.com/openshift/origin/pkg/diagnostics/pod/auth.go:56 (0xc541b5) <autogenerated>:4 (0xc590f2) /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:185 (0x5383a8) /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:191 (0x53433e) /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:120 (0x533929) /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:58 (0x53700b) /go/src/github.com/openshift/origin/Godeps/_workspace/src/github.com/spf13/cobra/command.go:572 (0x4b5d7f) /go/src/github.com/openshift/origin/Godeps/_workspace/src/github.com/spf13/cobra/command.go:662 (0x4b65eb) /go/src/github.com/openshift/origin/Godeps/_workspace/src/github.com/spf13/cobra/command.go:618 (0x4b60ea) /go/src/github.com/openshift/origin/cmd/openshift/openshift.go:27 (0x401de5) /usr/lib/golang/src/runtime/proc.go:63 (0x4168a3) /usr/lib/golang/src/runtime/asm_amd64.s:2232 (0x440471) [Note] Running diagnostic: PodCheckDns Description: Check that DNS within a pod works as expected ERROR: [DP2003 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:71] The first /etc/resolv.conf nameserver 172.18.0.170 could not resolve kubernetes.default.svc.cluster.local. Error: read udp 172.18.0.170:53: connection refused This nameserver points to the master's SkyDNS which is critical for resolving cluster names, e.g. for Services. WARN: [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:116] Error querying nameserver 172.18.0.170: read udp 172.18.0.170:53: connection refused This may indicate a problem with DNS. WARN: [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:116] Error querying nameserver 172.18.0.170: read udp 172.18.0.170:53: connection refused This may indicate a problem with DNS. WARN: [DP2014 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:113] A request to the nameserver 172.18.0.170 timed out. This could be temporary but could also indicate network or DNS problems. WARN: [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:116] Error querying nameserver 172.18.0.170: read udp 172.18.0.170:53: connection refused This may indicate a problem with DNS. [Note] Summary of diagnostics execution (version v1.1.3): [Note] Warnings seen: 4 [Note] Errors seen: 2
This output from the diagnostic running in the pod indicates it's not running the latest version of the image on the devenv: [Note] Summary of diagnostics execution (version v1.1.3): If you don't specify anything in --images or --latest-images, the client default tag is used (not the hash, but the base client version). The :v1.1.3 tag would attempt to run that image from dockerhub (unless you tagged some other image with that pre-emptively). With devenv-rhel7_3526 if I do this: # openshift ex diagnostics DiagnosticPod --latest-images Then the end of the diagnostic output looks like: ------------- [Note] Summary of diagnostics execution (version v1.1.3-245-g806dd7e): [Note] Warnings seen: 3 [Note] Errors seen: 2 [Note] Summary of diagnostics execution (version v1.1.3-245-g806dd7e): [Note] Errors seen: 1 ------------- The images on the node are: # docker images | grep deployer openshift/origin-deployer 806dd7e f64522cb8094 9 hours ago 520 MB openshift/origin-deployer latest f64522cb8094 9 hours ago 520 MB docker.io/openshift/origin-deployer latest 18e3884452ac 11 hours ago 519.7 MB The 806dd7e code hash is consistent. Oddly enough, it seems to download the latest deployer from docker.io, but it doesn't use it. I don't quite understand that behavior but it does what we want in the end by using the on-node image. Once the image tag is synced with the version of the client, this won't be an issue, but for now, until the next version is tagged and released to dockerhub, to test this I think you need --latest-images on the diagnostics invocation. It works for me.
Alternatively, tag the built image to what it's looking for e.g. for devenv-rhel7_3526: docker tag f64522cb8094 openshift/origin-deployer:v1.1.3 I just tested this and it works this way too, without downloading anything from dockerhub.
Oh, And I now realized that the image I used yesterday is wrong. Thanks for the info-- yes, the issue is fixed. Tested on devenv-rhel7_3532 with openshift ex diagnostics --images=openshift/origin-deployer:latest openshift ex diagnostics --latest-images openshift ex diagnostics DiagnosticPod --latest-images Image tested is: openshift/origin-deployer 36bb05708ed1 Error message is : (both the code panic and dup error messages are fixed well) [Note] Running diagnostic: DiagnosticPod Description: Create a pod to run diagnostics from the application standpoint ERROR: [DCli2012 from diagnostic DiagnosticPod@openshift/origin/pkg/diagnostics/client/run_diagnostics_pod.go:155] See the errors below in the output from the diagnostic pod: [Note] Running diagnostic: PodCheckAuth Description: Check that service account credentials authenticate as expected WARN: [DP1005 from diagnostic PodCheckAuth@openshift/origin/pkg/diagnostics/pod/auth.go:84] A request to the master timed out. This could be temporary but could also indicate network or DNS problems. ERROR: [DP1016 from diagnostic PodCheckAuth@openshift/origin/pkg/diagnostics/pod/auth.go:112] DNS resolution for registry address docker-registry.default.svc.cluster.local returned an error; container DNS is likely incorrect. The error was: read udp 172.18.8.207:53: connection refused [Note] Running diagnostic: PodCheckDns Description: Check that DNS within a pod works as expected ERROR: [DP2003 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:72] The first /etc/resolv.conf nameserver 172.18.8.207 could not resolve kubernetes.default.svc.cluster.local. Error: read udp 172.18.8.207:53: connection refused This nameserver points to the master's SkyDNS which is critical for resolving cluster names, e.g. for Services. WARN: [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:124] Error querying nameserver 172.18.8.207: read udp 172.18.8.207:53: connection refused This may indicate a problem with DNS. WARN: [DP2014 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:119] A request to the nameserver 172.18.8.207 timed out. This could be temporary but could also indicate network or DNS problems. [Note] Summary of diagnostics execution (version v1.1.3-267-g0842757): [Note] Warnings seen: 3 [Note] Errors seen: 2 So cloing the issue as fixed, thanks again!