Bug 1302649 - [Intservice_public_88] A panic was encountered while running the PodCheckAuth diagnostic when SDN br0 is down
[Intservice_public_88] A panic was encountered while running the PodCheckAuth...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Origin
Classification: Red Hat
Component: Command Line Interface (Show other bugs)
3.x
Unspecified Unspecified
medium Severity medium
: ---
: 3.x
Assigned To: Luke Meyer
Wei Sun
:
Depends On: 1309193
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-28 05:24 EST by Xia Zhao
Modified: 2016-05-12 13:10 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-12 13:10:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Luke Meyer 2016-01-28 10:17:59 EST
Good catch, that situation should be handled without a panic. Also I think the duplicate warnings could be suppressed.
Comment 2 Luke Meyer 2016-02-15 17:10:14 EST
Under way in https://github.com/openshift/origin/pull/7317
Comment 3 Luke Meyer 2016-02-16 15:28:51 EST
I think above PR fixes this, however I've run into https://github.com/openshift/origin/issues/7358 so I am getting no actual logs from the diagnostic pod anyway, so I can't test until that's resolved.
Comment 4 Luke Meyer 2016-02-22 15:41:12 EST
Planning to merge https://github.com/openshift/origin/issues/7358 tonight. Note that the code that was panicking is inside the deployer container, so this needs to be rebuilt with updated openshift binary in order to fix the code.

Note also, I couldn't reproduce this as described; what did reproduce it, however, is to break the master DNS. In master config I changed to:

dnsConfig:
  bindAddress: 0.0.0.0:8053

This completely breaks DNS and causes the panic in unpatched code. Updated code should calmly report errors and warnings (without so many duplicates).
Comment 5 Luke Meyer 2016-02-22 21:59:13 EST
ON_QA assuming merge completes
Comment 6 Xia Zhao 2016-02-22 22:10:45 EST
Thanks for fixing this. I Will test once the bug fix PR https://github.com/openshift/origin/pull/7317 get merged into EC2 images. (The latest devenv-rhel7_3509 did not contain this fix)
Comment 7 Luke Meyer 2016-02-23 11:37:39 EST
Finally merged.
Comment 8 Xia Zhao 2016-02-24 00:38:38 EST
Issue was reproduced with devenv-rhel7_3522 by setting this in master-config.yaml:
dnsConfig:
  bindAddress: 0.0.0.0:8053

The problem about duplicated messages were already fixed except for DP2015.

Here are the full logs on devenv-rhel7_3522:

[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint
       
ERROR: [DCli2012 from diagnostic DiagnosticPod@openshift/origin/pkg/diagnostics/client/run_diagnostics_pod.go:155]
       See the errors below in the output from the diagnostic pod:
       [Note] Running diagnostic: PodCheckAuth
              Description: Check that service account credentials authenticate as expected
              
       ERROR: [CED5001 from controller openshift/origin/pkg/cmd/experimental/diagnostics/pod.go]
              While running the PodCheckAuth diagnostic, a panic was encountered.
              This is a bug in diagnostics. Error and stack trace follow: 
              runtime error: invalid memory address or nil pointer dereference
              /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:168 (0x537812)
              /usr/lib/golang/src/runtime/asm_amd64.s:401 (0x43e335)
              /usr/lib/golang/src/runtime/panic.go:387 (0x414d58)
              /usr/lib/golang/src/runtime/panic.go:42 (0x41407e)
              /usr/lib/golang/src/runtime/sigpanic_unix.go:26 (0x41a904)
              /go/src/github.com/openshift/origin/pkg/diagnostics/pod/auth.go:109 (0xc54d63)
              /go/src/github.com/openshift/origin/pkg/diagnostics/pod/auth.go:56 (0xc541b5)
              <autogenerated>:4 (0xc590f2)
              /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:185 (0x5383a8)
              /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:191 (0x53433e)
              /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:120 (0x533929)
              /go/src/github.com/openshift/origin/pkg/cmd/experimental/diagnostics/pod.go:58 (0x53700b)
              /go/src/github.com/openshift/origin/Godeps/_workspace/src/github.com/spf13/cobra/command.go:572 (0x4b5d7f)
              /go/src/github.com/openshift/origin/Godeps/_workspace/src/github.com/spf13/cobra/command.go:662 (0x4b65eb)
              /go/src/github.com/openshift/origin/Godeps/_workspace/src/github.com/spf13/cobra/command.go:618 (0x4b60ea)
              /go/src/github.com/openshift/origin/cmd/openshift/openshift.go:27 (0x401de5)
              /usr/lib/golang/src/runtime/proc.go:63 (0x4168a3)
              /usr/lib/golang/src/runtime/asm_amd64.s:2232 (0x440471)
              
       [Note] Running diagnostic: PodCheckDns
              Description: Check that DNS within a pod works as expected
              
       ERROR: [DP2003 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:71]
              The first /etc/resolv.conf nameserver 172.18.0.170
              could not resolve kubernetes.default.svc.cluster.local.
              Error: read udp 172.18.0.170:53: connection refused
              This nameserver points to the master's SkyDNS which is critical for
              resolving cluster names, e.g. for Services.
              
       WARN:  [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:116]
              Error querying nameserver 172.18.0.170:
                read udp 172.18.0.170:53: connection refused
              This may indicate a problem with DNS.
              
       WARN:  [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:116]
              Error querying nameserver 172.18.0.170:
                read udp 172.18.0.170:53: connection refused
              This may indicate a problem with DNS.
              
       WARN:  [DP2014 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:113]
              A request to the nameserver 172.18.0.170 timed out.
              This could be temporary but could also indicate network or DNS problems.
              
       WARN:  [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:116]
              Error querying nameserver 172.18.0.170:
                read udp 172.18.0.170:53: connection refused
              This may indicate a problem with DNS.
              
       [Note] Summary of diagnostics execution (version v1.1.3):
       [Note] Warnings seen: 4
       [Note] Errors seen: 2
Comment 9 Luke Meyer 2016-02-24 11:05:29 EST
This output from the diagnostic running in the pod indicates it's not running the latest version of the image on the devenv:

       [Note] Summary of diagnostics execution (version v1.1.3):

If you don't specify anything in --images or --latest-images, the client default tag is used (not the hash, but the base client version). The :v1.1.3 tag would attempt to run that image from dockerhub (unless you tagged some other image with that pre-emptively).

With devenv-rhel7_3526 if I do this:

    # openshift ex diagnostics DiagnosticPod --latest-images

Then the end of the diagnostic output looks like:

-------------
       [Note] Summary of diagnostics execution (version v1.1.3-245-g806dd7e):
       [Note] Warnings seen: 3
       [Note] Errors seen: 2
       
[Note] Summary of diagnostics execution (version v1.1.3-245-g806dd7e):
[Note] Errors seen: 1
-------------

The images on the node are:
# docker images | grep deployer
openshift/origin-deployer                806dd7e             f64522cb8094        9 hours ago         520 MB
openshift/origin-deployer                latest              f64522cb8094        9 hours ago         520 MB
docker.io/openshift/origin-deployer      latest              18e3884452ac        11 hours ago        519.7 MB

The 806dd7e code hash is consistent. Oddly enough, it seems to download the latest deployer from docker.io, but it doesn't use it. I don't quite understand that behavior but it does what we want in the end by using the on-node image.

Once the image tag is synced with the version of the client, this won't be an issue, but for now, until the next version is tagged and released to dockerhub, to test this I think you need --latest-images on the diagnostics invocation. It works for me.
Comment 10 Luke Meyer 2016-02-24 11:09:34 EST
Alternatively, tag the built image to what it's looking for e.g. for devenv-rhel7_3526:

docker tag f64522cb8094 openshift/origin-deployer:v1.1.3

I just tested this and it works this way too, without downloading anything from dockerhub.
Comment 11 Xia Zhao 2016-02-25 04:42:16 EST
Oh, And I now realized that the image I used yesterday is wrong. Thanks for the info-- yes, the issue is fixed. Tested on devenv-rhel7_3532 with 
openshift ex diagnostics --images=openshift/origin-deployer:latest
openshift ex diagnostics --latest-images
openshift ex diagnostics DiagnosticPod --latest-images

Image tested is:
openshift/origin-deployer            36bb05708ed1

Error message is : (both the code panic and dup error messages are fixed well)

[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint
       
ERROR: [DCli2012 from diagnostic DiagnosticPod@openshift/origin/pkg/diagnostics/client/run_diagnostics_pod.go:155]
       See the errors below in the output from the diagnostic pod:
       [Note] Running diagnostic: PodCheckAuth
              Description: Check that service account credentials authenticate as expected
              
       WARN:  [DP1005 from diagnostic PodCheckAuth@openshift/origin/pkg/diagnostics/pod/auth.go:84]
              A request to the master timed out.
              This could be temporary but could also indicate network or DNS problems.
              
       ERROR: [DP1016 from diagnostic PodCheckAuth@openshift/origin/pkg/diagnostics/pod/auth.go:112]
              DNS resolution for registry address docker-registry.default.svc.cluster.local returned an error; container DNS is likely incorrect. The error was: read udp 172.18.8.207:53: connection refused
              
       [Note] Running diagnostic: PodCheckDns
              Description: Check that DNS within a pod works as expected
              
       ERROR: [DP2003 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:72]
              The first /etc/resolv.conf nameserver 172.18.8.207
              could not resolve kubernetes.default.svc.cluster.local.
              Error: read udp 172.18.8.207:53: connection refused
              This nameserver points to the master's SkyDNS which is critical for
              resolving cluster names, e.g. for Services.
              
       WARN:  [DP2015 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:124]
              Error querying nameserver 172.18.8.207:
                read udp 172.18.8.207:53: connection refused
              This may indicate a problem with DNS.
              
       WARN:  [DP2014 from diagnostic PodCheckDns@openshift/origin/pkg/diagnostics/pod/dns.go:119]
              A request to the nameserver 172.18.8.207 timed out.
              This could be temporary but could also indicate network or DNS problems.
              
       [Note] Summary of diagnostics execution (version v1.1.3-267-g0842757):
       [Note] Warnings seen: 3
       [Note] Errors seen: 2


So cloing the issue as fixed, thanks again!

Note You need to log in before you can comment on or make changes to this bug.