Bug 1320939 - oadm diagnostics failed at "Check if master is also running node" step.
Summary: oadm diagnostics failed at "Check if master is also running node" step.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Luke Meyer
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-24 10:35 UTC by Johnny Liu
Modified: 2016-05-12 16:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 16:34:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Description Johnny Liu 2016-03-24 10:35:53 UTC
Description of problem:
On master, run "oadm diagnostics", get an error:
<--snip-->
ERROR: [DClu3002 from diagnostic MasterNode@openshift/origin/pkg/diagnostics/cluster/master_node.go:99]
       Client error while retrieving node records. Client retrieved records
       during discovery, so this is likely to be a transient error. Try running
       diagnostics again. If this message persists, there may be a permissions
       problem with getting node records. The error was:
       
       (*errors.StatusError) found '<', expected: !, identifier, or 'end of string'
<--snip-->

ose-3.1 does not have such issue.

Version-Release number of selected component (if applicable):
atomic-openshift-3.2.0.6-1.git.0.19d1bde.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Luke Meyer 2016-03-24 13:13:37 UTC
I've seen this too but not had a chance to look into it. Devan, do you have any idea what's going on here?

Comment 2 Avesh Agarwal 2016-03-24 16:40:54 UTC
This is reproducing in latest origin:


[Note] Running diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)
       
ERROR: [DClu3002 from diagnostic MasterNode@openshift/origin/pkg/diagnostics/cluster/master_node.go:99]
       Client error while retrieving node records. Client retrieved records
       during discovery, so this is likely to be a transient error. Try running
       diagnostics again. If this message persists, there may be a permissions
       problem with getting node records. The error was:
       
       (*errors.StatusError) found '<', expected: !, identifier, or 'end of string'

Comment 3 Avesh Agarwal 2016-03-24 17:26:35 UTC
I dont see this error if run with master config file:

oadm diagnostics --master-config=./openshift.local.config/master/master-config.yaml

And it shows: [Note] Skipping diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)
       Because: Network plugin does not require master to also run node:

I am testing it in a one master and one node setup.

Comment 4 Avesh Agarwal 2016-03-24 17:30:38 UTC
It seems to me that this error happens if master is running just as master not as master-node combination for openvswitch. As by default (without passing master-config), this oadm diagnostic seems to be assuming that master is running as master-node which is not the case and so the error.

Comment 5 Avesh Agarwal 2016-03-24 17:58:49 UTC
As this bug does not have enough information about setup (openshift cluster) and steps, I am just assuming that the cause of the error is same what I am noticing.

Comment 6 Avesh Agarwal 2016-03-24 18:11:50 UTC
From the comment in CanRun() in pkg/diagnostics/cluster/master_node.go,  

        // If there is a master config file available, we'll perform an additional
        // check to see if an OVS network plugin is in use. If no master config,
        // we assume this is the case for now and let the check run anyhow.

It seems pretty obvious pretty, this diagnostic is making assumption that master is always running as node too if master config file is not provided, which should be true in real time deployments.

However, as in my dev setup, I am just running master as master and not providing any master-config by default to "oadm diagnostic" is causing this error which seems harmful, although it would be better for "oadm disgnostic" to find this by itself, but does not seem like blocker at the moment to me unless Johnny Liu (the reporter) confirms otherwise.

Comment 7 Avesh Agarwal 2016-03-24 18:21:01 UTC
I meanted the error seems harmless not harmful.

Comment 8 Avesh Agarwal 2016-03-24 19:00:53 UTC
Somehow I think that the following line:

nodes, err := d.KubeClient.Nodes().List(kapi.ListOptions{LabelSelector: labels.Nothing()})

in Check() in pkg/diagnostics/cluster/master_node.go seems strange, as why it is trying to find "nodes without any label selector", or does it mean "nodes with any label selector"?

If the latter, shouldn't it be:
nodes, err := d.KubeClient.Nodes().List(kapi.ListOptions{})

Comment 9 Avesh Agarwal 2016-03-24 20:22:11 UTC
I had a discussion with dgoodwin on IRC, and send a PR to fix an issue where oadm diagnostic does not seem to find any nodes on the same machine as master:

https://github.com/openshift/origin/pull/8249

However, here are more thoughts based on different cases for oadm diagnostic (specifically master-node for openvswitch SDN):

1. oadm diagnostic is run with --master-config
behavior: diagnostic should figure out about ovs SDN plugin existence

1a) if ovs exists and a node exists,the diagnostic should pass, otherwise (if it fails) something is wrong.
1b) if ovs exists and a node does not exist, and the diagnostic fails

2. oadm diagnostic is run without --master-config
behavior: diagnostic can not figure out about ovs SDN plugin existence, but continues with the check anyway:

2a) a node exists,  it passes, otherwise (if it fails) something is wrong.
2b) a node does not exist, it fails.

Note: currently it seems that the diagnostic can not differentiate if a node just exists (unschedulable) for openvswitch SDN or a real node exists on the same machine (irrespective of schedulable and unschedulable) (may be it does not matter but just pointing out). Also, the diagnostic do not seem to take into account the node's status (Ready, NotReady), (again not sure if it matters). 

Perhaps more discussion is needed to figure out what is expected out of this diagnostic to make it more useful.

Comment 12 Avesh Agarwal 2016-03-28 13:39:13 UTC
Johnny, thanks for information, I sent this PR to origin andshould fix this issue:

https://github.com/openshift/origin/pull/8249

Comment 13 Luke Meyer 2016-03-28 14:15:38 UTC
(In reply to Avesh Agarwal from comment #8)
> nodes, err := d.KubeClient.Nodes().List(kapi.ListOptions{LabelSelector:
> labels.Nothing()})
> 
> in Check() in pkg/diagnostics/cluster/master_node.go seems strange, as why
> it is trying to find "nodes without any label selector", or does it mean
> "nodes with any label selector"?

It means nodes that match an empty label selector. Which is all nodes of course. There is no way to match only nodes that *don't* have any labels. I agree it's a bit confusing...

> If the latter, shouldn't it be:
> nodes, err := d.KubeClient.Nodes().List(kapi.ListOptions{})

In the past, yes, but at some point that apparently became a malformed request, i.e. the LabelSelector element became mandatory, even if the selector itself is empty.

Thanks for the PR, it seems to fix the issue.

Comment 14 Luke Meyer 2016-03-28 16:36:47 UTC
Fix merged in Origin.

Comment 15 Troy Dawson 2016-03-30 18:59:18 UTC
Should be in atomic-openshift-3.2.0.9-1.git.0.b99af7d.el7, which is now built and ready for testing.

Comment 16 Johnny Liu 2016-03-31 13:12:51 UTC
Verified this bug with atomic-openshift-3.2.0.9-1.git.0.b99af7d.el7.x86_64, and PASS.
<--output-->
[Note] Running diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)
       
Info:  Found a node with same IP as master: 10.66.78.46
<--output-->

Comment 18 errata-xmlrpc 2016-05-12 16:34:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064


Note You need to log in before you can comment on or make changes to this bug.