Bug 1459241 - oadm diagnostics NetworkCheck cannot deploy pods on non default nodes
oadm diagnostics NetworkCheck cannot deploy pods on non default nodes
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Command Line Interface (Show other bugs)
3.5.1
Unspecified Unspecified
medium Severity urgent
: ---
: ---
Assigned To: Luke Meyer
zhaozhanqi
:
: 1431588 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-06 11:31 EDT by Serhat Dirik
Modified: 2017-08-16 15 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the master config specifies a default nodeSelector for the cluster, test projects created by oadm diagnostics NetworkCheck got this nodeSelector, and therefore the test pods were also confined to this nodeSelector. Consequence: NetworkCheck test pods could only be scheduled on a subset of nodes, preventing the diagnostic covering the entire cluster; in some clusters this might even result in too few pods running for the diagnostic to succeed even if the cluster health is fine. Fix: NetworkCheck now creates the test projects with an empty nodeSelector so they can land on any schedulable node. Result: The diagnostic should be more robust and meaningful.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:26:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Serhat Dirik 2017-06-06 11:31:29 EDT
Description of problem:

oadm diagnostics NetworkCheck cannot deploy diagnostic pods on some nodes

Version-Release number of selected component (if applicable):
  oc v3.5.5.15
  kubernetes v1.5.2+43a9be4
  features: Basic-Auth GSSAPI Kerberos SPNEGO
  Server https://ocp-l01.ocp.trkc.tgc:443
  openshift v3.5.5.15
  kubernetes v1.5.2+43a9be4


How reproducible:

Steps to Reproduce:
1.Create a cluster which has some nodes that has no default selector labels
2.execute oadm diagnostics NetworkCheck

Actual results:
------------------
ERROR: [DNet2008 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:147]

       [Logs for network diagnostic pod on node "ocp-i03.ocp.trkc.tgc" failed: container "network-diag-pod-gsxm7" in pod "network-diag-pod-gsxm7" is not available, Logs for network diagnostic pod on node "ocp-i02.ocp.trkc.tgc" failed: container "network-diag-pod-gx2x0" in pod "network-diag-pod-gx2x0" is not available, Logs for network diagnostic pod on node "ocp-i01.ocp.trkc.tgc" failed: container "network-diag-pod-w4tt4" in pod "network-diag-pod-w4tt4" is not available]
 
####Node's syslog !!

Jun  4 11:27:48 ocp-i01 atomic-openshift-node: I0604 11:27:48.813533   13536 kubelet.go:1782] SyncLoop (ADD, "api"): "network-diag-pod-j06c0_network-diag-ns-6k9vs(b163c02d-48ff-11e7-b9d6-00505697fb55)"

Jun  4 11:27:48 ocp-i01 atomic-openshift-node: I0604 11:27:48.813695   13536 predicate.go:84] Predicate failed on Pod: network-diag-pod-j06c0_network-diag-ns-6k9vs(b163c02d-48ff-11e7-b9d6-00505697fb55), for reason: Predicate MatchNodeSelector failed

Expected results:
   Successful execution 

Additional info:
    Might be same bug as in https://bugzilla.redhat.com/show_bug.cgi?id=1431588
Comment 2 Luke Meyer 2017-06-09 11:03:02 EDT
"Predicate MatchNodeSelector failed" simply means that the pods have a nodeSelector that the node label doesn't match. This is a normal scheduling message where the pod doesn't fit the node.

The likely reason is that there is a default node selector in the master config, and the projects created for this diagnostic just inherit that. Then they won't run on any nodes that aren't selected by the default node selector.

Indeed is same bug as (RFE) https://bugzilla.redhat.com/show_bug.cgi?id=1431588

Not exactly a bug, just normal functioning, but if users are expecting the network pods to land everywhere, I think it should be possible to implement by just creating the projects with an empty node selector.
Comment 3 Serhat Dirik 2017-06-09 11:49:42 EDT
(In reply to Luke Meyer from comment #2)
> "Predicate MatchNodeSelector failed" simply means that the pods have a
> nodeSelector that the node label doesn't match. This is a normal scheduling
> message where the pod doesn't fit the node.
> 
> The likely reason is that there is a default node selector in the master
> config, and the projects created for this diagnostic just inherit that. Then
> they won't run on any nodes that aren't selected by the default node
> selector.
> 
> Indeed is same bug as (RFE)
> https://bugzilla.redhat.com/show_bug.cgi?id=1431588
> 
> Not exactly a bug, just normal functioning, but if users are expecting the
> network pods to land everywhere, I think it should be possible to implement
> by just creating the projects with an empty node selector.

I think it's better to change the default behavior as "run it everywhere", because large clusters always have some special group of nodes that are kept out of default nodes which are specified with "osm_default_node_selector". Infrastructre nodes are just one examples that. As running diagnostic tools, users are not trying to make deployments, they're simply trying to diagnose the cluster, so from their point of view how OCP running this diagnostics internally is irrelevant.
Comment 4 Luke Meyer 2017-06-16 10:57:59 EDT
https://github.com/openshift/origin/pull/14686
Comment 6 zhaozhanqi 2017-06-29 02:54:02 EDT
verified this bug on 
# openshift version
openshift v3.6.126.1
kubernetes v1.6.1+5115d708d7
etcd 3.2.0


1. changed the master-config.yaml
     defaultNodeSelector: "test=zzhao"
2. restart the master service
3. run 'oadm diagnostics NetworkCheck'
4. Check the pod will be scheduled on the node.
Comment 7 zhaozhanqi 2017-06-29 03:57:54 EDT
*** Bug 1431588 has been marked as a duplicate of this bug. ***
Comment 9 errata-xmlrpc 2017-08-10 01:26:47 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.