1459241 – oadm diagnostics NetworkCheck cannot deploy pods on non default nodes

Bug 1459241 - oadm diagnostics NetworkCheck cannot deploy pods on non default nodes

Summary: oadm diagnostics NetworkCheck cannot deploy pods on non default nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Luke Meyer
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1431588 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-06 15:31 UTC by Serhat Dirik
Modified:	2020-08-13 09:19 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When the master config specifies a default nodeSelector for the cluster, test projects created by oadm diagnostics NetworkCheck got this nodeSelector, and therefore the test pods were also confined to this nodeSelector. Consequence: NetworkCheck test pods could only be scheduled on a subset of nodes, preventing the diagnostic covering the entire cluster; in some clusters this might even result in too few pods running for the diagnostic to succeed even if the cluster health is fine. Fix: NetworkCheck now creates the test projects with an empty nodeSelector so they can land on any schedulable node. Result: The diagnostic should be more robust and meaningful.
Clone Of:
Environment:
Last Closed:	2017-08-10 05:26:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1534775	0	medium	CLOSED	oadm diagnostics NetworkCheck fails to schedule pods if there is a default node selector in the master-config	2021-06-10 14:12:24 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Internal Links: 1534775

Description Serhat Dirik 2017-06-06 15:31:29 UTC

Description of problem:

oadm diagnostics NetworkCheck cannot deploy diagnostic pods on some nodes

Version-Release number of selected component (if applicable):
  oc v3.5.5.15
  kubernetes v1.5.2+43a9be4
  features: Basic-Auth GSSAPI Kerberos SPNEGO
  Server https://ocp-l01.ocp.trkc.tgc:443
  openshift v3.5.5.15
  kubernetes v1.5.2+43a9be4


How reproducible:

Steps to Reproduce:
1.Create a cluster which has some nodes that has no default selector labels
2.execute oadm diagnostics NetworkCheck

Actual results:
------------------
ERROR: [DNet2008 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:147]

       [Logs for network diagnostic pod on node "ocp-i03.ocp.trkc.tgc" failed: container "network-diag-pod-gsxm7" in pod "network-diag-pod-gsxm7" is not available, Logs for network diagnostic pod on node "ocp-i02.ocp.trkc.tgc" failed: container "network-diag-pod-gx2x0" in pod "network-diag-pod-gx2x0" is not available, Logs for network diagnostic pod on node "ocp-i01.ocp.trkc.tgc" failed: container "network-diag-pod-w4tt4" in pod "network-diag-pod-w4tt4" is not available]
 
####Node's syslog !!

Jun  4 11:27:48 ocp-i01 atomic-openshift-node: I0604 11:27:48.813533   13536 kubelet.go:1782] SyncLoop (ADD, "api"): "network-diag-pod-j06c0_network-diag-ns-6k9vs(b163c02d-48ff-11e7-b9d6-00505697fb55)"

Jun  4 11:27:48 ocp-i01 atomic-openshift-node: I0604 11:27:48.813695   13536 predicate.go:84] Predicate failed on Pod: network-diag-pod-j06c0_network-diag-ns-6k9vs(b163c02d-48ff-11e7-b9d6-00505697fb55), for reason: Predicate MatchNodeSelector failed

Expected results:
   Successful execution 

Additional info:
    Might be same bug as in https://bugzilla.redhat.com/show_bug.cgi?id=1431588

Comment 2 Luke Meyer 2017-06-09 15:03:02 UTC

"Predicate MatchNodeSelector failed" simply means that the pods have a nodeSelector that the node label doesn't match. This is a normal scheduling message where the pod doesn't fit the node.

The likely reason is that there is a default node selector in the master config, and the projects created for this diagnostic just inherit that. Then they won't run on any nodes that aren't selected by the default node selector.

Indeed is same bug as (RFE) https://bugzilla.redhat.com/show_bug.cgi?id=1431588

Not exactly a bug, just normal functioning, but if users are expecting the network pods to land everywhere, I think it should be possible to implement by just creating the projects with an empty node selector.

Comment 3 Serhat Dirik 2017-06-09 15:49:42 UTC

(In reply to Luke Meyer from comment #2)
> "Predicate MatchNodeSelector failed" simply means that the pods have a
> nodeSelector that the node label doesn't match. This is a normal scheduling
> message where the pod doesn't fit the node.
> 
> The likely reason is that there is a default node selector in the master
> config, and the projects created for this diagnostic just inherit that. Then
> they won't run on any nodes that aren't selected by the default node
> selector.
> 
> Indeed is same bug as (RFE)
> https://bugzilla.redhat.com/show_bug.cgi?id=1431588
> 
> Not exactly a bug, just normal functioning, but if users are expecting the
> network pods to land everywhere, I think it should be possible to implement
> by just creating the projects with an empty node selector.

I think it's better to change the default behavior as "run it everywhere", because large clusters always have some special group of nodes that are kept out of default nodes which are specified with "osm_default_node_selector". Infrastructre nodes are just one examples that. As running diagnostic tools, users are not trying to make deployments, they're simply trying to diagnose the cluster, so from their point of view how OCP running this diagnostics internally is irrelevant.

Comment 4 Luke Meyer 2017-06-16 14:57:59 UTC

https://github.com/openshift/origin/pull/14686

Comment 6 zhaozhanqi 2017-06-29 06:54:02 UTC

verified this bug on 
# openshift version
openshift v3.6.126.1
kubernetes v1.6.1+5115d708d7
etcd 3.2.0


1. changed the master-config.yaml
     defaultNodeSelector: "test=zzhao"
2. restart the master service
3. run 'oadm diagnostics NetworkCheck'
4. Check the pod will be scheduled on the node.

Comment 7 zhaozhanqi 2017-06-29 07:57:54 UTC

*** Bug 1431588 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2017-08-10 05:26:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.