Bug 1534775

Summary: oadm diagnostics NetworkCheck fails to schedule pods if there is a default node selector in the master-config
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: ocAssignee: Luke Meyer <lmeyer>
Status: CLOSED ERRATA QA Contact: zhaozhanqi <zzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5.0CC: aos-bugs, erjones, jokerman, lmeyer, mmccomas
Target Milestone: ---   
Target Release: 3.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When the master config specifies a default nodeSelector for the cluster, test projects created by oadm diagnostics NetworkCheck got this nodeSelector, and therefore the test pods were also confined to this nodeSelector. Consequence: NetworkCheck test pods could only be scheduled on a subset of nodes, preventing the diagnostic covering the entire cluster; in some clusters this might even result in too few pods running for the diagnostic to succeed even if the cluster health is fine. Fix: NetworkCheck now creates the test projects with an empty nodeSelector so they can land on any schedulable node. Result: The diagnostic should be more robust and meaningful.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-12 06:01:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Jones 2018-01-15 23:14:04 UTC
Description of problem:
default node selector in the master config seems to prevent diagnostics pods from deploying properly.


Customer has the following setup in their master config and the oadm diagnostics NetworkCheck fails complaining about this selector (providing output of `oadm diagnostics NetworkCheck --diaglevel=0 --loglevel=8` shortly)

projectConfig:
  defaultNodeSelector: "purpose=work"

Comment 3 Luke Meyer 2018-01-16 21:55:24 UTC
This problem was fixed in origin with https://github.com/openshift/origin/pull/14686 which was released in 3.6.z per https://access.redhat.com/errata/RHEA-2017:1716

The 3.5 backport occurred in the private repo with https://github.com/openshift/ose/pull/849 which merged 2017-08-23. There was no formal bug created to track this into an errata, however I would expect it to have been built and released with https://access.redhat.com/errata/RHBA-2017:1828 a week later in atomic-openshift-clients-3.5.5.31.19-1.git.0.b23f57a.el7.x86_64.rpm although that could have been built before the merge and released after so we might need to look a little later to be exact.

I can't see the specific version of the client in this case to see if it's as recent as that. It is the client that defines the projects with or without an empty node selector so there is no need to make server-side updates, you should just be able to test with an updated client. If it doesn't work with a more recent 3.5 (or even 3.6/3.7) client then we need to figure out why the fix isn't being included.

Comment 5 Eric Jones 2018-01-17 16:50:07 UTC
@Luke, apologies, you didn't see the exact version the customer is running because I failed to include it, but they recently upgraded to 3.5.5.31 and are still seeing this behavior which makes me think we likely did not include it in that release.

Comment 7 Luke Meyer 2018-01-17 19:38:14 UTC
I saw 3.5.5.31 in the client version, but the errata release is more specific: 3.5.5.31.19-1 -- that's why I don't know if this is expected or not. I would be surprised if the most recent 3.5 client exhibited this problem though. It's also fixed in 3.6+

Comment 9 zhaozhanqi 2018-02-01 05:52:50 UTC
Verified this bug on oc v3.5.5.31.60

it has been fixed 

steps:

1. changed the master-config.yaml
     defaultNodeSelector: "test=zzhao"
2. restart the master service
3. run 'oadm diagnostics NetworkCheck'
4. Check the pod will be scheduled on the node.

Comment 10 Luke Meyer 2018-02-06 01:00:29 UTC
So, it seems this was fixed with previous errata.

Comment 11 Eric Jones 2018-02-21 21:47:27 UTC
@Luke, do you know what version of OpenShift the oc v3.5.5.31.60 came with?

I got my customer to give me the following:

$ rpm -qa | grep -ie openshift -ie ocp -ie ansible -ie ose
atomic-openshift-master-3.5.5.31-1.git.0.b6f55a2.el7.x86_64
openshift-ansible-filter-plugins-3.5.78-1.git.0.f7be576.el7.noarch
tuned-profiles-atomic-openshift-node-3.5.5.31-1.git.0.b6f55a2.el7.x86_64
openshift-ansible-3.5.78-1.git.0.f7be576.el7.noarch
atomic-openshift-excluder-3.5.5.31.36-1.git.0.fd415e7.el7.noarch
atomic-openshift-node-3.5.5.31-1.git.0.b6f55a2.el7.x86_64
openshift-ansible-lookup-plugins-3.5.78-1.git.0.f7be576.el7.noarch
openshift-ansible-playbooks-3.5.78-1.git.0.f7be576.el7.noarch
atomic-openshift-docker-excluder-3.5.5.31.36-1.git.0.fd415e7.el7.noarch
atomic-openshift-3.5.5.31-1.git.0.b6f55a2.el7.x86_64
openshift-ansible-callback-plugins-3.5.78-1.git.0.f7be576.el7.noarch
atomic-openshift-utils-3.5.78-1.git.0.f7be576.el7.noarch
atomic-openshift-clients-3.5.5.31-1.git.0.b6f55a2.el7.x86_64
openshift-ansible-docs-3.5.78-1.git.0.f7be576.el7.noarch
ansible-2.2.3.0-1.el7.noarch
atomic-openshift-sdn-ovs-3.5.5.31-1.git.0.b6f55a2.el7.x86_64
openshift-ansible-roles-3.5.78-1.git.0.f7be576.el7.noarch

Comment 12 Luke Meyer 2018-02-22 14:06:10 UTC
(In reply to Eric Jones from comment #11)
> @Luke, do you know what version of OpenShift the oc v3.5.5.31.60 came with?

It doesn't look like 3.5.5.31.60 is released yet but 3.5.5.31.48 was released two months ago and that should be fine:
https://access.redhat.com/errata/RHBA-2017:3438
https://access.redhat.com/downloads/content/rhel---7/x86_64/5801/atomic-openshift-clients/3.5.5.31.48-1.git.0.245c039.el7/x86_64/fd431d51/package

Again, this is purely a client-side fix.

> atomic-openshift-clients-3.5.5.31-1.git.0.b6f55a2.el7.x86_64

I think that's the client at GA.

Comment 15 errata-xmlrpc 2018-04-12 06:01:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106