Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1617990

Summary:	Task [Retrieve list of schedulable nodes matching selector] returns an empty list caused prometheus installation along with OCP failed in openstack
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Paul Gier <pgier>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	wmeng
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Sync pod not available when prometheus install checks for available nodes. Consequence: If a custom label is used for the prometheus install to select appropriate node, the sync pods must have already applied the label to the nodes. Otherwise, the prometheus installer will not find any nodes with matching label. Fix: Add check in install process to wait for sync pods to become available before continuing. Result: This ensures that the node labelling process will be complete, and the nodes will have the correct labels for the prometheus pod to be scheduled.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-11 07:25:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Junqi Zhao 2018-08-16 11:02:11 UTC

Description of problem:
Install prometheus along with OCP on openstack HA env, it is failed at TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on],
because of TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] returns an empty list, actually it is not a empty list.
This is issue is frequently happen on openstack, GCE/AWS don't have this issue

********************************************************************************
TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] ***
Thursday 16 August 2018  12:13:06 +0800 (0:00:00.106)       0:24:10.646 ******* 
ok: [host-8-241-153.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "results": {"cmd": "/usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"}

TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] ****
Thursday 16 August 2018  12:13:07 +0800 (0:00:00.462)       0:24:11.109 ******* 
fatal: [host-8-241-153.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "assertion": false, 
    "changed": false, 
    "evaluated_to": false, 
    "msg": "No schedulable nodes found matching node selector for Prometheus - 'role=node'"
}
    to retry, use: --limit @/home/slave6/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry
********************************************************************************
# /usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true
NAME                              STATUS    ROLES     AGE       VERSION
preserve0-shareir2-node-1         Ready     compute   4h        v1.11.0+d4cacc0
preserve0-shareir2-node-2         Ready     compute   4h        v1.11.0+d4cacc0
preserve0-shareir2-node-3         Ready     compute   4h        v1.11.0+d4cacc0
preserve0-shareir2-node-infra-1   Ready     infra     4h        v1.11.0+d4cacc0
preserve0-shareir2-node-infra-2   Ready     infra     4h        v1.11.0+d4cacc0


# oc get node --show-labels | grep role=node
preserve0-shareir2-node-1          Ready     compute   4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-1,node-role.kubernetes.io/compute=true,role=node
preserve0-shareir2-node-2          Ready     compute   4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-2,node-role.kubernetes.io/compute=true,role=node
preserve0-shareir2-node-3          Ready     compute   4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-3,node-role.kubernetes.io/compute=true,role=node
preserve0-shareir2-node-infra-1    Ready     infra     4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-1,node-role.kubernetes.io/infra=true,role=node
preserve0-shareir2-node-infra-2    Ready     infra     4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-2,node-role.kubernetes.io/infra=true,role=node

Version-Release number of selected component (if applicable):
openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm
openshift-ansible-docs-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm
openshift-ansible-roles-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm

How reproducible:
Always

Steps to Reproduce:
1. Deploy prometheus along with OCP in openstack, parameters see [Additional info] part
2.
3.

Actual results:


Expected results:


Additional info:
openshift_hosted_prometheus_deploy: true
openshift_prometheus_node_selector: '{"role":"node"}'

Comment 1 Paul Gier 2018-08-16 20:45:49 UTC

I can't think of any reason why GCE/AWS would behave differently from Openstack here unless there is some difference in the installation options which adds the "role=node" label on GCE/AWS.  By default this label is no longer added (as of 3.10 I think) and instead the selector for non-infra, non-master nodes is "node-role.kubernetes.io/compute=true".

In order to add the "role=node" label at install time, you'd have to modify the node group definition (https://docs.openshift.com/container-platform/3.10/install/configuring_inventory_file.html#configuring-node-host-labels)

I was able to reproduce the issue adding using the selector in my inventory:
openshift_prometheus_node_selector: {"role":"node"}

Which I think is expected since this label is not added by default.  After manually adding the "role=node" label to the node, then the prometheus installer correctly started the pods on the labelled node.

Does it work for you to use the newer label?
openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"}

Comment 2 Junqi Zhao 2018-08-17 00:48:08 UTC

will try with openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"} later, maybe the problem is related to our environment template

Comment 3 Junqi Zhao 2018-08-17 09:05:38 UTC

same issue with 3.10, see Bug 1609019

Comment 4 Paul Gier 2018-08-24 13:06:33 UTC

PR with fix for 3.11 https://github.com/openshift/openshift-ansible/pull/9661

Comment 5 Junqi Zhao 2018-08-29 01:53:33 UTC

Issue is fixed, install prometheus along with OCP is successful now

# openshift version
openshift v3.11.0-0.24.0

openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm

Comment 7 errata-xmlrpc 2018-10-11 07:25:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652