Bug 1617990 - Task [Retrieve list of schedulable nodes matching selector] returns an empty list caused prometheus installation along with OCP failed in openstack
Summary: Task [Retrieve list of schedulable nodes matching selector] returns an empty ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.0
Assignee: Paul Gier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-16 11:02 UTC by Junqi Zhao
Modified: 2018-10-11 07:25 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Sync pod not available when prometheus install checks for available nodes. Consequence: If a custom label is used for the prometheus install to select appropriate node, the sync pods must have already applied the label to the nodes. Otherwise, the prometheus installer will not find any nodes with matching label. Fix: Add check in install process to wait for sync pods to become available before continuing. Result: This ensures that the node labelling process will be complete, and the nodes will have the correct labels for the prometheus pod to be scheduled.
Clone Of:
Environment:
Last Closed: 2018-10-11 07:25:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1609019 0 medium CLOSED Prometheus deployment failed due to "No schedulable nodes found matching node selector" but it is not true 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:25:45 UTC

Internal Links: 1609019

Description Junqi Zhao 2018-08-16 11:02:11 UTC
Description of problem:
Install prometheus along with OCP on openstack HA env, it is failed at TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on],
because of TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] returns an empty list, actually it is not a empty list.
This is issue is frequently happen on openstack, GCE/AWS don't have this issue

********************************************************************************
TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] ***
Thursday 16 August 2018  12:13:06 +0800 (0:00:00.106)       0:24:10.646 ******* 
ok: [host-8-241-153.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "results": {"cmd": "/usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"}

TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] ****
Thursday 16 August 2018  12:13:07 +0800 (0:00:00.462)       0:24:11.109 ******* 
fatal: [host-8-241-153.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "assertion": false, 
    "changed": false, 
    "evaluated_to": false, 
    "msg": "No schedulable nodes found matching node selector for Prometheus - 'role=node'"
}
    to retry, use: --limit @/home/slave6/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry
********************************************************************************
# /usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true
NAME                              STATUS    ROLES     AGE       VERSION
preserve0-shareir2-node-1         Ready     compute   4h        v1.11.0+d4cacc0
preserve0-shareir2-node-2         Ready     compute   4h        v1.11.0+d4cacc0
preserve0-shareir2-node-3         Ready     compute   4h        v1.11.0+d4cacc0
preserve0-shareir2-node-infra-1   Ready     infra     4h        v1.11.0+d4cacc0
preserve0-shareir2-node-infra-2   Ready     infra     4h        v1.11.0+d4cacc0


# oc get node --show-labels | grep role=node
preserve0-shareir2-node-1          Ready     compute   4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-1,node-role.kubernetes.io/compute=true,role=node
preserve0-shareir2-node-2          Ready     compute   4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-2,node-role.kubernetes.io/compute=true,role=node
preserve0-shareir2-node-3          Ready     compute   4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-3,node-role.kubernetes.io/compute=true,role=node
preserve0-shareir2-node-infra-1    Ready     infra     4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-1,node-role.kubernetes.io/infra=true,role=node
preserve0-shareir2-node-infra-2    Ready     infra     4h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-2,node-role.kubernetes.io/infra=true,role=node

Version-Release number of selected component (if applicable):
openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm
openshift-ansible-docs-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm
openshift-ansible-roles-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm

How reproducible:
Always

Steps to Reproduce:
1. Deploy prometheus along with OCP in openstack, parameters see [Additional info] part
2.
3.

Actual results:


Expected results:


Additional info:
openshift_hosted_prometheus_deploy: true
openshift_prometheus_node_selector: '{"role":"node"}'

Comment 1 Paul Gier 2018-08-16 20:45:49 UTC
I can't think of any reason why GCE/AWS would behave differently from Openstack here unless there is some difference in the installation options which adds the "role=node" label on GCE/AWS.  By default this label is no longer added (as of 3.10 I think) and instead the selector for non-infra, non-master nodes is "node-role.kubernetes.io/compute=true".

In order to add the "role=node" label at install time, you'd have to modify the node group definition (https://docs.openshift.com/container-platform/3.10/install/configuring_inventory_file.html#configuring-node-host-labels)

I was able to reproduce the issue adding using the selector in my inventory:
openshift_prometheus_node_selector: {"role":"node"}

Which I think is expected since this label is not added by default.  After manually adding the "role=node" label to the node, then the prometheus installer correctly started the pods on the labelled node.

Does it work for you to use the newer label?
openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"}

Comment 2 Junqi Zhao 2018-08-17 00:48:08 UTC
will try with openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"} later, maybe the problem is related to our environment template

Comment 3 Junqi Zhao 2018-08-17 09:05:38 UTC
same issue with 3.10, see Bug 1609019

Comment 4 Paul Gier 2018-08-24 13:06:33 UTC
PR with fix for 3.11 https://github.com/openshift/openshift-ansible/pull/9661

Comment 5 Junqi Zhao 2018-08-29 01:53:33 UTC
Issue is fixed, install prometheus along with OCP is successful now

# openshift version
openshift v3.11.0-0.24.0

openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm

Comment 7 errata-xmlrpc 2018-10-11 07:25:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.