Description of problem: Install prometheus along with OCP on openstack HA env, it is failed at TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on], because of TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] returns an empty list, actually it is not a empty list. This is issue is frequently happen on openstack, GCE/AWS don't have this issue ******************************************************************************** TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] *** Thursday 16 August 2018 12:13:06 +0800 (0:00:00.106) 0:24:10.646 ******* ok: [host-8-241-153.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "results": {"cmd": "/usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"} TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] **** Thursday 16 August 2018 12:13:07 +0800 (0:00:00.462) 0:24:11.109 ******* fatal: [host-8-241-153.host.centralci.eng.rdu2.redhat.com]: FAILED! => { "assertion": false, "changed": false, "evaluated_to": false, "msg": "No schedulable nodes found matching node selector for Prometheus - 'role=node'" } to retry, use: --limit @/home/slave6/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry ******************************************************************************** # /usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true NAME STATUS ROLES AGE VERSION preserve0-shareir2-node-1 Ready compute 4h v1.11.0+d4cacc0 preserve0-shareir2-node-2 Ready compute 4h v1.11.0+d4cacc0 preserve0-shareir2-node-3 Ready compute 4h v1.11.0+d4cacc0 preserve0-shareir2-node-infra-1 Ready infra 4h v1.11.0+d4cacc0 preserve0-shareir2-node-infra-2 Ready infra 4h v1.11.0+d4cacc0 # oc get node --show-labels | grep role=node preserve0-shareir2-node-1 Ready compute 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-1,node-role.kubernetes.io/compute=true,role=node preserve0-shareir2-node-2 Ready compute 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-2,node-role.kubernetes.io/compute=true,role=node preserve0-shareir2-node-3 Ready compute 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-3,node-role.kubernetes.io/compute=true,role=node preserve0-shareir2-node-infra-1 Ready infra 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-1,node-role.kubernetes.io/infra=true,role=node preserve0-shareir2-node-infra-2 Ready infra 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-2,node-role.kubernetes.io/infra=true,role=node Version-Release number of selected component (if applicable): openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm openshift-ansible-docs-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm openshift-ansible-playbooks-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm openshift-ansible-roles-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm How reproducible: Always Steps to Reproduce: 1. Deploy prometheus along with OCP in openstack, parameters see [Additional info] part 2. 3. Actual results: Expected results: Additional info: openshift_hosted_prometheus_deploy: true openshift_prometheus_node_selector: '{"role":"node"}'
I can't think of any reason why GCE/AWS would behave differently from Openstack here unless there is some difference in the installation options which adds the "role=node" label on GCE/AWS. By default this label is no longer added (as of 3.10 I think) and instead the selector for non-infra, non-master nodes is "node-role.kubernetes.io/compute=true". In order to add the "role=node" label at install time, you'd have to modify the node group definition (https://docs.openshift.com/container-platform/3.10/install/configuring_inventory_file.html#configuring-node-host-labels) I was able to reproduce the issue adding using the selector in my inventory: openshift_prometheus_node_selector: {"role":"node"} Which I think is expected since this label is not added by default. After manually adding the "role=node" label to the node, then the prometheus installer correctly started the pods on the labelled node. Does it work for you to use the newer label? openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"}
will try with openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"} later, maybe the problem is related to our environment template
same issue with 3.10, see Bug 1609019
PR with fix for 3.11 https://github.com/openshift/openshift-ansible/pull/9661
Issue is fixed, install prometheus along with OCP is successful now # openshift version openshift v3.11.0-0.24.0 openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652