Bug 1617990
| Summary: | Task [Retrieve list of schedulable nodes matching selector] returns an empty list caused prometheus installation along with OCP failed in openstack | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
| Component: | Monitoring | Assignee: | Paul Gier <pgier> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | wmeng |
| Target Milestone: | --- | ||
| Target Release: | 3.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Sync pod not available when prometheus install checks for available nodes.
Consequence: If a custom label is used for the prometheus install to select appropriate node, the sync pods must have already applied the label to the nodes. Otherwise, the prometheus installer will not find any nodes with matching label.
Fix: Add check in install process to wait for sync pods to become available before continuing.
Result: This ensures that the node labelling process will be complete, and the nodes will have the correct labels for the prometheus pod to be scheduled.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-10-11 07:25:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I can't think of any reason why GCE/AWS would behave differently from Openstack here unless there is some difference in the installation options which adds the "role=node" label on GCE/AWS. By default this label is no longer added (as of 3.10 I think) and instead the selector for non-infra, non-master nodes is "node-role.kubernetes.io/compute=true". In order to add the "role=node" label at install time, you'd have to modify the node group definition (https://docs.openshift.com/container-platform/3.10/install/configuring_inventory_file.html#configuring-node-host-labels) I was able to reproduce the issue adding using the selector in my inventory: openshift_prometheus_node_selector: {"role":"node"} Which I think is expected since this label is not added by default. After manually adding the "role=node" label to the node, then the prometheus installer correctly started the pods on the labelled node. Does it work for you to use the newer label? openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"} will try with openshift_prometheus_node_selector: {"node-role.kubernetes.io/compute=true"} later, maybe the problem is related to our environment template
same issue with 3.10, see Bug 1609019 PR with fix for 3.11 https://github.com/openshift/openshift-ansible/pull/9661 Issue is fixed, install prometheus along with OCP is successful now # openshift version openshift v3.11.0-0.24.0 openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652 |
Description of problem: Install prometheus along with OCP on openstack HA env, it is failed at TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on], because of TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] returns an empty list, actually it is not a empty list. This is issue is frequently happen on openstack, GCE/AWS don't have this issue ******************************************************************************** TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] *** Thursday 16 August 2018 12:13:06 +0800 (0:00:00.106) 0:24:10.646 ******* ok: [host-8-241-153.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "results": {"cmd": "/usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"} TASK [openshift_control_plane : Ensure that Prometheus has nodes to run on] **** Thursday 16 August 2018 12:13:07 +0800 (0:00:00.462) 0:24:11.109 ******* fatal: [host-8-241-153.host.centralci.eng.rdu2.redhat.com]: FAILED! => { "assertion": false, "changed": false, "evaluated_to": false, "msg": "No schedulable nodes found matching node selector for Prometheus - 'role=node'" } to retry, use: --limit @/home/slave6/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry ******************************************************************************** # /usr/bin/oc get node --selector=role=node --field-selector=spec.unschedulable!=true NAME STATUS ROLES AGE VERSION preserve0-shareir2-node-1 Ready compute 4h v1.11.0+d4cacc0 preserve0-shareir2-node-2 Ready compute 4h v1.11.0+d4cacc0 preserve0-shareir2-node-3 Ready compute 4h v1.11.0+d4cacc0 preserve0-shareir2-node-infra-1 Ready infra 4h v1.11.0+d4cacc0 preserve0-shareir2-node-infra-2 Ready infra 4h v1.11.0+d4cacc0 # oc get node --show-labels | grep role=node preserve0-shareir2-node-1 Ready compute 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-1,node-role.kubernetes.io/compute=true,role=node preserve0-shareir2-node-2 Ready compute 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-2,node-role.kubernetes.io/compute=true,role=node preserve0-shareir2-node-3 Ready compute 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-3,node-role.kubernetes.io/compute=true,role=node preserve0-shareir2-node-infra-1 Ready infra 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-1,node-role.kubernetes.io/infra=true,role=node preserve0-shareir2-node-infra-2 Ready infra 4h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e52c6e51-9468-457a-b51f-d47418590fed,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=regionOne,failure-domain.beta.kubernetes.io/zone=nova,kubernetes.io/hostname=preserve0-shareir2-node-infra-2,node-role.kubernetes.io/infra=true,role=node Version-Release number of selected component (if applicable): openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm openshift-ansible-docs-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm openshift-ansible-playbooks-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm openshift-ansible-roles-3.11.0-0.16.0.git.0.e82689aNone.noarch.rpm How reproducible: Always Steps to Reproduce: 1. Deploy prometheus along with OCP in openstack, parameters see [Additional info] part 2. 3. Actual results: Expected results: Additional info: openshift_hosted_prometheus_deploy: true openshift_prometheus_node_selector: '{"role":"node"}'