Bug 1628208 - Install failed at TASK [openshift_control_plane : Ensure that Cluster Monitoring Operator has nodes to run on], which should not happen
Summary: Install failed at TASK [openshift_control_plane : Ensure that Cluster Monitor...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Vadim Rutkovsky
QA Contact: Weihua Meng
URL:
Whiteboard:
: 1628357 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-12 13:08 UTC by Weihua Meng
Modified: 2018-12-21 15:23 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: sync daemonset didn't wait for node restart when applying node configs Consequence: install proceeded with some nodes not yet restarted with a new config settings Fix: ansible waits for config to be applied before proceeding Result: other tasks don't complain that some nodes don't match the default nodeselectors
Clone Of:
Environment:
Last Closed: 2018-12-21 15:23:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Weihua Meng 2018-09-12 13:08:35 UTC
Description of problem:
Install failed at TASK [openshift_control_plane : Ensure that Cluster Monitoring Operator has nodes to run on], which should not happen

[root@wmen1gr311-master-etcd-1 ~]# /usr/bin/oc get node --selector=node-role.kubernetes.io/infra=true
NAME                      STATUS    ROLES     AGE       VERSION
wmen1gr311-node-infra-1   Ready     infra     14m       v1.11.0+d4cacc0
wmen1gr311-node-infra-2   Ready     infra     14m       v1.11.0+d4cacc0

Version-Release number of the following components:
openshift-ansible-3.11.1

How reproducible:
70% on openstack

Steps to Reproduce:
1. Install OCP v3.11 HA on openstack

Actual results:
Install failed
TASK [openshift_control_plane : Retrieve list of schedulable nodes matching selector] ***
Wednesday 12 September 2018  20:45:12 +0800 (0:00:00.162)       0:22:29.935 *** 
ok: [host-8-252-232.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "results": {"cmd": "/usr/bin/oc get node --selector=node-role.kubernetes.io/infra=true --field-selector=spec.unschedulable!=true -o json -n default", "results": [{"apiVersion": "v1", "items": [], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"}

TASK [openshift_control_plane : Ensure that Cluster Monitoring Operator has nodes to run on] ***
Wednesday 12 September 2018  20:45:13 +0800 (0:00:00.477)       0:22:30.413 *** 
fatal: [host-8-252-232.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "assertion": false, 
    "changed": false, 
    "evaluated_to": false, 
    "msg": "No schedulable nodes found matching node selector for Cluster Monitoring Operator - 'node-role.kubernetes.io/infra=true'"
}

Expected results:
Install succeeds

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Vadim Rutkovsky 2018-09-12 14:04:30 UTC
Same reason as in #1609019 - sync DS restarts infra nodes to apply node labels, but openshift-ansible proceeds, finds no infra nodes and fails

PR https://github.com/openshift/openshift-ansible/pull/9983 would resolve that, so waiting for Clayton to approve the solution there

Comment 4 Scott Dodson 2018-09-12 20:00:59 UTC
*** Bug 1628357 has been marked as a duplicate of this bug. ***

Comment 5 Scott Dodson 2018-09-13 01:47:27 UTC
https://github.com/openshift/openshift-ansible/pull/10039 release-3.11 pick

Comment 6 Scott Dodson 2018-09-13 12:07:12 UTC
In openshift-ansible-3.11.2-1

Comment 7 Weihua Meng 2018-09-14 06:45:33 UTC
Fixed.

openshift-ansible-3.11.4-1.git.0.d727082.el7_5.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 8 Chance Zibolski 2018-10-13 00:54:28 UTC
I just hit this even after manually building the openshift ansible docker image on the release-3.11 branch. 

Heres my ansible run output in a gist: https://gist.github.com/chancez/064ed08c5016513e5c4fd67edb43c6bc

You can see in the output it gets to the tasks added in https://github.com/openshift/openshift-ansible/pull/10039 eg: "Wait for sync DS to set annotations on all nodes"

Comment 9 Junqi Zhao 2018-10-15 00:57:41 UTC
(In reply to Chance Zibolski from comment #8)
> I just hit this even after manually building the openshift ansible docker
> image on the release-3.11 branch. 
> 
> Heres my ansible run output in a gist:
> https://gist.github.com/chancez/064ed08c5016513e5c4fd67edb43c6bc
> 
> You can see in the output it gets to the tasks added in
> https://github.com/openshift/openshift-ansible/pull/10039 eg: "Wait for sync
> DS to set annotations on all nodes"

make sure you have nodes labeled with node-role.kubernetes.io/infra=true, the error indicates you don't have nodes labeled with node-role.kubernetes.io/infra=true. you can check labels via
# oc get node --show-labels | grep node-role.kubernetes.io/infra=true

If you want to use other nodeSelector, you can set with openshift_cluster_monitoring_operator_node_selector parameter
eg:
openshift_cluster_monitoring_operator_node_selector={'role': 'node'}

Comment 10 Junqi Zhao 2018-10-15 05:38:50 UTC
(In reply to Junqi Zhao from comment #9)
> (In reply to Chance Zibolski from comment #8)
> > I just hit this even after manually building the openshift ansible docker
> > image on the release-3.11 branch. 
> > 
> > Heres my ansible run output in a gist:
> > https://gist.github.com/chancez/064ed08c5016513e5c4fd67edb43c6bc
> > 
> > You can see in the output it gets to the tasks added in
> > https://github.com/openshift/openshift-ansible/pull/10039 eg: "Wait for sync
> > DS to set annotations on all nodes"
> 
> make sure you have nodes labeled with node-role.kubernetes.io/infra=true,
> the error indicates you don't have nodes labeled with
> node-role.kubernetes.io/infra=true. you can check labels via
> # oc get node --show-labels | grep node-role.kubernetes.io/infra=true
> 
> If you want to use other nodeSelector, you can set with
> openshift_cluster_monitoring_operator_node_selector parameter
> eg:
> openshift_cluster_monitoring_operator_node_selector={'role': 'node'}

Sorry, checked again, you have node labeled with node-role.kubernetes.io/infra=true, it seems the issue reproduced again

Comment 11 Junqi Zhao 2018-10-15 09:09:38 UTC
Tested today, I did not find this issue in my env, please make sure you have nodes labeled with node-role.kubernetes.io/infra=true during the installation progress.

openshift-ansible version:
openshift-ansible-3.11.21-1.git.0.7dc17ca.el7.noarch

Comment 12 Chance Zibolski 2018-10-15 16:23:44 UTC
I used the GCP playbook used by origin CI https://github.com/openshift/release/tree/master/cluster/test-deploy which automatically creates the VMs such that they should have the correct node labels.

Comment 13 Chance Zibolski 2018-10-15 17:28:22 UTC
Well I confirmed they're not getting labeled, and I think i see why. It's config related.

Comment 14 Luke Meyer 2018-12-21 15:23:10 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.


Note You need to log in before you can comment on or make changes to this bug.