Bug 1590740
| Summary: | master/infra nodes are labeled with "compute" | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Johnny Liu <jialiu> | ||||||
| Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Johnny Liu <jialiu> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.10.0 | CC: | aos-bugs, jialiu, jokerman, mmccomas, wmeng | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.10.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: |
Cause: a race condition between sync service and openshift install
Consequence: master nodes are labelled as compute
Fix: all node labels are being applied by sync daemonset
Result: nodes have expected lables
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-12-20 21:42:29 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Johnny, I think the problem is that you don't have openshift_node_group_name set on the master in the masters group. I'll see if we can reproduce today but I suspect adding that will fix it and then we just need to make sure to improve documentation. It seems to be a race condition between sync and ansible install. Does this happen often? Here's what happening in the log: 1) sync DS started, couldn't find qe-master configmap, so it paused for 180 secs 2) qe-master configmap got created 3) ansible listed all the nodes, which don't have master and infra labels - and marked those as compute. 4) sync DS woke up, found qe-master configmap and assigned master=true label The simplest solution here would be explicitly marking masters with master=true, although a better idea is rewriting sync service to use 'oc observe' instead of 'oc get' and 'sleep' Created https://github.com/openshift/openshift-ansible/pull/8743 which should help with this issue PR from comment #3 was merged in openshift-ansible-3.10.0-0.69.0, it seems compute label doesn't get assigned to master anymore Re-test this bug with openshift-ansible-3.10.1-1.git.157.2bb6250.el7.noarch, still 100% reproduced. (In reply to Scott Dodson from comment #1) > Johnny, > > I think the problem is that you don't have openshift_node_group_name set on > the master in the masters group. I also tried to set openshift_node_group_name, also reproduced. [masters] qe-jialiu3101-master-etcd-1.0619-q06.qe.rhcloud.com openshift_hostname=qe-jialiu3101-master-etcd-1 openshift_node_group_name='qe-master' (In reply to Johnny Liu from comment #6) > (In reply to Scott Dodson from comment #1) > > Johnny, > > > > I think the problem is that you don't have openshift_node_group_name set on > > the master in the masters group. > I also tried to set openshift_node_group_name, also reproduced. > > [masters] > qe-jialiu3101-master-etcd-1.0619-q06.qe.rhcloud.com > openshift_hostname=qe-jialiu3101-master-etcd-1 > openshift_node_group_name='qe-master' We're gonna need ansible-playbooks logs and journalctl logs for origin-node service to find out more Created attachment 1452912 [details]
installation log with inventory file embedded 1
In my another install, found not only master, infra nodes are also labeled with "compute", pls refer to comment 8 to get install log and inventory file. hit this bug with openshift-ansible-3.10.1-1.git.157.2bb6250.el7.noarch.rpm
on openstack
[root@shared-wmeng310ch-master-etcd-1 ~]# oc get node
NAME STATUS ROLES AGE VERSION
shared-wmeng310ch-master-etcd-1 Ready compute,master 4h v1.10.0+b81c8f8
shared-wmeng310ch-master-etcd-2 Ready compute,master 4h v1.10.0+b81c8f8
shared-wmeng310ch-master-etcd-3 Ready compute,master 4h v1.10.0+b81c8f8
shared-wmeng310ch-node-primary-1 Ready compute 4h v1.10.0+b81c8f8
shared-wmeng310ch-node-primary-2 Ready compute 4h v1.10.0+b81c8f8
shared-wmeng310ch-node-primary-3 Ready compute 4h v1.10.0+b81c8f8
shared-wmeng310ch-nrri-1 Ready compute,infra 4h v1.10.0+b81c8f8
shared-wmeng310ch-nrri-2 Ready compute,infra 4h v1.10.0+b81c8f8
TASK [openshift_manage_node : label non-master non-infra nodes compute] ********
Wednesday 20 June 2018 00:20:23 -0400 (0:00:01.059) 0:34:26.928 ********
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-master-etcd-1) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-master-etcd-1", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-master-etcd-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-master-etcd-2) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-master-etcd-2", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-master-etcd-2 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-master-etcd-3) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-master-etcd-3", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-master-etcd-3 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
TASK [openshift_manage_node : label non-master non-infra nodes compute] ********
Wednesday 20 June 2018 00:23:02 -0400 (0:00:01.181) 0:37:05.557 ********
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-node-primary-1) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-node-primary-1", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-node-primary-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-node-primary-2) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-node-primary-2", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-node-primary-2 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-node-primary-3) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-node-primary-3", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-node-primary-3 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-nrri-1) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-nrri-1", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-nrri-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-nrri-2) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-nrri-2", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-nrri-2 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
Created https://github.com/openshift/openshift-ansible/pull/8868 to fix this - now additional labels are not applied by ansible, all labels have to be defined in the node config. Sync DS would take care of applying those Fix is available in openshift-ansible-3.10.2-1 Verified this bug with openshift-ansible-3.10.2-1, and PASS. master, infra node is labeled correctly without extra 'compute', those non-infra, non-master, non-compute nodes also labeled correctly. [root@ip-172-18-29-128 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-10-66.ec2.internal Ready master 5h v1.10.0+b81c8f8 ip-172-18-11-229.ec2.internal Ready infra 5h v1.10.0+b81c8f8 ip-172-18-2-185.ec2.internal Ready <none> 5h v1.10.0+b81c8f8 ip-172-18-26-161.ec2.internal Ready <none> 5h v1.10.0+b81c8f8 ip-172-18-28-2.ec2.internal Ready infra 5h v1.10.0+b81c8f8 ip-172-18-29-128.ec2.internal Ready master 5h v1.10.0+b81c8f8 ip-172-18-4-140.ec2.internal Ready master 5h v1.10.0+b81c8f8 # oc get node ip-172-18-2-185.ec2.internal --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-172-18-2-185.ec2.internal Ready <none> 5h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-2-185.ec2.internal,region=primary,role=node |
Created attachment 1450832 [details] installation log with inventory file embedded Description of problem: TASK [openshift_manage_node : label non-master non-infra nodes compute] ******** Wednesday 13 June 2018 02:23:59 -0400 (0:00:00.734) 0:12:24.600 ******** changed: [qe-jialiu3102-master-etcd-1.0613-l70.qe.rhcloud.com -> qe-jialiu3102-master-etcd-1.0613-l70.qe.rhcloud.com] => (item=qe-jialiu3102-master-etcd-1) => {"changed": true, "failed": false, "item": "qe-jialiu3102-master-etcd-1", "results": {"cmd": "/usr/bin/oc label node qe-jialiu3102-master-etcd-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"} The above task is labeling a master as compute node, even I only set "master" node name for master node. Version-Release number of the following components: openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch How reproducible: Always Steps to Reproduce: 1. Trigger a 3.10 installation, inventory file and install log will be attached later 2. After installation, check the node label. 3. Actual results: [root@qe-jialiu3102-master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-jialiu3102-master-etcd-1 Ready compute,master 3h v1.10.0+b81c8f8 qe-jialiu3102-node-registry-router-1 Ready compute 3h v1.10.0+b81c8f8 [root@qe-jialiu3102-master-etcd-1 ~]# oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS qe-jialiu3102-master-etcd-1 Ready compute,master 3h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=qe-jialiu3102-master-etcd-1,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/master=true qe-jialiu3102-node-registry-router-1 Ready compute 3h v1.10.0+b81c8f8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=qe-jialiu3102-node-registry-router-1,node-role.kubernetes.io/compute=true,registry=enabled,role=node,router=enabled Expected results: No "compute" label for master. Additional info: Please attach logs from ansible-playbook with the -vvv flag