Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1590740

Summary: master/infra nodes are labeled with "compute"
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: InstallerAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED CURRENTRELEASE QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, jialiu, jokerman, mmccomas, wmeng
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: a race condition between sync service and openshift install Consequence: master nodes are labelled as compute Fix: all node labels are being applied by sync daemonset Result: nodes have expected lables
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-20 21:42:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
installation log with inventory file embedded
none
installation log with inventory file embedded 1 none

Description Johnny Liu 2018-06-13 09:56:36 UTC
Created attachment 1450832 [details]
installation log with inventory file embedded

Description of problem:
TASK [openshift_manage_node : label non-master non-infra nodes compute] ********
Wednesday 13 June 2018  02:23:59 -0400 (0:00:00.734)       0:12:24.600 ******** 

changed: [qe-jialiu3102-master-etcd-1.0613-l70.qe.rhcloud.com -> qe-jialiu3102-master-etcd-1.0613-l70.qe.rhcloud.com] => (item=qe-jialiu3102-master-etcd-1) => {"changed": true, "failed": false, "item": "qe-jialiu3102-master-etcd-1", "results": {"cmd": "/usr/bin/oc label node qe-jialiu3102-master-etcd-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}

The above task is labeling a master as compute node, even I only set "master" node name for master node.

Version-Release number of the following components:
openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Trigger a 3.10 installation, inventory file and install log will be attached later

2. After installation, check the node label.
3.

Actual results:
[root@qe-jialiu3102-master-etcd-1 ~]# oc get node
NAME                                   STATUS    ROLES            AGE       VERSION
qe-jialiu3102-master-etcd-1            Ready     compute,master   3h        v1.10.0+b81c8f8
qe-jialiu3102-node-registry-router-1   Ready     compute          3h        v1.10.0+b81c8f8
[root@qe-jialiu3102-master-etcd-1 ~]# oc get node --show-labels
NAME                                   STATUS    ROLES            AGE       VERSION           LABELS
qe-jialiu3102-master-etcd-1            Ready     compute,master   3h        v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=qe-jialiu3102-master-etcd-1,node-role.kubernetes.io/compute=true,node-role.kubernetes.io/master=true
qe-jialiu3102-node-registry-router-1   Ready     compute          3h        v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-1,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=qe-jialiu3102-node-registry-router-1,node-role.kubernetes.io/compute=true,registry=enabled,role=node,router=enabled


Expected results:
No "compute" label for master.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-06-13 12:45:09 UTC
Johnny,

I think the problem is that you don't have openshift_node_group_name set on the master in the masters group. I'll see if we can reproduce today but I suspect adding that will fix it and then we just need to make sure to improve documentation.

Comment 2 Vadim Rutkovsky 2018-06-13 14:16:18 UTC
It seems to be a race condition between sync and ansible install.

Does this happen often?

Here's what happening in the log:
1) sync DS started, couldn't find qe-master configmap, so it paused for 180 secs
2) qe-master configmap got created
3) ansible listed all the nodes, which don't have master and infra labels - and marked those as compute.
4) sync DS woke up, found qe-master configmap and assigned master=true label


The simplest solution here would be explicitly marking masters with master=true, although a better idea is rewriting sync service to use 'oc observe' instead of 'oc get' and 'sleep'

Comment 3 Vadim Rutkovsky 2018-06-13 15:05:59 UTC
Created https://github.com/openshift/openshift-ansible/pull/8743 which should help with this issue

Comment 4 Vadim Rutkovsky 2018-06-15 08:29:29 UTC
PR from comment #3 was merged in openshift-ansible-3.10.0-0.69.0, it seems compute label doesn't get assigned to master anymore

Comment 5 Johnny Liu 2018-06-19 09:23:43 UTC
Re-test this bug with openshift-ansible-3.10.1-1.git.157.2bb6250.el7.noarch, still 100% reproduced.

Comment 6 Johnny Liu 2018-06-19 10:20:44 UTC
(In reply to Scott Dodson from comment #1)
> Johnny,
> 
> I think the problem is that you don't have openshift_node_group_name set on
> the master in the masters group.
I also tried to set openshift_node_group_name, also reproduced.

[masters]
qe-jialiu3101-master-etcd-1.0619-q06.qe.rhcloud.com openshift_hostname=qe-jialiu3101-master-etcd-1 openshift_node_group_name='qe-master'

Comment 7 Vadim Rutkovsky 2018-06-19 11:02:51 UTC
(In reply to Johnny Liu from comment #6)
> (In reply to Scott Dodson from comment #1)
> > Johnny,
> > 
> > I think the problem is that you don't have openshift_node_group_name set on
> > the master in the masters group.
> I also tried to set openshift_node_group_name, also reproduced.
> 
> [masters]
> qe-jialiu3101-master-etcd-1.0619-q06.qe.rhcloud.com
> openshift_hostname=qe-jialiu3101-master-etcd-1
> openshift_node_group_name='qe-master'

We're gonna need ansible-playbooks logs and journalctl logs for origin-node service to find out more

Comment 8 Johnny Liu 2018-06-19 11:15:47 UTC
Created attachment 1452912 [details]
installation log with inventory file embedded 1

Comment 10 Johnny Liu 2018-06-19 11:20:30 UTC
In my another install, found not only master, infra nodes are also labeled with "compute", pls refer to comment 8 to get install log and inventory file.

Comment 11 Weihua Meng 2018-06-20 08:48:44 UTC
hit this bug with openshift-ansible-3.10.1-1.git.157.2bb6250.el7.noarch.rpm
on openstack

[root@shared-wmeng310ch-master-etcd-1 ~]# oc get node
NAME                               STATUS    ROLES            AGE       VERSION
shared-wmeng310ch-master-etcd-1    Ready     compute,master   4h        v1.10.0+b81c8f8
shared-wmeng310ch-master-etcd-2    Ready     compute,master   4h        v1.10.0+b81c8f8
shared-wmeng310ch-master-etcd-3    Ready     compute,master   4h        v1.10.0+b81c8f8
shared-wmeng310ch-node-primary-1   Ready     compute          4h        v1.10.0+b81c8f8
shared-wmeng310ch-node-primary-2   Ready     compute          4h        v1.10.0+b81c8f8
shared-wmeng310ch-node-primary-3   Ready     compute          4h        v1.10.0+b81c8f8
shared-wmeng310ch-nrri-1           Ready     compute,infra    4h        v1.10.0+b81c8f8
shared-wmeng310ch-nrri-2           Ready     compute,infra    4h        v1.10.0+b81c8f8

TASK [openshift_manage_node : label non-master non-infra nodes compute] ********
Wednesday 20 June 2018  00:20:23 -0400 (0:00:01.059)       0:34:26.928 ******** 
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-master-etcd-1) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-master-etcd-1", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-master-etcd-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-master-etcd-2) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-master-etcd-2", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-master-etcd-2 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-master-etcd-3) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-master-etcd-3", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-master-etcd-3 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}

TASK [openshift_manage_node : label non-master non-infra nodes compute] ********
Wednesday 20 June 2018  00:23:02 -0400 (0:00:01.181)       0:37:05.557 ******** 
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-node-primary-1) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-node-primary-1", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-node-primary-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-node-primary-2) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-node-primary-2", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-node-primary-2 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-node-primary-3) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-node-primary-3", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-node-primary-3 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-nrri-1) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-nrri-1", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-nrri-1 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}
changed: [dhcp-89-154.sjc.redhat.com -> dhcp-89-154.sjc.redhat.com] => (item=shared-wmeng310ch-nrri-2) => {"changed": true, "failed": false, "item": "shared-wmeng310ch-nrri-2", "results": {"cmd": "/usr/bin/oc label node shared-wmeng310ch-nrri-2 node-role.kubernetes.io/compute=true --overwrite", "results": {}, "returncode": 0}, "state": "add"}

Comment 12 Vadim Rutkovsky 2018-06-20 14:52:53 UTC
Created https://github.com/openshift/openshift-ansible/pull/8868 to fix this - now additional labels are not applied by ansible, all labels have to be defined in the node config. Sync DS would take care of applying those

Comment 13 Vadim Rutkovsky 2018-06-21 07:54:09 UTC
Fix is available in openshift-ansible-3.10.2-1

Comment 14 Johnny Liu 2018-06-21 08:23:17 UTC
Verified this bug with openshift-ansible-3.10.2-1, and PASS.

master, infra node is labeled correctly without extra 'compute', those non-infra, non-master, non-compute nodes also labeled correctly.

[root@ip-172-18-29-128 ~]# oc get node
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-10-66.ec2.internal    Ready     master    5h        v1.10.0+b81c8f8
ip-172-18-11-229.ec2.internal   Ready     infra     5h        v1.10.0+b81c8f8
ip-172-18-2-185.ec2.internal    Ready     <none>    5h        v1.10.0+b81c8f8
ip-172-18-26-161.ec2.internal   Ready     <none>    5h        v1.10.0+b81c8f8
ip-172-18-28-2.ec2.internal     Ready     infra     5h        v1.10.0+b81c8f8
ip-172-18-29-128.ec2.internal   Ready     master    5h        v1.10.0+b81c8f8
ip-172-18-4-140.ec2.internal    Ready     master    5h        v1.10.0+b81c8f8

# oc get node ip-172-18-2-185.ec2.internal --show-labels
NAME                           STATUS    ROLES     AGE       VERSION           LABELS
ip-172-18-2-185.ec2.internal   Ready     <none>    5h        v1.10.0+b81c8f8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-2-185.ec2.internal,region=primary,role=node