Bug 1566805

Summary:

ansible installer gets confused about node identification when gluster nodes are added.

Product:

OpenShift Container Platform

Reporter:

raffaele spazzoli <rspazzol>

Component:

Installer

Assignee:

Scott Dodson <sdodson>

Installer sub component:

openshift-installer

QA Contact:

Johnny Liu <jialiu>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, boris.ruppert, jarrpa, jokerman, mmccomas, pkanthal, rspazzol, rteague, sbain

Version:

3.9.0

Keywords:

Triaged

Target Milestone:

---

Flags:

rspazzol: needinfo-

Target Release:

3.9.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-12-14 20:47:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
working_inventory	none
non-working inventory	none

Description raffaele spazzoli 2018-04-13 02:51:09 UTC

Description of problem:
the 3.9 installer has a new logic to identify the role of the nodes and separate between master, infra and compute nodes. Everything seems to work fine under these circumstances. When I add cns nodes the correct labels seems to not be applied anymore.
Here are the relevant part of my inventory:

to the masters I apply the following labels:
openshift_node_labels: 
  region: master 

to the infranodes I apply the following labels:
openshift_node_labels: 
  region: infra
  node-role.kubernetes.io/infranode: true

to the app nodes I apply the following nodes:
openshift_node_labels: 
  region: primary

to the cns nodes I apply the following labels:
openshift_node_labels: 
  region: cns
  node-role.kubernetes.io/cnsnode: true

the default node selector is the following:
osm_default_node_selector: 'region=primary'

the result in terms of labels applied to nodes is the following:

[root@env1-master-sq3z ~]# oc get nodes --show-labels
NAME                  STATUS    ROLES             AGE       VERSION             LABELS
env1-cnsnode-4prb     Ready     cnsnode,compute   38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=env1-cnsnode-4prb,node-role.kubernetes.io/cnsnode=True,node-role.kubernetes.io/compute=true,region=cns
env1-cnsnode-c531     Ready     cnsnode,compute   38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kubernetes.io/hostname=env1-cnsnode-c531,node-role.kubernetes.io/cnsnode=True,node-role.kubernetes.io/compute=true,region=cns
env1-cnsnode-s864     Ready     cnsnode,compute   38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=env1-cnsnode-s864,node-role.kubernetes.io/cnsnode=True,node-role.kubernetes.io/compute=true,region=cns
env1-infranode-7t4s   Ready     infranode         38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=env1-infranode-7t4s,node-role.kubernetes.io/infranode=True
env1-infranode-g9m6   Ready     infranode         38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kubernetes.io/hostname=env1-infranode-g9m6,node-role.kubernetes.io/infranode=True
env1-infranode-xpwf   Ready     infranode         38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=env1-infranode-xpwf,node-role.kubernetes.io/infranode=True
env1-master-j2f4      Ready     master            38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kubernetes.io/hostname=env1-master-j2f4,node-role.kubernetes.io/master=true
env1-master-sq3z      Ready     master            38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=env1-master-sq3z,node-role.kubernetes.io/master=true
env1-master-tv6g      Ready     master            38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=env1-master-tv6g,node-role.kubernetes.io/master=true
env1-node-hzn2        Ready     compute           38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=env1-node-hzn2,node-role.kubernetes.io/compute=true
env1-node-z2b1        Ready     compute           38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=env1-node-z2b1,node-role.kubernetes.io/compute=true
env1-node-zc97        Ready     compute           38m       v1.9.1+a0ce1bc657   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kubernetes.io/hostname=env1-node-zc97,node-role.kubernetes.io/compute=true

version 3.9.14 

reproducible
100%

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-04-13 03:11:41 UTC

To clarify, the app nodes in example are env1-node-hzn2, env1-node-z2b1, env1-node-zc97 ? This is pretty hard to follow without actually seeing the group mappings and variables.

The anomalies are
masters aren't labeled region=master
app aren't labeled region=primary
infra aren't labeled region=infra

The only nodes that actually get the region label applied are the cns nodes?

Can you please provide your inventory and groupvars?

Comment 2 raffaele spazzoli 2018-04-13 11:17:36 UTC

Created attachment 1421324 [details]
working_inventory

Comment 3 raffaele spazzoli 2018-04-13 11:18:01 UTC

Created attachment 1421325 [details]
non-working inventory

Comment 4 raffaele spazzoli 2018-04-13 11:19:05 UTC

Scott, you have correctly identified the anomaly.
I attach an example of a working inventory and a non-working inventory.

Comment 5 Russell Teague 2018-06-22 19:47:38 UTC

Thank you for the example inventories.  I am unable to track down the issue with the information provided so far.  Please attach a log file with -vvv output.  Please also attach yaml dump of the effected inventory using:

$ ansible-inventory -i hosts --list --yaml

Comment 6 boris.ruppert@consol.de 2018-06-27 08:56:15 UTC

We have exactly the same behavior with the ansible installer 3.9.27 and 3.9.30. 

All region labels get lost during the installation (when glusterfs is deployed), only region=storage survives. 
All other node labels survive the installation. 

Our current workaround is to change the node labels and selectors to something different, e.g type=master, type=infra, etc...

Comment 9 stbain 2018-11-07 19:04:16 UTC

We ran the following command against a customer's setup where we believe this bug may be manifesting itself:

ansible-inventory -i [inventory file] --list --yaml

The resulting YAML file shows the correct openshift_node_labels across the masters, infra, and workers. Regardless of what Ansible is receiving from the inventory, installing with two Gluster clusters (one for apps and one for logging, metrics, and registry) results in all nodes being labeled as region=infra. We plan to test and see if the behavior is repeatable without the Gluster installation playbooks being run.

We are also looking in the 3.9 roles and playbooks where Ansible's oc_label module is used to try and determine where Ansible may be applying the incorrect label.

Comment 10 Jose A. Rivera 2018-11-09 14:01:26 UTC

There problem here is that you're using "region" as your node-selector label for CNS. Each node can only have one label of a given name, so any node that is designated for GlusterFS will have its region label value changed to "cns". An easier solution would be to change the node selector to something like "storage=glusterfs".

Comment 11 Russell Teague 2018-11-20 19:03:14 UTC

The problem described above is a result of the combination of several inventory variables.

openshift_storage_glusterfs_nodeselector: "region=cns"
openshift_storage_glusterfs_wipe: true

By specifying the node selector with the key 'region' as well as setting openshift_storage_glusterfs_wipe=True, a task [1] during install will remove all labels from all hosts with the 'key' defined in the node selector.  Thus, all 'region' labels are removed from all hosts, and then only the label region=cns is added.

To resolve this issue, specify a custom node selector that does not use 'region' as the key, or do not specify any openshift_storage_glusterfs_nodeselector in the inventory which would allow the default node selector of 'glusterfs=storage-host' [2].

This issue does not exist in release-3.10 as the 'Unlabel' task does not exist due to code refactoring for additional improvements in the openshift_storage_glusterfs role.

Please report if the above recommendations will resolve this issue.

[1] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/openshift_storage_glusterfs/tasks/glusterfs_deploy.yml#L19-L26
[2] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/openshift_storage_glusterfs/README.md#role-variables

Comment 13 Scott Dodson 2018-12-14 20:47:15 UTC

This is believed to be a misconfiguration. Please see the suggestion in Comment 11.