Bug 1635864

Summary: Scaling out a splitstack environment with blacklisted nodes fails while running External deployment step 1
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: John Fulton <johfulto>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: agurenko, aschoen, ceph-eng-bugs, dbecker, gamado, gfidente, gmeno, james.bagwell, johfulto, mariel, mburns, morazi, nthomas, sankarshan
Target Milestone: betaKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-9.0.0-0.20181001174824.90afd18.0rc2.0rc2.0rc2.el7ost Doc Type: Bug Fix
Doc Text:
Blacklisting configuration updates against Ceph nodes no longer results in failed deployments.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:53:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs.tar.gz none

Description Marius Cornea 2018-10-03 19:28:24 UTC
Created attachment 1490294 [details]
logs.tar.gz

Description of problem:
Scaling out a splitstack environment with blacklisted nodes fails while running External deployment step 1:

[root@undercloud-0 stack]# tail -30 /var/lib/mistral/overcloud/ansible.log
2018-10-03 15:07:29,090 p=23844 u=mistral |  skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-10-03 15:07:29,122 p=23844 u=mistral |  skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-10-03 15:07:29,217 p=23844 u=mistral |  skipping: [ceph-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-10-03 15:07:29,263 p=23844 u=mistral |  skipping: [ceph-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-10-03 15:07:29,282 p=23844 u=mistral |  skipping: [ceph-2] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-10-03 15:07:36,258 p=23844 u=mistral |  changed: [compute-2] => {"changed": true, "cmd": ["ntpdate", "-u", "clock.redhat.com"], "delta": "0:00:06.882513", "end": "2018-10-03 15:07:36.235121", "rc": 0, "start": "2018-10-03 15:07:29.352608", "stderr": "", "stderr_lines": [], "stdout": " 3 Oct 15:07:36 ntpdate[17583]: adjust time server 10.11.160.238 offset -0.003281 sec", "stdout_lines": [" 3 Oct 15:07:36 ntpdate[17583]: adjust time server 10.11.160.238 offset -0.003281 sec"]}
2018-10-03 15:07:36,269 p=23844 u=mistral |  PLAY [External deployment step 1] **********************************************
2018-10-03 15:07:36,294 p=23844 u=mistral |  TASK [set blacklisted_hostnames] ***********************************************
2018-10-03 15:07:36,294 p=23844 u=mistral |  Wednesday 03 October 2018  15:07:36 -0400 (0:00:07.326)       0:11:31.487 ***** 
2018-10-03 15:07:36,346 p=23844 u=mistral |  ok: [undercloud] => {"ansible_facts": {"blacklisted_hostnames": ["compute-0", "compute-1"]}, "changed": false}
2018-10-03 15:07:36,367 p=23844 u=mistral |  TASK [create ceph-ansible temp dirs] *******************************************
2018-10-03 15:07:36,368 p=23844 u=mistral |  Wednesday 03 October 2018  15:07:36 -0400 (0:00:00.073)       0:11:31.560 ***** 
2018-10-03 15:07:36,582 p=23844 u=mistral |  ok: [undercloud] => (item=/var/lib/mistral/overcloud/ceph-ansible/group_vars) => {"changed": false, "gid": 42430, "group": "mistral", "item": "/var/lib/mistral/overcloud/ceph-ansible/group_vars", "mode": "0755", "owner": "mistral", "path": "/var/lib/mistral/overcloud/ceph-ansible/group_vars", "size": 88, "state": "directory", "uid": 42430}
2018-10-03 15:07:36,758 p=23844 u=mistral |  ok: [undercloud] => (item=/var/lib/mistral/overcloud/ceph-ansible/host_vars) => {"changed": false, "gid": 42430, "group": "mistral", "item": "/var/lib/mistral/overcloud/ceph-ansible/host_vars", "mode": "0755", "owner": "mistral", "path": "/var/lib/mistral/overcloud/ceph-ansible/host_vars", "size": 174, "state": "directory", "uid": 42430}
2018-10-03 15:07:36,924 p=23844 u=mistral |  ok: [undercloud] => (item=/var/lib/mistral/overcloud/ceph-ansible/fetch_dir) => {"changed": false, "gid": 42430, "group": "mistral", "item": "/var/lib/mistral/overcloud/ceph-ansible/fetch_dir", "mode": "0755", "owner": "mistral", "path": "/var/lib/mistral/overcloud/ceph-ansible/fetch_dir", "size": 80, "state": "directory", "uid": 42430}
2018-10-03 15:07:36,944 p=23844 u=mistral |  TASK [generate inventory] ******************************************************
2018-10-03 15:07:36,944 p=23844 u=mistral |  Wednesday 03 October 2018  15:07:36 -0400 (0:00:00.576)       0:11:32.136 ***** 
2018-10-03 15:07:38,740 p=23844 u=mistral |  fatal: [undercloud]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'ansible_hostname'\n\nThe error appears to have been in '/var/lib/mistral/overcloud/external_deploy_steps_tasks.yaml': line 15, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n    - '{{playbook_dir}}/ceph-ansible/fetch_dir'\n  - copy:\n    ^ here\n"}
2018-10-03 15:07:38,740 p=23844 u=mistral |  NO MORE HOSTS LEFT *************************************************************
2018-10-03 15:07:38,741 p=23844 u=mistral |  PLAY RECAP *********************************************************************
2018-10-03 15:07:38,741 p=23844 u=mistral |  ceph-0                     : ok=124  changed=29   unreachable=0    failed=0   
2018-10-03 15:07:38,742 p=23844 u=mistral |  ceph-1                     : ok=124  changed=29   unreachable=0    failed=0   
2018-10-03 15:07:38,742 p=23844 u=mistral |  ceph-2                     : ok=124  changed=29   unreachable=0    failed=0   
2018-10-03 15:07:38,742 p=23844 u=mistral |  compute-2                  : ok=136  changed=59   unreachable=0    failed=0   
2018-10-03 15:07:38,742 p=23844 u=mistral |  controller-0               : ok=190  changed=33   unreachable=0    failed=0   
2018-10-03 15:07:38,742 p=23844 u=mistral |  controller-1               : ok=190  changed=33   unreachable=0    failed=0   
2018-10-03 15:07:38,742 p=23844 u=mistral |  controller-2               : ok=190  changed=33   unreachable=0    failed=0   
2018-10-03 15:07:38,743 p=23844 u=mistral |  undercloud                 : ok=4    changed=0    unreachable=0    failed=1   
2018-10-03 15:07:38,743 p=23844 u=mistral |  Wednesday 03 October 2018  15:07:38 -0400 (0:00:01.798)       0:11:33.935 ***** 
2018-10-03 15:07:38,743 p=23844 u=mistral |  =============================================================================== 


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.0-0.20180919080941.0rc1.0rc1.el7ost.noarch
openstack-tripleo-common-9.3.1-0.20180923215325.d22cb3e.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy splitstack environment with 3 controller + 2 computes + 3 ceph nodes

2. create blacklist.yaml:
parameter_defaults:
  DeploymentServerBlacklist:
    - compute-0
    - compute-1

3. Set ComputeDeployedServerCount: 3

4. Run overcloud deploy command with blacklist.yaml:

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--overcloud-ssh-user stack \
--disable-validation \
-r /home/stack/composable_roles/roles/roles_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-bootstrap-environment-rhel.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-pacemaker-environment.yaml \
-e /home/stack/composable_roles/network-config.yaml \
-e /home/stack/composable_roles/ctrlplane-template.yml \
-e /home/stack/composable_roles/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/composable_roles/roles-port-config.yml \
-e /home/stack/composable_roles/network/network-environment.yaml \
-e /home/stack/composable_roles/enable-tls.yaml \
-e /home/stack/composable_roles/inject-trust-anchor.yaml \
-e /home/stack/composable_roles/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/composable_roles/debug.yaml \
-e /home/stack/blacklist.yaml \
-e /home/stack/composable_roles/docker-images.yaml \
--log-file overcloud_deployment_64.log


Actual results:
Ansible run fails:

2018-10-03 15:07:36,367 p=23844 u=mistral |  TASK [create ceph-ansible temp dirs] *******************************************
2018-10-03 15:07:36,368 p=23844 u=mistral |  Wednesday 03 October 2018  15:07:36 -0400 (0:00:00.073)       0:11:31.560 ***** 
2018-10-03 15:07:36,582 p=23844 u=mistral |  ok: [undercloud] => (item=/var/lib/mistral/overcloud/ceph-ansible/group_vars) => {"changed": false, "gid": 42430, "group": "mistral", "item": "/var/lib/mistral/overcloud/ceph-ansible/group_vars", "mode": "0755", "owner": "mistral", "path": "/var/lib/mistral/overcloud/ceph-ansible/group_vars", "size": 88, "state": "directory", "uid": 42430}
2018-10-03 15:07:36,758 p=23844 u=mistral |  ok: [undercloud] => (item=/var/lib/mistral/overcloud/ceph-ansible/host_vars) => {"changed": false, "gid": 42430, "group": "mistral", "item": "/var/lib/mistral/overcloud/ceph-ansible/host_vars", "mode": "0755", "owner": "mistral", "path": "/var/lib/mistral/overcloud/ceph-ansible/host_vars", "size": 174, "state": "directory", "uid": 42430}
2018-10-03 15:07:36,924 p=23844 u=mistral |  ok: [undercloud] => (item=/var/lib/mistral/overcloud/ceph-ansible/fetch_dir) => {"changed": false, "gid": 42430, "group": "mistral", "item": "/var/lib/mistral/overcloud/ceph-ansible/fetch_dir", "mode": "0755", "owner": "mistral", "path": "/var/lib/mistral/overcloud/ceph-ansible/fetch_dir", "size": 80, "state": "directory", "uid": 42430}
2018-10-03 15:07:36,944 p=23844 u=mistral |  TASK [generate inventory] ******************************************************
2018-10-03 15:07:36,944 p=23844 u=mistral |  Wednesday 03 October 2018  15:07:36 -0400 (0:00:00.576)       0:11:32.136 ***** 
2018-10-03 15:07:38,740 p=23844 u=mistral |  fatal: [undercloud]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'ansible_hostname'\n\nThe error appears to have been in '/var/lib/mistral/overcloud/external_deploy_steps_tasks.yaml': line 15, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n    - '{{playbook_dir}}/ceph-ansible/fetch_dir'\n  - copy:\n    ^ here\n"}


Expected results:
No failure

Additional info:
Attaching /var/lib/mistral and home dir with templates.

Comment 2 Marius Cornea 2018-10-03 20:10:13 UTC
Moving it to Ceph DFG since it looks like it's a problem with ceph-ansible when blacklist is used.

Comment 7 John Fulton 2018-10-10 12:37:02 UTC
(In reply to Marius Cornea from comment #0)
> Scaling out a splitstack environment with blacklisted nodes fails while
> running External deployment step 1:
...
> 2018-10-03 15:07:36,944 p=23844 u=mistral |  TASK [generate inventory]
> ******************************************************
> 2018-10-03 15:07:36,944 p=23844 u=mistral |  Wednesday 03 October 2018 
> 15:07:36 -0400 (0:00:00.576)       0:11:32.136 ***** 
> 2018-10-03 15:07:38,740 p=23844 u=mistral |  fatal: [undercloud]: FAILED! =>
> {"msg": "The task includes an option with an undefined variable. The error
> was: 'dict object' has no attribute 'ansible_hostname'\n\nThe error appears
> to have been in
> '/var/lib/mistral/overcloud/external_deploy_steps_tasks.yaml': line 15,
> column 5, but may\nbe elsewhere in the file depending on the exact syntax
> problem.\n\nThe offending line appears to be:\n\n    -
> '{{playbook_dir}}/ceph-ansible/fetch_dir'\n  - copy:\n    ^ here\n"}

Looks like this embedded ansible in tripleo heat templates:

https://github.com/openstack/tripleo-heat-templates/blob/stable/rocky/docker/services/ceph-ansible/ceph-base.yaml#L379-L399

needs to not access hostvars.raw_get(host)['ansible_hostname'] unless that variable is set. 

  John

PS: the generated ansible was:

  - copy:
      content: "{%- set ceph_groups = ['mgr', 'mon', 'osd', 'mds', 'rgw', 'nfs', 'rbdmirror',\
        \ 'client'] -%}\n{%- for ceph_group in ceph_groups -%}\n{%- if 'ceph_' ~ ceph_group\
        \ in groups %}\n\n{{ ceph_group ~ 's:' }}\n  hosts:\n    {% for host in groups['ceph_'\
        \ ~ ceph_group] -%}\n    {%- if hostvars.raw_get(host)['ansible_hostname']\
        \ not in blacklisted_hostnames -%}\n    {{ hostvars.raw_get(host)['ansible_hostname']\
        \ }}:\n      ansible_user: {{ hostvars.raw_get(host)['ansible_ssh_user'] |\
        \ default('root') }}\n      ansible_host: {{ hostvars.raw_get(host)['ansible_host']\
        \ | default(host) }}\n      ansible_become: true\n    {% endif -%}\n    {%-\
        \ endfor -%}\n\n{%- endif -%}\n{%- endfor %}\n"
      dest: '{{playbook_dir}}/ceph-ansible/inventory.yml'
    name: generate inventory

Comment 13 Jim Bagwell 2018-11-26 18:20:05 UTC
So what was the actual fix? Im encountering this issue with rpm version openstack-tripleo-heat-templates-9.0.1-0.20181013060858.ffbe879.el7.noarch still

Comment 14 John Fulton 2018-11-26 19:07:19 UTC
(In reply to Jim Bagwell from comment #13)
> So what was the actual fix? Im encountering this issue with rpm version
> openstack-tripleo-heat-templates-9.0.1-0.20181013060858.ffbe879.el7.noarch
> still

This bug produced this fix https://review.openstack.org/#/c/609682 but there was also a follow up bug and fix regarding blacklisting which might explain what you're encountering, even though you have the first fix.

https://github.com/openstack/tripleo-heat-templates/commit/6c17d0f1c3c8c28c401dae56d88c6dd4075bf046#diff-f84644f1b5951f535bc7c3f4151a9a90

Comment 15 John Fulton 2018-11-26 20:08:03 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1639038

Comment 17 errata-xmlrpc 2019-01-11 11:53:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045