Bug 1719295

Summary: HCI + Fast Forward Upgrade: converge step always fails at WorkflowTasks_Step2_Execution
Product: Red Hat OpenStack Reporter: Punit Kundal <pkundal>
Component: ceph-ansibleAssignee: Giulio Fidente <gfidente>
Status: CLOSED DUPLICATE QA Contact: Yogev Rabl <yrabl>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: fpantano, gfidente, johfulto
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-26 12:55:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
required logs
none
ansible-mistral run files
none
templates used with the upgrade none

Description Punit Kundal 2019-06-11 12:57:27 UTC
Created attachment 1579362 [details]
required logs

Description of problem:

I am trying to run a fast forward upgrade on HCI deployment in RHOSP 10.

I've already upgraded:

Controllers and Compute services to RHOSP 13 

RHCS cluster from ceph 2 to ceph 3 

using the steps as mentioned in the guide at [1] 

The actual command that I used to upgrade my ceph cluster from RHCS 2 (as deployed with director during RHOSP 10 deployment) to RHCS 3 using the below command:

(undercloud) [stack@undercloud-10 ~]$ openstack overcloud ceph-upgrade run --templates /usr/share/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/ips-from-pool-all.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -r ~/templates/roles_data.yaml -e ~/templates/scheduler_hints_env.yaml -e ~/templates/custom_repositories_script.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/templates/storage-environment.yaml -e ~/templates/extra-configs-upgrade.yaml -e ~/templates/overcloud_images.yaml -e ~/templates/node-info.yaml --ceph-ansible-playbook '/usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml,/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml' | tee ceph-upgrade-1.log 


This command went well and my ceph cluster was successfully upgraded and here are some results from controller/mon node:


[root@overcloud-ctrl-0 ~]# ceph -s
  cluster:
    id:     91d282cd-1eb2-4bce-96f9-597b7f728df1
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum overcloud-ctrl-0,overcloud-ctrl-1,overcloud-ctrl-2
    mgr: overcloud-ctrl-0(active), standbys: overcloud-ctrl-2, overcloud-ctrl-1
    osd: 5 osds: 5 up, 5 in
 
  data:
    pools:   6 pools, 189 pgs
    objects: 0 objects, 0B
    usage:   547MiB used, 189GiB / 190GiB avail
    pgs:     189 active+clean
 
[root@overcloud-ctrl-0 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME                      STATUS REWEIGHT PRI-AFF 
-1       0.18547 root default                                           
-2       0.11128     host overcloud-ceph-cmpt-0                         
 0       0.03709         osd.0                      up  1.00000 1.00000 
 1       0.03709         osd.1                      up  1.00000 1.00000 
 2       0.03709         osd.2                      up  1.00000 1.00000 
-3       0.07419     host overcloud-ceph-cmpt-1                         
 3       0.03709         osd.3                      up  1.00000 1.00000 
 4       0.03709         osd.4                      up  1.00000 1.00000 
[root@overcloud-ctrl-0 ~]# 


Just for reference, I will be attach ceph-upgrade-1.log to the bugzilla.

Now I am at the converge step as per the link at [2].

For this stage, the command that I am running is:


(undercloud) [stack@undercloud-10 ~]$ openstack overcloud ffwd-upgrade converge --templates /usr/share/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/ips-from-pool-all.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -r ~/templates/roles_data.yaml -e ~/templates/scheduler_hints_env.yaml -e ~/templates/custom_repositories_script.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/templates/storage-environment.yaml -e ~/templates/extra-configs-upgrade.yaml -e ~/templates/overcloud_images.yaml -e ~/templates/node-info.yaml --yes | tee upgrade-converge-1.log

This step always fails with:

(undercloud) [stack@undercloud-10 ~]$ openstack stack failures list overcloud 
overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: c392f6cc-e462-42b8-8f00-4905112f0920
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: Failure caused by error in tasks: ceph_base_ansible_workflow
    
      ceph_base_ansible_workflow [task_ex_id=62e46f17-c447-4cf2-86bf-7be533313cb2] -> Failure caused by error in tasks: ceph_install
    
      ceph_install [task_ex_id=0d67a37a-6568-4b17-aaa8-50ca3d20048a] -> One or more actions had failed.
.....

(omitted for brevity....full output will be attached)


Here's the ansible-playbook command which runs for this step:

Command: ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionHi9KFw/inventory.yaml --private-key /tmp/ansible-mistral-actionHi9KFw/ssh_private_key --skip*** package-install,with_pkg

Looking inside the directory, I can see the log file:

[root@undercloud-10 ~]# cat /tmp/ansible-mistral-actionHi9KFw/ansible-playbook-command.sh 
#!/bin/bash

PROFILE_TASKS_TASK_OUTPUT_LIMIT="0"
ANSIBLE_RETRY_FILES_ENABLED="False"
ANSIBLE_CONFIG="/usr/share/ceph-ansible/ansible.cfg"
ANSIBLE_LOG_PATH="/var/log/mistral/ceph-install-workflow.log"
DEFAULT_FORKS="25"
ANSIBLE_LIBRARY="/usr/share/ceph-ansible/library/"
ANSIBLE_HOST_KEY_CHECKING="False"
ANSIBLE_ROLES_PATH="/usr/share/ceph-ansible/roles/"
ANSIBLE_LOCAL_TEMP="/tmp/ansible-mistral-actionHi9KFw"
HOME="/tmp/ansible-mistral-actionHi9KFw"
ANSIBLE_CALLBACK_WHITELIST="profile_tasks"
ANSIBLE_SSH_RETRIES="3"
ANSIBLE_ACTION_PLUGINS="/usr/share/ceph-ansible/plugins/actions/"

ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionHi9KFw/inventory.yaml --private-key /tmp/ansible-mistral-actionHi9KFw/ssh_private_key --skip-tags package-install,with_pkg "$@"


So the /var/log/mistral/ceph-install-workflow.log points the cause of the failure to be:

12972 2019-06-11 17:20:00,869 p=27741 u=mistral |  failed: [192.168.24.8] (item=[u'/var/lib/ceph/bootstrap-rgw/ceph.keyring', {'_ansible_parsed': True, u'stat': {u'exists': False}, '_ansible_item_result': True        , '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'localhost', 'ansible_host': u'localhost'}, u'changed': False, 'failed': False, 'item': u'/var/lib/ceph/bootstrap-rgw/ce        ph.keyring', u'invocation': {u'module_args': {u'checksum_algorithm': u'sha1', u'get_checksum': True, u'follow': False, u'path': u'/tmp/file-mistral-actionNeRu4Y/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/        lib/ceph/bootstrap-rgw/ceph.keyring', u'get_md5': None, u'get_mime': True, u'get_attributes': True}}, 'failed_when_result': False, '_ansible_ignore_errors': None, '_ansible_item_label': u'/var/lib/ceph/b        ootstrap-rgw/ceph.keyring'}]) => {"changed": false, "item": ["/var/lib/ceph/bootstrap-rgw/ceph.keyring", {"_ansible_delegated_vars": {"ansible_delegated_host": "localhost", "ansible_host": "localhost"},         "_ansible_ignore_errors": null, "_ansible_item_label": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "fail        ed": false, "failed_when_result": false, "invocation": {"module_args": {"checksum_algorithm": "sha1", "follow": false, "get_attributes": true, "get_checksum": true, "get_md5": null, "get_mime": true, "pa        th": "/tmp/file-mistral-actionNeRu4Y/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ceph/bootstrap-rgw/ceph.keyring"}}, "item": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "stat": {"exists": false}}], "ms  g": "file not found: /var/lib/ceph/bootstrap-rgw/ceph.keyring"}

So it's trying to check for ceph.keyring file undercloud /var/lib/ceph/bootstrap-rgw directory on the controller nodes but it does not exist:


[root@overcloud-ctrl-0 ~]# ls -lR /var/lib/ceph/bootstrap-*
/var/lib/ceph/bootstrap-mds:
total 4
-rw-------. 1 ceph ceph 71 Jun  9 01:28 ceph.keyring

/var/lib/ceph/bootstrap-osd:
total 4
-rw-------. 1 ceph ceph 113 Jun  9 01:28 ceph.keyring

/var/lib/ceph/bootstrap-rbd:
total 4
-rw-------. 1 ceph ceph 113 Jun 11 11:55 ceph.keyring

/var/lib/ceph/bootstrap-rgw:
total 0

Note that I am not at all using the ceph-rgw service and I am only using ceph-mon, ceph-osd and ceph-mgr services as this environment is an upgrade from RHOSP 10 to RHOSP 13 and at the time of deploying, I didn't configure ceph-rgw service and I don't want that configured either.

So far, I've tried running the converge step twice and it has failed in the both the attempts with the same issue; here's a log trace from a different timestamp:

11986 2019-06-11 11:55:09,778 p=14286 u=mistral |  failed: [192.168.24.8] (item=[u'/var/lib/ceph/bootstrap-rgw/ceph.keyring', {'_ansible_parsed': True, u'stat': {u'exists': False}, '_ansible_item_result': True,       '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'localhost', 'ansible_host': u'localhost'}, u'changed': False, 'failed': False, 'item': u'/var/lib/ceph/bootstrap-rgw/ceph.k      eyring', u'invocation': {u'module_args': {u'checksum_algorithm': u'sha1', u'get_checksum': True, u'follow': False, u'path': u'/tmp/file-mistral-action4C4UPg/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ce      ph/bootstrap-rgw/ceph.keyring', u'get_md5': None, u'get_mime': True, u'get_attributes': True}}, 'failed_when_result': False, '_ansible_ignore_errors': None, '_ansible_item_label': u'/var/lib/ceph/bootstrap      -rgw/ceph.keyring'}]) => {"changed": false, "item": ["/var/lib/ceph/bootstrap-rgw/ceph.keyring", {"_ansible_delegated_vars": {"ansible_delegated_host": "localhost", "ansible_host": "localhost"}, "_ansible_      ignore_errors": null, "_ansible_item_label": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "failed": false,       "failed_when_result": false, "invocation": {"module_args": {"checksum_algorithm": "sha1", "follow": false, "get_attributes": true, "get_checksum": true, "get_md5": null, "get_mime": true, "path": "/tmp/fil      e-mistral-action4C4UPg/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ceph/bootstrap-rgw/ceph.keyring"}}, "item": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "stat": {"exists": false}}], "msg": "file not fo      und: /var/lib/ceph/bootstrap-rgw/ceph.keyring"}

I've also checked the inventory file under /tmp/ansible-mistral-actionHi9KFw/ and I can see that there are no hosts configured for rgw service which is expected:

+++
[root@undercloud-10 ansible-mistral-action0ZZyC0]# cat inventory.yaml | grep -i rgw -C2
    ceph_conf_overrides:
      global: {osd_pool_default_min_size: 1, osd_pool_default_pg_num: 25, osd_pool_default_pgp_num: 25,
        osd_pool_default_size: 2, rgw_keystone_accepted_roles: 'Member, admin', rgw_keystone_admin_domain: default,
        rgw_keystone_admin_password: d8hwT3XKp4RCKPnH7hG34wZYT, rgw_keystone_admin_project: service,
        rgw_keystone_admin_user: swift, rgw_keystone_api_version: 3, rgw_keystone_implicit_tenants: 'true',
        rgw_keystone_revocation_interval: '0', rgw_keystone_url: 'http://172.168.20.20:5000',
        rgw_s3_auth_use_keystone: 'true'}
    ceph_docker_image: rhceph/rhceph-3-rhel7
    ceph_docker_image_tag: 3-27
--
rbdmirrors:
  hosts: {}
rgws:
  hosts: {}
+++

To circumvent I tried adding the below:

[root@undercloud-10 ~]# cat /home/stack/templates/storage-environment.yaml | grep -i heat
## A Heat environment file which can be used to set up storage
  OS::TripleO::Services::CephMgr: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-mgr.yaml
  OS::TripleO::Services::CephMon: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-mon.yaml
  OS::TripleO::Services::CephOSD: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-osd.yaml
  OS::TripleO::Services::CephClient: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-client.yaml
  OS::TripleO::Services::CephRgw: OS::Heat::None   << disable the ceph-rgw service completely but this didn't help either....


So the question is, if the ceph-rgw service is not even running why should ceph-ansible check for the existence of ceph.keyring file inside the rgw bootstrap directory ? 

Attaching to the bugzilla:

1. Templates used for the upgrade
2. ceph-upgrade-1.log
3. upgrade-converge-1.log
4. ceph-install-workflow.log
5. tarball of /tmp/ansible-mistral-actionHi9KFw/ for all the required files
6. Full output of openstack stack failures list overcloud --long

If needed, I can attach my templates from original RHOSP 10 deployment, let me know if those are required.


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-upgrading_the_overcloud
[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-upgrading_the_overcloud#finalizing_the_fast_forward_upgrade


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Punit Kundal 2019-06-11 12:58:12 UTC
Created attachment 1579363 [details]
ansible-mistral run files

Comment 2 Punit Kundal 2019-06-11 12:58:42 UTC
Created attachment 1579365 [details]
templates used with the upgrade

Comment 9 Red Hat Bugzilla 2023-09-14 05:30:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days