Bug 1719295 - HCI + Fast Forward Upgrade: converge step always fails at WorkflowTasks_Step2_Execution
Summary: HCI + Fast Forward Upgrade: converge step always fails at WorkflowTasks_Step...
Keywords:
Status: CLOSED DUPLICATE of bug 1697860
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ceph-ansible
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-11 12:57 UTC by Punit Kundal
Modified: 2023-09-14 05:30 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-26 12:55:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
required logs (476.42 KB, application/gzip)
2019-06-11 12:57 UTC, Punit Kundal
no flags Details
ansible-mistral run files (52.70 KB, application/gzip)
2019-06-11 12:58 UTC, Punit Kundal
no flags Details
templates used with the upgrade (5.87 KB, application/gzip)
2019-06-11 12:58 UTC, Punit Kundal
no flags Details

Description Punit Kundal 2019-06-11 12:57:27 UTC
Created attachment 1579362 [details]
required logs

Description of problem:

I am trying to run a fast forward upgrade on HCI deployment in RHOSP 10.

I've already upgraded:

Controllers and Compute services to RHOSP 13 

RHCS cluster from ceph 2 to ceph 3 

using the steps as mentioned in the guide at [1] 

The actual command that I used to upgrade my ceph cluster from RHCS 2 (as deployed with director during RHOSP 10 deployment) to RHCS 3 using the below command:

(undercloud) [stack@undercloud-10 ~]$ openstack overcloud ceph-upgrade run --templates /usr/share/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/ips-from-pool-all.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -r ~/templates/roles_data.yaml -e ~/templates/scheduler_hints_env.yaml -e ~/templates/custom_repositories_script.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/templates/storage-environment.yaml -e ~/templates/extra-configs-upgrade.yaml -e ~/templates/overcloud_images.yaml -e ~/templates/node-info.yaml --ceph-ansible-playbook '/usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml,/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml' | tee ceph-upgrade-1.log 


This command went well and my ceph cluster was successfully upgraded and here are some results from controller/mon node:


[root@overcloud-ctrl-0 ~]# ceph -s
  cluster:
    id:     91d282cd-1eb2-4bce-96f9-597b7f728df1
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum overcloud-ctrl-0,overcloud-ctrl-1,overcloud-ctrl-2
    mgr: overcloud-ctrl-0(active), standbys: overcloud-ctrl-2, overcloud-ctrl-1
    osd: 5 osds: 5 up, 5 in
 
  data:
    pools:   6 pools, 189 pgs
    objects: 0 objects, 0B
    usage:   547MiB used, 189GiB / 190GiB avail
    pgs:     189 active+clean
 
[root@overcloud-ctrl-0 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME                      STATUS REWEIGHT PRI-AFF 
-1       0.18547 root default                                           
-2       0.11128     host overcloud-ceph-cmpt-0                         
 0       0.03709         osd.0                      up  1.00000 1.00000 
 1       0.03709         osd.1                      up  1.00000 1.00000 
 2       0.03709         osd.2                      up  1.00000 1.00000 
-3       0.07419     host overcloud-ceph-cmpt-1                         
 3       0.03709         osd.3                      up  1.00000 1.00000 
 4       0.03709         osd.4                      up  1.00000 1.00000 
[root@overcloud-ctrl-0 ~]# 


Just for reference, I will be attach ceph-upgrade-1.log to the bugzilla.

Now I am at the converge step as per the link at [2].

For this stage, the command that I am running is:


(undercloud) [stack@undercloud-10 ~]$ openstack overcloud ffwd-upgrade converge --templates /usr/share/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/ips-from-pool-all.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -r ~/templates/roles_data.yaml -e ~/templates/scheduler_hints_env.yaml -e ~/templates/custom_repositories_script.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/templates/storage-environment.yaml -e ~/templates/extra-configs-upgrade.yaml -e ~/templates/overcloud_images.yaml -e ~/templates/node-info.yaml --yes | tee upgrade-converge-1.log

This step always fails with:

(undercloud) [stack@undercloud-10 ~]$ openstack stack failures list overcloud 
overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: c392f6cc-e462-42b8-8f00-4905112f0920
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: Failure caused by error in tasks: ceph_base_ansible_workflow
    
      ceph_base_ansible_workflow [task_ex_id=62e46f17-c447-4cf2-86bf-7be533313cb2] -> Failure caused by error in tasks: ceph_install
    
      ceph_install [task_ex_id=0d67a37a-6568-4b17-aaa8-50ca3d20048a] -> One or more actions had failed.
.....

(omitted for brevity....full output will be attached)


Here's the ansible-playbook command which runs for this step:

Command: ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionHi9KFw/inventory.yaml --private-key /tmp/ansible-mistral-actionHi9KFw/ssh_private_key --skip*** package-install,with_pkg

Looking inside the directory, I can see the log file:

[root@undercloud-10 ~]# cat /tmp/ansible-mistral-actionHi9KFw/ansible-playbook-command.sh 
#!/bin/bash

PROFILE_TASKS_TASK_OUTPUT_LIMIT="0"
ANSIBLE_RETRY_FILES_ENABLED="False"
ANSIBLE_CONFIG="/usr/share/ceph-ansible/ansible.cfg"
ANSIBLE_LOG_PATH="/var/log/mistral/ceph-install-workflow.log"
DEFAULT_FORKS="25"
ANSIBLE_LIBRARY="/usr/share/ceph-ansible/library/"
ANSIBLE_HOST_KEY_CHECKING="False"
ANSIBLE_ROLES_PATH="/usr/share/ceph-ansible/roles/"
ANSIBLE_LOCAL_TEMP="/tmp/ansible-mistral-actionHi9KFw"
HOME="/tmp/ansible-mistral-actionHi9KFw"
ANSIBLE_CALLBACK_WHITELIST="profile_tasks"
ANSIBLE_SSH_RETRIES="3"
ANSIBLE_ACTION_PLUGINS="/usr/share/ceph-ansible/plugins/actions/"

ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionHi9KFw/inventory.yaml --private-key /tmp/ansible-mistral-actionHi9KFw/ssh_private_key --skip-tags package-install,with_pkg "$@"


So the /var/log/mistral/ceph-install-workflow.log points the cause of the failure to be:

12972 2019-06-11 17:20:00,869 p=27741 u=mistral |  failed: [192.168.24.8] (item=[u'/var/lib/ceph/bootstrap-rgw/ceph.keyring', {'_ansible_parsed': True, u'stat': {u'exists': False}, '_ansible_item_result': True        , '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'localhost', 'ansible_host': u'localhost'}, u'changed': False, 'failed': False, 'item': u'/var/lib/ceph/bootstrap-rgw/ce        ph.keyring', u'invocation': {u'module_args': {u'checksum_algorithm': u'sha1', u'get_checksum': True, u'follow': False, u'path': u'/tmp/file-mistral-actionNeRu4Y/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/        lib/ceph/bootstrap-rgw/ceph.keyring', u'get_md5': None, u'get_mime': True, u'get_attributes': True}}, 'failed_when_result': False, '_ansible_ignore_errors': None, '_ansible_item_label': u'/var/lib/ceph/b        ootstrap-rgw/ceph.keyring'}]) => {"changed": false, "item": ["/var/lib/ceph/bootstrap-rgw/ceph.keyring", {"_ansible_delegated_vars": {"ansible_delegated_host": "localhost", "ansible_host": "localhost"},         "_ansible_ignore_errors": null, "_ansible_item_label": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "fail        ed": false, "failed_when_result": false, "invocation": {"module_args": {"checksum_algorithm": "sha1", "follow": false, "get_attributes": true, "get_checksum": true, "get_md5": null, "get_mime": true, "pa        th": "/tmp/file-mistral-actionNeRu4Y/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ceph/bootstrap-rgw/ceph.keyring"}}, "item": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "stat": {"exists": false}}], "ms  g": "file not found: /var/lib/ceph/bootstrap-rgw/ceph.keyring"}

So it's trying to check for ceph.keyring file undercloud /var/lib/ceph/bootstrap-rgw directory on the controller nodes but it does not exist:


[root@overcloud-ctrl-0 ~]# ls -lR /var/lib/ceph/bootstrap-*
/var/lib/ceph/bootstrap-mds:
total 4
-rw-------. 1 ceph ceph 71 Jun  9 01:28 ceph.keyring

/var/lib/ceph/bootstrap-osd:
total 4
-rw-------. 1 ceph ceph 113 Jun  9 01:28 ceph.keyring

/var/lib/ceph/bootstrap-rbd:
total 4
-rw-------. 1 ceph ceph 113 Jun 11 11:55 ceph.keyring

/var/lib/ceph/bootstrap-rgw:
total 0

Note that I am not at all using the ceph-rgw service and I am only using ceph-mon, ceph-osd and ceph-mgr services as this environment is an upgrade from RHOSP 10 to RHOSP 13 and at the time of deploying, I didn't configure ceph-rgw service and I don't want that configured either.

So far, I've tried running the converge step twice and it has failed in the both the attempts with the same issue; here's a log trace from a different timestamp:

11986 2019-06-11 11:55:09,778 p=14286 u=mistral |  failed: [192.168.24.8] (item=[u'/var/lib/ceph/bootstrap-rgw/ceph.keyring', {'_ansible_parsed': True, u'stat': {u'exists': False}, '_ansible_item_result': True,       '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'localhost', 'ansible_host': u'localhost'}, u'changed': False, 'failed': False, 'item': u'/var/lib/ceph/bootstrap-rgw/ceph.k      eyring', u'invocation': {u'module_args': {u'checksum_algorithm': u'sha1', u'get_checksum': True, u'follow': False, u'path': u'/tmp/file-mistral-action4C4UPg/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ce      ph/bootstrap-rgw/ceph.keyring', u'get_md5': None, u'get_mime': True, u'get_attributes': True}}, 'failed_when_result': False, '_ansible_ignore_errors': None, '_ansible_item_label': u'/var/lib/ceph/bootstrap      -rgw/ceph.keyring'}]) => {"changed": false, "item": ["/var/lib/ceph/bootstrap-rgw/ceph.keyring", {"_ansible_delegated_vars": {"ansible_delegated_host": "localhost", "ansible_host": "localhost"}, "_ansible_      ignore_errors": null, "_ansible_item_label": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "failed": false,       "failed_when_result": false, "invocation": {"module_args": {"checksum_algorithm": "sha1", "follow": false, "get_attributes": true, "get_checksum": true, "get_md5": null, "get_mime": true, "path": "/tmp/fil      e-mistral-action4C4UPg/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ceph/bootstrap-rgw/ceph.keyring"}}, "item": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "stat": {"exists": false}}], "msg": "file not fo      und: /var/lib/ceph/bootstrap-rgw/ceph.keyring"}

I've also checked the inventory file under /tmp/ansible-mistral-actionHi9KFw/ and I can see that there are no hosts configured for rgw service which is expected:

+++
[root@undercloud-10 ansible-mistral-action0ZZyC0]# cat inventory.yaml | grep -i rgw -C2
    ceph_conf_overrides:
      global: {osd_pool_default_min_size: 1, osd_pool_default_pg_num: 25, osd_pool_default_pgp_num: 25,
        osd_pool_default_size: 2, rgw_keystone_accepted_roles: 'Member, admin', rgw_keystone_admin_domain: default,
        rgw_keystone_admin_password: d8hwT3XKp4RCKPnH7hG34wZYT, rgw_keystone_admin_project: service,
        rgw_keystone_admin_user: swift, rgw_keystone_api_version: 3, rgw_keystone_implicit_tenants: 'true',
        rgw_keystone_revocation_interval: '0', rgw_keystone_url: 'http://172.168.20.20:5000',
        rgw_s3_auth_use_keystone: 'true'}
    ceph_docker_image: rhceph/rhceph-3-rhel7
    ceph_docker_image_tag: 3-27
--
rbdmirrors:
  hosts: {}
rgws:
  hosts: {}
+++

To circumvent I tried adding the below:

[root@undercloud-10 ~]# cat /home/stack/templates/storage-environment.yaml | grep -i heat
## A Heat environment file which can be used to set up storage
  OS::TripleO::Services::CephMgr: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-mgr.yaml
  OS::TripleO::Services::CephMon: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-mon.yaml
  OS::TripleO::Services::CephOSD: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-osd.yaml
  OS::TripleO::Services::CephClient: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-client.yaml
  OS::TripleO::Services::CephRgw: OS::Heat::None   << disable the ceph-rgw service completely but this didn't help either....


So the question is, if the ceph-rgw service is not even running why should ceph-ansible check for the existence of ceph.keyring file inside the rgw bootstrap directory ? 

Attaching to the bugzilla:

1. Templates used for the upgrade
2. ceph-upgrade-1.log
3. upgrade-converge-1.log
4. ceph-install-workflow.log
5. tarball of /tmp/ansible-mistral-actionHi9KFw/ for all the required files
6. Full output of openstack stack failures list overcloud --long

If needed, I can attach my templates from original RHOSP 10 deployment, let me know if those are required.


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-upgrading_the_overcloud
[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-upgrading_the_overcloud#finalizing_the_fast_forward_upgrade


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Punit Kundal 2019-06-11 12:58:12 UTC
Created attachment 1579363 [details]
ansible-mistral run files

Comment 2 Punit Kundal 2019-06-11 12:58:42 UTC
Created attachment 1579365 [details]
templates used with the upgrade

Comment 9 Red Hat Bugzilla 2023-09-14 05:30:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.