Created attachment 1579362 [details] required logs Description of problem: I am trying to run a fast forward upgrade on HCI deployment in RHOSP 10. I've already upgraded: Controllers and Compute services to RHOSP 13 RHCS cluster from ceph 2 to ceph 3 using the steps as mentioned in the guide at [1] The actual command that I used to upgrade my ceph cluster from RHCS 2 (as deployed with director during RHOSP 10 deployment) to RHCS 3 using the below command: (undercloud) [stack@undercloud-10 ~]$ openstack overcloud ceph-upgrade run --templates /usr/share/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/ips-from-pool-all.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -r ~/templates/roles_data.yaml -e ~/templates/scheduler_hints_env.yaml -e ~/templates/custom_repositories_script.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/templates/storage-environment.yaml -e ~/templates/extra-configs-upgrade.yaml -e ~/templates/overcloud_images.yaml -e ~/templates/node-info.yaml --ceph-ansible-playbook '/usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml,/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml' | tee ceph-upgrade-1.log This command went well and my ceph cluster was successfully upgraded and here are some results from controller/mon node: [root@overcloud-ctrl-0 ~]# ceph -s cluster: id: 91d282cd-1eb2-4bce-96f9-597b7f728df1 health: HEALTH_OK services: mon: 3 daemons, quorum overcloud-ctrl-0,overcloud-ctrl-1,overcloud-ctrl-2 mgr: overcloud-ctrl-0(active), standbys: overcloud-ctrl-2, overcloud-ctrl-1 osd: 5 osds: 5 up, 5 in data: pools: 6 pools, 189 pgs objects: 0 objects, 0B usage: 547MiB used, 189GiB / 190GiB avail pgs: 189 active+clean [root@overcloud-ctrl-0 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.18547 root default -2 0.11128 host overcloud-ceph-cmpt-0 0 0.03709 osd.0 up 1.00000 1.00000 1 0.03709 osd.1 up 1.00000 1.00000 2 0.03709 osd.2 up 1.00000 1.00000 -3 0.07419 host overcloud-ceph-cmpt-1 3 0.03709 osd.3 up 1.00000 1.00000 4 0.03709 osd.4 up 1.00000 1.00000 [root@overcloud-ctrl-0 ~]# Just for reference, I will be attach ceph-upgrade-1.log to the bugzilla. Now I am at the converge step as per the link at [2]. For this stage, the command that I am running is: (undercloud) [stack@undercloud-10 ~]$ openstack overcloud ffwd-upgrade converge --templates /usr/share/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/ips-from-pool-all.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -r ~/templates/roles_data.yaml -e ~/templates/scheduler_hints_env.yaml -e ~/templates/custom_repositories_script.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/templates/storage-environment.yaml -e ~/templates/extra-configs-upgrade.yaml -e ~/templates/overcloud_images.yaml -e ~/templates/node-info.yaml --yes | tee upgrade-converge-1.log This step always fails with: (undercloud) [stack@undercloud-10 ~]$ openstack stack failures list overcloud overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution: resource_type: OS::TripleO::WorkflowSteps physical_resource_id: c392f6cc-e462-42b8-8f00-4905112f0920 status: CREATE_FAILED status_reason: | resources.WorkflowTasks_Step2_Execution: Failure caused by error in tasks: ceph_base_ansible_workflow ceph_base_ansible_workflow [task_ex_id=62e46f17-c447-4cf2-86bf-7be533313cb2] -> Failure caused by error in tasks: ceph_install ceph_install [task_ex_id=0d67a37a-6568-4b17-aaa8-50ca3d20048a] -> One or more actions had failed. ..... (omitted for brevity....full output will be attached) Here's the ansible-playbook command which runs for this step: Command: ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionHi9KFw/inventory.yaml --private-key /tmp/ansible-mistral-actionHi9KFw/ssh_private_key --skip*** package-install,with_pkg Looking inside the directory, I can see the log file: [root@undercloud-10 ~]# cat /tmp/ansible-mistral-actionHi9KFw/ansible-playbook-command.sh #!/bin/bash PROFILE_TASKS_TASK_OUTPUT_LIMIT="0" ANSIBLE_RETRY_FILES_ENABLED="False" ANSIBLE_CONFIG="/usr/share/ceph-ansible/ansible.cfg" ANSIBLE_LOG_PATH="/var/log/mistral/ceph-install-workflow.log" DEFAULT_FORKS="25" ANSIBLE_LIBRARY="/usr/share/ceph-ansible/library/" ANSIBLE_HOST_KEY_CHECKING="False" ANSIBLE_ROLES_PATH="/usr/share/ceph-ansible/roles/" ANSIBLE_LOCAL_TEMP="/tmp/ansible-mistral-actionHi9KFw" HOME="/tmp/ansible-mistral-actionHi9KFw" ANSIBLE_CALLBACK_WHITELIST="profile_tasks" ANSIBLE_SSH_RETRIES="3" ANSIBLE_ACTION_PLUGINS="/usr/share/ceph-ansible/plugins/actions/" ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionHi9KFw/inventory.yaml --private-key /tmp/ansible-mistral-actionHi9KFw/ssh_private_key --skip-tags package-install,with_pkg "$@" So the /var/log/mistral/ceph-install-workflow.log points the cause of the failure to be: 12972 2019-06-11 17:20:00,869 p=27741 u=mistral | failed: [192.168.24.8] (item=[u'/var/lib/ceph/bootstrap-rgw/ceph.keyring', {'_ansible_parsed': True, u'stat': {u'exists': False}, '_ansible_item_result': True , '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'localhost', 'ansible_host': u'localhost'}, u'changed': False, 'failed': False, 'item': u'/var/lib/ceph/bootstrap-rgw/ce ph.keyring', u'invocation': {u'module_args': {u'checksum_algorithm': u'sha1', u'get_checksum': True, u'follow': False, u'path': u'/tmp/file-mistral-actionNeRu4Y/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/ lib/ceph/bootstrap-rgw/ceph.keyring', u'get_md5': None, u'get_mime': True, u'get_attributes': True}}, 'failed_when_result': False, '_ansible_ignore_errors': None, '_ansible_item_label': u'/var/lib/ceph/b ootstrap-rgw/ceph.keyring'}]) => {"changed": false, "item": ["/var/lib/ceph/bootstrap-rgw/ceph.keyring", {"_ansible_delegated_vars": {"ansible_delegated_host": "localhost", "ansible_host": "localhost"}, "_ansible_ignore_errors": null, "_ansible_item_label": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "fail ed": false, "failed_when_result": false, "invocation": {"module_args": {"checksum_algorithm": "sha1", "follow": false, "get_attributes": true, "get_checksum": true, "get_md5": null, "get_mime": true, "pa th": "/tmp/file-mistral-actionNeRu4Y/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ceph/bootstrap-rgw/ceph.keyring"}}, "item": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "stat": {"exists": false}}], "ms g": "file not found: /var/lib/ceph/bootstrap-rgw/ceph.keyring"} So it's trying to check for ceph.keyring file undercloud /var/lib/ceph/bootstrap-rgw directory on the controller nodes but it does not exist: [root@overcloud-ctrl-0 ~]# ls -lR /var/lib/ceph/bootstrap-* /var/lib/ceph/bootstrap-mds: total 4 -rw-------. 1 ceph ceph 71 Jun 9 01:28 ceph.keyring /var/lib/ceph/bootstrap-osd: total 4 -rw-------. 1 ceph ceph 113 Jun 9 01:28 ceph.keyring /var/lib/ceph/bootstrap-rbd: total 4 -rw-------. 1 ceph ceph 113 Jun 11 11:55 ceph.keyring /var/lib/ceph/bootstrap-rgw: total 0 Note that I am not at all using the ceph-rgw service and I am only using ceph-mon, ceph-osd and ceph-mgr services as this environment is an upgrade from RHOSP 10 to RHOSP 13 and at the time of deploying, I didn't configure ceph-rgw service and I don't want that configured either. So far, I've tried running the converge step twice and it has failed in the both the attempts with the same issue; here's a log trace from a different timestamp: 11986 2019-06-11 11:55:09,778 p=14286 u=mistral | failed: [192.168.24.8] (item=[u'/var/lib/ceph/bootstrap-rgw/ceph.keyring', {'_ansible_parsed': True, u'stat': {u'exists': False}, '_ansible_item_result': True, '_ansible_no_log': False, '_ansible_delegated_vars': {'ansible_delegated_host': u'localhost', 'ansible_host': u'localhost'}, u'changed': False, 'failed': False, 'item': u'/var/lib/ceph/bootstrap-rgw/ceph.k eyring', u'invocation': {u'module_args': {u'checksum_algorithm': u'sha1', u'get_checksum': True, u'follow': False, u'path': u'/tmp/file-mistral-action4C4UPg/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ce ph/bootstrap-rgw/ceph.keyring', u'get_md5': None, u'get_mime': True, u'get_attributes': True}}, 'failed_when_result': False, '_ansible_ignore_errors': None, '_ansible_item_label': u'/var/lib/ceph/bootstrap -rgw/ceph.keyring'}]) => {"changed": false, "item": ["/var/lib/ceph/bootstrap-rgw/ceph.keyring", {"_ansible_delegated_vars": {"ansible_delegated_host": "localhost", "ansible_host": "localhost"}, "_ansible_ ignore_errors": null, "_ansible_item_label": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "failed": false, "failed_when_result": false, "invocation": {"module_args": {"checksum_algorithm": "sha1", "follow": false, "get_attributes": true, "get_checksum": true, "get_md5": null, "get_mime": true, "path": "/tmp/fil e-mistral-action4C4UPg/91d282cd-1eb2-4bce-96f9-597b7f728df1//var/lib/ceph/bootstrap-rgw/ceph.keyring"}}, "item": "/var/lib/ceph/bootstrap-rgw/ceph.keyring", "stat": {"exists": false}}], "msg": "file not fo und: /var/lib/ceph/bootstrap-rgw/ceph.keyring"} I've also checked the inventory file under /tmp/ansible-mistral-actionHi9KFw/ and I can see that there are no hosts configured for rgw service which is expected: +++ [root@undercloud-10 ansible-mistral-action0ZZyC0]# cat inventory.yaml | grep -i rgw -C2 ceph_conf_overrides: global: {osd_pool_default_min_size: 1, osd_pool_default_pg_num: 25, osd_pool_default_pgp_num: 25, osd_pool_default_size: 2, rgw_keystone_accepted_roles: 'Member, admin', rgw_keystone_admin_domain: default, rgw_keystone_admin_password: d8hwT3XKp4RCKPnH7hG34wZYT, rgw_keystone_admin_project: service, rgw_keystone_admin_user: swift, rgw_keystone_api_version: 3, rgw_keystone_implicit_tenants: 'true', rgw_keystone_revocation_interval: '0', rgw_keystone_url: 'http://172.168.20.20:5000', rgw_s3_auth_use_keystone: 'true'} ceph_docker_image: rhceph/rhceph-3-rhel7 ceph_docker_image_tag: 3-27 -- rbdmirrors: hosts: {} rgws: hosts: {} +++ To circumvent I tried adding the below: [root@undercloud-10 ~]# cat /home/stack/templates/storage-environment.yaml | grep -i heat ## A Heat environment file which can be used to set up storage OS::TripleO::Services::CephMgr: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-mgr.yaml OS::TripleO::Services::CephMon: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-mon.yaml OS::TripleO::Services::CephOSD: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-osd.yaml OS::TripleO::Services::CephClient: /usr/share/openstack-tripleo-heat-templates/docker/services/ceph-ansible/ceph-client.yaml OS::TripleO::Services::CephRgw: OS::Heat::None << disable the ceph-rgw service completely but this didn't help either.... So the question is, if the ceph-rgw service is not even running why should ceph-ansible check for the existence of ceph.keyring file inside the rgw bootstrap directory ? Attaching to the bugzilla: 1. Templates used for the upgrade 2. ceph-upgrade-1.log 3. upgrade-converge-1.log 4. ceph-install-workflow.log 5. tarball of /tmp/ansible-mistral-actionHi9KFw/ for all the required files 6. Full output of openstack stack failures list overcloud --long If needed, I can attach my templates from original RHOSP 10 deployment, let me know if those are required. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-upgrading_the_overcloud [2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-upgrading_the_overcloud#finalizing_the_fast_forward_upgrade Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1579363 [details] ansible-mistral run files
Created attachment 1579365 [details] templates used with the upgrade
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days