rolling upgrade issues with satellite installation #Issue-1 https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/installation_guide_for_red_hat_enterprise_linux/#upgrading_between_minor_versions_and_applying_asynchronous_updates 5.2. Upgrading Between Minor Versions and Applying Asynchronous Updates We have opened a doc bug: https://bugzilla.redhat.com/show_bug.cgi?id=1479074 this section does not talk about first that the customer needs to upgrade the ansible node. Because of this the customer ran rolling upgrade without upgrading ceph-anisble version(ansible version was which we shipped in Red Hat Ceph Storage 2.1). and rolling upgrade playbook went fine but it listed: - one osd node as failed - and we verified OSD's in this node has latest rpm and in memory version - and did not unset the nodeep-scrub, noout and noscrub flag We started troubleshooting the issue: First, we upgraded to latest ceph-ansible: ansible-2.2.3.0-1.el7.noarch ceph-ansible-2.2.11-1.el7scon.noarch #Issue -2 As soon as we upgraded to latest ansible it started failing only in the first monitor node with following error: This was set because of satellite installation in all.yml ceph_origin: 'distro' # or 'distro' NEEDED FOR SAT ceph_stable_rh_storage: true 2017-08-07 16:55:00,788 p=4873 u=root | TASK [ceph-common : verify that a method was chosen for red hat storage] ******* 2017-08-07 16:55:00,817 p=4873 u=root | fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "choose between ceph_rhcs_cdn_install and ceph_rhcs_iso_install"} Issue-3: This was set because of satellite installation in all.yml ceph_origin: 'distro' # or 'distro' NEEDED FOR SAT and commented: #ceph_stable_rh_storage: true #ceph_rhcs_cdn_install: false 2017-08-07 17:13:27,230 p=7053 u=root | fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "choose an upstream installation source or read https://github.com/ceph/ceph-ansible/wiki"} We have manually unset the nodeep-scrub, noscrub and noout flags.
From what I have seen: The old rolling_update.yml is still used, then it looks for group_vars/all file 76 pre_tasks: 77 - include_vars: roles/ceph-common/defaults/main.yml 78 - include_vars: roles/ceph-mon/defaults/main.yml 79 - include_vars: roles/ceph-restapi/defaults/main.yml 80 - include_vars: group_vars/all 81 failed_when: false 82 - include_vars: group_vars/{{ mon_group_name }} 83 failed_when: false 84 - include_vars: group_vars/{{ restapi_group_name }} 85 failed_when: false but group_vars/all doesn't exist : $~/Downloads/debug_bz/usr/share/ceph-ansible$ ls group_vars/all ls: group_vars/all: No such file or directory therefore it takes the default value for ceph_origin which is : roles/ceph-common/defaults/main.yml:83:ceph_origin: 'upstream' # or 'distro' or 'local' then it end up with the error we can see in the logs because its entering in that condition: https://github.com/ceph/ceph-ansible/blob/v2.2.11/roles/ceph-common/tasks/checks/check_mandatory_vars.yml#L12-L23 I think playing the new version of rolling_update.yml should fix this error since it will look for group_vars/all.yml (which actually exists) and set ceph_origin to 'distro' as expected.