Bug 1479078

Summary: rolling upgrade issues with satellite installation
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.3CC: adeza, aschoen, ceph-eng-bugs, gmeno, jbautist, nthomas, sankarshan
Target Milestone: rc   
Target Release: 2.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-28 21:41:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Vikhyat Umrao 2017-08-07 21:53:12 UTC
rolling upgrade issues with satellite installation 

#Issue-1

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/installation_guide_for_red_hat_enterprise_linux/#upgrading_between_minor_versions_and_applying_asynchronous_updates

5.2. Upgrading Between Minor Versions and Applying Asynchronous Updates

We have opened a doc bug: https://bugzilla.redhat.com/show_bug.cgi?id=1479074 this section does not talk about first that the customer needs to upgrade the ansible node.

Because of this the customer ran rolling upgrade without upgrading ceph-anisble version(ansible version was which we shipped in Red Hat Ceph Storage 2.1).

and rolling upgrade playbook went fine but it listed:

- one osd node as failed - and we verified OSD's in this node has latest rpm and in memory version
- and did not unset the nodeep-scrub, noout and noscrub flag

We started troubleshooting the issue:

First, we upgraded to latest ceph-ansible:

ansible-2.2.3.0-1.el7.noarch
ceph-ansible-2.2.11-1.el7scon.noarch

#Issue -2 

As soon as we upgraded to latest ansible it started failing only in the first monitor node with following error:

This was set because of satellite installation in all.yml
ceph_origin: 'distro' # or 'distro' NEEDED FOR SAT
ceph_stable_rh_storage: true


2017-08-07 16:55:00,788 p=4873 u=root |  TASK [ceph-common : verify that a method was chosen for red hat storage] *******

2017-08-07 16:55:00,817 p=4873 u=root |  fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "choose between ceph_rhcs_cdn_install and ceph_rhcs_iso_install"}

Issue-3:

This was set because of satellite installation in all.yml
ceph_origin: 'distro' # or 'distro' NEEDED FOR SAT

and commented:
#ceph_stable_rh_storage: true
#ceph_rhcs_cdn_install: false 

2017-08-07 17:13:27,230 p=7053 u=root |  fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "choose an upstream installation source or read https://github.com/ceph/ceph-ansible/wiki"}


We have manually unset the nodeep-scrub, noscrub and noout flags.

Comment 5 Guillaume Abrioux 2017-08-09 20:45:26 UTC
From what I have seen:

The old rolling_update.yml is still used, then it looks for group_vars/all file

 76   pre_tasks:
 77     - include_vars: roles/ceph-common/defaults/main.yml
 78     - include_vars: roles/ceph-mon/defaults/main.yml
 79     - include_vars: roles/ceph-restapi/defaults/main.yml
 80     - include_vars: group_vars/all
 81       failed_when: false
 82     - include_vars: group_vars/{{ mon_group_name }}
 83       failed_when: false
 84     - include_vars: group_vars/{{ restapi_group_name }}
 85       failed_when: false


but group_vars/all doesn't exist :

$~/Downloads/debug_bz/usr/share/ceph-ansible$ ls group_vars/all
ls: group_vars/all: No such file or directory

therefore it takes the default value for ceph_origin which is :

roles/ceph-common/defaults/main.yml:83:ceph_origin: 'upstream' # or 'distro' or 'local'

then it end up with the error we can see in the logs because its entering in that condition:

https://github.com/ceph/ceph-ansible/blob/v2.2.11/roles/ceph-common/tasks/checks/check_mandatory_vars.yml#L12-L23

I think playing the new version of rolling_update.yml should fix this error since it will look for group_vars/all.yml (which actually exists) and set ceph_origin to 'distro' as expected.