Bug 1479078

Summary:	rolling upgrade issues with satellite installation
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED NOTABUG	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.3	CC:	adeza, aschoen, ceph-eng-bugs, gmeno, jbautist, nthomas, sankarshan
Target Milestone:	rc
Target Release:	2.5
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-28 21:41:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Vikhyat Umrao 2017-08-07 21:53:12 UTC

rolling upgrade issues with satellite installation 

#Issue-1

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/installation_guide_for_red_hat_enterprise_linux/#upgrading_between_minor_versions_and_applying_asynchronous_updates

5.2. Upgrading Between Minor Versions and Applying Asynchronous Updates

We have opened a doc bug: https://bugzilla.redhat.com/show_bug.cgi?id=1479074 this section does not talk about first that the customer needs to upgrade the ansible node.

Because of this the customer ran rolling upgrade without upgrading ceph-anisble version(ansible version was which we shipped in Red Hat Ceph Storage 2.1).

and rolling upgrade playbook went fine but it listed:

- one osd node as failed - and we verified OSD's in this node has latest rpm and in memory version
- and did not unset the nodeep-scrub, noout and noscrub flag

We started troubleshooting the issue:

First, we upgraded to latest ceph-ansible:

ansible-2.2.3.0-1.el7.noarch
ceph-ansible-2.2.11-1.el7scon.noarch

#Issue -2 

As soon as we upgraded to latest ansible it started failing only in the first monitor node with following error:

This was set because of satellite installation in all.yml
ceph_origin: 'distro' # or 'distro' NEEDED FOR SAT
ceph_stable_rh_storage: true


2017-08-07 16:55:00,788 p=4873 u=root |  TASK [ceph-common : verify that a method was chosen for red hat storage] *******

2017-08-07 16:55:00,817 p=4873 u=root |  fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "choose between ceph_rhcs_cdn_install and ceph_rhcs_iso_install"}

Issue-3:

This was set because of satellite installation in all.yml
ceph_origin: 'distro' # or 'distro' NEEDED FOR SAT

and commented:
#ceph_stable_rh_storage: true
#ceph_rhcs_cdn_install: false 

2017-08-07 17:13:27,230 p=7053 u=root |  fatal: [node1]: FAILED! => {"changed": false, "failed": true, "msg": "choose an upstream installation source or read https://github.com/ceph/ceph-ansible/wiki"}


We have manually unset the nodeep-scrub, noscrub and noout flags.

Comment 5 Guillaume Abrioux 2017-08-09 20:45:26 UTC

From what I have seen:

The old rolling_update.yml is still used, then it looks for group_vars/all file

 76   pre_tasks:
 77     - include_vars: roles/ceph-common/defaults/main.yml
 78     - include_vars: roles/ceph-mon/defaults/main.yml
 79     - include_vars: roles/ceph-restapi/defaults/main.yml
 80     - include_vars: group_vars/all
 81       failed_when: false
 82     - include_vars: group_vars/{{ mon_group_name }}
 83       failed_when: false
 84     - include_vars: group_vars/{{ restapi_group_name }}
 85       failed_when: false


but group_vars/all doesn't exist :

$~/Downloads/debug_bz/usr/share/ceph-ansible$ ls group_vars/all
ls: group_vars/all: No such file or directory

therefore it takes the default value for ceph_origin which is :

roles/ceph-common/defaults/main.yml:83:ceph_origin: 'upstream' # or 'distro' or 'local'

then it end up with the error we can see in the logs because its entering in that condition:

https://github.com/ceph/ceph-ansible/blob/v2.2.11/roles/ceph-common/tasks/checks/check_mandatory_vars.yml#L12-L23

I think playing the new version of rolling_update.yml should fix this error since it will look for group_vars/all.yml (which actually exists) and set ceph_origin to 'distro' as expected.