Bug 1372481

Summary:	[ceph-ansible] : rolling_update got hung in task 'compress the store as much as possible'
Product:	[Red Hat Storage] Red Hat Storage Console	Reporter:	Rachana Patel <racpatel>
Component:	ceph-ansible	Assignee:	seb
Status:	CLOSED ERRATA	QA Contact:	Rachana Patel <racpatel>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2	CC:	adeza, aschoen, ceph-eng-bugs, gmeno, hnallurv, kdreyer, nthomas, racpatel, rcyriac, rghatvis, sankarshan, seb, vsarmila
Target Milestone:	---
Target Release:	2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-ansible-1.0.5-33.el7scon	Doc Type:	Bug Fix
Doc Text:	Issuing a command to compact its data store during a rolling upgrade renders the Ceph monitors unresponsive. To avoid this behaviour, skip the command to compact the data store during a rolling upgrade. As a result, the Ceph monitors are responsive.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-19 15:22:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1357777

Description Rachana Patel 2016-09-01 21:28:12 UTC

Description of problem:
=======================
When cluster have more than one monitor rolling update gets hung in task 'compress the store as much as possible' for second monitor node


Version-Release number of selected component (if applicable):
==============================================================
update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64


How reproducible:
=================
always


Steps to Reproduce:
===================
1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64)
[root@magna044 ceph-ansible]# cat /etc/ansible/hosts
[mons]
magna078
magna084
magna085

[osds]
magna090
magna091
magna085

[rgws]
magna094


2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits
3. Change the value of 'serial:' to adjust the number of server to be updated.
4. use rolling_update.yml to update all nodes

Actual results:
===============
[root@magna044 ceph-ansible]# ansible-playbook rolling_update.yml 
Are you sure you want to upgrade the cluster? [no]: yes

PLAY [confirm whether user really meant to upgrade the cluster] *************** 

GATHERING FACTS *************************************************************** 
ok: [localhost]

TASK: [exit playbook, if user did not mean to upgrade cluster] **************** 
skipping: [localhost]

PLAY [mons;osds;mdss;rgws] **************************************************** 

GATHERING FACTS *************************************************************** 
ok: [magna084]
ok: [magna078]
ok: [magna085]
ok: [magna091]
ok: [magna090]
ok: [magna094]

TASK: [debug msg="gather facts on all Ceph hosts for following reference"] **** 
ok: [magna078] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna084] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna085] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna094] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna090] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna091] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}

TASK: [check if sysvinit] ***************************************************** 
ok: [magna084]
ok: [magna090]
ok: [magna091]
ok: [magna078]
ok: [magna085]
ok: [magna094]

TASK: [check if upstart] ****************************************************** 
ok: [magna084]
ok: [magna078]
ok: [magna090]
ok: [magna085]
ok: [magna091]
ok: [magna094]

TASK: [check if systemd] ****************************************************** 
changed: [magna090]
changed: [magna084]
changed: [magna085]
changed: [magna078]
changed: [magna094]
changed: [magna091]

PLAY [mons] ******************************************************************* 

GATHERING FACTS *************************************************************** 
ok: [magna084]
ok: [magna078]
ok: [magna085]

TASK: [compress the store as much as possible] ******************************** 
changed: [magna078]



Expected results:
=================
IT should update all nodes


Additional info:

Comment 3 seb 2016-09-02 15:21:45 UTC

Can you give me the state of the cluster prior to run this?
Are all the monitors started?
Can you try to run the compress command manually on the monitor nodes?

Comment 5 Rachana Patel 2016-09-06 15:08:07 UTC

I do have one question here.
AFAIK, all MON share same db then why do we need to compress on all MONs?

We can compress on one MON node and it should work fine, right?

Please correct me if I am wrong

Comment 11 seb 2016-09-13 08:07:18 UTC

It's weird that we don't know the root cause of that, even if the compaction is not needed by the upgrade, I think it's a nice to have.
I ran the playbook several times and the only case where the compact command hung was the monitor being stopped...

I can remove the compact command from the playbook anyway.

Comment 12 seb 2016-09-13 08:12:19 UTC

https://github.com/ceph/ceph-ansible/pull/975

Comment 24 errata-xmlrpc 2016-10-19 15:22:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2082