1372481 – [ceph-ansible] : rolling_update got hung in task 'compress the store as much as possible'

Bug 1372481 - [ceph-ansible] : rolling_update got hung in task 'compress the store as much as possible'

Summary: [ceph-ansible] : rolling_update got hung in task 'compress the store as much ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	ceph-ansible
Sub Component:
Version:	2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	2
Assignee:	seb
QA Contact:	Rachana Patel
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Console-2-Async
TreeView+	depends on / blocked

Reported:	2016-09-01 21:28 UTC by Rachana Patel
Modified:	2016-10-19 15:22 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ceph-ansible-1.0.5-33.el7scon
Doc Type:	Bug Fix
Doc Text:	Issuing a command to compact its data store during a rolling upgrade renders the Ceph monitors unresponsive. To avoid this behaviour, skip the command to compact the data store during a rolling upgrade. As a result, the Ceph monitors are responsive.
Clone Of:
Environment:
Last Closed:	2016-10-19 15:22:11 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2082	0	normal	SHIPPED_LIVE	Moderate: Red Hat Storage Console 2 security and bug fix update	2017-04-18 19:29:02 UTC

Description Rachana Patel 2016-09-01 21:28:12 UTC

Description of problem:
=======================
When cluster have more than one monitor rolling update gets hung in task 'compress the store as much as possible' for second monitor node


Version-Release number of selected component (if applicable):
==============================================================
update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64


How reproducible:
=================
always


Steps to Reproduce:
===================
1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64)
[root@magna044 ceph-ansible]# cat /etc/ansible/hosts
[mons]
magna078
magna084
magna085

[osds]
magna090
magna091
magna085

[rgws]
magna094


2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits
3. Change the value of 'serial:' to adjust the number of server to be updated.
4. use rolling_update.yml to update all nodes

Actual results:
===============
[root@magna044 ceph-ansible]# ansible-playbook rolling_update.yml 
Are you sure you want to upgrade the cluster? [no]: yes

PLAY [confirm whether user really meant to upgrade the cluster] *************** 

GATHERING FACTS *************************************************************** 
ok: [localhost]

TASK: [exit playbook, if user did not mean to upgrade cluster] **************** 
skipping: [localhost]

PLAY [mons;osds;mdss;rgws] **************************************************** 

GATHERING FACTS *************************************************************** 
ok: [magna084]
ok: [magna078]
ok: [magna085]
ok: [magna091]
ok: [magna090]
ok: [magna094]

TASK: [debug msg="gather facts on all Ceph hosts for following reference"] **** 
ok: [magna078] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna084] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna085] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna094] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna090] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}
ok: [magna091] => {
    "msg": "gather facts on all Ceph hosts for following reference"
}

TASK: [check if sysvinit] ***************************************************** 
ok: [magna084]
ok: [magna090]
ok: [magna091]
ok: [magna078]
ok: [magna085]
ok: [magna094]

TASK: [check if upstart] ****************************************************** 
ok: [magna084]
ok: [magna078]
ok: [magna090]
ok: [magna085]
ok: [magna091]
ok: [magna094]

TASK: [check if systemd] ****************************************************** 
changed: [magna090]
changed: [magna084]
changed: [magna085]
changed: [magna078]
changed: [magna094]
changed: [magna091]

PLAY [mons] ******************************************************************* 

GATHERING FACTS *************************************************************** 
ok: [magna084]
ok: [magna078]
ok: [magna085]

TASK: [compress the store as much as possible] ******************************** 
changed: [magna078]



Expected results:
=================
IT should update all nodes


Additional info:

Comment 3 seb 2016-09-02 15:21:45 UTC

Can you give me the state of the cluster prior to run this?
Are all the monitors started?
Can you try to run the compress command manually on the monitor nodes?

Comment 5 Rachana Patel 2016-09-06 15:08:07 UTC

I do have one question here.
AFAIK, all MON share same db then why do we need to compress on all MONs?

We can compress on one MON node and it should work fine, right?

Please correct me if I am wrong

Comment 11 seb 2016-09-13 08:07:18 UTC

It's weird that we don't know the root cause of that, even if the compaction is not needed by the upgrade, I think it's a nice to have.
I ran the playbook several times and the only case where the compact command hung was the monitor being stopped...

I can remove the compact command from the playbook anyway.

Comment 12 seb 2016-09-13 08:12:19 UTC

https://github.com/ceph/ceph-ansible/pull/975

Comment 24 errata-xmlrpc 2016-10-19 15:22:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2082

Note You need to log in before you can comment on or make changes to this bug.