Bug 1380195

Summary:	[ceph-ansible] : rolling update is failing if cluster takes time to achieve OK state after OSD upgrade
Product:	[Red Hat Storage] Red Hat Storage Console	Reporter:	Rachana Patel <racpatel>
Component:	ceph-ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Tejas <tchandra>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2	CC:	adeza, aschoen, ceph-eng-bugs, flucifre, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb
Target Milestone:	---
Target Release:	2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-ansible-1.0.5-35.el7scon	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-22 23:41:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rachana Patel 2016-09-28 22:10:07 UTC

Description of problem:
=======================
After all OSD upgrade, code is checking for cluster health. If cluster has more data and takes a time to reach OK state then rolling update failes with error as 
waiting time before aborting is less.


Version-Release number of selected component (if applicable):
============================================================
update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64


How reproducible:
=================
always



Steps to Reproduce:
===================
1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64). create lots of data on that cluster(around50% full)
2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits
3. Change the value of 'serial:' to adjust the number of server to be updated.
4. use rolling_update.yml to update all nodes

Actual results:
================

TASK: [waiting for clean pgs...] ********************************************** 
failed: [magna090 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.449153", "end": "2016-09-07 20:22:19.111983", "failed": true, "rc": 2, "start": "2016-09-07 20:22:09.662830", "warnings": []}
stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected
msg: Task failed as maximum retries was encountered


failed: [magna091 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:18.451028", "end": "2016-09-07 20:22:47.032077", "failed": true, "rc": 2, "start": "2016-09-07 20:22:28.581049", "warnings": []}
stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected
msg: Task failed as maximum retries was encountered
failed: [magna094 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.440957", "end": "2016-09-07 20:22:54.999292", "failed": true, "rc": 2, "start": "2016-09-07 20:22:45.558335", "warnings": []}
stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected
msg: Task failed as maximum retries was encountered

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/rolling_update.retry

localhost                  : ok=1    changed=0    unreachable=0    failed=0   
magna078                   : ok=153  changed=8    unreachable=0    failed=0   
magna084                   : ok=153  changed=8    unreachable=0    failed=0   
magna085                   : ok=153  changed=8    unreachable=0    failed=0   
magna090                   : ok=231  changed=9    unreachable=0    failed=1   
magna091                   : ok=231  changed=9    unreachable=0    failed=1   
magna094                   : ok=231  changed=9    unreachable=0    failed=1   
magna095                   : ok=5    changed=1    unreachable=0    failed=0   



[root@magna078 ceph]# ceph -s --cluster ceph1
    cluster 5521bc4c-e0c5-4f12-9078-31b0e37739d4
     health HEALTH_ERR

Expected results:
=================
increase no. of retry or waiting time so cluster gets enough time to reach to healthy state and rolling update dont abort operation


Additional info:

Comment 4 seb 2016-09-29 13:08:31 UTC

Ok, we should be able to customize this timeout.

Comment 5 seb 2016-10-03 09:27:51 UTC

Would you mind giving this a try? https://github.com/ceph/ceph-ansible/pull/1001

Thanks!

Comment 7 Federico Lucifredi 2016-10-07 17:09:01 UTC

This will ship concurrently with RHCS 2.1.

Comment 8 Harish NV Rao 2016-10-07 17:14:19 UTC

this will be tested as part of rolling_update tests

Comment 11 Tejas 2016-11-04 18:30:14 UTC

Verified in build:
ceph-ansible-1.0.5-39.el7scon

The timeout is sufficient for the cluster to reach a WARN or OK state.

Comment 13 errata-xmlrpc 2016-11-22 23:41:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2817