Description of problem:
=======================
After all OSD upgrade, code is checking for cluster health. If cluster has more data and takes a time to reach OK state then rolling update failes with error as
waiting time before aborting is less.
Version-Release number of selected component (if applicable):
============================================================
update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64
How reproducible:
=================
always
Steps to Reproduce:
===================
1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64). create lots of data on that cluster(around50% full)
2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits
3. Change the value of 'serial:' to adjust the number of server to be updated.
4. use rolling_update.yml to update all nodes
Actual results:
================
TASK: [waiting for clean pgs...] **********************************************
failed: [magna090 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.449153", "end": "2016-09-07 20:22:19.111983", "failed": true, "rc": 2, "start": "2016-09-07 20:22:09.662830", "warnings": []}
stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected
msg: Task failed as maximum retries was encountered
failed: [magna091 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:18.451028", "end": "2016-09-07 20:22:47.032077", "failed": true, "rc": 2, "start": "2016-09-07 20:22:28.581049", "warnings": []}
stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected
msg: Task failed as maximum retries was encountered
failed: [magna094 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.440957", "end": "2016-09-07 20:22:54.999292", "failed": true, "rc": 2, "start": "2016-09-07 20:22:45.558335", "warnings": []}
stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected
msg: Task failed as maximum retries was encountered
FATAL: all hosts have already failed -- aborting
PLAY RECAP ********************************************************************
to retry, use: --limit @/root/rolling_update.retry
localhost : ok=1 changed=0 unreachable=0 failed=0
magna078 : ok=153 changed=8 unreachable=0 failed=0
magna084 : ok=153 changed=8 unreachable=0 failed=0
magna085 : ok=153 changed=8 unreachable=0 failed=0
magna090 : ok=231 changed=9 unreachable=0 failed=1
magna091 : ok=231 changed=9 unreachable=0 failed=1
magna094 : ok=231 changed=9 unreachable=0 failed=1
magna095 : ok=5 changed=1 unreachable=0 failed=0
[root@magna078 ceph]# ceph -s --cluster ceph1
cluster 5521bc4c-e0c5-4f12-9078-31b0e37739d4
health HEALTH_ERR
Expected results:
=================
increase no. of retry or waiting time so cluster gets enough time to reach to healthy state and rolling update dont abort operation
Additional info:
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2016:2817