Description of problem: ======================= After all OSD upgrade, code is checking for cluster health. If cluster has more data and takes a time to reach OK state then rolling update failes with error as waiting time before aborting is less. Version-Release number of selected component (if applicable): ============================================================ update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64 How reproducible: ================= always Steps to Reproduce: =================== 1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64). create lots of data on that cluster(around50% full) 2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits 3. Change the value of 'serial:' to adjust the number of server to be updated. 4. use rolling_update.yml to update all nodes Actual results: ================ TASK: [waiting for clean pgs...] ********************************************** failed: [magna090 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.449153", "end": "2016-09-07 20:22:19.111983", "failed": true, "rc": 2, "start": "2016-09-07 20:22:09.662830", "warnings": []} stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected msg: Task failed as maximum retries was encountered failed: [magna091 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:18.451028", "end": "2016-09-07 20:22:47.032077", "failed": true, "rc": 2, "start": "2016-09-07 20:22:28.581049", "warnings": []} stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected msg: Task failed as maximum retries was encountered failed: [magna094 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.440957", "end": "2016-09-07 20:22:54.999292", "failed": true, "rc": 2, "start": "2016-09-07 20:22:45.558335", "warnings": []} stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected msg: Task failed as maximum retries was encountered FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/rolling_update.retry localhost : ok=1 changed=0 unreachable=0 failed=0 magna078 : ok=153 changed=8 unreachable=0 failed=0 magna084 : ok=153 changed=8 unreachable=0 failed=0 magna085 : ok=153 changed=8 unreachable=0 failed=0 magna090 : ok=231 changed=9 unreachable=0 failed=1 magna091 : ok=231 changed=9 unreachable=0 failed=1 magna094 : ok=231 changed=9 unreachable=0 failed=1 magna095 : ok=5 changed=1 unreachable=0 failed=0 [root@magna078 ceph]# ceph -s --cluster ceph1 cluster 5521bc4c-e0c5-4f12-9078-31b0e37739d4 health HEALTH_ERR Expected results: ================= increase no. of retry or waiting time so cluster gets enough time to reach to healthy state and rolling update dont abort operation Additional info:
Ok, we should be able to customize this timeout.
Would you mind giving this a try? https://github.com/ceph/ceph-ansible/pull/1001 Thanks!
This will ship concurrently with RHCS 2.1.
this will be tested as part of rolling_update tests
Verified in build: ceph-ansible-1.0.5-39.el7scon The timeout is sufficient for the cluster to reach a WARN or OK state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2817