| Summary: | [ceph-ansible] : rolling update is failing if cluster takes time to achieve OK state after OSD upgrade | ||
|---|---|---|---|
| Product: | Red Hat Storage Console | Reporter: | Rachana Patel <racpatel> |
| Component: | ceph-ansible | Assignee: | Sébastien Han <shan> |
| Status: | CLOSED ERRATA | QA Contact: | Tejas <tchandra> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2 | CC: | adeza, aschoen, ceph-eng-bugs, flucifre, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb |
| Target Milestone: | --- | ||
| Target Release: | 2 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-ansible-1.0.5-35.el7scon | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-22 23:41:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Ok, we should be able to customize this timeout. Would you mind giving this a try? https://github.com/ceph/ceph-ansible/pull/1001 Thanks! This will ship concurrently with RHCS 2.1. this will be tested as part of rolling_update tests Verified in build: ceph-ansible-1.0.5-39.el7scon The timeout is sufficient for the cluster to reach a WARN or OK state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2817 |
Description of problem: ======================= After all OSD upgrade, code is checking for cluster health. If cluster has more data and takes a time to reach OK state then rolling update failes with error as waiting time before aborting is less. Version-Release number of selected component (if applicable): ============================================================ update from 10.2.2-38.el7cp.x86_64 to 10.2.2-39.el7cp.x86_64 How reproducible: ================= always Steps to Reproduce: =================== 1. Create a cluster via ceph-ansible having 3 MON, 3 OSD and 1 RGW node (10.2.2-38.el7cp.x86_64). create lots of data on that cluster(around50% full) 2. create repo fie on all nodes which points to 10.2.2-39.el7cp.x86_64 bits 3. Change the value of 'serial:' to adjust the number of server to be updated. 4. use rolling_update.yml to update all nodes Actual results: ================ TASK: [waiting for clean pgs...] ********************************************** failed: [magna090 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.449153", "end": "2016-09-07 20:22:19.111983", "failed": true, "rc": 2, "start": "2016-09-07 20:22:09.662830", "warnings": []} stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected msg: Task failed as maximum retries was encountered failed: [magna091 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:18.451028", "end": "2016-09-07 20:22:47.032077", "failed": true, "rc": 2, "start": "2016-09-07 20:22:28.581049", "warnings": []} stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected msg: Task failed as maximum retries was encountered failed: [magna094 -> magna078] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster ceph1 | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster ceph1 | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health--cluster ceph1 | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:09.440957", "end": "2016-09-07 20:22:54.999292", "failed": true, "rc": 2, "start": "2016-09-07 20:22:45.558335", "warnings": []} stderr: /bin/sh: line 0: test: 40 active+undersized+degraded, 8 undersized+degraded+peered, 56 : integer expression expected msg: Task failed as maximum retries was encountered FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/rolling_update.retry localhost : ok=1 changed=0 unreachable=0 failed=0 magna078 : ok=153 changed=8 unreachable=0 failed=0 magna084 : ok=153 changed=8 unreachable=0 failed=0 magna085 : ok=153 changed=8 unreachable=0 failed=0 magna090 : ok=231 changed=9 unreachable=0 failed=1 magna091 : ok=231 changed=9 unreachable=0 failed=1 magna094 : ok=231 changed=9 unreachable=0 failed=1 magna095 : ok=5 changed=1 unreachable=0 failed=0 [root@magna078 ceph]# ceph -s --cluster ceph1 cluster 5521bc4c-e0c5-4f12-9078-31b0e37739d4 health HEALTH_ERR Expected results: ================= increase no. of retry or waiting time so cluster gets enough time to reach to healthy state and rolling update dont abort operation Additional info: