Bug 1624962
Summary: | [RFE] Set flag noup during scaleout and unset it when all new OSD's daemons are running | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
Component: | Ceph-Ansible | Assignee: | Sébastien Han <shan> |
Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> |
Severity: | high | Docs Contact: | Bara Ancincova <bancinco> |
Priority: | high | ||
Version: | 3.0 | CC: | anharris, aschoen, ceph-eng-bugs, gabrioux, gmeno, hnallurv, mamccoma, nthomas, sankarshan, shan, tnielsen, tserlin |
Target Milestone: | rc | Keywords: | FutureFeature |
Target Release: | 3.2 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-ansible-3.2.0-0.1.beta6.el7cp Ubuntu: ceph-ansible_3.2.0~beta6-2redhat1 | Doc Type: | Enhancement |
Doc Text: |
.The `noup` flag is now set before creating OSDs to distribute PGs properly
The `ceph-ansible` utility now sets the `noup` flag before creating OSDs to prevent them from changing their status to `up` before all OSDs are created. Previously, if the flag was not set, placement groups (PGs) were created on only one OSD and got stuck in creation or activation. With this update, the `noup` flag is set before creating OSDs and unset after the creation is complete. As a result, PGs are distributed properly among all OSDs.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-01-03 19:01:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1651060 | ||
Bug Blocks: | 1629656 |
Description
Vikhyat Umrao
2018-09-03 17:18:51 UTC
Why PG get stuck in activating you can read this KCS - https://access.redhat.com/solutions/3526531 This is a new feature in Luminous(RHCS 3) to avoid a large number of pgs to get mapped to one OSD. *** Test from lab environment with "noup" flag ***: Before any changes in my environment (baseline): [root@vm250-137 ~]# ceph -s cluster: id: 256b60c8-8d8e-47bb-9dfe-492055072a7e health: HEALTH_WARN application not enabled on 1 pool(s) 1/3 mons down, quorum vm250-8,vm250-137 services: mon: 3 daemons, quorum vm250-8,vm250-137, out of quorum: vm250-194 mgr: vm250-137(active), standbys: vm250-8 osd: 9 osds: 9 up, 9 in rgw: 2 daemons active tcmu-runner: 2 daemons active data: pools: 9 pools, 576 pgs objects: 231 objects, 3873 bytes usage: 1112 MB used, 87876 MB / 88988 MB avail pgs: 576 active+clean io: client: 170 B/s rd, 0 op/s rd, 0 op/s wr [root@vm250-137 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08464 root default -5 0.02888 host vm250-248 1 hdd 0.00929 osd.1 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 -7 0.02788 host vm251-254 2 hdd 0.00929 osd.2 up 1.00000 1.00000 5 hdd 0.00929 osd.5 up 1.00000 1.00000 8 hdd 0.00929 osd.8 up 1.00000 1.00000 -3 0.02788 host vm253-212 0 hdd 0.00929 osd.0 up 1.00000 1.00000 4 hdd 0.00929 osd.4 up 1.00000 1.00000 7 hdd 0.00929 osd.7 up 1.00000 1.00000 -------------------------------------------------------------------------------- ** Remove OSDs/OSD node (vm251-254) and apply flags to simulate adding a new node with OSDs: [root@vm250-137 ~]# ceph -s cluster: id: 256b60c8-8d8e-47bb-9dfe-492055072a7e health: HEALTH_WARN noup,nobackfill,norecover flag(s) set 55/693 objects misplaced (7.937%) Degraded data redundancy: 176/693 objects degraded (25.397%), 352 pgs unclean, 21 pgs degraded, 352 pgs undersized application not enabled on 1 pool(s) 1/3 mons down, quorum vm250-8,vm250-137 services: mon: 3 daemons, quorum vm250-8,vm250-137, out of quorum: vm250-194 mgr: vm250-137(active), standbys: vm250-8 osd: 6 osds: 6 up, 6 in; 224 remapped pgs flags noup,nobackfill,norecover rgw: 2 daemons active tcmu-runner: 2 daemons active data: pools: 9 pools, 576 pgs objects: 231 objects, 3873 bytes usage: 750 MB used, 59087 MB / 59837 MB avail pgs: 176/693 objects degraded (25.397%) 55/693 objects misplaced (7.937%) 331 active+undersized 214 active+clean+remapped 21 active+undersized+degraded 10 active+clean io: client: 127 B/s rd, 0 op/s rd, 0 op/s wr [root@vm250-137 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05676 root default -5 0.02888 host vm250-248 1 hdd 0.00929 osd.1 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 -3 0.02788 host vm253-212 0 hdd 0.00929 osd.0 up 1.00000 1.00000 4 hdd 0.00929 osd.4 up 1.00000 1.00000 7 hdd 0.00929 osd.7 up 1.00000 1.00000 ------------------------------------------------------------------------------- ** ceph-ansible playbook fails on this non-containerized task at the end of the playbook?? but still appears to be successful in applying the changes ** TASK [ceph-osd : manually prepare ceph "filestore" non-containerized osd disk(s) with collocated osd data and journal] ***** changed: [vm251-254] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/sdb print | egrep -sq '^ 1.*ceph'", u'end': u'2018-09-21 14:44:00.954490', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/sdb print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-09-21 14:44:00.910970', u'delta': u'0:00:00.043520', 'item': u'/dev/sdb', u'rc': 1, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u'', '_ansible_ignore_errors': None, u'failed': False}, u'/dev/sdb']) changed: [vm251-254] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/sdc print | egrep -sq '^ 1.*ceph'", u'end': u'2018-09-21 14:44:01.467960', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/sdc print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-09-21 14:44:01.408488', u'delta': u'0:00:00.059472', 'item': u'/dev/sdc', u'rc': 1, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u'', '_ansible_ignore_errors': None, u'failed': False}, u'/dev/sdc']) changed: [vm251-254] => (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/sdd print | egrep -sq '^ 1.*ceph'", u'end': u'2018-09-21 14:44:02.006833', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/sdd print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-09-21 14:44:01.959515', u'delta': u'0:00:00.047318', 'item': u'/dev/sdd', u'rc': 1, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u'', '_ansible_ignore_errors': None, u'failed': False}, u'/dev/sdd']) failed: [vm251-254] (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/sdd print | egrep -sq '^ 1.*ceph'", u'end': u'2018-09-21 14:44:02.437479', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/sdd print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-09-21 14:44:02.417113', u'delta': u'0:00:00.020366', 'item': u'/dev/sdd', u'rc': 1, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u'', '_ansible_ignore_errors': None, u'failed': False}, u'/dev/sdd']) => {"changed": true, "cmd": ["ceph-disk", "prepare", "--cluster", "ceph", "--filestore", "/dev/sdd"], "delta": "0:00:01.538766", "end": "2018-09-21 14:44:46.513581", "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/sdd print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.020366", "end": "2018-09-21 14:44:02.437479", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/sdd print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/sdd", "msg": "non-zero return code", "rc": 1, "start": "2018-09-21 14:44:02.417113", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/sdd"], "msg": "non-zero return code", "rc": 1, "start": "2018-09-21 14:44:44.974815", "stderr": "Could not create partition 2 from 34 to 1048609\nError encountered; not saving changes.\n'/sbin/sgdisk --new=2:0:+512M --change-name=2:ceph journal --partition-guid=2:e07fb99d-f87e-4a44-aa6b-e6f466f7aef2 --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdd' failed with status code 4", "stderr_lines": ["Could not create partition 2 from 34 to 1048609", "Error encountered; not saving changes.", "'/sbin/sgdisk --new=2:0:+512M --change-name=2:ceph journal --partition-guid=2:e07fb99d-f87e-4a44-aa6b-e6f466f7aef2 --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdd' failed with status code 4"], "stdout": "", "stdout_lines": []} failed: [vm251-254] (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/sdb print | egrep -sq '^ 1.*ceph'", u'end': u'2018-09-21 14:44:02.846679', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/sdb print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-09-21 14:44:02.830390', u'delta': u'0:00:00.016289', 'item': u'/dev/sdb', u'rc': 1, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u'', '_ansible_ignore_errors': None, u'failed': False}, u'/dev/sdb']) => {"changed": true, "cmd": ["ceph-disk", "prepare", "--cluster", "ceph", "--filestore", "/dev/sdb"], "delta": "0:00:00.308885", "end": "2018-09-21 14:44:47.732516", "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/sdb print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.016289", "end": "2018-09-21 14:44:02.846679", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/sdb print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/sdb", "msg": "non-zero return code", "rc": 1, "start": "2018-09-21 14:44:02.830390", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/sdb"], "msg": "non-zero return code", "rc": 1, "start": "2018-09-21 14:44:47.423631", "stderr": "ceph-disk: Error: Device is mounted: /dev/sdb1", "stderr_lines": ["ceph-disk: Error: Device is mounted: /dev/sdb1"], "stdout": "", "stdout_lines": []} failed: [vm251-254] (item=[{'_ansible_parsed': True, 'stderr_lines': [], u'cmd': u"parted --script /dev/sdc print | egrep -sq '^ 1.*ceph'", u'end': u'2018-09-21 14:44:03.289726', '_ansible_no_log': False, u'stdout': u'', '_ansible_item_result': True, u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"parted --script /dev/sdc print | egrep -sq '^ 1.*ceph'", u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-09-21 14:44:03.273331', u'delta': u'0:00:00.016395', 'item': u'/dev/sdc', u'rc': 1, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u'', '_ansible_ignore_errors': None, u'failed': False}, u'/dev/sdc']) => {"changed": true, "cmd": ["ceph-disk", "prepare", "--cluster", "ceph", "--filestore", "/dev/sdc"], "delta": "0:00:00.306034", "end": "2018-09-21 14:44:48.455621", "item": [{"_ansible_ignore_errors": null, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/sdc print | egrep -sq '^ 1.*ceph'", "delta": "0:00:00.016395", "end": "2018-09-21 14:44:03.289726", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/sdc print | egrep -sq '^ 1.*ceph'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "/dev/sdc", "msg": "non-zero return code", "rc": 1, "start": "2018-09-21 14:44:03.273331", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/sdc"], "msg": "non-zero return code", "rc": 1, "start": "2018-09-21 14:44:48.149587", "stderr": "ceph-disk: Error: Device is mounted: /dev/sdc1", "stderr_lines": ["ceph-disk: Error: Device is mounted: /dev/sdc1"], "stdout": "", "stdout_lines": []} PLAY RECAP ***************************************************************************************************************** vm251-254 : ok=63 changed=4 unreachable=0 failed=1 -------------------------------------------------------------------------------- *** Changes after the run of the playbook *** ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08464 root default -5 0.02888 host vm250-248 1 hdd 0.00929 osd.1 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 -7 0.02788 host vm251-254 2 hdd 0.00929 osd.2 down 0 1.00000 5 hdd 0.00929 osd.5 down 0 1.00000 8 hdd 0.00929 osd.8 down 0 1.00000 -3 0.02788 host vm253-212 0 hdd 0.00929 osd.0 up 1.00000 1.00000 4 hdd 0.00929 osd.4 up 1.00000 1.00000 7 hdd 0.00929 osd.7 up 1.00000 1.00000 [root@vm250-137 ~]# ceph osd unset nobackfill nobackfill is unset [root@vm250-137 ~]# ceph osd unset norecover norecover is unset [root@vm250-137 ~]# ceph osd unset noup noup is unset [root@vm250-137 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08464 root default -5 0.02888 host vm250-248 1 hdd 0.00929 osd.1 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 -7 0.02788 host vm251-254 2 hdd 0.00929 osd.2 up 1.00000 1.00000 5 hdd 0.00929 osd.5 up 1.00000 1.00000 8 hdd 0.00929 osd.8 up 1.00000 1.00000 -3 0.02788 host vm253-212 0 hdd 0.00929 osd.0 up 1.00000 1.00000 4 hdd 0.00929 osd.4 up 1.00000 1.00000 7 hdd 0.00929 osd.7 up 1.00000 1.00000 ** Cluster has now successfully re-balanced **: [root@vm250-137 ~]# ceph -s cluster: id: 256b60c8-8d8e-47bb-9dfe-492055072a7e health: HEALTH_WARN application not enabled on 1 pool(s) 1/3 mons down, quorum vm250-8,vm250-137 services: mon: 3 daemons, quorum vm250-8,vm250-137, out of quorum: vm250-194 mgr: vm250-137(active), standbys: vm250-8 osd: 9 osds: 9 up, 9 in rgw: 2 daemons active tcmu-runner: 2 daemons active data: pools: 9 pools, 576 pgs objects: 231 objects, 3873 bytes usage: 1090 MB used, 87898 MB / 88988 MB avail pgs: 576 active+clean io: client: 85 B/s rd, 0 op/s rd, 0 op/s wr Vikhyat, for day 2 operations it is encouraged to use the playbook osd-configure.yml, which will add new OSDs. So we are going to add your request in this playbook. lgtm Observed that noup falg was set and unset as required. Moving to VERIFIED state. ceph-ansible-3.2.0-0.1.rc3.el7cp.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0020 |