| Summary: | document ceph-ansible method for hosts that have different block device names. | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Ben England <bengland> |
| Component: | Documentation | Assignee: | Bara Ancincova <bancinco> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Tejas <tchandra> |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | 2.0 | CC: | adeza, aschoen, asriram, bengland, ceph-eng-bugs, gmeno, hnallurv, jharriga, kbader, kdreyer, nthomas, sankarshan, seb, twilkins, wfoster |
| Target Milestone: | rc | ||
| Target Release: | 2.1 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-28 09:37:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
It is hard to register which command succeed or not and then invalidate hosts from this. Our though when we introduce this change is to get the desired state. If you set 36 disks, then you expect to get 36 OSDs, thus if one disk is faulty we abort. We just don't want to end up in a situation where you get 35/36 OSDs. The idea is that you just fix or remove from the variable list any faulty drives. I disagree. There are several problems here: - ceph-ansible identifies devices by name - ceph-ansible expects every host to have the same set of devices - All-or-nothing behavior Identifying OSDs by device name: Linux does not guarantee that low-level block devices will have the same name after reboot or across hosts, unless they happen to be discovered at boot time in the same order. Whereas higher-level block devices such as LVM volumes, which are associated with a written signature on their storage, do have persistent names. For example, in our scalability lab on some hosts we've seen the system disks show up as the first two devices, /dev/sda and /dev/sdb, and on other hosts they show up as the *last* two devices /dev/sdk and /dev/sdl, depending on the order of discovery for the storage controllers, and in the case of a missing OSD drive, they show up as /dev/sdj and /dev/sdk! In the scale lab we worked around this by forcing the same order of discovery for the 2 storage controllers by forcing the system disk controller to be loaded first, but this may not always be feasible (https://engineering.redhat.com/rt/Ticket/Display.html?id=416265) Alternative: identify OSD devices by their attributes in ansible_devices, such as size, rotational, number of partitions, etc. Same number of devices on all hosts: This is a side effect of the previous problem. Ceph does not require all OSD hosts to have the same number and type of devices, but ceph-ansible does. As for "removing from the variable list any faulty drives", yes that's what my workaround is, but I lose that device name on *every host*, not just the one where the drive failure occurred. The bigger the cluster, the more expensive this gets. No I cannot just change out the bad drive - it may be months before I get it replaced. I need to use the Ceph cluster anyway. All-or-nothing behavior (what started this bz): If you have a large cluster with 30 hosts, you don't want to have to redo the install because 1 disk was down. But that's exactly what you would have to do here, since you would lose ALL the OSDs on any host that doesn't have exactly the number of drives that you expect. Alternative: register as many OSDs on the list as you can. In our example of a 20-server configuration with 36 drives/server, if we had two OSDs that did not come up out of 20 x 36 = 720 OSDs, then you could continue with the install while you use the usual maintenance procedures to replace the two missing or failed drives, identifying failed OSD devices to the user at the end of the installation process? In ceph-ansible implementation, could we implement the OSD bring-up steps as ones that register results but don't fail? The final step of OSD bring-up is to count how many came up and fail (on that host) only if failed OSD count exceeds some threshold (i.e. 1-2 devices by default). By setting this to zero you get the current behavior. Alternative: instead of discovering at the end what went wrong, why not put a OSD device check in the front of the procedure and make sure that all the devices (both storage and network) that you expect are indeed working before attempting the install at all? Overall ceph-ansible is a great scalability improvement over what we had before but I think these issues will be important to sysadmins. If you keep sysadmins happy then they will be powerful advocates for Ceph. I partly disagree :). Identifying OSDs by device name: I know the device declaration has been a long standing issue, I have had a PR to attempt to address this. We just have some verifications sequences that are not compatible with persistent paths such as /dev/disk/by-uuid. The rest should work normally, unless we discover an issue with Ceph disk. I actually just had an idea that should quickly solve this problem :). Same number of devices on all hosts: It is not a all or nothing, you can edit host variable by setting specific devices as part of your inventory file, ie: [osds] ceph-osd-01 devices="[ '/dev/sdb', '/dev/sdc' ]" ceph-osd-02 devices="[ '/dev/sdb', '/dev/sdc', '/dev/sdd' ]" Given my last answer I think All-or-nothing behaviour can be mitigated, right? I don't think we can do (at least I don't know how, perhaps an handler) to check disk status at the end of the play... Do you mind giving this another try? The issue that you reported only happened because the device didn't exist on the system, you should now be able to move forward. You know that I'm trying as hard as I can, and I used to do only sysadmin only at some point, so I'll make everyone happy ;) Thanks! Seb, I appreciate greatly what you have accomplished and how hard it is to do, this bz is not a criticism of any individual. All I'm saying is that we have an opportunity to make it better, perhaps I didn't express that well enough. Also, these problems are not unique to ceph-ansible - I think OSP 8 has the same problem in its YAML config files with defining which block devices can be used for storage using device names. I didn't know about ability to set a different device list per host. So if I understand correctly, I could set 36 drives per server except on the host where we lost a drive, and then set 35 there with an edit to the inventory file. If so, then we can reduce priority of this bz, I'll try it. I don't think this is mentioned in the documentation anywhere, and it certainly wasn't obvious to me that we could do this. Maybe a comment in the osds.sample file would help? and also the RHCS 2 docs? But it would be nice not to have to edit the inventory file this way, particularly when the devices list is very long. What would be perhaps easier for the end-user is a way to subtract devices from the device list, like: c07-h01-6048r device_blacklist="[ '/dev/sdal' ]" Instead of c07-h01-6048r devices="[ '/dev/sdc', '/dev/sdd', ... , '/dev/sdak' ]" In the long run, IMHO a better way to configure OSDs is to specify rules for OSD selection instead of enumerating them. thx -ben lowering priority because the workaround was successful. Sebastien suggested overriding the "devices" var to remove a block device from the list and this worked, just didn't think of it. As I said in previous reply, I don't think it's an ideal user interface but it allows us to use every functioning OSD device in the system, as well as incorporating OSD hosts with varying numbers of block devices. No worries Ben, I didn't take anything personally, really :). I know it is not that ideal to pass devices like this "[ '/dev/sdc', '/dev/sdd', ... , '/dev/sdak' ]" especially when we have a long list of devices... The device blacklist might be difficult to implement thus I'm not sure how we could enhance this to be honest... Oh and btw, you don't need to edit the inventory for all the hosts, just for the one that is not consistent with the others ;) Yeah that's what I did, just the one host, and it worked fine. Great! I'm wondering if we should not close this now Ben, if we are looking at something else then we should open a new BZ for that, wdyt? So I'm changing this bz to a documentation bz. I think that long-term there are still usability issues raised in comment 5 that matter, and I'll try to find the right venue for that discussion. I think the above discussion verifies that we don't have to fix the software right now, but we should document so people know how to deal with this situation, unless everyone but me knew about this technique for dealing with non-identical servers ;-) https://access.redhat.com/documentation/en/red-hat-ceph-storage/2/paged/installation-guide-for-red-hat-enterprise-linux/ does not discuss this technique AFAICT. In your example, maybe more like this? [osds] <osd-hostname> devices="[ '<device_1>', '<device_2>', ... , '<last_dev_name>' ]" Also, you might want to provide an example to make it clearer. For example, 3 hosts, host1, host2, and host3, have devices /dev/sdb, /dev/sdc, and /dev/sdd. On host2, /dev/sdc fails and is removed. Upon subsequent reboot, the old /dev/sdc is not discovered and so host2 now only has /dev/sdb and /dev/sdc (formerly /dev/sdd). To handle this situation, we can override the devices var for host 2 as follows: [osds] host1 host2 devices="[ '/dev/sdb', '/dev/sdc' ]" host3 Now ceph-ansible can run without error and bring up every available disk in this configuration. Why am I obsessing about this problem? Disk failure by far the most likely hardware failure to occur in a Ceph cluster, so it really really matters. Not only that, but people who are prototyping Ceph clusters using non-identical servers will really appreciate knowing this. I looked at the commit and it looks good to me. Ben is that enough for you? LGTM thanks! Moving to Verified based on Rachana's inputs. |
Description of problem: if a single disk is unavailable in an OSD host in the devices: list for ceph-ansible, ceph-ansible will not bring up any of the OSDs on that host. The only reason that this is not high-priority is that ceph-ansible is typically run once at install time and there is a workaround - to replace the bad disk or even out the disk distribution across hosts, with any extra disks serving as spares. It is possible to run purge-cluster.yml to clear out the old install and redo it with fewer devices. However, in a cluster with a large OSD host count, this results in a significant number of unused block devices, and significant extra work to bring the unused block devices online. Since autodiscover mode and SSD journaling are not compatible (right?), autodiscover mode is not a workaround. Version-Release number of selected component (if applicable): RHCS 2.0 - ceph-common-10.2.2-38.el7cp.x86_64 RHEL 7.3 beta - kernel-3.10.0-493.el7.x86_64 How reproducible: Every time. Steps to Reproduce: 1. define N hosts with 3 block devices each, and 1 extra host with only 2 of 3 block devices 2. define ceph-ansible's devices: list to point to all 3, and use raw_multi_journal: true 3. ansible-playbook -i inventory-file site.yml Actual results: you get an error in the ceph-ansible task named: - name: fix partitions gpt header or labels of the osd disks shell: "sgdisk --zap-all --clear --mbrtogpt -g -- {{ item.1 }} || sgdisk --zap-all --clear --mbrtogpt -g -- {{ item.1 }}" with_together: - combined_osd_partition_status_results.results - devices changed_when: false when: (journal_collocation or raw_multi_journal) and not osd_auto_discovery and item.0.rc != 0 In this example, we have 36 drives available to ceph-ansible and /dev/sdal is the 36th drive. If there is a bad disk anywhere in /dev/sd[a-z] /dev/sda[a-l], then /dev/sdal will not exist on that host and this step will fail: failed: [c07-h01-6048r.rdu.openstack.engineering.redhat.com] => (item=[{u'cmd': u'parted --script /dev/sdal print > /dev/null 2>&1', u'end': u'2016-09-12 21:27:49.662849', 'failed': False, u'stdout': u'', u'changed': False, u'rc': 1, u'start': u'2016-09-12 21:27:49.659113', 'item': '/dev/sdal', u'warnings': [], u'delta': u'0:00:00.003736', 'invocation': {'module_name': u'shell', 'module_complex_args': {}, 'module_args': u'parted --script /dev/sdal print > /dev/null 2>&1'}, 'stdout_lines': [], 'failed_when_result': False, u'stderr': u''}, '/dev/sdal']) => {"changed": false, "cmd": "sgdisk --zap-all --clear --mbrtogpt -g -- /dev/sdal || sgdisk --zap-all --clear --mbrtogpt -g -- /dev/sdal", "delta": "0:00:00.137519", "end": "2016-09-12 and no further commands are executed for this host! It is not obvious from inspection of the ansible log but you do see this line in the play summary at the end: c07-h01-6048r.rdu.openstack.engineering.redhat.com : ok=99 changed=8 unreachable=0 failed=1 Expected results: Although no further commands to be executed for this block device, but the host should be able to continue installation on the remaining block devices. At the end, "ceph status" and "ceph osd tree" commands will make clear which hosts did not get all their OSDs. The installation can proceed and any failed OSDs can be replaced using normal Ceph maintenance procedures. Additional info: The root of the problem is that an ansible with_together clause is used, and the task in the playbook is considered to have failed if *any* of the devices in the with_together list have failed, and the host is then disqualified from further participation in the play. Is there a way to record whether sgdisk command succeeded or not, and then base subsequent steps for that block device on the outcome?