Created attachment 1511859 [details] ceph-install -workflow.log Description of problem: Customer is using workarond like [1] because they do not have homogeneous hardware , also using /dev/disk/by-id/wwn template mappings [2] . For some reason after applying the workaround [1] , the deploy fails in ceph-install-workflow.log we see it seems ceph-ansible tries to inspect osd containers that no longer exist [3] . This seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=1651290 ?? [1] head /etc/udev/rules.d/20-names.rules KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f1", SYMLINK+="diska%n" KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f5", SYMLINK+="diskb%n" [2] head csu-osd-disk-maps.yaml # disk maps parameter_defaults: NodeDataLookup: | { "D2181EBB-1B41-994D-9FF9-65C3D171352D": { "devices": [ "/dev/disk/by-id/wwn-0x500003976822a88d", "/dev/disk/by-id/wwn-0x500003976822a991", "/dev/disk/by-id/wwn-0x500003976822a851", [3] ###/cases/02265105/x-text/ceph-install-workflow.log 2018-12-03 17:16:23,740 p=31082 u=mistral | fatal: [10.20.12.118]: FAILED! => {"changed": false, "cmd": ["docker", "inspect", "4e8a7e270552", "c5b3874bbeca", "58848a70883b", "9f686ca59f1b", "f7850013c947", "2ea7f52a9f96", "e703bc590a59", "daac219bb5cb", "6fa90e420694", "82cfa880738e", "a6b523303b27", "dae11aac396a", "8606dc679269", "f22f4fe14cfc", "3bb507ea55c7", "b0094fb48b26", "684378fec9d5", "c8eb22aeaf8a", "a49cd34c5317", "aaa89ea2cbbb"], "delta": "0:00:00.031685", "end": "2018-12-03 17:16:23.693344", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-12-03 17:16:23.661659", "stderr": "Error: No such object: c5b3874bbeca", "stderr_lines": ["Error: No such object: c5b3874bbeca"], "stdout": "[\n {\n \"Id\": \"4e8a7e270552020430208feb8cc2b971bd8c3e031a7d0028e9bb0a42eb9f7823\",\n [jmelvin@collab-shell x-text]$ grep c5b3874bbeca /cases/02265105/sosreport-20181203-222729/os2-prd-ceph05/sos_commands/docker/docker_ps [jmelvin@collab-shell x-text]$ Version-Release number of selected component (if applicable): Customer updated to latest version and still has issue: Updating : ceph-ansible-3.1.5-1.el7cp.noarch 1/2 Cleanup : ceph-ansible-3.1.0-0.1.rc10.el7cp.noarch How reproducible: 100% Steps to Reproduce: 1. apply workaround 2. run overcloud deploy 3. Actual results: deploys fails Expected results: deploy works Additional info:
The no such container error [1] is likely caused by the container failing before it can be used. At that point another container will be created with a different ID because the systemd unit file has restart always (it will likely fail for the same reason). As a next step, on a server running OSDs, configure the OSD unit file to not restart [2] and then run `systemctl daemon-reload`. After that run the script to start the OSD manually feeding the block device as an argument. For example: /usr/share/ceph-osd-run.sh /dev/sdc The output of the above command is what we need next to continue diagnosis. Also, ensure that the disks are clean for each deployment as described in doc bug 1613918. If the disks haven't been zapped that will explain why this happened. So please provide: 1. the output of the ceph-osd-run command above after updating the OSD systemd unit file 2. whether or not the disks were cleaned prior to deployment [1] 2018-12-03 17:16:23,740 p=31082 u=mistral | fatal: [10.20.12.118]: FAILED! => {"changed": false, "cmd": ["docker", "inspect", "4e8a7e270552", "c5b3874bbeca", "58848a70883b", "9f686ca59f1b", "f7850013c947", "2ea7f52a9f96", "e703bc590a59", "daac219bb5cb", "6fa90e420694", "82cfa880738e", "a6b523303b27", "dae11aac396a", "8606dc679269", "f22f4fe14cfc", "3bb507ea55c7", "b0094fb48b26", "684378fec9d5", "c8eb22aeaf8a", "a49cd34c5317", "aaa89ea2cbbb"], "delta": "0:00:00.031685", "end": "2018-12-03 17:16:23.693344", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-12-03 17:16:23.661659", "stderr": "Error: No such object: c5b3874bbeca", "stderr_lines": ["Error: No such object: c5b3874bbeca"], "stdout": "[\n {\n \"Id\": \"4e8a7e270552020430208feb8cc2b971bd8c3e031a7d0028e9bb0a42eb9f7823\",\n [2] [root@ceph-0 ~]# grep Restart /etc/systemd/system/ceph-osd\@.service Restart=never # <-------- was always RestartSec=10s [root@ceph-0 ~]#
(In reply to John Fulton from comment #5) > So please provide: > > 1. the output of the ceph-osd-run command above after updating the OSD > systemd unit file > 2. whether or not the disks were cleaned prior to deployment As per our conversation we don't need item #2 above (as this is a resassert of an existing deployment). In lieu of that please provide #1 and the output of `ceph -s` as run on a controller node.
I talked to Seb more about this bug. A. The inconsistency between containers named sdX vs diskX [0] can be explained by the udev rule from comment #1 [1]; ceph-ansible 3.1 with ceph-disk will consistently do the right thing unless you modify the udev rules and parameter inputs. B. The inconsistency between block devices on reboot happens with some hardware and the solution is to use names which are consistent between reboots like the WWN or PCI path, which seems to already be being done in this case [2] C. Though there is a coming change of OSD container naming as a result of the move to ceph-ansible 3.2 and ceph-volume, a way to deal with that and continue to use the old name for existing servers while using the new name for new servers and how to transition is described in doc bug 1660503. So to avoid this issue going forward, don't use the udev rules like the ones described in A and names which will be consistent between reboots as described in B. This should continue to work through updates even with 3.2 lands. After 3.2 lands, if you wish to transition to bluestore and the new contianer naming conversion, please follow what is described in doc bug 1660503. [0] CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES da2398a113c8 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 3 seconds ago Up 2 seconds ceph-osd-os2-prd-ceph01-sdc 0b154cc39800 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 44 seconds ago Up 43 seconds ceph-osd-os2-prd-ceph01-sdj 21d8fd4d9b4b 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 46 seconds ago Up 45 seconds ceph-osd-os2-prd-ceph01-sdk 9660ae532957 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sdd de8d34cc4bac 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sdl 850d3c1fa845 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sda 1361735d3cfe 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sde 9abca5b261d8 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sdf a1ba1d6cb00d 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sdi f5326613ade6 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" About a minute ago Up About a minute ceph-osd-os2-prd-ceph01-sdb 106e9058f479 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskg 1715d31b17a1 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskh 40bb07d13ef5 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskc d936453ca5e9 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diski 3c3d52f0c70b 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskj a406a95c4e2c 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskd ab3e07aa5753 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diska 439df8ed7abe 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskf e9e6ed530554 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diske cfbf98a60353 10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12 "/entrypoint.sh" 4 hours ago Up 4 hours ceph-osd-os2-prd-ceph01-diskb 2e966f3ba78d 10.20.12.10:8787/rhosp13/openstack-cron:13.0-59 "kolla_start" 3 weeks ago Up 4 hours logrotate_crond [1] head /etc/udev/rules.d/20-names.rules KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f1", SYMLINK+="diska%n" KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f5", SYMLINK+="diskb%n" [2] head csu-osd-disk-maps.yaml # disk maps parameter_defaults: NodeDataLookup: | { "D2181EBB-1B41-994D-9FF9-65C3D171352D": { "devices": [ "/dev/disk/by-id/wwn-0x500003976822a88d", "/dev/disk/by-id/wwn-0x500003976822a991", "/dev/disk/by-id/wwn-0x500003976822a851",