Bug 1656545 - osd container name is inconsistent between ceph-ansible versions
Summary: osd container name is inconsistent between ceph-ansible versions
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 3.*
Assignee: Sébastien Han
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1578730
TreeView+ depends on / blocked
 
Reported: 2018-12-05 18:14 UTC by Jeremy
Modified: 2022-03-13 17:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-02 15:12:15 UTC
Embargoed:


Attachments (Terms of Use)
ceph-install -workflow.log (18.99 MB, text/plain)
2018-12-05 18:14 UTC, Jeremy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-3764 0 None None None 2022-03-13 17:14:00 UTC

Description Jeremy 2018-12-05 18:14:27 UTC
Created attachment 1511859 [details]
ceph-install -workflow.log

Description of problem:
Customer is using workarond like [1] because they do not have  homogeneous hardware , also using /dev/disk/by-id/wwn template mappings [2] . For some reason after applying the workaround [1] , the deploy fails in ceph-install-workflow.log we see it seems ceph-ansible tries to inspect osd containers that no longer exist [3] . This seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=1651290 ??



[1]
head /etc/udev/rules.d/20-names.rules
KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f1", SYMLINK+="diska%n"
KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f5", SYMLINK+="diskb%n"

[2]
head csu-osd-disk-maps.yaml 
# disk maps
parameter_defaults:
  NodeDataLookup: |
    {
      "D2181EBB-1B41-994D-9FF9-65C3D171352D": {
        "devices": [
          "/dev/disk/by-id/wwn-0x500003976822a88d",
          "/dev/disk/by-id/wwn-0x500003976822a991",
          "/dev/disk/by-id/wwn-0x500003976822a851",

[3]
###/cases/02265105/x-text/ceph-install-workflow.log
2018-12-03 17:16:23,740 p=31082 u=mistral |  fatal: [10.20.12.118]: FAILED! => {"changed": false, "cmd": ["docker", "inspect", "4e8a7e270552", "c5b3874bbeca", "58848a70883b", "9f686ca59f1b", "f7850013c947", "2ea7f52a9f96", "e703bc590a59", "daac219bb5cb", "6fa90e420694", "82cfa880738e", "a6b523303b27", "dae11aac396a", "8606dc679269", "f22f4fe14cfc", "3bb507ea55c7", "b0094fb48b26", "684378fec9d5", "c8eb22aeaf8a", "a49cd34c5317", "aaa89ea2cbbb"], "delta": "0:00:00.031685", "end": "2018-12-03 17:16:23.693344", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-12-03 17:16:23.661659", "stderr": "Error: No such object: c5b3874bbeca", "stderr_lines": ["Error: No such object: c5b3874bbeca"], "stdout": "[\n    {\n        \"Id\": \"4e8a7e270552020430208feb8cc2b971bd8c3e031a7d0028e9bb0a42eb9f7823\",\n      

[jmelvin@collab-shell x-text]$ grep c5b3874bbeca /cases/02265105/sosreport-20181203-222729/os2-prd-ceph05/sos_commands/docker/docker_ps
[jmelvin@collab-shell x-text]$



Version-Release number of selected component (if applicable):
 Customer updated to latest version and still has issue:
 Updating   : ceph-ansible-3.1.5-1.el7cp.noarch                                                                                                                                                      1/2
  Cleanup    : ceph-ansible-3.1.0-0.1.rc10.el7cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. apply workaround
2. run overcloud deploy
3.

Actual results:
deploys fails 

Expected results:
deploy works

Additional info:

Comment 5 John Fulton 2018-12-12 14:51:43 UTC
The no such container error [1] is likely caused by the container failing before it can be used. At that point another container will be created with a different ID because the systemd unit file has restart always (it will likely fail for the same reason). As a next step, on a server running OSDs, configure the OSD unit file to not restart [2] and then run `systemctl daemon-reload`. After that run the script to start the OSD manually feeding the block device as an argument. For example:

 /usr/share/ceph-osd-run.sh /dev/sdc

The output of the above command is what we need next to continue diagnosis. Also, ensure that the disks are clean for each deployment as described in doc bug 1613918. If the disks haven't been zapped that will explain why this happened.

So please provide:

1. the output of the ceph-osd-run command above after updating the OSD systemd unit file
2. whether or not the disks were cleaned prior to deployment


[1]
2018-12-03 17:16:23,740 p=31082 u=mistral |  fatal: [10.20.12.118]: FAILED! => {"changed": false, "cmd": ["docker", "inspect", "4e8a7e270552", "c5b3874bbeca", "58848a70883b", "9f686ca59f1b", "f7850013c947", "2ea7f52a9f96", "e703bc590a59", "daac219bb5cb", "6fa90e420694", "82cfa880738e", "a6b523303b27", "dae11aac396a", "8606dc679269", "f22f4fe14cfc", "3bb507ea55c7", "b0094fb48b26", "684378fec9d5", "c8eb22aeaf8a", "a49cd34c5317", "aaa89ea2cbbb"], "delta": "0:00:00.031685", "end": "2018-12-03 17:16:23.693344", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-12-03 17:16:23.661659", "stderr": "Error: No such object: c5b3874bbeca", "stderr_lines": ["Error: No such object: c5b3874bbeca"], "stdout": "[\n    {\n        \"Id\": \"4e8a7e270552020430208feb8cc2b971bd8c3e031a7d0028e9bb0a42eb9f7823\",\n   

[2]
[root@ceph-0 ~]# grep Restart /etc/systemd/system/ceph-osd\@.service
Restart=never   # <-------- was always
RestartSec=10s
[root@ceph-0 ~]#

Comment 6 John Fulton 2018-12-12 15:08:17 UTC
(In reply to John Fulton from comment #5)
> So please provide:
> 
> 1. the output of the ceph-osd-run command above after updating the OSD
> systemd unit file
> 2. whether or not the disks were cleaned prior to deployment

As per our conversation we don't need item #2 above (as this is a resassert of an existing deployment). 

In lieu of that please provide #1 and the output of `ceph -s` as run on a controller node.

Comment 11 John Fulton 2019-01-02 15:12:15 UTC
I talked to Seb more about this bug. 

A. The inconsistency between containers named sdX vs diskX [0] can be explained by the udev rule from comment #1 [1]; ceph-ansible 3.1 with ceph-disk will consistently do the right thing unless you modify the udev rules and parameter inputs. 

B. The inconsistency between block devices on reboot happens with some hardware and the solution is to use names which are consistent between reboots like the WWN or PCI path, which seems to already be being done in this case [2]

C. Though there is a coming change of OSD container naming as a result of the move to ceph-ansible 3.2 and ceph-volume, a way to deal with that and continue to use the old name for existing servers while using the new name for new servers and how to transition is described in doc bug 1660503.

So to avoid this issue going forward, don't use the udev rules like the ones described in A and names which will be consistent between reboots as described in B. This should continue to work through updates even with 3.2 lands. After 3.2 lands, if you wish to transition to bluestore and the new contianer naming conversion, please follow what is described in doc bug 1660503.

[0]
CONTAINER ID        IMAGE                                             COMMAND             CREATED              STATUS              PORTS               NAMES
da2398a113c8        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    3 seconds ago        Up 2 seconds                            ceph-osd-os2-prd-ceph01-sdc
0b154cc39800        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    44 seconds ago       Up 43 seconds                           ceph-osd-os2-prd-ceph01-sdj
21d8fd4d9b4b        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    46 seconds ago       Up 45 seconds                           ceph-osd-os2-prd-ceph01-sdk
9660ae532957        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sdd
de8d34cc4bac        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sdl
850d3c1fa845        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sda
1361735d3cfe        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sde
9abca5b261d8        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sdf
a1ba1d6cb00d        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sdi
f5326613ade6        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    About a minute ago   Up About a minute                       ceph-osd-os2-prd-ceph01-sdb
106e9058f479        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskg
1715d31b17a1        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskh
40bb07d13ef5        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskc
d936453ca5e9        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diski
3c3d52f0c70b        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskj
a406a95c4e2c        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskd
ab3e07aa5753        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diska
439df8ed7abe        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskf
e9e6ed530554        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diske
cfbf98a60353        10.20.12.10:8787/rhceph/rhceph-3-rhel7:3-12       "/entrypoint.sh"    4 hours ago          Up 4 hours                              ceph-osd-os2-prd-ceph01-diskb
2e966f3ba78d        10.20.12.10:8787/rhosp13/openstack-cron:13.0-59   "kolla_start"       3 weeks ago          Up 4 hours                              logrotate_crond

[1]
head /etc/udev/rules.d/20-names.rules
KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f1", SYMLINK+="diska%n"
KERNEL=="sd*", SUBSYSTEM=="block", PROGRAM=="/lib/udev/scsi_id --whitelisted --replace-whitespace --device=%N", RESULT=="3500003976822a1f5", SYMLINK+="diskb%n"

[2]
head csu-osd-disk-maps.yaml 
# disk maps
parameter_defaults:
  NodeDataLookup: |
    {
      "D2181EBB-1B41-994D-9FF9-65C3D171352D": {
        "devices": [
          "/dev/disk/by-id/wwn-0x500003976822a88d",
          "/dev/disk/by-id/wwn-0x500003976822a991",
          "/dev/disk/by-id/wwn-0x500003976822a851",


Note You need to log in before you can comment on or make changes to this bug.