Bug 1670527 - if LVM is not installed containers don't come up after a system reboot
Summary: if LVM is not installed containers don't come up after a system reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 3.3
Assignee: Dimitri Savineau
QA Contact: Yogesh Mane
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-29 18:32 UTC by Alfredo Deza
Modified: 2019-08-21 15:10 UTC (History)
13 users (show)

Fixed In Version: RHEL: ceph-ansible-3.2.15-1.el7cp Ubuntu: ceph-ansible_3.2.15-2redhat1
Doc Type: Bug Fix
Doc Text:
.OSD containers running on Ubuntu start after a system reboot as expected In a containerized deployment when using the `lvm` OSD scenario, the LVM package and service are required. On Red Hat Enterprise Linux, the LVM package is automatically installed as a dependency of the `docker` package, but it is not automatically installed on Ubuntu. Consequently, OSD containers running on Ubuntu did not start after a reboot. With this update, the `ceph-ansible` utility installs the LVM package and starts the LVM service explicitly. As a result, OSD containers running on Ubuntu start as expected after a system reboot.
Clone Of:
Environment:
Last Closed: 2019-08-21 15:10:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 3732 0 None closed ceph-osd: Ensure lvm2 is installed 2020-08-28 20:58:24 UTC
Github ceph ceph-ansible pull 3733 0 None closed Automatic backport of pull request #3732 2020-08-28 20:58:24 UTC
Red Hat Product Errata RHSA-2019:2538 0 None None None 2019-08-21 15:10:42 UTC

Description Alfredo Deza 2019-01-29 18:32:09 UTC
Description of problem: Provisioning ceph-volume bluestore OSD containers don't survive a system reboot because LVM is not installed in the host.

Somehow the system is able to produce the LVs needed for the initial provisioning (systemd units and OSDs are running correctly).

After a system reboot, the logs report that the devices can't be found. After installing LVM and restarting again, the OSDs are able to come up. 


Steps to Reproduce:
1. provision a bluestore (probably filestore would behave the same) OSD with ceph-volume in a container
2. After provisioning is complete and the OSD is up and in, restart the system


Actual results: systemd units attempt to bring the OSD up, but fail, and after a few tries it gives up


Expected results: LVM is ensured to be present which allows OSDs to fully come up after a system reboot


Additional info:
group vars:
    egrep "^\w" group_vars/all.yml
    osd_objectstore: "bluestore"
    osd_scenario: lvm
    num_osds: 1
    devices:
    dummy:
    fsid : "8f045acd-4f18-4850-a5a1-f61792e555cf"
    monitor_interface: eth1
    monitor_address: 192.168.111.100
    ip_version: ipv4
    mon_use_fqdn: false # if set to true, the MON name used will be the fqdn in the ceph.conf
    public_network: 192.168.111.0/24
    ceph_docker_image: "ceph/daemon"
    ceph_docker_image_tag: latest-luminous
    ceph_docker_registry: docker.io
    containerized_deployment: true



LVM is not installed in the system (Ubuntu Bionic in this case):

    root@node4:/home/vagrant# lvs

    Command 'lvs' not found, but can be installed with:

    apt install lvm2


ceph-ansible correctly provisioned the OSD:

    root@node4:/home/vagrant# systemctl status ceph-osd@1
    ● ceph-osd - Ceph OSD
       Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
       Active: active (running) since Tue 2019-01-29 18:04:22 UTC; 1min 43s ago
      Process: 9184 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE)
      Process: 9169 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE)
     Main PID: 9192 (ceph-osd-run.sh)
        Tasks: 12 (limit: 2325)
       CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd
               ├─9192 /bin/bash /usr/share/ceph-osd-run.sh 1
               └─9194 /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=1993m --cpus=1 -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/ru

    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: Running command: chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: Running command: chown -R ceph:ceph /dev/mapper/ceph--a893229a--7adb--4104--8f3b--0e5e94eb8d67-osd--data--df8a15d1--7f80--45f6--aa87--9beb7f40c242
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: --> ceph-volume lvm activate successful for osd ID: 1
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: 2019-01-29 18:04:28  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: exec: PID 9516: spawning /usr/bin/ceph-osd --cluster ceph -f -i 1
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: exec: Waiting 9516 to quit
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
    Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: 2019-01-29 18:04:28.848970 7feb37263d80 -1 osd.1 0 log_to_monitors {default=true}
    Jan 29 18:04:30 node4 ceph-osd-run.sh[9192]: 2019-01-29 18:04:30.161239 7feb1f054700 -1 osd.1 0 waiting for initial osdmap

Restart the system:
    root@node4:/var/log/ceph# reboot
    Connection to node4 closed by remote host.
    $ ssh node4
    Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-42-generic x86_64)

     * Documentation:  https://help.ubuntu.com
     * Management:     https://landscape.canonical.com
     * Support:        https://ubuntu.com/advantage

    Last login: Tue Jan 29 18:05:45 2019 from 192.168.111.1
    vagrant@node4:~$ sudo su

Containers are not running:

    root@node4:/home/vagrant# docker ps
    CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
    root@node4:/home/vagrant# systemctl status ceph-osd@1
    ● ceph-osd - Ceph OSD
       Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
       Active: activating (auto-restart) (Result: exit-code) since Tue 2019-01-29 18:07:16 UTC; 9s ago
      Process: 1163 ExecStart=/usr/share/ceph-osd-run.sh 1 (code=exited, status=1/FAILURE)
      Process: 1132 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE)
      Process: 1094 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE)
     Main PID: 1163 (code=exited, status=1/FAILURE)

    Jan 29 18:07:16 node4 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE
    Jan 29 18:07:16 node4 systemd[1]: ceph-osd: Failed with result 'exit-code'.

Attempting to start the container doesn't work:

    root@node4:/home/vagrant# systemctl start ceph-osd@1
    root@node4:/home/vagrant# systemctl status ceph-osd@1
    ● ceph-osd - Ceph OSD
       Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
       Active: active (running) since Tue 2019-01-29 18:07:41 UTC; 2s ago
      Process: 2863 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE)
      Process: 2841 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE)
     Main PID: 2877 (ceph-osd-run.sh)
        Tasks: 12 (limit: 2325)
       CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd
               ├─2877 /bin/bash /usr/share/ceph-osd-run.sh 1
               └─2878 /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=1993m --cpus=1 -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/ru

    Jan 29 18:07:41 node4 systemd[1]: Stopped Ceph OSD.
    Jan 29 18:07:41 node4 systemd[1]: Starting Ceph OSD...
    Jan 29 18:07:41 node4 docker[2841]: Error response from daemon: No such container: ceph-osd-1
    Jan 29 18:07:41 node4 docker[2863]: Error: No such container: ceph-osd-1
    Jan 29 18:07:41 node4 systemd[1]: Started Ceph OSD.
    Jan 29 18:07:42 node4 ceph-osd-run.sh[2877]: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
    root@node4:/home/vagrant# docker ps
    CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES


ceph-volume (via ceph-bluestore-tool) reports that the device is not present, this is correct since it is LVM who creates the links into /dev/, since 
LVM is not installed, the devices fail to appear:

    root@node4:/home/vagrant# journalctl -u ceph-osd@1
    ...
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,556][ceph_volume.process][INFO  ] Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO  ] stderr 2019-01-29 18:08:00.887652 7f28b1e1dec0 -1 bluestore(/dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7f80-45f6-aa8
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO  ] stderr failed to read label for
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO  ] stderr /dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7f80-45f6-aa87-9beb7f40c242
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO  ] stderr :
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO  ] stderr (2) No such file or directory
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO  ] stderr
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,891][ceph_volume][ERROR ] exception caught by decorator
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: Traceback (most recent call last):
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     return f(*a, **kw)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in main
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     terminal.dispatch(self.mapper, subcommand_args)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in dispatch
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     instance.main()
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/main.py", line 40, in main
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     terminal.dispatch(self.mapper, self.argv)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in dispatch
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     instance.main()
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 333, in main
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     self.activate(args)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, in is_root
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     return func(*a, **kw)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 257, in activate
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     activate_bluestore(lvs, no_systemd=args.no_systemd)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 164, in activate_bluestore
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     process.run(prime_command)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:   File "/usr/lib/python2.7/site-packages/ceph_volume/process.py", line 153, in run
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]:     raise RuntimeError(msg)
    Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: RuntimeError: command returned non-zero exit status: 1
    Jan 29 18:08:01 node4 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE
    Jan 29 18:08:01 node4 systemd[1]: ceph-osd: Failed with result 'exit-code'.
    ...skipping...

Installing LVM and restarting the system again causes the containers to work correctly again:

Verify installing of LVM is working:

    root@node4:/home/vagrant# lvs
      LV                                            VG                                        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
      osd-data-8ca11df8-cf60-4be2-91f2-784b4d969514 ceph-3cbffed6-e087-4b2c-9e71-2bb434976b72 -wi-ao---- <10.74g
      osd-data-df8a15d1-7f80-45f6-aa87-9beb7f40c242 ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67 -wi-ao---- <10.74g

Check status of OSD containers:

    root@node4:/home/vagrant# systemctl status ceph-osd@1
    ● ceph-osd - Ceph OSD
       Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
       Active: active (running) since Tue 2019-01-29 18:11:59 UTC; 19s ago
      Process: 1083 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE)
      Process: 1028 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE)
     Main PID: 1093 (ceph-osd-run.sh)
        Tasks: 10 (limit: 2323)
       CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd
               ├─1093 /bin/bash /usr/share/ceph-osd-run.sh 1
               └─1108 /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=1993m --cpus=1 -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/ru

    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: ln -snf /dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7f80-45f6-aa87-9beb7f40c242 /var/lib/ceph/osd/ceph-1/block
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: chown -R ceph:ceph /dev/dm-0
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: --> ceph-volume lvm activate successful for osd ID: 1
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: 2019-01-29 18:12:10  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: exec: PID 1906: spawning /usr/bin/ceph-osd --cluster ceph -f -i 1
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: exec: Waiting 1906 to quit
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
    Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: 2019-01-29 18:12:10.524402 7fd8d0b82d80 -1 osd.1 6 log_to_monitors {default=true}

Comment 1 Sébastien Han 2019-01-30 13:53:18 UTC
On a containerized environment we don't install any packages so we assume the host as the right content.
There is not so much we can do about that, we can display a message and throw an error.

If we managed to get the lv symlinks in the first place, there might be a way to trigger that from the command line inside the container.

Rishabh, please investigate whenever you have time. Thanks

Comment 2 Alfredo Deza 2019-01-30 15:30:27 UTC
Getting LVM symlinks is non-trivial, LVM itself relies on UDEV rules which work together with the lvmetad daemon. I am not sure how would any system would go about
creating the links without being able to inspect the logical volumes (since LVM itself is not installed).

I wasn't aware that ceph-ansible does not install anything on a host for containerized deployments, but it is clear that there is a gap here. Why wouldn't ceph-ansible try
to install required dependencies like LVM if they are missing? It already does it in other playbooks.

Comment 13 errata-xmlrpc 2019-08-21 15:10:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2538


Note You need to log in before you can comment on or make changes to this bug.