Description of problem: Provisioning ceph-volume bluestore OSD containers don't survive a system reboot because LVM is not installed in the host. Somehow the system is able to produce the LVs needed for the initial provisioning (systemd units and OSDs are running correctly). After a system reboot, the logs report that the devices can't be found. After installing LVM and restarting again, the OSDs are able to come up. Steps to Reproduce: 1. provision a bluestore (probably filestore would behave the same) OSD with ceph-volume in a container 2. After provisioning is complete and the OSD is up and in, restart the system Actual results: systemd units attempt to bring the OSD up, but fail, and after a few tries it gives up Expected results: LVM is ensured to be present which allows OSDs to fully come up after a system reboot Additional info: group vars: egrep "^\w" group_vars/all.yml osd_objectstore: "bluestore" osd_scenario: lvm num_osds: 1 devices: dummy: fsid : "8f045acd-4f18-4850-a5a1-f61792e555cf" monitor_interface: eth1 monitor_address: 192.168.111.100 ip_version: ipv4 mon_use_fqdn: false # if set to true, the MON name used will be the fqdn in the ceph.conf public_network: 192.168.111.0/24 ceph_docker_image: "ceph/daemon" ceph_docker_image_tag: latest-luminous ceph_docker_registry: docker.io containerized_deployment: true LVM is not installed in the system (Ubuntu Bionic in this case): root@node4:/home/vagrant# lvs Command 'lvs' not found, but can be installed with: apt install lvm2 ceph-ansible correctly provisioned the OSD: root@node4:/home/vagrant# systemctl status ceph-osd@1 ● ceph-osd - Ceph OSD Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled) Active: active (running) since Tue 2019-01-29 18:04:22 UTC; 1min 43s ago Process: 9184 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE) Process: 9169 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE) Main PID: 9192 (ceph-osd-run.sh) Tasks: 12 (limit: 2325) CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd ├─9192 /bin/bash /usr/share/ceph-osd-run.sh 1 └─9194 /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=1993m --cpus=1 -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/ru Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: Running command: chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: Running command: chown -R ceph:ceph /dev/mapper/ceph--a893229a--7adb--4104--8f3b--0e5e94eb8d67-osd--data--df8a15d1--7f80--45f6--aa87--9beb7f40c242 Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1 Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: --> ceph-volume lvm activate successful for osd ID: 1 Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: 2019-01-29 18:04:28 /opt/ceph-container/bin/entrypoint.sh: SUCCESS Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: exec: PID 9516: spawning /usr/bin/ceph-osd --cluster ceph -f -i 1 Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: exec: Waiting 9516 to quit Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal Jan 29 18:04:28 node4 ceph-osd-run.sh[9192]: 2019-01-29 18:04:28.848970 7feb37263d80 -1 osd.1 0 log_to_monitors {default=true} Jan 29 18:04:30 node4 ceph-osd-run.sh[9192]: 2019-01-29 18:04:30.161239 7feb1f054700 -1 osd.1 0 waiting for initial osdmap Restart the system: root@node4:/var/log/ceph# reboot Connection to node4 closed by remote host. $ ssh node4 Welcome to Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-42-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage Last login: Tue Jan 29 18:05:45 2019 from 192.168.111.1 vagrant@node4:~$ sudo su Containers are not running: root@node4:/home/vagrant# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES root@node4:/home/vagrant# systemctl status ceph-osd@1 ● ceph-osd - Ceph OSD Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2019-01-29 18:07:16 UTC; 9s ago Process: 1163 ExecStart=/usr/share/ceph-osd-run.sh 1 (code=exited, status=1/FAILURE) Process: 1132 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE) Process: 1094 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE) Main PID: 1163 (code=exited, status=1/FAILURE) Jan 29 18:07:16 node4 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE Jan 29 18:07:16 node4 systemd[1]: ceph-osd: Failed with result 'exit-code'. Attempting to start the container doesn't work: root@node4:/home/vagrant# systemctl start ceph-osd@1 root@node4:/home/vagrant# systemctl status ceph-osd@1 ● ceph-osd - Ceph OSD Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled) Active: active (running) since Tue 2019-01-29 18:07:41 UTC; 2s ago Process: 2863 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE) Process: 2841 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE) Main PID: 2877 (ceph-osd-run.sh) Tasks: 12 (limit: 2325) CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd ├─2877 /bin/bash /usr/share/ceph-osd-run.sh 1 └─2878 /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=1993m --cpus=1 -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/ru Jan 29 18:07:41 node4 systemd[1]: Stopped Ceph OSD. Jan 29 18:07:41 node4 systemd[1]: Starting Ceph OSD... Jan 29 18:07:41 node4 docker[2841]: Error response from daemon: No such container: ceph-osd-1 Jan 29 18:07:41 node4 docker[2863]: Error: No such container: ceph-osd-1 Jan 29 18:07:41 node4 systemd[1]: Started Ceph OSD. Jan 29 18:07:42 node4 ceph-osd-run.sh[2877]: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. root@node4:/home/vagrant# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ceph-volume (via ceph-bluestore-tool) reports that the device is not present, this is correct since it is LVM who creates the links into /dev/, since LVM is not installed, the devices fail to appear: root@node4:/home/vagrant# journalctl -u ceph-osd@1 ... Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,556][ceph_volume.process][INFO ] Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7 Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO ] stderr 2019-01-29 18:08:00.887652 7f28b1e1dec0 -1 bluestore(/dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7f80-45f6-aa8 Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO ] stderr failed to read label for Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO ] stderr /dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7f80-45f6-aa87-9beb7f40c242 Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO ] stderr : Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO ] stderr (2) No such file or directory Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,888][ceph_volume.process][INFO ] stderr Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: [2019-01-29 18:08:00,891][ceph_volume][ERROR ] exception caught by decorator Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: Traceback (most recent call last): Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, in newfunc Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: return f(*a, **kw) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in main Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: terminal.dispatch(self.mapper, subcommand_args) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in dispatch Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: instance.main() Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/main.py", line 40, in main Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: terminal.dispatch(self.mapper, self.argv) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in dispatch Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: instance.main() Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 333, in main Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: self.activate(args) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, in is_root Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: return func(*a, **kw) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 257, in activate Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: activate_bluestore(lvs, no_systemd=args.no_systemd) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", line 164, in activate_bluestore Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: process.run(prime_command) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: File "/usr/lib/python2.7/site-packages/ceph_volume/process.py", line 153, in run Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: raise RuntimeError(msg) Jan 29 18:08:00 node4 ceph-osd-run.sh[3351]: RuntimeError: command returned non-zero exit status: 1 Jan 29 18:08:01 node4 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE Jan 29 18:08:01 node4 systemd[1]: ceph-osd: Failed with result 'exit-code'. ...skipping... Installing LVM and restarting the system again causes the containers to work correctly again: Verify installing of LVM is working: root@node4:/home/vagrant# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert osd-data-8ca11df8-cf60-4be2-91f2-784b4d969514 ceph-3cbffed6-e087-4b2c-9e71-2bb434976b72 -wi-ao---- <10.74g osd-data-df8a15d1-7f80-45f6-aa87-9beb7f40c242 ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67 -wi-ao---- <10.74g Check status of OSD containers: root@node4:/home/vagrant# systemctl status ceph-osd@1 ● ceph-osd - Ceph OSD Loaded: loaded (/etc/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled) Active: active (running) since Tue 2019-01-29 18:11:59 UTC; 19s ago Process: 1083 ExecStartPre=/usr/bin/docker rm -f ceph-osd-1 (code=exited, status=1/FAILURE) Process: 1028 ExecStartPre=/usr/bin/docker stop ceph-osd-1 (code=exited, status=1/FAILURE) Main PID: 1093 (ceph-osd-run.sh) Tasks: 10 (limit: 2323) CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd ├─1093 /bin/bash /usr/share/ceph-osd-run.sh 1 └─1108 /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=1993m --cpus=1 -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/ru Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: ln -snf /dev/ceph-a893229a-7adb-4104-8f3b-0e5e94eb8d67/osd-data-df8a15d1-7f80-45f6-aa87-9beb7f40c242 /var/lib/ceph/osd/ceph-1/block Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: chown -R ceph:ceph /dev/dm-0 Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1 Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: --> ceph-volume lvm activate successful for osd ID: 1 Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: 2019-01-29 18:12:10 /opt/ceph-container/bin/entrypoint.sh: SUCCESS Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: exec: PID 1906: spawning /usr/bin/ceph-osd --cluster ceph -f -i 1 Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: exec: Waiting 1906 to quit Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal Jan 29 18:12:10 node4 ceph-osd-run.sh[1093]: 2019-01-29 18:12:10.524402 7fd8d0b82d80 -1 osd.1 6 log_to_monitors {default=true}
On a containerized environment we don't install any packages so we assume the host as the right content. There is not so much we can do about that, we can display a message and throw an error. If we managed to get the lv symlinks in the first place, there might be a way to trigger that from the command line inside the container. Rishabh, please investigate whenever you have time. Thanks
Getting LVM symlinks is non-trivial, LVM itself relies on UDEV rules which work together with the lvmetad daemon. I am not sure how would any system would go about creating the links without being able to inspect the logical volumes (since LVM itself is not installed). I wasn't aware that ceph-ansible does not install anything on a host for containerized deployments, but it is clear that there is a gap here. Why wouldn't ceph-ansible try to install required dependencies like LVM if they are missing? It already does it in other playbooks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2538