Created attachment 1481164 [details] ceph-install-workflow.log Description of problem: Ceph upgrade fails with all of the OSDs of a node failing after the ceph-osd rpm has been removed. The OSDs docker containers didn't start during the upgrade and the without the rpm and its repository, recovery is impossible. It seems that the ceph docker image was downloaded successfully to the node. As a workaround, 1) the ceph-osd repository needs to be set 2) install ceph-osd rpm 3) restart the osds with the command ceph-osd -f -i {OSD_ID} --osd-data /var/lib/ceph/osd/ceph-{OSD_ID} --osd-journal /var/lib/ceph/osd/ceph-{OSD_ID}/journal Version-Release number of selected component (if applicable): ceph-ansible-3.1.0-0.1.rc10.el7cp.noarch openstack-tripleo-common-containers-8.6.3-10.el7ost.noarch python-tripleoclient-9.2.3-4.el7ost.noarch openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch openstack-tripleo-heat-templates-8.0.4-20.el7ost.noarch openstack-tripleo-validations-8.4.2-1.el7ost.noarch openstack-tripleo-puppet-elements-8.0.1-1.el7ost.noarch openstack-tripleo-common-8.6.3-10.el7ost.noarch puppet-tripleo-8.3.4-5.el7ost.noarch How reproducible: 40% of the time Steps to Reproduce: 1. Deploy RHOSP 10 with Ceph 2.x, with multiple OSDs on each ceph storage 2. Run FFU with local Actual results: The FFU fails Expected results: The ceph cluster should be upgraded with all of its osds running Additional info:
Created attachment 1481165 [details] sos report from the node with the failed OSDs
It seems that during the run of switch-from-non-containerized-to-containerized-ceph-daemons.yml [1], it migrated ceph-1 from non-containerized OSDs to containerized OSDs. That task must have suceeded as it then proceeded to do the next part of the FFU which is to remove the packages and repository for the OSDs as designed. It seems then that some of the OSDs failed but instead of getting the containerized OSDs working, some half-roll back steps were taken to get them running on baremetal again including renabling the repository and reinstalling the OSD package. Unfortunately this seems to have resulted in some OSDs getting corrupted [2] because the container OSDs were still trying to manage the disk (note that /etc/systemd/system/ceph-osd\@.service has Restart=always) and baremetal OSDs were started while pointing to the same disk. This is going to prevent figuring out why some of the OSDs failed after the upgrade in the current environment. Thus, I request that you reproduce this again and not take any roll back steps and instead ping me and we can then continue to debug to figure out the reason why some of the OSDs failed. [1] https://github.com/ceph/ceph-ansible/blob/v3.1.0rc10/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml [2] # /usr/share/ceph-osd-run.sh vde ... Sep 06 13:37:37 ceph-1 dockerd-current[2059]: activate: OSD uuid is 28d91121-afd0-4fe9-925a-d92a2ddbc920 Sep 06 13:37:37 ceph-1 dockerd-current[2059]: activate: OSD id is 13 Sep 06 13:37:37 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup init Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit Sep 06 13:37:38 ceph-1 dockerd-current[2059]: activate: Marking with init system none Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.LFLMuq/none Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.LFLMuq/none Sep 06 13:37:38 ceph-1 dockerd-current[2059]: activate: ceph osd.13 data dir is ready at /var/lib/ceph/tmp/mnt.LFLMuq Sep 06 13:37:38 ceph-1 dockerd-current[2059]: move_mount: Moving mount to final location... Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdf1 /var/lib/ceph/osd/ceph-13 Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.LFLMuq Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38 /entrypoint.sh: SUCCESS Sep 06 13:37:38 ceph-1 dockerd-current[2059]: exec: PID 162851: spawning /usr/bin/ceph-osd --cluster ceph -f -i 13 --setuser ceph --setgroup disk Sep 06 13:37:38 ceph-1 dockerd-current[2059]: starting osd.13 at - osd_data /var/lib/ceph/osd/ceph-13 /var/lib/ceph/osd/ceph-13/journal Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.467612 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.467640 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.483071 7f9cad371d80 -1 osd.13 0 failed to load OSD map for epoch 225, got 0 bytes Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105 Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 3: (OSD::init()+0x2072) [0x55cde7c6efb2] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 4: (main()+0x2d07) [0x55cde7b73427] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 6: (()+0x4b5ae3) [0x55cde7c11ae3] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.485234 7f9cad371d80 -1 /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105 Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 3: (OSD::init()+0x2072) [0x55cde7c6efb2] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 4: (main()+0x2d07) [0x55cde7b73427] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 6: (()+0x4b5ae3) [0x55cde7c11ae3] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: -23> 2018-09-06 13:37:38.467612 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic Sep 06 13:37:38 ceph-1 dockerd-current[2059]: -21> 2018-09-06 13:37:38.467640 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic Sep 06 13:37:38 ceph-1 dockerd-current[2059]: -1> 2018-09-06 13:37:38.483071 7f9cad371d80 -1 osd.13 0 failed to load OSD map for epoch 225, got 0 bytes Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 0> 2018-09-06 13:37:38.485234 7f9cad371d80 -1 /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105 Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 3: (OSD::init()+0x2072) [0x55cde7c6efb2] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 4: (main()+0x2d07) [0x55cde7b73427] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 6: (()+0x4b5ae3) [0x55cde7c11ae3] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: *** Caught signal (Aborted) ** Sep 06 13:37:38 ceph-1 dockerd-current[2059]: in thread 7f9cad371d80 thread_name:ceph-osd Sep 06 13:37:38 ceph-1 dockerd-current[2059]: ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 1: (()+0xa3c941) [0x55cde8198941] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2: (()+0xf680) [0x7f9caaa39680] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 3: (gsignal()+0x37) [0x7f9ca9a5a207] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 4: (abort()+0x148) [0x7f9ca9a5b8f8] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 7: (OSD::init()+0x2072) [0x55cde7c6efb2] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 8: (main()+0x2d07) [0x55cde7b73427] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 10: (()+0x4b5ae3) [0x55cde7c11ae3] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.488076 7f9cad371d80 -1 *** Caught signal (Aborted) ** Sep 06 13:37:38 ceph-1 dockerd-current[2059]: in thread 7f9cad371d80 thread_name:ceph-osd Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 1: (()+0xa3c941) [0x55cde8198941] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2: (()+0xf680) [0x7f9caaa39680] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 3: (gsignal()+0x37) [0x7f9ca9a5a207] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 4: (abort()+0x148) [0x7f9ca9a5b8f8] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 7: (OSD::init()+0x2072) [0x55cde7c6efb2] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 8: (main()+0x2d07) [0x55cde7b73427] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 10: (()+0x4b5ae3) [0x55cde7c11ae3] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 0> 2018-09-06 13:37:38.488076 7f9cad371d80 -1 *** Caught signal (Aborted) ** Sep 06 13:37:38 ceph-1 dockerd-current[2059]: in thread 7f9cad371d80 thread_name:ceph-osd Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable) Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 1: (()+0xa3c941) [0x55cde8198941] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2: (()+0xf680) [0x7f9caaa39680] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 3: (gsignal()+0x37) [0x7f9ca9a5a207] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 4: (abort()+0x148) [0x7f9ca9a5b8f8] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 7: (OSD::init()+0x2072) [0x55cde7c6efb2] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 8: (main()+0x2d07) [0x55cde7b73427] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 10: (()+0x4b5ae3) [0x55cde7c11ae3] Sep 06 13:37:38 ceph-1 dockerd-current[2059]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Sep 06 13:37:38 ceph-1 dockerd-current[2059]: Sep 06 13:37:38 ceph-1 dockerd-current[2059]: docker_exec.sh: line 56: 162851 Aborted "$@"
I didn't manage to reproduce it yet
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days