Bug 1625778 - [FFU] - Ceph upgrade fails to bring OSDs back during ceph upgrade
Summary: [FFU] - Ceph upgrade fails to bring OSDs back during ceph upgrade
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.1
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: z2
: 3.2
Assignee: Guillaume Abrioux
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks: 1578730
TreeView+ depends on / blocked
 
Reported: 2018-09-05 20:45 UTC by Yogev Rabl
Modified: 2023-09-15 00:11 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-01 08:35:38 UTC
Embargoed:


Attachments (Terms of Use)
ceph-install-workflow.log (10.60 MB, text/plain)
2018-09-05 20:45 UTC, Yogev Rabl
no flags Details
sos report from the node with the failed OSDs (15.17 MB, application/x-xz)
2018-09-05 20:46 UTC, Yogev Rabl
no flags Details

Description Yogev Rabl 2018-09-05 20:45:22 UTC
Created attachment 1481164 [details]
ceph-install-workflow.log

Description of problem:
Ceph upgrade fails with all of the OSDs of a node failing after the ceph-osd rpm has been removed. 
The OSDs docker containers didn't start during the upgrade and the without the rpm and its repository, recovery is impossible.

It seems that the ceph docker image was downloaded successfully to the node.

As a workaround, 
1) the ceph-osd repository needs to be set
2) install ceph-osd rpm
3) restart the osds with the command 
 ceph-osd -f -i {OSD_ID} --osd-data /var/lib/ceph/osd/ceph-{OSD_ID} --osd-journal /var/lib/ceph/osd/ceph-{OSD_ID}/journal 

Version-Release number of selected component (if applicable):
ceph-ansible-3.1.0-0.1.rc10.el7cp.noarch
openstack-tripleo-common-containers-8.6.3-10.el7ost.noarch
python-tripleoclient-9.2.3-4.el7ost.noarch
openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch
openstack-tripleo-heat-templates-8.0.4-20.el7ost.noarch
openstack-tripleo-validations-8.4.2-1.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.1-1.el7ost.noarch
openstack-tripleo-common-8.6.3-10.el7ost.noarch
puppet-tripleo-8.3.4-5.el7ost.noarch

How reproducible:
40% of the time

Steps to Reproduce:
1. Deploy RHOSP 10 with Ceph 2.x, with multiple OSDs on each ceph storage
2. Run FFU with local 

Actual results:
The FFU fails

Expected results:
The ceph cluster should be upgraded with all of its osds running

Additional info:

Comment 1 Yogev Rabl 2018-09-05 20:46:21 UTC
Created attachment 1481165 [details]
sos report from the node with the failed OSDs

Comment 3 John Fulton 2018-09-06 14:11:01 UTC
It seems that during the run of switch-from-non-containerized-to-containerized-ceph-daemons.yml [1], it migrated ceph-1 from non-containerized OSDs to containerized OSDs. That task must have suceeded as it then proceeded to do the next part of the FFU which is to remove the packages and repository for the OSDs as designed.

It seems then that some of the OSDs failed but instead of getting the containerized OSDs working, some half-roll back steps were taken to get them running on baremetal again including renabling the repository and reinstalling the OSD package. Unfortunately this seems to have resulted in some OSDs getting corrupted [2] because the container OSDs were still trying to manage the disk (note that /etc/systemd/system/ceph-osd\@.service has Restart=always) and baremetal OSDs were started while pointing to the same disk. This is going to prevent figuring out why some of the OSDs failed after the upgrade in the current environment. Thus, I request that you reproduce this again and not take any roll back steps and instead ping me and we can then continue to debug to figure out the reason why some of the OSDs failed.

[1] https://github.com/ceph/ceph-ansible/blob/v3.1.0rc10/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml

[2] 
# /usr/share/ceph-osd-run.sh vde 
...
Sep 06 13:37:37 ceph-1 dockerd-current[2059]: activate: OSD uuid is 28d91121-afd0-4fe9-925a-d92a2ddbc920
Sep 06 13:37:37 ceph-1 dockerd-current[2059]: activate: OSD id is 13
Sep 06 13:37:37 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup init
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: activate: Marking with init system none
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.LFLMuq/none
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.LFLMuq/none
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: activate: ceph osd.13 data dir is ready at /var/lib/ceph/tmp/mnt.LFLMuq
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: move_mount: Moving mount to final location...
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdf1 /var/lib/ceph/osd/ceph-13
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.LFLMuq
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38  /entrypoint.sh: SUCCESS
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: exec: PID 162851: spawning /usr/bin/ceph-osd --cluster ceph -f -i 13 --setuser ceph --setgroup disk
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: starting osd.13 at - osd_data /var/lib/ceph/osd/ceph-13 /var/lib/ceph/osd/ceph-13/journal
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.467612 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.467640 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.483071 7f9cad371d80 -1 osd.13 0 failed to load OSD map for epoch 225, got 0 bytes
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.485234 7f9cad371d80 -1 /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:    -23> 2018-09-06 13:37:38.467612 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:    -21> 2018-09-06 13:37:38.467640 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:     -1> 2018-09-06 13:37:38.483071 7f9cad371d80 -1 osd.13 0 failed to load OSD map for epoch 225, got 0 bytes
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:      0> 2018-09-06 13:37:38.485234 7f9cad371d80 -1 /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: *** Caught signal (Aborted) **
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  in thread 7f9cad371d80 thread_name:ceph-osd
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (()+0xa3c941) [0x55cde8198941]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (()+0xf680) [0x7f9caaa39680]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (gsignal()+0x37) [0x7f9ca9a5a207]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (abort()+0x148) [0x7f9ca9a5b8f8]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  7: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  8: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  10: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.488076 7f9cad371d80 -1 *** Caught signal (Aborted) **
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  in thread 7f9cad371d80 thread_name:ceph-osd
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (()+0xa3c941) [0x55cde8198941]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (()+0xf680) [0x7f9caaa39680]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (gsignal()+0x37) [0x7f9ca9a5a207]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (abort()+0x148) [0x7f9ca9a5b8f8]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  7: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  8: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  10: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:      0> 2018-09-06 13:37:38.488076 7f9cad371d80 -1 *** Caught signal (Aborted) **
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  in thread 7f9cad371d80 thread_name:ceph-osd
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (()+0xa3c941) [0x55cde8198941]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (()+0xf680) [0x7f9caaa39680]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (gsignal()+0x37) [0x7f9ca9a5a207]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (abort()+0x148) [0x7f9ca9a5b8f8]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  7: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  8: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  10: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: docker_exec.sh: line 56: 162851 Aborted                 "$@"

Comment 8 Yogev Rabl 2018-10-01 14:23:53 UTC
I didn't manage to reproduce it yet

Comment 12 Red Hat Bugzilla 2023-09-15 00:11:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.