Bug 1625778

Summary: [FFU] - Ceph upgrade fails to bring OSDs back during ceph upgrade
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Yogev Rabl <yrabl>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Yogev Rabl <yrabl>
Severity: medium Docs Contact:
Priority: high    
Version: 3.1CC: anharris, aschoen, ceph-eng-bugs, gfidente, gmeno, johfulto, mburns, nthomas, pasik, pgrist, sankarshan, sostapov, tserlin, yrabl
Target Milestone: z2   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-01 08:35:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730    
Attachments:
Description Flags
ceph-install-workflow.log
none
sos report from the node with the failed OSDs none

Description Yogev Rabl 2018-09-05 20:45:22 UTC
Created attachment 1481164 [details]
ceph-install-workflow.log

Description of problem:
Ceph upgrade fails with all of the OSDs of a node failing after the ceph-osd rpm has been removed. 
The OSDs docker containers didn't start during the upgrade and the without the rpm and its repository, recovery is impossible.

It seems that the ceph docker image was downloaded successfully to the node.

As a workaround, 
1) the ceph-osd repository needs to be set
2) install ceph-osd rpm
3) restart the osds with the command 
 ceph-osd -f -i {OSD_ID} --osd-data /var/lib/ceph/osd/ceph-{OSD_ID} --osd-journal /var/lib/ceph/osd/ceph-{OSD_ID}/journal 

Version-Release number of selected component (if applicable):
ceph-ansible-3.1.0-0.1.rc10.el7cp.noarch
openstack-tripleo-common-containers-8.6.3-10.el7ost.noarch
python-tripleoclient-9.2.3-4.el7ost.noarch
openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch
openstack-tripleo-heat-templates-8.0.4-20.el7ost.noarch
openstack-tripleo-validations-8.4.2-1.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.1-1.el7ost.noarch
openstack-tripleo-common-8.6.3-10.el7ost.noarch
puppet-tripleo-8.3.4-5.el7ost.noarch

How reproducible:
40% of the time

Steps to Reproduce:
1. Deploy RHOSP 10 with Ceph 2.x, with multiple OSDs on each ceph storage
2. Run FFU with local 

Actual results:
The FFU fails

Expected results:
The ceph cluster should be upgraded with all of its osds running

Additional info:

Comment 1 Yogev Rabl 2018-09-05 20:46:21 UTC
Created attachment 1481165 [details]
sos report from the node with the failed OSDs

Comment 3 John Fulton 2018-09-06 14:11:01 UTC
It seems that during the run of switch-from-non-containerized-to-containerized-ceph-daemons.yml [1], it migrated ceph-1 from non-containerized OSDs to containerized OSDs. That task must have suceeded as it then proceeded to do the next part of the FFU which is to remove the packages and repository for the OSDs as designed.

It seems then that some of the OSDs failed but instead of getting the containerized OSDs working, some half-roll back steps were taken to get them running on baremetal again including renabling the repository and reinstalling the OSD package. Unfortunately this seems to have resulted in some OSDs getting corrupted [2] because the container OSDs were still trying to manage the disk (note that /etc/systemd/system/ceph-osd\@.service has Restart=always) and baremetal OSDs were started while pointing to the same disk. This is going to prevent figuring out why some of the OSDs failed after the upgrade in the current environment. Thus, I request that you reproduce this again and not take any roll back steps and instead ping me and we can then continue to debug to figure out the reason why some of the OSDs failed.

[1] https://github.com/ceph/ceph-ansible/blob/v3.1.0rc10/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml

[2] 
# /usr/share/ceph-osd-run.sh vde 
...
Sep 06 13:37:37 ceph-1 dockerd-current[2059]: activate: OSD uuid is 28d91121-afd0-4fe9-925a-d92a2ddbc920
Sep 06 13:37:37 ceph-1 dockerd-current[2059]: activate: OSD id is 13
Sep 06 13:37:37 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup init
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: activate: Marking with init system none
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.LFLMuq/none
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.LFLMuq/none
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: activate: ceph osd.13 data dir is ready at /var/lib/ceph/tmp/mnt.LFLMuq
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: move_mount: Moving mount to final location...
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdf1 /var/lib/ceph/osd/ceph-13
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.LFLMuq
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38  /entrypoint.sh: SUCCESS
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: exec: PID 162851: spawning /usr/bin/ceph-osd --cluster ceph -f -i 13 --setuser ceph --setgroup disk
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: starting osd.13 at - osd_data /var/lib/ceph/osd/ceph-13 /var/lib/ceph/osd/ceph-13/journal
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.467612 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.467640 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.483071 7f9cad371d80 -1 osd.13 0 failed to load OSD map for epoch 225, got 0 bytes
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.485234 7f9cad371d80 -1 /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:    -23> 2018-09-06 13:37:38.467612 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:    -21> 2018-09-06 13:37:38.467640 7f9cad371d80 -1 journal do_read_entry(70213632): bad header magic
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:     -1> 2018-09-06 13:37:38.483071 7f9cad371d80 -1 osd.13 0 failed to load OSD map for epoch 225, got 0 bytes
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:      0> 2018-09-06 13:37:38.485234 7f9cad371d80 -1 /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9cad371d80 time 2018-09-06 13:37:38.483105
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: /builddir/build/BUILD/ceph-12.2.4/src/osd/OSD.h: 976: FAILED assert(ret)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55cde81d72d0]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: *** Caught signal (Aborted) **
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  in thread 7f9cad371d80 thread_name:ceph-osd
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (()+0xa3c941) [0x55cde8198941]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (()+0xf680) [0x7f9caaa39680]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (gsignal()+0x37) [0x7f9ca9a5a207]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (abort()+0x148) [0x7f9ca9a5b8f8]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  7: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  8: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  10: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 2018-09-06 13:37:38.488076 7f9cad371d80 -1 *** Caught signal (Aborted) **
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  in thread 7f9cad371d80 thread_name:ceph-osd
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (()+0xa3c941) [0x55cde8198941]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (()+0xf680) [0x7f9caaa39680]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (gsignal()+0x37) [0x7f9ca9a5a207]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (abort()+0x148) [0x7f9ca9a5b8f8]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  7: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  8: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  10: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:      0> 2018-09-06 13:37:38.488076 7f9cad371d80 -1 *** Caught signal (Aborted) **
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  in thread 7f9cad371d80 thread_name:ceph-osd
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  ceph version 12.2.4-42.el7cp (f73642baacccbf2a3c254d1fb5f0317b933b28cf) luminous (stable)
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  1: (()+0xa3c941) [0x55cde8198941]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  2: (()+0xf680) [0x7f9caaa39680]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  3: (gsignal()+0x37) [0x7f9ca9a5a207]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  4: (abort()+0x148) [0x7f9ca9a5b8f8]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55cde81d7444]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  6: (OSDService::get_map(unsigned int)+0x3d) [0x55cde7cbdc7d]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  7: (OSD::init()+0x2072) [0x55cde7c6efb2]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  8: (main()+0x2d07) [0x55cde7b73427]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  9: (__libc_start_main()+0xf5) [0x7f9ca9a463d5]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  10: (()+0x4b5ae3) [0x55cde7c11ae3]
Sep 06 13:37:38 ceph-1 dockerd-current[2059]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: 
Sep 06 13:37:38 ceph-1 dockerd-current[2059]: docker_exec.sh: line 56: 162851 Aborted                 "$@"

Comment 8 Yogev Rabl 2018-10-01 14:23:53 UTC
I didn't manage to reproduce it yet

Comment 12 Red Hat Bugzilla 2023-09-15 00:11:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days