Red Hat Bugzilla – Bug 1472409
ceph: not all OSDs are up when ceph node starts
Last modified: 2018-06-26 19:45:38 EDT
ceph: not all OSDs are up when ceph node is rebooted during major upgrade. Environment: python-cephfs-10.2.7-28.el7cp.x86_64 ceph-osd-10.2.7-28.el7cp.x86_64 ceph-common-10.2.7-28.el7cp.x86_64 ceph-selinux-10.2.7-28.el7cp.x86_64 puppet-ceph-2.3.0-5.el7ost.noarch ceph-mon-10.2.7-28.el7cp.x86_64 libcephfs1-10.2.7-28.el7cp.x86_64 ceph-base-10.2.7-28.el7cp.x86_64 ceph-radosgw-10.2.7-28.el7cp.x86_64 openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch openstack-tripleo-heat-templates-5.2.0-21.el7ost.noarch instack-undercloud-5.3.0-1.el7ost.noarch openstack-puppet-modules-9.3.0-1.el7ost.noarch Steps to reproduce: 1. Follow the procedure to upgrade OSP9 to OSP10 , reach the following stage: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Ceph 2. Reboot a ceph node and after reboot login to it and check ceph status. Result: [root@overcloud-cephstorage-1 ~]# ceph -s cluster 1289fdf6-6b11-11e7-b06e-5254002376d6 health HEALTH_WARN 823 pgs degraded 823 pgs stuck degraded 823 pgs stuck unclean 823 pgs stuck undersized 823 pgs undersized recovery 6/57 objects degraded (10.526%) 3/24 in osds are down noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0} election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e227: 24 osds: 21 up, 24 in; 823 remapped pgs flags noout,norebalance,require_jewel_osds pgmap v20481: 2240 pgs, 6 pools, 45659 kB data, 19 objects 1341 MB used, 22331 GB / 22333 GB avail 6/57 objects degraded (10.526%) 1417 active+clean 823 active+undersized+degraded [root@overcloud-cephstorage-1 ~]# systemctl|grep -i fail ● ceph-disk@dev-sdb2.service loaded failed failed Ceph disk activation: /dev/sdb2 ● ceph-disk@dev-sdb3.service loaded failed failed Ceph disk activation: /dev/sdb3 ● ceph-disk@dev-sdb4.service loaded failed failed Ceph disk activation: /dev/sdb4 ● ceph-disk@dev-sdc2.service loaded failed failed Ceph disk activation: /dev/sdc2 ● ceph-disk@dev-sdc4.service loaded failed failed Ceph disk activation: /dev/sdc4 ● ceph-disk@dev-sdd1.service loaded failed failed Ceph disk activation: /dev/sdd1 ● ceph-disk@dev-sde1.service loaded failed failed Ceph disk activation: /dev/sde1 ● ceph-disk@dev-sdf1.service loaded failed failed Ceph disk activation: /dev/sdf1 ● ceph-disk@dev-sdh1.service loaded failed failed Ceph disk activation: /dev/sdh1 ● ceph-disk@dev-sdj1.service loaded failed failed Ceph disk activation: /dev/sdj1 ● ceph-disk@dev-sdk1.service loaded failed failed Ceph disk activation: /dev/sdk1 ● ceph-osd@14.service loaded failed failed Ceph object storage daemon ● ceph-osd@17.service loaded failed failed Ceph object storage daemon ● ceph-osd@22.service loaded failed failed Ceph object storage daemon [root@overcloud-cephstorage-1 ~]# journalctl -u ceph-disk@dev-sdb2.service -- Logs begin at Mon 2017-07-17 17:08:21 UTC, end at Tue 2017-07-18 16:16:54 UTC. -- Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Starting Ceph disk activation: /dev/sdb2... Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdb2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', fu Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/init --version Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command_check_call: Running command: /usr/bin/chown ceph:ceph /dev/sdb2 Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/blkid -o udev -p /dev/sdb2 Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/blkid -o udev -p /dev/sdb2 Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: main_trigger: trigger /dev/sdb2 parttype 45b0969e-9b03-4f30-b4c6-b4b80ceff106 uuid 461c3e2f-ccf0-43c8-9e2e-9d218ab2f66c Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/ceph-disk --verbose activate-journal /dev/sdb2 Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: ceph-disk@dev-sdb2.service: main process exited, code=exited, status=124/n/a Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Failed to start Ceph disk activation: /dev/sdb2. Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Unit ceph-disk@dev-sdb2.service entered failed state. Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: ceph-disk@dev-sdb2.service failed. Workaround: Running: systemctl start ceph-disk@dev-sdb2.service systemctl start ceph-disk@dev-sdb3.service systemctl start ceph-disk@dev-sdb4.service systemctl start ceph-disk@dev-sdc2.service systemctl start ceph-disk@dev-sdc4.service systemctl start ceph-disk@dev-sdd1.service systemctl start ceph-disk@dev-sde1.service systemctl start ceph-disk@dev-sdf1.service systemctl start ceph-disk@dev-sdj1.service systemctl start ceph-disk@dev-sdk1.service systemctl start ceph-disk@dev-sdh1.service Resolved the situation: [root@overcloud-cephstorage-1 ~]# ceph status cluster 1289fdf6-6b11-11e7-b06e-5254002376d6 health HEALTH_WARN noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0} election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e236: 24 osds: 24 up, 24 in flags noout,norebalance,require_jewel_osds pgmap v20518: 2240 pgs, 6 pools, 45659 kB data, 19 objects 1353 MB used, 22331 GB / 22333 GB avail 2240 active+clean
The issue reproduced on all 3 ceph nodes. Exactly 3 osds were down after rebooting each: 3/24 in osds are down [heat-admin@overcloud-cephstorage-0 ~]$ sudo ceph -s cluster 1289fdf6-6b11-11e7-b06e-5254002376d6 health HEALTH_WARN 808 pgs degraded 808 pgs stuck degraded 808 pgs stuck unclean 808 pgs stuck undersized 808 pgs undersized recovery 11/57 objects degraded (19.298%) 3/24 in osds are down noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0} election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e201: 24 osds: 21 up, 24 in; 808 remapped pgs flags noout,norebalance,require_jewel_osds pgmap v20349: 2240 pgs, 6 pools, 45659 kB data, 19 objects 1273 MB used, 22331 GB / 22333 GB avail 11/57 objects degraded (19.298%) 1432 active+clean 808 active+undersized+degraded [root@overcloud-cephstorage-1 ~]# ceph -s cluster 1289fdf6-6b11-11e7-b06e-5254002376d6 health HEALTH_WARN 823 pgs degraded 823 pgs stuck degraded 823 pgs stuck unclean 823 pgs stuck undersized 823 pgs undersized recovery 6/57 objects degraded (10.526%) 3/24 in osds are down noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0} election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e227: 24 osds: 21 up, 24 in; 823 remapped pgs flags noout,norebalance,require_jewel_osds pgmap v20481: 2240 pgs, 6 pools, 45659 kB data, 19 objects 1341 MB used, 22331 GB / 22333 GB avail 6/57 objects degraded (10.526%) 1417 active+clean 823 active+undersized+degraded [heat-admin@overcloud-cephstorage-2 ~]$ sudo ceph status cluster 1289fdf6-6b11-11e7-b06e-5254002376d6 health HEALTH_WARN 844 pgs degraded 844 pgs stuck degraded 844 pgs stuck unclean 844 pgs stuck undersized 844 pgs undersized recovery 10/57 objects degraded (17.544%) 3/24 in osds are down noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0} election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e253: 24 osds: 21 up, 24 in; 844 remapped pgs flags noout,norebalance,require_jewel_osds pgmap v20615: 2240 pgs, 6 pools, 45659 kB data, 19 objects 1361 MB used, 22331 GB / 22333 GB avail 10/57 objects degraded (17.544%) 1396 active+clean 844 active+undersized+degraded
It could be that after a while the osds come up.
Sasha, Are you proposing that we wait then check status, then run systemctl start ceph-disk on disks that are not up yet and then check results of these and then complete.
Hi Arkady, I was hoping that osds come up if we wait longer (something I thought I saw on one machine), but trying to prove that part - I verified that they don't (waited for more than 1 hour): [root@overcloud-cephstorage-0 ~]# uptime 22:13:49 up 1:06, 1 user, load average: 0.03, 0.03, 0.05 [root@overcloud-cephstorage-0 ~]# ceph -s cluster 9d071b3c-6d0d-11e7-91c2-525400141c5e health HEALTH_WARN 612 pgs degraded 612 pgs stuck degraded 612 pgs stuck unclean 612 pgs stuck undersized 612 pgs undersized recovery 6/57 objects degraded (10.526%) 2/24 in osds are down noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.123:6789/0,overcloud-controller-2=192.168.170.126:6789/0} election epoch 34, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0 osdmap e208: 24 osds: 22 up, 24 in; 612 remapped pgs flags noout,norebalance,require_jewel_osds pgmap v13321: 2368 pgs, 6 pools, 45659 kB data, 19 objects 1313 MB used, 22331 GB / 22333 GB avail 6/57 objects degraded (10.526%) 1756 active+clean 612 active+undersized+degraded So then I ran: [root@overcloud-cephstorage-0 ~]# for i in `systemctl|awk '/ceph-disk/ {print $2}'`; do echo $i; systemctl start $i; done ceph-disk@dev-sdb1.service ceph-disk@dev-sdb2.service ceph-disk@dev-sdb3.service ceph-disk@dev-sdc1.service ceph-disk@dev-sdc4.service ceph-disk@dev-sdd1.service ceph-disk@dev-sdf1.service ceph-disk@dev-sdg1.service ceph-disk@dev-sdh1.service ceph-disk@dev-sdj1.service ceph-disk@dev-sdk1.service Checking the status again - all osds are up: [root@overcloud-cephstorage-0 ~]# ceph -s cluster 9d071b3c-6d0d-11e7-91c2-525400141c5e health HEALTH_WARN 65 pgs peering 65 pgs stuck unclean noout,norebalance flag(s) set monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.123:6789/0,overcloud-controller-2=192.168.170.126:6789/0} election epoch 34, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0 osdmap e214: 24 osds: 24 up, 24 in flags noout,norebalance,require_jewel_osds pgmap v13335: 2368 pgs, 6 pools, 45659 kB data, 19 objects 1320 MB used, 22331 GB / 22333 GB avail 2303 active+clean 65 peering So comment #3 can be disregarded.
Dup of: https://bugzilla.redhat.com/show_bug.cgi?id=1457231 Not a puppet-ceph bug. Unfortunately, as Alfredo mentioned this is well-known. This is taking care of in ceph-disk. So I suspect we can close this and leave this in Ceph itself.
Ian may have already tracked this down, but it appears to have been fixed in (not before) 2.3 per https://github.com/ceph/ceph/pull/12147/files. @Loic, is there any plan to backport this into 1.3.X?
I don't know that there are plans to do that.
The problem has been reproduced already two times this week with regular deployment of OSP11 (RH7-RHOS-11.0 2017-08-22.2).
FYI - ceph --version on a ceph node shows "ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7)"
Federico, can you escalate it? Thanks
Hi Wayne Engineering believes the fix is likely: http://tracker.ceph.com/issues/18007. That is merged upstream but not yet available downstream. A manual fix [1] until it is available downstream is to set the following variable in systemd/ceph-disk@.service (default is 300) Environment=CEPH_DISK_TIMEOUT=10000 [1] https://github.com/ceph/ceph/pull/17133/files Can you give it a try? Sean
Hi Loic, Can you please elaborate on the workaround? I have the one in comment 16 and they came back with the following: "I want to try out your suggestion, but the instructions and the links you provide are not specific enough. I don’t know where the file(s) I should change resides.. Can you point to more specifics?" Thanks, Sean
Loic, Sean, I was able to test this simple work-around (having found the target files) and it appears to work fine in a single-node reboot scenario. I am testing a reboot-all-ceph-nodes (ipmi-soft) scenario now. Will let you know. Seems hopeful.
Re: #16 - Reboot of all ceph nodes at once with this work-around installed also resulted in successful return of osd's to status "up".
Just as an fyi, this also occurs on OSP10 using unlocked bits
Loic, could we get a backport of the fix?
@tserlin this is done at 5e20864e136ea532431b05de24f0e78f59b63c41
Do we have a patch for RHEL?
Loic, Was going to see if there is any tuning guidance we should add to the documents w/r/t disk # or sizes and how it might interact with an appropriate timeout value.
@Mike I think the timeout does not need tuning, it is large enough.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2903
Hi, we have set the timeout as noted in Comment 16 and rebooted the Ceph nodes, but there is still one OSD down on each storage node. I'm running an OSP9 upgraded to OSP10 and the ceph version is ceph version 10.2.7-48.el7cp (cf7751bcd460c757e596d3ee2991884e13c37b96.