Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1472409

Summary:	ceph: not all OSDs are up when ceph node starts
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Alexander Chuzhoy <sasha>
Component:	Ceph-Disk	Assignee:	Loic Dachary <ldachary>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	urgent	Docs Contact:	Bara Ancincova <bancinco>
Priority:	unspecified
Version:	1.3.3	CC:	arkady_kanevsky, audra_cooper, bniver, dcain, federico, gael_rehault, gfidente, goneri, icolle, jdurgin, johfulto, John_walsh, kdreyer, kurt_hey, ldachary, lhh, mburns, morazi, nlevine, Paul_Dardeau, rajini.karthik, randy_perryman, sasha, seb, smerrow, srevivo, tserlin, vakulkar
Target Milestone:	rc	Keywords:	ZStream
Target Release:	2.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-10.2.7-37.el7cp Ubuntu: ceph_10.2.7-36redhat1	Doc Type:	Bug Fix
Doc Text:	.OSDs now wait up to three hours for other OSD to complete its initialization sequence At boot time, an OSD daemon could fail to start when it took more than five minutes to wait for other OSD to complete its initialization sequence. As a consequence, such OSDs had to be started manually. With this update, OSDs wait up to three hours. As a result, OSDs no longer fail to start when the initialization sequence of other OSDs takes too long.	Story Points:	---
Clone Of:
Clones:	1532775 (view as bug list)		Environment:
Last Closed:	2017-10-17 18:12:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1335596, 1356451, 1473436, 1479701

Description Alexander Chuzhoy 2017-07-18 16:19:49 UTC

ceph: not all OSDs are up when ceph node is rebooted during major upgrade.

Environment:
python-cephfs-10.2.7-28.el7cp.x86_64
ceph-osd-10.2.7-28.el7cp.x86_64
ceph-common-10.2.7-28.el7cp.x86_64
ceph-selinux-10.2.7-28.el7cp.x86_64
puppet-ceph-2.3.0-5.el7ost.noarch
ceph-mon-10.2.7-28.el7cp.x86_64
libcephfs1-10.2.7-28.el7cp.x86_64
ceph-base-10.2.7-28.el7cp.x86_64
ceph-radosgw-10.2.7-28.el7cp.x86_64

openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
openstack-tripleo-heat-templates-5.2.0-21.el7ost.noarch
instack-undercloud-5.3.0-1.el7ost.noarch
openstack-puppet-modules-9.3.0-1.el7ost.noarch

Steps to reproduce:

1. Follow the procedure to upgrade OSP9 to OSP10 , reach the following stage:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Ceph

2. Reboot a ceph node and after reboot login to it and check ceph status.

Result:
[root@overcloud-cephstorage-1 ~]# ceph -s
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            823 pgs degraded
            823 pgs stuck degraded
            823 pgs stuck unclean
            823 pgs stuck undersized
            823 pgs undersized
            recovery 6/57 objects degraded (10.526%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e227: 24 osds: 21 up, 24 in; 823 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20481: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1341 MB used, 22331 GB / 22333 GB avail
            6/57 objects degraded (10.526%)
                1417 active+clean
                 823 active+undersized+degraded


[root@overcloud-cephstorage-1 ~]# systemctl|grep -i fail
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdb2
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdb3
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdb4
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdc2
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdc4
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdd1
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sde1
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdf1
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdh1
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdj1
● ceph-disk                                                                         loaded failed failed    Ceph disk activation: /dev/sdk1
● ceph-osd                                                                                loaded failed failed    Ceph object storage daemon
● ceph-osd                                                                                loaded failed failed    Ceph object storage daemon
● ceph-osd                                                                                loaded failed failed    Ceph object storage daemon

[root@overcloud-cephstorage-1 ~]# journalctl -u ceph-disk
-- Logs begin at Mon 2017-07-17 17:08:21 UTC, end at Tue 2017-07-18 16:16:54 UTC. --
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Starting Ceph disk activation: /dev/sdb2...
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdb2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', fu
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/init --version
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command_check_call: Running command: /usr/bin/chown ceph:ceph /dev/sdb2
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/blkid -o udev -p /dev/sdb2
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/blkid -o udev -p /dev/sdb2
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: main_trigger: trigger /dev/sdb2 parttype 45b0969e-9b03-4f30-b4c6-b4b80ceff106 uuid 461c3e2f-ccf0-43c8-9e2e-9d218ab2f66c
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/ceph-disk --verbose activate-journal /dev/sdb2
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: ceph-disk: main process exited, code=exited, status=124/n/a
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Failed to start Ceph disk activation: /dev/sdb2.
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Unit ceph-disk entered failed state.
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: ceph-disk failed.




Workaround:
Running:
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk
     systemctl start ceph-disk

Resolved the situation:
[root@overcloud-cephstorage-1 ~]# ceph status
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e236: 24 osds: 24 up, 24 in
            flags noout,norebalance,require_jewel_osds
      pgmap v20518: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1353 MB used, 22331 GB / 22333 GB avail
                2240 active+clean

Comment 1 Alexander Chuzhoy 2017-07-18 16:39:15 UTC

The issue reproduced on all 3 ceph nodes. 

Exactly 3 osds were down after rebooting each:

3/24 in osds are down


[heat-admin@overcloud-cephstorage-0 ~]$ sudo ceph -s
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            808 pgs degraded
            808 pgs stuck degraded
            808 pgs stuck unclean
            808 pgs stuck undersized
            808 pgs undersized
            recovery 11/57 objects degraded (19.298%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e201: 24 osds: 21 up, 24 in; 808 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20349: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1273 MB used, 22331 GB / 22333 GB avail
            11/57 objects degraded (19.298%)
                1432 active+clean
                 808 active+undersized+degraded


[root@overcloud-cephstorage-1 ~]# ceph -s
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            823 pgs degraded
            823 pgs stuck degraded
            823 pgs stuck unclean
            823 pgs stuck undersized
            823 pgs undersized
            recovery 6/57 objects degraded (10.526%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e227: 24 osds: 21 up, 24 in; 823 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20481: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1341 MB used, 22331 GB / 22333 GB avail
            6/57 objects degraded (10.526%)
                1417 active+clean
                 823 active+undersized+degraded




[heat-admin@overcloud-cephstorage-2 ~]$ sudo ceph status
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            844 pgs degraded
            844 pgs stuck degraded
            844 pgs stuck unclean
            844 pgs stuck undersized
            844 pgs undersized
            recovery 10/57 objects degraded (17.544%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e253: 24 osds: 21 up, 24 in; 844 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20615: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1361 MB used, 22331 GB / 22333 GB avail
            10/57 objects degraded (17.544%)
                1396 active+clean
                 844 active+undersized+degraded

Comment 3 Alexander Chuzhoy 2017-07-18 18:17:18 UTC

It could be that after a while the osds come up.

Comment 4 arkady kanevsky 2017-07-19 13:44:13 UTC

Sasha,
Are you proposing that we wait then check status, then run systemctl start ceph-disk on disks that are not up yet and then check results of these and then complete.

Comment 5 Alexander Chuzhoy 2017-07-20 22:20:02 UTC

Hi Arkady, 
I was hoping that osds come up if we wait longer (something I thought I saw on one machine), but trying to prove that part - I verified that they don't (waited for more than 1 hour):

[root@overcloud-cephstorage-0 ~]# uptime
 22:13:49 up  1:06,  1 user,  load average: 0.03, 0.03, 0.05


[root@overcloud-cephstorage-0 ~]# ceph -s
    cluster 9d071b3c-6d0d-11e7-91c2-525400141c5e
     health HEALTH_WARN
            612 pgs degraded
            612 pgs stuck degraded
            612 pgs stuck unclean
            612 pgs stuck undersized
            612 pgs undersized
            recovery 6/57 objects degraded (10.526%)
            2/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.123:6789/0,overcloud-controller-2=192.168.170.126:6789/0}
            election epoch 34, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0
     osdmap e208: 24 osds: 22 up, 24 in; 612 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v13321: 2368 pgs, 6 pools, 45659 kB data, 19 objects
            1313 MB used, 22331 GB / 22333 GB avail
            6/57 objects degraded (10.526%)
                1756 active+clean
                 612 active+undersized+degraded


So then I ran:
[root@overcloud-cephstorage-0 ~]# for i in `systemctl|awk '/ceph-disk/ {print $2}'`; do echo $i; systemctl start $i; done
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk
ceph-disk



Checking the status again - all osds are up:
[root@overcloud-cephstorage-0 ~]# ceph -s
    cluster 9d071b3c-6d0d-11e7-91c2-525400141c5e
     health HEALTH_WARN
            65 pgs peering
            65 pgs stuck unclean
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.123:6789/0,overcloud-controller-2=192.168.170.126:6789/0}
            election epoch 34, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0
     osdmap e214: 24 osds: 24 up, 24 in
            flags noout,norebalance,require_jewel_osds
      pgmap v13335: 2368 pgs, 6 pools, 45659 kB data, 19 objects
            1320 MB used, 22331 GB / 22333 GB avail
                2303 active+clean
                  65 peering


So comment #3 can be disregarded.

Comment 6 seb 2017-08-02 14:56:30 UTC

Dup of: https://bugzilla.redhat.com/show_bug.cgi?id=1457231
Not a puppet-ceph bug.

Unfortunately, as Alfredo mentioned this is well-known.

This is taking care of in ceph-disk. So I suspect we can close this and leave this in Ceph itself.

Comment 9 Brett Niver 2017-08-09 15:36:14 UTC

Ian may have already tracked this down, but it appears to have been fixed in (not before) 2.3 per https://github.com/ceph/ceph/pull/12147/files.  @Loic, is there any plan to backport this into 1.3.X?

Comment 10 Loic Dachary 2017-08-10 14:24:51 UTC

I don't know that there are plans to do that.

Comment 11 Gonéri Le Bouder 2017-08-24 17:20:39 UTC

The problem has been reproduced already two times this week with regular deployment of OSP11 (RH7-RHOS-11.0 2017-08-22.2).

Comment 12 Wayne Allen 2017-08-24 19:00:27 UTC

FYI - 
ceph --version 

on a ceph node shows "ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7)"

Comment 15 arkady kanevsky 2017-08-28 19:50:37 UTC

Federico,
can you escalate it?
Thanks

Comment 16 Sean Merrow 2017-08-29 12:49:00 UTC

Hi Wayne

Engineering believes the fix is likely:

http://tracker.ceph.com/issues/18007. 

That is merged upstream but not yet available downstream. A manual fix [1] until it is available downstream is to set the following variable in systemd/ceph-disk@.service (default is 300)

Environment=CEPH_DISK_TIMEOUT=10000

[1] https://github.com/ceph/ceph/pull/17133/files

Can you give it a try?

Sean

Comment 17 Sean Merrow 2017-08-29 21:02:48 UTC

Hi Loic,

Can you please elaborate on the workaround? I have the one in comment 16 and they came back with the following:

"I want to try out your suggestion, but the instructions and the links you provide are not specific enough. I don’t know where the file(s) I should change resides..  Can you point to more specifics?"

Thanks,
Sean

Comment 18 Wayne Allen 2017-08-30 22:11:57 UTC

Loic, Sean,

I was able to test this simple work-around (having found the target files) and it appears to work fine in a single-node reboot scenario.  I am testing a reboot-all-ceph-nodes (ipmi-soft) scenario now.  Will let you know.

Seems hopeful.

Comment 19 Wayne Allen 2017-08-30 22:53:34 UTC

Re: #16 - Reboot of all ceph nodes at once with this work-around installed also resulted in successful return of osd's to status "up".

Comment 20 Kurt Hey 2017-09-01 13:49:44 UTC

Just as an fyi, this also occurs on OSP10 using unlocked bits

Comment 21 Gonéri Le Bouder 2017-09-06 18:50:57 UTC

Loic, could we get a backport of the fix?

Comment 24 Loic Dachary 2017-09-19 16:45:15 UTC

@tserlin this is done at 5e20864e136ea532431b05de24f0e78f59b63c41

Comment 27 arkady kanevsky 2017-09-19 19:26:30 UTC

Do we have a patch for RHEL?

Comment 29 Mike Orazi 2017-09-28 18:32:46 UTC

Loic,

Was going to see if there is any tuning guidance we should add to the documents w/r/t disk # or sizes and how it might interact with an appropriate timeout value.

Comment 30 Loic Dachary 2017-10-02 13:48:47 UTC

@Mike I think the timeout does not need tuning, it is large enough.

Comment 36 errata-xmlrpc 2017-10-17 18:12:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2903

Comment 38 Audra Cooper 2018-01-09 18:18:02 UTC

Hi, we have set the timeout as noted in Comment 16 and rebooted the Ceph nodes, but there is still one OSD down on each storage node.  I'm running an OSP9 upgraded to OSP10 and the ceph version is ceph version 10.2.7-48.el7cp (cf7751bcd460c757e596d3ee2991884e13c37b96.