Bug 1472409 - ceph: not all OSDs are up when ceph node starts
ceph: not all OSDs are up when ceph node starts
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Disk (Show other bugs)
1.3.3
Unspecified Unspecified
unspecified Severity urgent
: rc
: 2.4
Assigned To: Loic Dachary
ceph-qe-bugs
Bara Ancincova
: ZStream
Depends On:
Blocks: 1335596 1356451 1473436 1479701
  Show dependency treegraph
 
Reported: 2017-07-18 12:19 EDT by Alexander Chuzhoy
Modified: 2017-10-18 14:13 EDT (History)
29 users (show)

See Also:
Fixed In Version: RHEL: ceph-10.2.7-37.el7cp Ubuntu: ceph_10.2.7-36redhat1
Doc Type: Bug Fix
Doc Text:
.OSDs now wait up to three hours for other OSD to complete its initialization sequence At boot time, an OSD daemon could fail to start when it took more than five minutes to wait for other OSD to complete its initialization sequence. As a consequence, such OSDs had to be started manually. With this update, OSDs wait up to three hours. As a result, OSDs no longer fail to start when the initialization sequence of other OSDs takes too long.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-17 14:12:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 18007 None None None 2017-08-28 10:52 EDT
Ceph Project Bug Tracker 20229 None None None 2017-09-13 15:59 EDT

  None (edit)
Description Alexander Chuzhoy 2017-07-18 12:19:49 EDT
ceph: not all OSDs are up when ceph node is rebooted during major upgrade.

Environment:
python-cephfs-10.2.7-28.el7cp.x86_64
ceph-osd-10.2.7-28.el7cp.x86_64
ceph-common-10.2.7-28.el7cp.x86_64
ceph-selinux-10.2.7-28.el7cp.x86_64
puppet-ceph-2.3.0-5.el7ost.noarch
ceph-mon-10.2.7-28.el7cp.x86_64
libcephfs1-10.2.7-28.el7cp.x86_64
ceph-base-10.2.7-28.el7cp.x86_64
ceph-radosgw-10.2.7-28.el7cp.x86_64

openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
openstack-tripleo-heat-templates-5.2.0-21.el7ost.noarch
instack-undercloud-5.3.0-1.el7ost.noarch
openstack-puppet-modules-9.3.0-1.el7ost.noarch

Steps to reproduce:

1. Follow the procedure to upgrade OSP9 to OSP10 , reach the following stage:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Ceph

2. Reboot a ceph node and after reboot login to it and check ceph status.

Result:
[root@overcloud-cephstorage-1 ~]# ceph -s
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            823 pgs degraded
            823 pgs stuck degraded
            823 pgs stuck unclean
            823 pgs stuck undersized
            823 pgs undersized
            recovery 6/57 objects degraded (10.526%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e227: 24 osds: 21 up, 24 in; 823 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20481: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1341 MB used, 22331 GB / 22333 GB avail
            6/57 objects degraded (10.526%)
                1417 active+clean
                 823 active+undersized+degraded


[root@overcloud-cephstorage-1 ~]# systemctl|grep -i fail
● ceph-disk@dev-sdb2.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdb2
● ceph-disk@dev-sdb3.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdb3
● ceph-disk@dev-sdb4.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdb4
● ceph-disk@dev-sdc2.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdc2
● ceph-disk@dev-sdc4.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdc4
● ceph-disk@dev-sdd1.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdd1
● ceph-disk@dev-sde1.service                                                                         loaded failed failed    Ceph disk activation: /dev/sde1
● ceph-disk@dev-sdf1.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdf1
● ceph-disk@dev-sdh1.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdh1
● ceph-disk@dev-sdj1.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdj1
● ceph-disk@dev-sdk1.service                                                                         loaded failed failed    Ceph disk activation: /dev/sdk1
● ceph-osd@14.service                                                                                loaded failed failed    Ceph object storage daemon
● ceph-osd@17.service                                                                                loaded failed failed    Ceph object storage daemon
● ceph-osd@22.service                                                                                loaded failed failed    Ceph object storage daemon

[root@overcloud-cephstorage-1 ~]# journalctl -u ceph-disk@dev-sdb2.service
-- Logs begin at Mon 2017-07-17 17:08:21 UTC, end at Tue 2017-07-18 16:16:54 UTC. --
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Starting Ceph disk activation: /dev/sdb2...
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdb2', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', fu
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/init --version
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command_check_call: Running command: /usr/bin/chown ceph:ceph /dev/sdb2
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/blkid -o udev -p /dev/sdb2
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/blkid -o udev -p /dev/sdb2
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: main_trigger: trigger /dev/sdb2 parttype 45b0969e-9b03-4f30-b4c6-b4b80ceff106 uuid 461c3e2f-ccf0-43c8-9e2e-9d218ab2f66c
Jul 18 15:46:53 overcloud-cephstorage-1.fv1dci.org sh[1511]: command: Running command: /usr/sbin/ceph-disk --verbose activate-journal /dev/sdb2
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: ceph-disk@dev-sdb2.service: main process exited, code=exited, status=124/n/a
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Failed to start Ceph disk activation: /dev/sdb2.
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: Unit ceph-disk@dev-sdb2.service entered failed state.
Jul 18 15:48:53 overcloud-cephstorage-1.fv1dci.org systemd[1]: ceph-disk@dev-sdb2.service failed.




Workaround:
Running:
     systemctl start ceph-disk@dev-sdb2.service
     systemctl start ceph-disk@dev-sdb3.service
     systemctl start ceph-disk@dev-sdb4.service
     systemctl start ceph-disk@dev-sdc2.service
     systemctl start ceph-disk@dev-sdc4.service
     systemctl start ceph-disk@dev-sdd1.service
     systemctl start ceph-disk@dev-sde1.service
     systemctl start ceph-disk@dev-sdf1.service
     systemctl start ceph-disk@dev-sdj1.service
     systemctl start ceph-disk@dev-sdk1.service
     systemctl start ceph-disk@dev-sdh1.service

Resolved the situation:
[root@overcloud-cephstorage-1 ~]# ceph status
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e236: 24 osds: 24 up, 24 in
            flags noout,norebalance,require_jewel_osds
      pgmap v20518: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1353 MB used, 22331 GB / 22333 GB avail
                2240 active+clean
Comment 1 Alexander Chuzhoy 2017-07-18 12:39:15 EDT
The issue reproduced on all 3 ceph nodes. 

Exactly 3 osds were down after rebooting each:

3/24 in osds are down


[heat-admin@overcloud-cephstorage-0 ~]$ sudo ceph -s
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            808 pgs degraded
            808 pgs stuck degraded
            808 pgs stuck unclean
            808 pgs stuck undersized
            808 pgs undersized
            recovery 11/57 objects degraded (19.298%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e201: 24 osds: 21 up, 24 in; 808 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20349: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1273 MB used, 22331 GB / 22333 GB avail
            11/57 objects degraded (19.298%)
                1432 active+clean
                 808 active+undersized+degraded


[root@overcloud-cephstorage-1 ~]# ceph -s
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            823 pgs degraded
            823 pgs stuck degraded
            823 pgs stuck unclean
            823 pgs stuck undersized
            823 pgs undersized
            recovery 6/57 objects degraded (10.526%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e227: 24 osds: 21 up, 24 in; 823 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20481: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1341 MB used, 22331 GB / 22333 GB avail
            6/57 objects degraded (10.526%)
                1417 active+clean
                 823 active+undersized+degraded




[heat-admin@overcloud-cephstorage-2 ~]$ sudo ceph status
    cluster 1289fdf6-6b11-11e7-b06e-5254002376d6
     health HEALTH_WARN
            844 pgs degraded
            844 pgs stuck degraded
            844 pgs stuck unclean
            844 pgs stuck undersized
            844 pgs undersized
            recovery 10/57 objects degraded (17.544%)
            3/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.122:6789/0}
            election epoch 32, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e253: 24 osds: 21 up, 24 in; 844 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v20615: 2240 pgs, 6 pools, 45659 kB data, 19 objects
            1361 MB used, 22331 GB / 22333 GB avail
            10/57 objects degraded (17.544%)
                1396 active+clean
                 844 active+undersized+degraded
Comment 3 Alexander Chuzhoy 2017-07-18 14:17:18 EDT
It could be that after a while the osds come up.
Comment 4 arkady kanevsky 2017-07-19 09:44:13 EDT
Sasha,
Are you proposing that we wait then check status, then run systemctl start ceph-disk on disks that are not up yet and then check results of these and then complete.
Comment 5 Alexander Chuzhoy 2017-07-20 18:20:02 EDT
Hi Arkady, 
I was hoping that osds come up if we wait longer (something I thought I saw on one machine), but trying to prove that part - I verified that they don't (waited for more than 1 hour):

[root@overcloud-cephstorage-0 ~]# uptime
 22:13:49 up  1:06,  1 user,  load average: 0.03, 0.03, 0.05


[root@overcloud-cephstorage-0 ~]# ceph -s
    cluster 9d071b3c-6d0d-11e7-91c2-525400141c5e
     health HEALTH_WARN
            612 pgs degraded
            612 pgs stuck degraded
            612 pgs stuck unclean
            612 pgs stuck undersized
            612 pgs undersized
            recovery 6/57 objects degraded (10.526%)
            2/24 in osds are down
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.123:6789/0,overcloud-controller-2=192.168.170.126:6789/0}
            election epoch 34, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0
     osdmap e208: 24 osds: 22 up, 24 in; 612 remapped pgs
            flags noout,norebalance,require_jewel_osds
      pgmap v13321: 2368 pgs, 6 pools, 45659 kB data, 19 objects
            1313 MB used, 22331 GB / 22333 GB avail
            6/57 objects degraded (10.526%)
                1756 active+clean
                 612 active+undersized+degraded


So then I ran:
[root@overcloud-cephstorage-0 ~]# for i in `systemctl|awk '/ceph-disk/ {print $2}'`; do echo $i; systemctl start $i; done
ceph-disk@dev-sdb1.service
ceph-disk@dev-sdb2.service
ceph-disk@dev-sdb3.service
ceph-disk@dev-sdc1.service
ceph-disk@dev-sdc4.service
ceph-disk@dev-sdd1.service
ceph-disk@dev-sdf1.service
ceph-disk@dev-sdg1.service
ceph-disk@dev-sdh1.service
ceph-disk@dev-sdj1.service
ceph-disk@dev-sdk1.service



Checking the status again - all osds are up:
[root@overcloud-cephstorage-0 ~]# ceph -s
    cluster 9d071b3c-6d0d-11e7-91c2-525400141c5e
     health HEALTH_WARN
            65 pgs peering
            65 pgs stuck unclean
            noout,norebalance flag(s) set
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.123:6789/0,overcloud-controller-2=192.168.170.126:6789/0}
            election epoch 34, quorum 0,1,2 overcloud-controller-1,overcloud-controller-2,overcloud-controller-0
     osdmap e214: 24 osds: 24 up, 24 in
            flags noout,norebalance,require_jewel_osds
      pgmap v13335: 2368 pgs, 6 pools, 45659 kB data, 19 objects
            1320 MB used, 22331 GB / 22333 GB avail
                2303 active+clean
                  65 peering


So comment #3 can be disregarded.
Comment 6 seb 2017-08-02 10:56:30 EDT
Dup of: https://bugzilla.redhat.com/show_bug.cgi?id=1457231
Not a puppet-ceph bug.

Unfortunately, as Alfredo mentioned this is well-known.

This is taking care of in ceph-disk. So I suspect we can close this and leave this in Ceph itself.
Comment 9 Brett Niver 2017-08-09 11:36:14 EDT
Ian may have already tracked this down, but it appears to have been fixed in (not before) 2.3 per https://github.com/ceph/ceph/pull/12147/files.  @Loic, is there any plan to backport this into 1.3.X?
Comment 10 Loic Dachary 2017-08-10 10:24:51 EDT
I don't know that there are plans to do that.
Comment 11 Gonéri Le Bouder 2017-08-24 13:20:39 EDT
The problem has been reproduced already two times this week with regular deployment of OSP11 (RH7-RHOS-11.0 2017-08-22.2).
Comment 12 Wayne Allen 2017-08-24 15:00:27 EDT
FYI - 
ceph --version 

on a ceph node shows "ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7)"
Comment 15 arkady kanevsky 2017-08-28 15:50:37 EDT
Federico,
can you escalate it?
Thanks
Comment 16 Sean Merrow 2017-08-29 08:49:00 EDT
Hi Wayne

Engineering believes the fix is likely:

http://tracker.ceph.com/issues/18007. 

That is merged upstream but not yet available downstream. A manual fix [1] until it is available downstream is to set the following variable in systemd/ceph-disk@.service (default is 300)

Environment=CEPH_DISK_TIMEOUT=10000

[1] https://github.com/ceph/ceph/pull/17133/files

Can you give it a try?

Sean
Comment 17 Sean Merrow 2017-08-29 17:02:48 EDT
Hi Loic,

Can you please elaborate on the workaround? I have the one in comment 16 and they came back with the following:

"I want to try out your suggestion, but the instructions and the links you provide are not specific enough. I don’t know where the file(s) I should change resides..  Can you point to more specifics?"

Thanks,
Sean
Comment 18 Wayne Allen 2017-08-30 18:11:57 EDT
Loic, Sean,

I was able to test this simple work-around (having found the target files) and it appears to work fine in a single-node reboot scenario.  I am testing a reboot-all-ceph-nodes (ipmi-soft) scenario now.  Will let you know.

Seems hopeful.
Comment 19 Wayne Allen 2017-08-30 18:53:34 EDT
Re: #16 - Reboot of all ceph nodes at once with this work-around installed also resulted in successful return of osd's to status "up".
Comment 20 Kurt Hey 2017-09-01 09:49:44 EDT
Just as an fyi, this also occurs on OSP10 using unlocked bits
Comment 21 Gonéri Le Bouder 2017-09-06 14:50:57 EDT
Loic, could we get a backport of the fix?
Comment 24 Loic Dachary 2017-09-19 12:45:15 EDT
@tserlin this is done at 5e20864e136ea532431b05de24f0e78f59b63c41
Comment 27 arkady kanevsky 2017-09-19 15:26:30 EDT
Do we have a patch for RHEL?
Comment 29 Mike Orazi 2017-09-28 14:32:46 EDT
Loic,

Was going to see if there is any tuning guidance we should add to the documents w/r/t disk # or sizes and how it might interact with an appropriate timeout value.
Comment 30 Loic Dachary 2017-10-02 09:48:47 EDT
@Mike I think the timeout does not need tuning, it is large enough.
Comment 36 errata-xmlrpc 2017-10-17 14:12:51 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2903

Note You need to log in before you can comment on or make changes to this bug.