Hide Forgot
rhel-osp-director: 9.0 After minor update (includes rhel7.2->rhel7.3 switch) + reboot of overcloud nodes, ceph OSDs are down. Environment: openstack-tripleo-heat-templates-2.0.0-35.el7ost.noarch instack-undercloud-4.0.0-14.el7ost.noarch openstack-puppet-modules-8.1.8-2.el7ost.noarch Steps to reproduce: 1. Deploy overcloud (rhel7.2). 2. Minor update the setup and reboot all nodes (switched to rhel7.3 with the update) 3. Run against overcloud: glance image-create --name cirros --disk-format qcow2 --container-format bare --file /home/stack/cirros-0.3.3-x86_64-disk.img Result: Error finding address for http://10.19.184.180:9292/v1/images: Unable to establish connection to http://10.19.184.180:9292/v1/images [heat-admin@overcloud-cephstorage-0 ~]$ sudo -i [root@overcloud-cephstorage-0 ~]# ceph status cluster be987e18-713b-11e6-bf5c-5254003ec993 health HEALTH_WARN 192 pgs degraded 192 pgs stale 192 pgs stuck degraded 192 pgs stuck stale 192 pgs stuck unclean 192 pgs stuck undersized 192 pgs undersized 2/2 in osds are down monmap e1: 3 mons at {overcloud-controller-0=10.19.95.15:6789/0,overcloud-controller-1=10.19.95.13:6789/0,overcloud-controller-2=10.19.95.11:6789/0} election epoch 10, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e16: 2 osds: 0 up, 2 in pgmap v86: 192 pgs, 5 pools, 0 bytes data, 0 objects 9717 MB used, 826 GB / 873 GB avail 192 stale+active+undersized+degraded Note: manually running /etc/init.d/ceph start on ceph nodes resolves the situation. Expected result: The OSDs should be UP upon reboot.
Reproduced.
hi Alexander, is 'sudo chkconfig --list ceph' reporting ceph as enabled on boot on the ceph storage nodes?
Hi Gioulio, yes - it's enabled. [stack@undercloud72 ~]$ ssh heat-admin.0.8 "sudo chkconfig --list ceph" Note: This output shows SysV services only and does not include native systemd services. SysV configuration data might be overridden by native systemd configuration. If you want to list systemd services use 'systemctl list-unit-files'. To see services enabled on particular target use 'systemctl list-dependencies [target]'. ceph 0:off 1:off 2:on 3:on 4:on 5:on 6:off
From my tests, rebooting a cephstorage node after it has been upgraded to RHEL 7.3 is not an issue. The ceph-osd is started on boot and it re-joins the cluster as long as the Ceph monitors remain available Upon restart, ceph-osd will try to reach one of the monitors for 1 minute, after which if it couldn't it will terminate itself. My understanding is that if controller and cephstorage nodes are rebooted roughly around the same time, it is possible that none of the Ceph monitors is available when the Ceph OSDs are attempting to start, causing them to terminate. Alexander, can you confirm that by rebooting the cephstorage nodes only all the OSDs are brought back up? I will also check if it is possible and feasible to increase the wait time of the OSDs.
Might be Documentation only - reboot in particular order (following the same process as update steps). Sasha will verify and if works, please send to doc_text.
This could be hit if all nodes went down at the same time, in which case if the Ceph OSDs start before the MONs, they will terminate themselves after a 1min timeout. It should be sufficient to attempt a manual start the ceph-osd systemd units manually after the MONs are available. For example, to restart the osd.0 instance, login on the node hosting the osd.0 and run the following as root: systemctl restart ceph-osd@0 As an alternative, the cephstorage nodes can be rebooted after MONs are available.
(In reply to Jaromir Coufal from comment #10) > Might be Documentation only - reboot in particular order (following the same > process as update steps). Sasha will verify and if works, please send to > doc_text. (In reply to Giulio Fidente from comment #12) According to comment #12 in case of power-outage we can manually start the ceph-osd systemd units. this issue is going to be documented. I'm removing the blocker flag and lower Priority /Severity.
Yeah, so I can confirm that after seeing this issue, I simply rebooted the ceph osd nodes and upon return all the OSDs were up.