Description of problem: While testing RHOS survavability after a power crash, I ran into Ceph coming up with a clock skew initially: [heat-admin@overcloud-cephstorage-0 ~]$ sudo ceph status cluster cc5d0c9c-1d28-11e6-be4e-525400ab0cdd health HEALTH_WARN clock skew detected on mon.overcloud-controller-0, mon.overcloud-controller-2 Monitor clock skew detected monmap e1: 3 mons at {overcloud-controller-0=192.168.110.13:6789/0,overcloud-controller-1=192.168.110.11:6789/0,overcloud-controller-2=192.168.110.17:6789/0} election epoch 6, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2 osdmap e15: 3 osds: 3 up, 3 in pgmap v49: 160 pgs, 4 pools, 0 bytes data, 0 objects 11237 MB used, 111 GB / 122 GB avail 160 active+clean After a while, NTP fixes the clock skew, but Ceph shows all OSDs as down: [heat-admin@overcloud-cephstorage-0 ~]$ sudo ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *clock01.util.ph .CDMA. 1 u 83 128 377 45.450 2.211 1.084 [heat-admin@overcloud-cephstorage-2 ~]$ sudo ceph status cluster cc5d0c9c-1d28-11e6-be4e-525400ab0cdd health HEALTH_WARN 160 pgs stale 160 pgs stuck stale 3/3 in osds are down monmap e1: 3 mons at {overcloud-controller-0=192.168.110.13:6789/0,overcloud-controller-1=192.168.110.11:6789/0,overcloud-controller-2=192.168.110.17:6789/0} election epoch 6, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2 osdmap e16: 3 osds: 0 up, 3 in pgmap v50: 160 pgs, 4 pools, 0 bytes data, 0 objects 11237 MB used, 111 GB / 122 GB avail 160 stale+active+clean Version-Release number of selected component (if applicable): openstack-swift-2.5.0-2.el7ost.noarch openstack-manila-share-1.0.1-3.el7ost.noarch openstack-ceilometer-collector-5.0.2-2.el7ost.noarch openstack-neutron-common-7.0.1-15.el7ost.noarch openstack-swift-object-2.5.0-2.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch openstack-dashboard-8.0.1-2.el7ost.noarch openstack-glance-11.0.1-4.el7ost.noarch openstack-heat-api-5.0.1-5.el7ost.noarch openstack-nova-api-12.0.2-5.el7ost.noarch openstack-neutron-bigswitch-lldp-2015.3.8-1.el7ost.noarch openstack-puppet-modules-7.0.17-1.el7ost.noarch openstack-swift-container-2.5.0-2.el7ost.noarch python-django-openstack-auth-2.0.1-1.2.el7ost.noarch openstack-neutron-7.0.1-15.el7ost.noarch openstack-nova-compute-12.0.2-5.el7ost.noarch openstack-heat-api-cloudwatch-5.0.1-5.el7ost.noarch openstack-neutron-openvswitch-7.0.1-15.el7ost.noarch openstack-ceilometer-central-5.0.2-2.el7ost.noarch openstack-swift-proxy-2.5.0-2.el7ost.noarch openstack-nova-console-12.0.2-5.el7ost.noarch openstack-nova-novncproxy-12.0.2-5.el7ost.noarch openstack-neutron-metering-agent-7.0.1-15.el7ost.noarch openstack-neutron-bigswitch-agent-2015.3.8-1.el7ost.noarch openstack-selinux-0.6.58-1.el7ost.noarch openstack-nova-common-12.0.2-5.el7ost.noarch openstack-ceilometer-common-5.0.2-2.el7ost.noarch openstack-heat-common-5.0.1-5.el7ost.noarch openstack-neutron-lbaas-7.0.0-2.el7ost.noarch openstack-heat-engine-5.0.1-5.el7ost.noarch openstack-ceilometer-compute-5.0.2-2.el7ost.noarch openstack-swift-account-2.5.0-2.el7ost.noarch openstack-nova-scheduler-12.0.2-5.el7ost.noarch openstack-manila-1.0.1-3.el7ost.noarch python-openstackclient-1.7.2-1.el7ost.noarch openstack-ceilometer-notification-5.0.2-2.el7ost.noarch openstack-ceilometer-polling-5.0.2-2.el7ost.noarch openstack-dashboard-theme-8.0.1-2.el7ost.noarch openstack-cinder-7.0.1-8.el7ost.noarch openstack-heat-api-cfn-5.0.1-5.el7ost.noarch openstack-nova-conductor-12.0.2-5.el7ost.noarch openstack-swift-plugin-swift3-1.9-1.el7ost.noarch openstack-neutron-ml2-7.0.1-15.el7ost.noarch openstack-keystone-8.0.1-1.el7ost.noarch openstack-ceilometer-api-5.0.2-2.el7ost.noarch openstack-ceilometer-alarm-5.0.2-2.el7ost.noarch openstack-nova-cert-12.0.2-5.el7ost.noarch ceph-osd-0.94.5-9.el7cp.x86_64 ceph-common-0.94.5-9.el7cp.x86_64 ceph-0.94.5-9.el7cp.x86_64 ceph-mon-0.94.5-9.el7cp.x86_64 How reproducible: Always Steps to Reproduce: 1. deploy RHOS 8 with 3x ceph hosts 2. shutdown everything and start it up again 3. monitor ceph status on the ceph hosts Actual results: see above Expected results: ceph shoudl return to HEALTH_OK Additional info: Initial state of ceph before restart was HEALTH_OK
moving to Ceph as this seems completely unrelated to OpenStack.
I don't really understand. Did you also restart the OSD node? Can you attach an OSD log? I need more information here. I think the clock skew is a red herring.
I am running a negative test in a virtual environment, basically I pull the plug on all the machines running my entire environment, which includes Ceph, wait for all the services to start back up again, and verify each one of them is healthy and recovered after the simulated power outage. So far, this is the only warning/error I got out of this particular test. I'm no Ceph expert but I don't think OSD is used by Openstack. However, before the test the cluster was showing OK, and after... well you saw the output. If you need a specific log, please let me know where I can find it. Or ping me on irc 'dyasny'on #rhos-mgt and I'll let you into the system to take a look directly
The simplest thing would be for you to reproduce it and then leave the vms in that state for me to look at.
[stack@instack ~]$ ssh heat-admin.2.8 | Last login: Wed May 18 21:18:16 2016 from 192.0.2.1 [heat-admin@overcloud-cephstorage-2 ~]$ sudo systemctl status 'ceph*' | ● ceph.service - LSB: Start Ceph distributed file system daemons at boot time Loaded: loaded (/etc/rc.d/init.d/ceph) Active: failed (Result: exit-code) since Wed 2016-05-18 20:16:22 UTC; 1 day 23h ago Docs: man:systemd-sysv-generator(8) Process: 1146 ExecStart=/etc/rc.d/init.d/ceph start (code=exited, status=1/FAILURE) May 18 20:16:09 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:09.523755 7f7a28451700 0 -- :/1002337 >> 192.168.110.13:6789/0 pipe(0x7f7a18008280 sd=3 :0 s=1 pgs=0 c...79b0).fault May 18 20:16:12 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:12.524921 7f7a28350700 0 -- :/1002337 >> 192.168.110.17:6789/0 pipe(0x7f7a18000c00 sd=3 :0 s=1 pgs=0 c...1120).fault May 18 20:16:15 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:15.525163 7f7a28451700 0 -- :/1002337 >> 192.168.110.11:6789/0 pipe(0x7f7a18008280 sd=3 :0 s=1 pgs=0 c...c520).fault May 18 20:16:18 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:18.525428 7f7a28350700 0 -- :/1002337 >> 192.168.110.17:6789/0 pipe(0x7f7a18000c00 sd=4 :0 s=1 pgs=0 c...13a0).fault May 18 20:16:21 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:21.526002 7f7a28451700 0 -- :/1002337 >> 192.168.110.11:6789/0 pipe(0x7f7a18008280 sd=3 :0 s=1 pgs=0 c...c520).fault May 18 20:16:21 overcloud-cephstorage-2.localdomain ceph[1146]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 --keyring=/var/lib/ceph/osd/ceph-1/keyring osd cru...ot=default' May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: ceph.service: control process exited, code=exited status=1 May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: Failed to start LSB: Start Ceph distributed file system daemons at boot time. May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: Unit ceph.service entered failed state. May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: ceph.service failed. Hint: Some lines were ellipsized, use -l to show in full.
sudo systemctl start 'ceph*' works
I guess the systemd script has a 30s timeout and the OSD script couldn't get at the mons in that time (which makes sense, the mons also got restarted and probably weren't up yet). I'm not really sure what the right behavior here is.
Marking as an RFE and assigning to branto.
access to host provided for debugging
Hi all, we override the systemd defaults in RHCEPH 2 so this should no longer be an issue there -- we do restart the ceph daemons on failure albeit in a limited fashion (3 times). We did not do anything like that in 1.3 since we did not support systemd just yet -- we used SysV init scripts in 1.3. @Dan: Can you please retest with OSP 10?
In OSP10, we are currently seeing https://bugzilla.redhat.com/show_bug.cgi?id=1374465 which stops MONs from starting, so OSDs also never do.
I don't see this issue with OSP 11 and ceph-10.2.7-16. Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1497