1337579 – [RFE] ceph osd init script should retry indefinitely (at a configurable interval) if the mons cannot be reached

Bug 1337579 - [RFE] ceph osd init script should retry indefinitely (at a configurable interval) if the mons cannot be reached

Summary: [RFE] ceph osd init script should retry indefinitely (at a configurable inter...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	2.3
Assignee:	Boris Ranto
QA Contact:	shilpa
Docs Contact:	Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-19 14:18 UTC by Dan Yasny
Modified:	2017-07-30 15:13 UTC (History)
CC List:	16 users (show)
Fixed In Version:	RHEL: ceph-10.2.7-2.el7cp Ubuntu: ceph_10.2.7-3redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-19 13:25:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:1497	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.3 bug fix and enhancement update	2017-06-19 17:24:11 UTC

Description Dan Yasny 2016-05-19 14:18:29 UTC

Description of problem:
While testing RHOS survavability after a power crash, I ran into Ceph coming up with a clock skew initially:
[heat-admin@overcloud-cephstorage-0 ~]$ sudo ceph status
    cluster cc5d0c9c-1d28-11e6-be4e-525400ab0cdd
     health HEALTH_WARN
            clock skew detected on mon.overcloud-controller-0, mon.overcloud-controller-2
            Monitor clock skew detected
     monmap e1: 3 mons at {overcloud-controller-0=192.168.110.13:6789/0,overcloud-controller-1=192.168.110.11:6789/0,overcloud-controller-2=192.168.110.17:6789/0}
            election epoch 6, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2
     osdmap e15: 3 osds: 3 up, 3 in
      pgmap v49: 160 pgs, 4 pools, 0 bytes data, 0 objects
            11237 MB used, 111 GB / 122 GB avail
                 160 active+clean

After a while, NTP fixes the clock skew, but Ceph shows all OSDs as down:
[heat-admin@overcloud-cephstorage-0 ~]$ sudo ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*clock01.util.ph .CDMA.           1 u   83  128  377   45.450    2.211   1.084

[heat-admin@overcloud-cephstorage-2 ~]$ sudo ceph status
    cluster cc5d0c9c-1d28-11e6-be4e-525400ab0cdd
     health HEALTH_WARN
            160 pgs stale
            160 pgs stuck stale
            3/3 in osds are down
     monmap e1: 3 mons at {overcloud-controller-0=192.168.110.13:6789/0,overcloud-controller-1=192.168.110.11:6789/0,overcloud-controller-2=192.168.110.17:6789/0}
            election epoch 6, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2
     osdmap e16: 3 osds: 0 up, 3 in
      pgmap v50: 160 pgs, 4 pools, 0 bytes data, 0 objects
            11237 MB used, 111 GB / 122 GB avail
                 160 stale+active+clean


Version-Release number of selected component (if applicable):
openstack-swift-2.5.0-2.el7ost.noarch
openstack-manila-share-1.0.1-3.el7ost.noarch
openstack-ceilometer-collector-5.0.2-2.el7ost.noarch
openstack-neutron-common-7.0.1-15.el7ost.noarch
openstack-swift-object-2.5.0-2.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-dashboard-8.0.1-2.el7ost.noarch
openstack-glance-11.0.1-4.el7ost.noarch
openstack-heat-api-5.0.1-5.el7ost.noarch
openstack-nova-api-12.0.2-5.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.3.8-1.el7ost.noarch
openstack-puppet-modules-7.0.17-1.el7ost.noarch
openstack-swift-container-2.5.0-2.el7ost.noarch
python-django-openstack-auth-2.0.1-1.2.el7ost.noarch
openstack-neutron-7.0.1-15.el7ost.noarch
openstack-nova-compute-12.0.2-5.el7ost.noarch
openstack-heat-api-cloudwatch-5.0.1-5.el7ost.noarch
openstack-neutron-openvswitch-7.0.1-15.el7ost.noarch
openstack-ceilometer-central-5.0.2-2.el7ost.noarch
openstack-swift-proxy-2.5.0-2.el7ost.noarch
openstack-nova-console-12.0.2-5.el7ost.noarch
openstack-nova-novncproxy-12.0.2-5.el7ost.noarch
openstack-neutron-metering-agent-7.0.1-15.el7ost.noarch
openstack-neutron-bigswitch-agent-2015.3.8-1.el7ost.noarch
openstack-selinux-0.6.58-1.el7ost.noarch
openstack-nova-common-12.0.2-5.el7ost.noarch
openstack-ceilometer-common-5.0.2-2.el7ost.noarch
openstack-heat-common-5.0.1-5.el7ost.noarch
openstack-neutron-lbaas-7.0.0-2.el7ost.noarch
openstack-heat-engine-5.0.1-5.el7ost.noarch
openstack-ceilometer-compute-5.0.2-2.el7ost.noarch
openstack-swift-account-2.5.0-2.el7ost.noarch
openstack-nova-scheduler-12.0.2-5.el7ost.noarch
openstack-manila-1.0.1-3.el7ost.noarch
python-openstackclient-1.7.2-1.el7ost.noarch
openstack-ceilometer-notification-5.0.2-2.el7ost.noarch
openstack-ceilometer-polling-5.0.2-2.el7ost.noarch
openstack-dashboard-theme-8.0.1-2.el7ost.noarch
openstack-cinder-7.0.1-8.el7ost.noarch
openstack-heat-api-cfn-5.0.1-5.el7ost.noarch
openstack-nova-conductor-12.0.2-5.el7ost.noarch
openstack-swift-plugin-swift3-1.9-1.el7ost.noarch
openstack-neutron-ml2-7.0.1-15.el7ost.noarch
openstack-keystone-8.0.1-1.el7ost.noarch
openstack-ceilometer-api-5.0.2-2.el7ost.noarch
openstack-ceilometer-alarm-5.0.2-2.el7ost.noarch
openstack-nova-cert-12.0.2-5.el7ost.noarch
ceph-osd-0.94.5-9.el7cp.x86_64
ceph-common-0.94.5-9.el7cp.x86_64
ceph-0.94.5-9.el7cp.x86_64
ceph-mon-0.94.5-9.el7cp.x86_64


How reproducible:
Always

Steps to Reproduce:
1. deploy RHOS 8 with 3x ceph hosts
2. shutdown everything and start it up again
3. monitor ceph status on the ceph hosts

Actual results:
see above

Expected results:
ceph shoudl return to HEALTH_OK

Additional info:

Initial state of ceph before restart was HEALTH_OK

Comment 2 Mike Burns 2016-05-19 14:26:38 UTC

moving to Ceph as this seems completely unrelated to OpenStack.

Comment 5 Samuel Just 2016-05-19 16:36:37 UTC

I don't really understand.  Did you also restart the OSD node?  Can you attach an OSD log?  I need more information here.  I think the clock skew is a red herring.

Comment 6 Dan Yasny 2016-05-19 20:39:50 UTC

I am running a negative test in a virtual environment, basically I pull the plug on all the machines running my entire environment, which includes Ceph, wait for all the services to start back up again, and verify each one of them is healthy and recovered after the simulated power outage. So far, this is the only warning/error I got out of this particular test. 
I'm no Ceph expert but I don't think OSD is used by Openstack. However, before the test the cluster was showing OK, and after... well you saw the output. 

If you need a specific log, please let me know where I can find it. Or ping me on irc 'dyasny'on #rhos-mgt and I'll let you into the system to take a look directly

Comment 7 Samuel Just 2016-05-20 14:54:03 UTC

The simplest thing would be for you to reproduce it and then leave the vms in that state for me to look at.

Comment 8 Samuel Just 2016-05-20 19:27:52 UTC

[stack@instack ~]$ ssh heat-admin.2.8                                                                                                                                                           |
Last login: Wed May 18 21:18:16 2016 from 192.0.2.1
[heat-admin@overcloud-cephstorage-2 ~]$ sudo systemctl status 'ceph*'                                                                                                                                 |
● ceph.service - LSB: Start Ceph distributed file system daemons at boot time
   Loaded: loaded (/etc/rc.d/init.d/ceph)
   Active: failed (Result: exit-code) since Wed 2016-05-18 20:16:22 UTC; 1 day 23h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 1146 ExecStart=/etc/rc.d/init.d/ceph start (code=exited, status=1/FAILURE)

May 18 20:16:09 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:09.523755 7f7a28451700  0 -- :/1002337 >> 192.168.110.13:6789/0 pipe(0x7f7a18008280 sd=3 :0 s=1 pgs=0 c...79b0).fault
May 18 20:16:12 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:12.524921 7f7a28350700  0 -- :/1002337 >> 192.168.110.17:6789/0 pipe(0x7f7a18000c00 sd=3 :0 s=1 pgs=0 c...1120).fault
May 18 20:16:15 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:15.525163 7f7a28451700  0 -- :/1002337 >> 192.168.110.11:6789/0 pipe(0x7f7a18008280 sd=3 :0 s=1 pgs=0 c...c520).fault
May 18 20:16:18 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:18.525428 7f7a28350700  0 -- :/1002337 >> 192.168.110.17:6789/0 pipe(0x7f7a18000c00 sd=4 :0 s=1 pgs=0 c...13a0).fault
May 18 20:16:21 overcloud-cephstorage-2.localdomain ceph[1146]: 2016-05-18 20:16:21.526002 7f7a28451700  0 -- :/1002337 >> 192.168.110.11:6789/0 pipe(0x7f7a18008280 sd=3 :0 s=1 pgs=0 c...c520).fault
May 18 20:16:21 overcloud-cephstorage-2.localdomain ceph[1146]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 --keyring=/var/lib/ceph/osd/ceph-1/keyring osd cru...ot=default'
May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: ceph.service: control process exited, code=exited status=1
May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: Failed to start LSB: Start Ceph distributed file system daemons at boot time.
May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: Unit ceph.service entered failed state.
May 18 20:16:22 overcloud-cephstorage-2.localdomain systemd[1]: ceph.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

Comment 9 Samuel Just 2016-05-20 19:28:31 UTC

sudo systemctl start 'ceph*'

works

Comment 10 Samuel Just 2016-05-20 19:30:18 UTC

I guess the systemd script has a 30s timeout and the OSD script couldn't get at the mons in that time (which makes sense, the mons also got restarted and probably weren't up yet).  I'm not really sure what the right behavior here is.

Comment 11 Samuel Just 2016-05-20 19:40:12 UTC

Marking as an RFE and assigning to branto.

Comment 12 Dan Yasny 2016-05-20 19:51:12 UTC

access to host provided for debugging

Comment 19 Boris Ranto 2016-09-22 09:26:29 UTC

Hi all,

we override the systemd defaults in RHCEPH 2 so this should no longer be an issue there -- we do restart the ceph daemons on failure albeit in a limited fashion (3 times). We did not do anything like that in 1.3 since we did not support systemd just yet -- we used SysV init scripts in 1.3.

@Dan: Can you please retest with OSP 10?

Comment 21 Dan Yasny 2016-09-23 13:36:14 UTC

In OSP10, we are currently seeing https://bugzilla.redhat.com/show_bug.cgi?id=1374465 which stops MONs from starting, so OSDs also never do.

Comment 29 shilpa 2017-05-25 06:59:15 UTC

I don't see this issue with OSP 11 and ceph-10.2.7-16. Moving to verified.

Comment 32 errata-xmlrpc 2017-06-19 13:25:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1497

Note You need to log in before you can comment on or make changes to this bug.