Bug 1391197 - OSD's not starting after a reboot in RHCS 2.0
Summary: OSD's not starting after a reboot in RHCS 2.0
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Disk
Version: 2.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 2.2
Assignee: Loic Dachary
QA Contact: Vasishta
Bara Ancincova
URL:
Whiteboard:
: 1392455 (view as bug list)
Depends On:
Blocks: 1412948
TreeView+ depends on / blocked
 
Reported: 2016-11-02 18:40 UTC by jquinn
Modified: 2017-11-01 20:03 UTC (History)
13 users (show)

Fixed In Version: RHEL: ceph-10.2.5-17.el7cp Ubuntu: ceph_10.2.5-10redhat1xenial
Doc Type: Bug Fix
Doc Text:
.OSD nodes no longer fail to start after a reboot When the `ceph-osd` service was enabled for a given OSD device, a race condition in some cases occurred between the `ceph-osd` and the `ceph-disk` services at boot time. As a consequence, the OSD did not start after a reboot. With this update, the `ceph-disk` utility now calls the `systemctl enable` and `disable` commands with the `--runtime` option so that the `ceph-osd` units are lost after a reboot. As result, OSD nodes start as expected after a reboot.
Clone Of:
Environment:
Last Closed: 2017-03-14 15:46:29 UTC


Attachments (Terms of Use)
sosreport (14.19 MB, application/x-xz)
2016-11-02 18:40 UTC, jquinn
no flags Details
Messages file from a fresh reboot that displays issue (34.34 KB, application/x-gzip)
2016-11-09 20:49 UTC, jquinn
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0514 normal SHIPPED_LIVE Red Hat Ceph Storage 2.2 bug fix and enhancement update 2017-03-21 07:24:26 UTC
Ceph Project Bug Tracker 17889 None None None 2016-11-14 07:36:22 UTC
Red Hat Knowledge Base (Solution) 2774741 None None None 2016-11-21 20:21:24 UTC

Description jquinn 2016-11-02 18:40:40 UTC
Created attachment 1216704 [details]
sosreport

Description of problem:Customer opened a case the seems to be related to bug 1300703.  Customer has Self Encrypted Drives that do not always mount/Start the OSD after a reboot.  See the errors listed below [1] where it's unable to open the OSD superblock.  Customer is able to manually mount the OSD and start the service or many times a second reboot will start the OSD's. 

There is a known issue documented in RHCS 1.3 related to this, but it's no longer documented in 2.0 but appears to still be an issue.  In 1.3 docs there is a workaround listed to add the activate to the , we tried this workaround but then the OSD's would not start no matter how many times the node is rebooted. 

If possible customer would like a temporary workaround until issue is resolved. 

I have the sosreports attached to the case and can place them

[1]
2016-10-28 13:14:09.894599 7f75d3ee5800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:14:09.894630 7f75d3ee5800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 7444
2016-10-28 13:14:09.894758 7f75d3ee5800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m
2016-10-28 13:14:10.383318 7fd105efa800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:14:10.383422 7fd105efa800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 7713
2016-10-28 13:14:10.383550 7fd105efa800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m
2016-10-28 13:14:10.870065 7f0a96b5a800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:14:10.870096 7f0a96b5a800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 8380
2016-10-28 13:14:10.870238 7f0a96b5a800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m
2016-10-28 13:14:11.401274 7f6990010800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:14:11.401310 7f6990010800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 8634
2016-10-28 13:14:11.401445 7f6990010800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m
2016-10-28 13:14:11.858758 7f922c97d800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:14:11.858786 7f922c97d800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 8894
2016-10-28 13:14:11.858929 7f922c97d800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m
2016-10-28 13:14:12.431798 7fdefbcc5800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-28 13:14:12.431824 7fdefbcc5800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 9075
2016-10-28 13:14:12.433434 7fdefbcc5800  0 pidfile_write: ignore empty --pid-file
2016-10-28 13:14:12.457641 7fdefbcc5800  0 filestore(/var/lib/ceph/osd/ceph-lc-38) backend xfs (magic 0x58465342)



[2] This is the workaround listed in the 1.3 known issues: 
https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3.2/release-notes/chapter-4-known-issues


Version-Release number of selected component (if applicable):

ceph version 10.2.2-41.el7cp
Kernel 3.10.0-327.36.2.el7.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Loic Dachary 2016-11-03 15:32:42 UTC
The errors in the sosreport ( var/log/messages ) and quoted here are expected. They are often confused with genuine errors although they are transient and only happen while the machine is booting.

It would be most helpful to have the content of var/log/message after a boot where part of the OSDs did not come up. Maybe there is such a log in the var/log/messages archives of the sosreport and I'd be grateful if you could let me know where they are.

Comment 4 Loic Dachary 2016-11-04 08:33:55 UTC
@Brad I found the sosreport, thanks. Reading the syslog reports I only see parts where all OSD are booted as expected. But there are many reboots and it is entirely possible that I'm missing the one that shows the osd did not come up. It would be most helpful to know which one I should look at.

Comment 5 jquinn 2016-11-07 22:00:51 UTC
Hello, 

Thanks for the feedback.  The reason for not including the times tamps was because I thought the errors I referenced were what was causing the OSD's not to mount and they were easy to locate.  I have asked customer for another set of logs while re-producing the issue so that we will have a timestamp to reference. I will review the logs and update the bug with my findings.  

Thanks, 
Joe

Comment 6 jquinn 2016-11-09 20:49:43 UTC
Created attachment 1219060 [details]
Messages file from a fresh reboot that displays issue

Comment 7 jquinn 2016-11-09 20:54:50 UTC
@Loic, 

The messages file I attached displays the issue of not mounting/starting the OSD's after a reboot.  Please note that the customer runs the sosreport command after logging in and it mounts the OSD and starts the service, I verified this in the lab as well. Up until the sosreport command is run the OSD's are not mounted.  Below is an outline of what I see, and the same can be found in many of the messages logs from the sosreport as well, the one attached is from another server that displays the same symptoms.   



Nov  4 12:42:28 <<<<<<<<<<<-----------Boot time


*** Trying to start the Ceph osd daemons** 

Nov  4 12:42:52 lcpcephosd1n1 systemd: Starting Ceph object storage daemon...
Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon...
Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon...
Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon...


Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon...
Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon...
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Collectd statistics daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started NRPE.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Enable periodic update of entitlement certificates..
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Dynamic System Tuning Daemon.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.613903 7fe03f9d1700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-5/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.613912 7fe03f9d1700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.613914 7fe03f9d1700  0 librados: osd.5 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614783 7f90c43e6700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-4/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614791 7f90c43e6700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614792 7f90c43e6700  0 librados: osd.4 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615367 7f648a740700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-1/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615375 7f648a740700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615377 7f648a740700  0 librados: osd.1 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615879 7f8d99e9c700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-0/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615887 7f8d99e9c700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615888 7f8d99e9c700  0 librados: osd.0 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614529 7fa676661700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-3/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614538 7fa676661700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614539 7fa676661700  0 librados: osd.3 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614834 7f6e2bcaf700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-7/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614851 7f6e2bcaf700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614853 7f6e2bcaf700  0 librados: osd.7 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614752 7f0bfacda700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-6/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614760 7f0bfacda700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614761 7f0bfacda700  0 librados: osd.6 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614868 7fd8c35bb700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-2/keyring: (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614878 7fd8c35bb700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614881 7fd8c35bb700  0 librados: osd.2 initialization error (2) No such file or directory
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting ceph target allowing to start/stop all ceph-osd@.service instances at once.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Nov  4 12:42:53 lcpcephosd1n1 systemd: Starting ceph target allowing to start/stop all ceph*@.service instances at once.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.879697 7fc502892800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-5: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@5.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@5.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@5.service failed.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908618 7f107a08d800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-0: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@0.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@0.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@0.service failed.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908835 7fbc2fadf800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-6: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@6.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@6.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@6.service failed.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908836 7f6bda18c800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-3: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@3.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@3.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@3.service failed.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.892622 7f8a87ca1800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-1: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908789 7f6142f83800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-4: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@1.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@1.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@1.service failed.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@4.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@4.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@4.service failed.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908783 7f9d8e549800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-2: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@2.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@2.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@2.service failed.
Nov  4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.958293 7f02718b8800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-7: (2) No such file or directory#033[0m
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@7.service: main process exited, code=exited, status=1/FAILURE
Nov  4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd@7.service entered failed state.
Nov  4 12:42:53 lcpcephosd1n1 systemd: ceph-osd@7.service failed.
Nov  4 12:42:54 lcpcephosd1n1 systemd: Started Postfix Mail Transport Agent.
Nov  4 12:42:54 lcpcephosd1n1 systemd: ceph-osd@3.service holdoff time over, scheduling restart.
Nov  4 12:42:54 lcpcephosd1n1 systemd: ceph-osd@0.service holdoff time over, scheduling restart.
Nov  4 12:42:54 lcpcephosd1n1 systemd: ceph-osd@5.service holdoff time over, scheduling restart.
Nov  4 12:42:54 lcpcephosd1n1 systemd: ceph-osd@6.service holdoff time over, scheduling restart.




*** The system continued to retry the OSD start with the same pattern of messages from above being repeated*** 

Nov  4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon.
Nov  4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd@7.service entered failed state.
Nov  4 12:42:55 lcpcephosd1n1 systemd: ceph-osd@7.service failed.
Nov  4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd@4.service
Nov  4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon.
Nov  4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd@4.service entered failed state.
Nov  4 12:42:55 lcpcephosd1n1 systemd: ceph-osd@4.service failed.
Nov  4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd@2.service
Nov  4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon.
Nov  4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd@2.service entered failed state.
Nov  4 12:42:55 lcpcephosd1n1 systemd: ceph-osd@2.service failed.
Nov  4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd@1.service
Nov  4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon.
Nov  4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd@1.service entered failed state.
Nov  4 12:42:55 lcpcephosd1n1 systemd: ceph-osd@1.service failed.
Nov  4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd@7.service
Nov  4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon.
Nov  4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd@7.service entered failed state.
Nov  4 12:42:55 lcpcephosd1n1 systemd: ceph-osd@7.service failed.
Nov  4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd@4.service
Nov  4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon.
Nov  4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd@4.service entered failed state.
Nov  4 12:42:55 lcpcephosd1n1 systemd: ceph-osd@4.service failed.


*** We then see the customer log in *** 


Nov  4 12:42:55 lcpcephosd1n1 sshd[4954]: Accepted publickey for dmorris from 10.1.73.107 port 65316 ssh2: RSA 5f:4f:b5:dd:ea:3b:d9:f9:ea:0f:68:3e:e1:00:51:8e



** The ceph disk activation then begins, this was triggered when he ran the sosreport*** 

Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting system-ceph\x2ddisk.slice.
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb1...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb4...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb2...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb3...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc2...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc1...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc3...
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc4...
Nov  4 12:44:17 lcpcephosd1n1 kernel: sdf: sdf1
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdd1...
Nov  4 12:44:17 lcpcephosd1n1 kernel: sdg: sdg1
Nov  4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sde1...



Please let me know if I can gather any further information. 

Thanks, 
Joe

Comment 8 Loic Dachary 2016-11-14 07:36:22 UTC
It looks like it could be the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=1392455 because /var is also mounted on a lvm volume 

/dev/mapper/rootvg-varlv on /var type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

Comment 9 Loic Dachary 2016-11-14 07:47:46 UTC
Hum, reading the logs once more it looks like this is not even udev racing but systemd trying to activate the OSDs before /var is mounted which would explain the 

/var/lib/ceph/osd/ceph-lc-5/keyring: (2) No such file or directory

Does that sound like a plausible explanation to you ?

Comment 10 jquinn 2016-11-15 15:10:41 UTC
@Loic, 

I agree that it appears to try to activate them before /var is mounted, which may explain why sometimes the OSD's get mounted and start properly when /var is mounted while the retries are still taking place. 

Thanks, 
Joe

Comment 12 Vikhyat Umrao 2016-11-21 17:06:52 UTC
*** Bug 1392455 has been marked as a duplicate of this bug. ***

Comment 17 Loic Dachary 2016-11-23 06:57:45 UTC
I believe https://github.com/ceph/ceph/pull/12147/commits/49f40665e71d45222ee2458ab9c81bb86063fca2 fixes the issue. Could you confirm ?

Comment 18 Vikhyat Umrao 2016-11-23 16:10:51 UTC
(In reply to Loic Dachary from comment #17)
> I believe
> https://github.com/ceph/ceph/pull/12147/commits/
> 49f40665e71d45222ee2458ab9c81bb86063fca2 fixes the issue. Could you confirm ?

Thank you Loic. 

Is it okay to take a backup of:

1. src/ceph-disk/ceph_disk/main.py and then apply the patch:

+    if get_ceph_user() == 'ceph':
+        command_check_call(['chown', 'ceph:ceph', args.dev])

2. systemd/ceph-disk@.service and then apply the patch:

+After=local-fs.target
+Wants=local-fs.target

and then reboot the server and see if it is reproducible ?

Comment 19 Loic Dachary 2016-11-23 21:19:22 UTC
I think you can apply the patches above to a live system to try it, yes. You don't need the main.py patch, I think. Only the ceph-disk@.service patch.

Comment 29 Loic Dachary 2016-11-30 16:27:08 UTC
It would be useful to try to add

RestartSec=5min

to the [Service] section of /lib/systemd/system/ceph-osd@.service as a workaround. Instead of retrying three times with a 100ms pause, it would retry starting the OSD 5 minutes apart, three times. If that helps it confirms ceph-disk@.service races with ceph-osd@.service. It is not a proper fix but it will tell us we're going in the right direction.

Comment 30 Loic Dachary 2016-12-01 07:48:07 UTC
I believe the proper fix for this problem is https://github.com/ceph/ceph/pull/12241/commits/b9534c7811f6aec65f39607e5ee42be32c6861ce and I'm testing it and trying to assert the border cases where it could either fail or introduce a non backward compatible behavior. The test with RestartSec=5min is still very valuable because it is a workaround that should have the same effect.

Comment 41 Loic Dachary 2017-01-09 15:11:37 UTC
It's not merged in jewel yet, targetting 2.3.

Comment 75 Loic Dachary 2017-02-24 15:50:22 UTC
Your addition is fine

Comment 76 Loic Dachary 2017-02-24 16:00:34 UTC
Your addition is fine (with my corporate email ;-)

Comment 81 errata-xmlrpc 2017-03-14 15:46:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html


Note You need to log in before you can comment on or make changes to this bug.