Created attachment 1216704 [details] sosreport Description of problem:Customer opened a case the seems to be related to bug 1300703. Customer has Self Encrypted Drives that do not always mount/Start the OSD after a reboot. See the errors listed below [1] where it's unable to open the OSD superblock. Customer is able to manually mount the OSD and start the service or many times a second reboot will start the OSD's. There is a known issue documented in RHCS 1.3 related to this, but it's no longer documented in 2.0 but appears to still be an issue. In 1.3 docs there is a workaround listed to add the activate to the , we tried this workaround but then the OSD's would not start no matter how many times the node is rebooted. If possible customer would like a temporary workaround until issue is resolved. I have the sosreports attached to the case and can place them [1] 2016-10-28 13:14:09.894599 7f75d3ee5800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:14:09.894630 7f75d3ee5800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 7444 2016-10-28 13:14:09.894758 7f75d3ee5800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m 2016-10-28 13:14:10.383318 7fd105efa800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:14:10.383422 7fd105efa800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 7713 2016-10-28 13:14:10.383550 7fd105efa800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m 2016-10-28 13:14:10.870065 7f0a96b5a800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:14:10.870096 7f0a96b5a800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 8380 2016-10-28 13:14:10.870238 7f0a96b5a800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m 2016-10-28 13:14:11.401274 7f6990010800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:14:11.401310 7f6990010800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 8634 2016-10-28 13:14:11.401445 7f6990010800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m 2016-10-28 13:14:11.858758 7f922c97d800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:14:11.858786 7f922c97d800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 8894 2016-10-28 13:14:11.858929 7f922c97d800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-38: (2) No such file or directory^[[0m 2016-10-28 13:14:12.431798 7fdefbcc5800 0 set uid:gid to 167:167 (ceph:ceph) 2016-10-28 13:14:12.431824 7fdefbcc5800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 9075 2016-10-28 13:14:12.433434 7fdefbcc5800 0 pidfile_write: ignore empty --pid-file 2016-10-28 13:14:12.457641 7fdefbcc5800 0 filestore(/var/lib/ceph/osd/ceph-lc-38) backend xfs (magic 0x58465342) [2] This is the workaround listed in the 1.3 known issues: https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3.2/release-notes/chapter-4-known-issues Version-Release number of selected component (if applicable): ceph version 10.2.2-41.el7cp Kernel 3.10.0-327.36.2.el7.x86_64 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The errors in the sosreport ( var/log/messages ) and quoted here are expected. They are often confused with genuine errors although they are transient and only happen while the machine is booting. It would be most helpful to have the content of var/log/message after a boot where part of the OSDs did not come up. Maybe there is such a log in the var/log/messages archives of the sosreport and I'd be grateful if you could let me know where they are.
@Brad I found the sosreport, thanks. Reading the syslog reports I only see parts where all OSD are booted as expected. But there are many reboots and it is entirely possible that I'm missing the one that shows the osd did not come up. It would be most helpful to know which one I should look at.
Hello, Thanks for the feedback. The reason for not including the times tamps was because I thought the errors I referenced were what was causing the OSD's not to mount and they were easy to locate. I have asked customer for another set of logs while re-producing the issue so that we will have a timestamp to reference. I will review the logs and update the bug with my findings. Thanks, Joe
Created attachment 1219060 [details] Messages file from a fresh reboot that displays issue
@Loic, The messages file I attached displays the issue of not mounting/starting the OSD's after a reboot. Please note that the customer runs the sosreport command after logging in and it mounts the OSD and starts the service, I verified this in the lab as well. Up until the sosreport command is run the OSD's are not mounted. Below is an outline of what I see, and the same can be found in many of the messages logs from the sosreport as well, the one attached is from another server that displays the same symptoms. Nov 4 12:42:28 <<<<<<<<<<<-----------Boot time *** Trying to start the Ceph osd daemons** Nov 4 12:42:52 lcpcephosd1n1 systemd: Starting Ceph object storage daemon... Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon... Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon... Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon... Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon... Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting Ceph object storage daemon... Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Collectd statistics daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started NRPE. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Enable periodic update of entitlement certificates.. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Dynamic System Tuning Daemon. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.613903 7fe03f9d1700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-5/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.613912 7fe03f9d1700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.613914 7fe03f9d1700 0 librados: osd.5 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614783 7f90c43e6700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-4/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614791 7f90c43e6700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614792 7f90c43e6700 0 librados: osd.4 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615367 7f648a740700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-1/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615375 7f648a740700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615377 7f648a740700 0 librados: osd.1 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615879 7f8d99e9c700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-0/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615887 7f8d99e9c700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.615888 7f8d99e9c700 0 librados: osd.0 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614529 7fa676661700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-3/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614538 7fa676661700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614539 7fa676661700 0 librados: osd.3 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614834 7f6e2bcaf700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-7/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614851 7f6e2bcaf700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614853 7f6e2bcaf700 0 librados: osd.7 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614752 7f0bfacda700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-6/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614760 7f0bfacda700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614761 7f0bfacda700 0 librados: osd.6 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614868 7fd8c35bb700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-lc-2/keyring: (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614878 7fd8c35bb700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: 2016-11-04 12:42:53.614881 7fd8c35bb700 0 librados: osd.2 initialization error (2) No such file or directory Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound Nov 4 12:42:53 lcpcephosd1n1 ceph-osd-prestart.sh: Error connecting to cluster: ObjectNotFound Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Started Ceph object storage daemon. Nov 4 12:42:53 lcpcephosd1n1 systemd: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once. Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting ceph target allowing to start/stop all ceph-osd@.service instances at once. Nov 4 12:42:53 lcpcephosd1n1 systemd: Reached target ceph target allowing to start/stop all ceph*@.service instances at once. Nov 4 12:42:53 lcpcephosd1n1 systemd: Starting ceph target allowing to start/stop all ceph*@.service instances at once. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.879697 7fc502892800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-5: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908618 7f107a08d800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-0: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908835 7fbc2fadf800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-6: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908836 7f6bda18c800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-3: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.892622 7f8a87ca1800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-1: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908789 7f6142f83800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-4: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.908783 7f9d8e549800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-2: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:53 lcpcephosd1n1 ceph-osd: 2016-11-04 12:42:53.958293 7f02718b8800 -1 #033[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-lc-7: (2) No such file or directory#033[0m Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd: main process exited, code=exited, status=1/FAILURE Nov 4 12:42:53 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:53 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:54 lcpcephosd1n1 systemd: Started Postfix Mail Transport Agent. Nov 4 12:42:54 lcpcephosd1n1 systemd: ceph-osd holdoff time over, scheduling restart. Nov 4 12:42:54 lcpcephosd1n1 systemd: ceph-osd holdoff time over, scheduling restart. Nov 4 12:42:54 lcpcephosd1n1 systemd: ceph-osd holdoff time over, scheduling restart. Nov 4 12:42:54 lcpcephosd1n1 systemd: ceph-osd holdoff time over, scheduling restart. *** The system continued to retry the OSD start with the same pattern of messages from above being repeated*** Nov 4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon. Nov 4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:55 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd Nov 4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon. Nov 4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:55 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd Nov 4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon. Nov 4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:55 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd Nov 4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon. Nov 4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:55 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd Nov 4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon. Nov 4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:55 lcpcephosd1n1 systemd: ceph-osd failed. Nov 4 12:42:55 lcpcephosd1n1 systemd: start request repeated too quickly for ceph-osd Nov 4 12:42:55 lcpcephosd1n1 systemd: Failed to start Ceph object storage daemon. Nov 4 12:42:55 lcpcephosd1n1 systemd: Unit ceph-osd entered failed state. Nov 4 12:42:55 lcpcephosd1n1 systemd: ceph-osd failed. *** We then see the customer log in *** Nov 4 12:42:55 lcpcephosd1n1 sshd[4954]: Accepted publickey for dmorris from 10.1.73.107 port 65316 ssh2: RSA 5f:4f:b5:dd:ea:3b:d9:f9:ea:0f:68:3e:e1:00:51:8e ** The ceph disk activation then begins, this was triggered when he ran the sosreport*** Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting system-ceph\x2ddisk.slice. Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb1... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb4... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb2... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdb3... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc2... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc1... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc3... Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdc4... Nov 4 12:44:17 lcpcephosd1n1 kernel: sdf: sdf1 Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sdd1... Nov 4 12:44:17 lcpcephosd1n1 kernel: sdg: sdg1 Nov 4 12:44:17 lcpcephosd1n1 systemd: Starting Ceph disk activation: /dev/sde1... Please let me know if I can gather any further information. Thanks, Joe
It looks like it could be the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=1392455 because /var is also mounted on a lvm volume /dev/mapper/rootvg-varlv on /var type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
Hum, reading the logs once more it looks like this is not even udev racing but systemd trying to activate the OSDs before /var is mounted which would explain the /var/lib/ceph/osd/ceph-lc-5/keyring: (2) No such file or directory Does that sound like a plausible explanation to you ?
@Loic, I agree that it appears to try to activate them before /var is mounted, which may explain why sometimes the OSD's get mounted and start properly when /var is mounted while the retries are still taking place. Thanks, Joe
*** Bug 1392455 has been marked as a duplicate of this bug. ***
I believe https://github.com/ceph/ceph/pull/12147/commits/49f40665e71d45222ee2458ab9c81bb86063fca2 fixes the issue. Could you confirm ?
(In reply to Loic Dachary from comment #17) > I believe > https://github.com/ceph/ceph/pull/12147/commits/ > 49f40665e71d45222ee2458ab9c81bb86063fca2 fixes the issue. Could you confirm ? Thank you Loic. Is it okay to take a backup of: 1. src/ceph-disk/ceph_disk/main.py and then apply the patch: + if get_ceph_user() == 'ceph': + command_check_call(['chown', 'ceph:ceph', args.dev]) 2. systemd/ceph-disk@.service and then apply the patch: +After=local-fs.target +Wants=local-fs.target and then reboot the server and see if it is reproducible ?
I think you can apply the patches above to a live system to try it, yes. You don't need the main.py patch, I think. Only the ceph-disk@.service patch.
It would be useful to try to add RestartSec=5min to the [Service] section of /lib/systemd/system/ceph-osd@.service as a workaround. Instead of retrying three times with a 100ms pause, it would retry starting the OSD 5 minutes apart, three times. If that helps it confirms ceph-disk@.service races with ceph-osd@.service. It is not a proper fix but it will tell us we're going in the right direction.
I believe the proper fix for this problem is https://github.com/ceph/ceph/pull/12241/commits/b9534c7811f6aec65f39607e5ee42be32c6861ce and I'm testing it and trying to assert the border cases where it could either fail or introduce a non backward compatible behavior. The test with RestartSec=5min is still very valuable because it is a workaround that should have the same effect.
It's not merged in jewel yet, targetting 2.3.
Your addition is fine
Your addition is fine (with my corporate email ;-)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html