Hide Forgot
Description of problem: 1) 3 node cluster with 1.3.2 Node1: mon.a, osd.1, osd.2, osd.3 Node2: mob.b, Node3: osd-3, osd-4 2) Upgraded node1 to 2.0, I couldnt restart the service due to bz 1324589 so I just rebooted the node 3) After reboot, the service failed to start ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdb1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdb2 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdc1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdc2 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdd1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdd2 ● ceph-osd loaded failed failed Ceph object storage daemon ● ceph-osd loaded failed failed Ceph object storage daemon ● ceph-osd loaded failed failed Ceph object storage daemon [root@magna095 ~]# sudo systemctl status ceph-disk ● ceph-disk - Ceph disk activation: /dev/sdb1 Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2016-04-15 23:11:55 UTC; 8min ago Main PID: 753 (code=exited, status=1/FAILURE) Apr 15 23:11:55 magna095 sh[753]: main(sys.argv[1:]) Apr 15 23:11:55 magna095 sh[753]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4915, in main Apr 15 23:11:55 magna095 sh[753]: args.func(args) Apr 15 23:11:55 magna095 sh[753]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4352, in main_trigger Apr 15 23:11:55 magna095 sh[753]: raise Error('return code ' + str(ret)) Apr 15 23:11:55 magna095 sh[753]: ceph_disk.main.Error: Error: return code 1 Apr 15 23:11:55 magna095 systemd[1]: ceph-disk: main process exited, code=exited, status=1/FAILURE Apr 15 23:11:55 magna095 systemd[1]: Failed to start Ceph disk activation: /dev/sdb1. Apr 15 23:11:55 magna095 systemd[1]: Unit ceph-disk entered failed state. Apr 15 23:11:55 magna095 systemd[1]: ceph-disk failed. [root@magna095 ~]# date Fri Apr 15 23:21:42 UTC 2016 [root@magna095 ~]# sudo systemctl status ceph-disk ● ceph-disk - Ceph disk activation: /dev/sdb2 Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2016-04-15 23:08:54 UTC; 12min ago Main PID: 728 (code=exited, status=1/FAILURE) Apr 15 23:08:54 magna095 sh[728]: main(sys.argv[1:]) Apr 15 23:08:54 magna095 sh[728]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4915, in main Apr 15 23:08:54 magna095 sh[728]: args.func(args) Apr 15 23:08:54 magna095 sh[728]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4352, in main_trigger Apr 15 23:08:54 magna095 sh[728]: raise Error('return code ' + str(ret)) Apr 15 23:08:54 magna095 sh[728]: ceph_disk.main.Error: Error: return code 1 Apr 15 23:08:54 magna095 systemd[1]: ceph-disk: main process exited, code=exited, status=1/FAILURE Apr 15 23:08:54 magna095 systemd[1]: Failed to start Ceph disk activation: /dev/sdb2. Apr 15 23:08:54 magna095 systemd[1]: Unit ceph-disk entered failed state. Apr 15 23:08:54 magna095 systemd[1]: ceph-disk failed. [root@magna095 ~]# sudo systemctl status ceph-osd ● ceph-osd - Ceph object storage daemon Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Fri 2016-04-15 23:11:55 UTC; 10min ago Process: 3986 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=TERM) Apr 15 23:11:46 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:46.826544 7f0f842c3700 0 -- 10.8.128.95:0/3484574614 >> 10.8.128.95:6789/0 pipe(0x7f0f7400...60).fault Apr 15 23:11:49 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:49.826601 7f0f84ac4700 1 -- 10.8.128.95:0/3484574614 mark_down 0x7f0f7400e660 -- 0x7f0f7400d3a0 Apr 15 23:11:49 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:49.826656 7f0f84ac4700 1 -- 10.8.128.95:0/3484574614 --> 10.8.128.93:6789/0 -- auth(proto ...f74014de0 Apr 15 23:11:52 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:52.826788 7f0f84ac4700 1 -- 10.8.128.95:0/3484574614 mark_down 0x7f0f74014de0 -- 0x7f0f7400fc40 Apr 15 23:11:52 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:52.826850 7f0f84ac4700 1 -- 10.8.128.95:0/3484574614 --> 10.8.128.95:6789/0 -- auth(proto ...f7400e660 Apr 15 23:11:52 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:52.826912 7f0f841c2700 0 -- 10.8.128.95:0/3484574614 >> 10.8.128.95:6789/0 pipe(0x7f0f7400...60).fault Apr 15 23:11:55 magna095 systemd[1]: ceph-osd start-pre operation timed out. Terminating. Apr 15 23:11:55 magna095 systemd[1]: Failed to start Ceph object storage daemon. Apr 15 23:11:55 magna095 systemd[1]: Unit ceph-osd entered failed state. Apr 15 23:11:55 magna095 systemd[1]: ceph-osd failed. Hint: Some lines were ellipsized, use -l to show in full. Version-Release number of selected component (if applicable): ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae) and ceph version 0.94.5-9.el7cp (deef183a81111fa5e128ec88c90a32c9587c615d) How reproducible: 1/1
Ken feel free to reassign this to right owner, I dont find appropriate dropdown for ceph-disk related issues.
tl;dr: - should start monitor first * systemctl daemon-reload * systemctl start ceph-mon - should read http://docs.ceph.com/docs/master/release-notes/#upgrading-from-hammer for adjust fixing the data ownership during the upgrading so i guess it's a non-issue? long story: Process: 3986 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=TERM) > Apr 15 23:11:46 magna095 ceph-osd-prestart.sh[3986]: 2016-04-15 23:11:46.826544 7f0f842c3700 0 -- 10.8.128.95:0/3484574614 >> 10.8.128.95:6789/0 pipe(0x7f0f7400...60).fault so it failed to connect to 10.8.128.95:6789, and it's apparently a mon. # systemctl list-unit-files --type=service| grep ceph ceph-create-keys@.service static ceph-disk@.service static ceph-mds@.service disabled ceph-mon@.service disabled ceph-osd@.service enabled ceph-radosgw@.service disabled so ceph-mon service was disabled. # systemctl --no-page -t service -a | grep ceph ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdb1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdb2 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdc1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdc2 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdd1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdd2 ● ceph-osd loaded failed failed Ceph object storage daemon ● ceph-osd loaded failed failed Ceph object storage daemon ● ceph-osd loaded failed failed Ceph object storage daemon and ceph-mon unit file was not loaded at all. so i [root@magna095 ubuntu]# systemctl daemon-reload so it's loaded [root@magna095 ubuntu]# systemctl status ceph-mon@service ● ceph-mon - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: inactive (dead) let's start one of the monitors: [root@magna095 ubuntu]# systemctl start ceph-mon but it failed to do so: [root@magna095 ubuntu]# systemctl status ceph-mon ● ceph-mon - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2016-04-18 07:14:30 UTC; 1min 49s ago Process: 12412 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) Main PID: 12412 (code=exited, status=1/FAILURE) Apr 18 07:14:30 magna095 systemd[1]: Started Ceph cluster monitor daemon. Apr 18 07:14:30 magna095 systemd[1]: Starting Ceph cluster monitor daemon... Apr 18 07:14:30 magna095 ceph-mon[12412]: IO error: /var/lib/ceph/mon/ceph-magna095/store.db/LOCK: Permission denied Apr 18 07:14:30 magna095 ceph-mon[12412]: 2016-04-18 07:14:30.709231 7f92359ce4c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-magna095': (22) Invalid argument Apr 18 07:14:30 magna095 systemd[1]: ceph-mon: main process exited, code=exited, status=1/FAILURE Apr 18 07:14:30 magna095 systemd[1]: Unit ceph-mon entered failed state. Apr 18 07:14:30 magna095 systemd[1]: ceph-mon failed. please note, we have > Apr 18 07:14:30 magna095 ceph-mon[12412]: IO error: /var/lib/ceph/mon/ceph-magna095/store.db/LOCK: Permission denied and > Process: 12412 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) so i checked the permissions under mon's run directory [root@magna095 ubuntu]# ls -l /var/lib/ceph/mon/ceph-magna095/store.db/ total 6860 -rw-r--r--. 1 root root 787 Apr 15 22:36 000005.sst -rw-r--r--. 1 root root 4493913 Apr 15 22:40 000008.sst -rw-r--r--. 1 root root 1388072 Apr 15 22:58 000011.sst -rw-r--r--. 1 root root 1112329 Apr 15 23:00 000012.log -rw-r--r--. 1 root root 16 Apr 15 22:58 CURRENT -rw-r--r--. 1 root root 0 Apr 15 22:36 LOCK -rw-r--r--. 1 root root 313 Apr 15 22:40 LOG -rw-r--r--. 1 root root 309 Apr 15 22:36 LOG.old -rw-r--r--. 1 root root 185 Apr 15 23:00 MANIFEST-000010 so all files are owned by root, and ceph:ceph can not acquire the exclusive write lock for the monitor. let's change it! [root@magna095 ubuntu]# chown -R ceph:ceph /var/lib/ceph/mon/ceph-magna095 and the monitor is up and running [root@magna095 ubuntu]# systemctl start ceph-mon [root@magna095 ubuntu]# systemctl status ceph-mon ● ceph-mon - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2016-04-18 07:18:54 UTC; 3s ago Main PID: 12528 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon └─12528 /usr/bin/ceph-mon -f --cluster ceph --id magna095 --setuser ceph --setgroup ceph Apr 18 07:18:54 magna095 systemd[1]: Started Ceph cluster monitor daemon. Apr 18 07:18:54 magna095 systemd[1]: Starting Ceph cluster monitor daemon... Apr 18 07:18:54 magna095 ceph-mon[12528]: starting mon.magna095 rank 1 at 10.8.128.95:6789/0 mon_data /var/lib/ceph/mon/ceph-magna095 fsid 3cc10402-b0cd-456e-9119-b7ae9ceccff6 and "ceph" CLI is able to talk to the cluster now! [root@magna095 ubuntu]# ceph -s cluster 3cc10402-b0cd-456e-9119-b7ae9ceccff6 health HEALTH_WARN clock skew detected on mon.magna095 42 pgs degraded 42 pgs stuck degraded 64 pgs stuck unclean 42 pgs stuck undersized 42 pgs undersized Monitor clock skew detected monmap e1: 2 mons at {magna093=10.8.128.93:6789/0,magna095=10.8.128.95:6789/0} election epoch 12, quorum 0,1 magna093,magna095 osdmap e48: 5 osds: 2 up, 2 in; 22 remapped pgs pgmap v84: 64 pgs, 1 pools, 0 bytes data, 0 objects 70752 kB used, 1852 GB / 1852 GB avail 42 active+undersized+degraded 22 active+remapped let's try again with the ceph-disk service: [root@magna095 ubuntu]# systemctl start ceph-disk [root@magna095 ubuntu]# systemctl status ceph-disk ● ceph-disk - Ceph disk activation: /dev/sdb1 Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; vendor preset: disabled) Active: inactive (dead) Apr 18 07:30:04 magna095 sh[12616]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit Apr 18 07:30:04 magna095 sh[12616]: activate: Marking with init system systemd Apr 18 07:30:04 magna095 sh[12616]: activate: ceph osd.2 data dir is ready at /var/lib/ceph/tmp/mnt.FS09xt Apr 18 07:30:04 magna095 sh[12616]: mount_activate: ceph osd.2 already mounted in position; unmounting ours. Apr 18 07:30:04 magna095 sh[12616]: unmount: Unmounting /var/lib/ceph/tmp/mnt.FS09xt Apr 18 07:30:04 magna095 sh[12616]: command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.FS09xt Apr 18 07:30:04 magna095 sh[12616]: start_daemon: Starting ceph osd.2... Apr 18 07:30:04 magna095 sh[12616]: command_check_call: Running command: /usr/bin/systemctl enable ceph-osd@2 Apr 18 07:30:04 magna095 sh[12616]: command_check_call: Running command: /usr/bin/systemctl start ceph-osd@2 Apr 18 07:30:04 magna095 systemd[1]: Started Ceph disk activation: /dev/sdb1. so it starts just fine, but is still dead! since the disk is ready, i tried to start the osd.2 directly: [root@magna095 ubuntu]# systemctl start ceph-osd@2 [root@magna095 ubuntu]# systemctl status ceph-osd@2 ● ceph-osd - Ceph object storage daemon Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2016-04-18 07:34:04 UTC; 3s ago Process: 12922 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) Process: 12873 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS) Main PID: 12922 (code=exited, status=1/FAILURE) Apr 18 07:34:04 magna095 ceph-osd-prestart.sh[12873]: 2016-04-18 07:34:04.570472 7f3521a7f700 1 -- 10.8.128.95:0/1024745295 mark_down_all Apr 18 07:34:04 magna095 ceph-osd-prestart.sh[12873]: 2016-04-18 07:34:04.570554 7f3521a7f700 1 -- 10.8.128.95:0/1024745295 shutdown complete. Apr 18 07:34:04 magna095 systemd[1]: Started Ceph object storage daemon. Apr 18 07:34:04 magna095 ceph-osd[12922]: starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal Apr 18 07:34:04 magna095 ceph-osd[12922]: 2016-04-18 07:34:04.641784 7f9865f8d800 -1 filestore(/var/lib/ceph/osd/ceph-2) FileStore::mount: unable to access basedir '/var/lib/ceph/osd/ceph-2': (13) Permission denied Apr 18 07:34:04 magna095 ceph-osd[12922]: 2016-04-18 07:34:04.641797 7f9865f8d800 -1 osd.2 0 OSD:init: unable to mount object store Apr 18 07:34:04 magna095 ceph-osd[12922]: 2016-04-18 07:34:04.641801 7f9865f8d800 -1 ** ERROR: osd init failed: (13) Permission denied Apr 18 07:34:04 magna095 systemd[1]: ceph-osd: main process exited, code=exited, status=1/FAILURE Apr 18 07:34:04 magna095 systemd[1]: Unit ceph-osd entered failed state. Apr 18 07:34:04 magna095 systemd[1]: ceph-osd failed. still no luck, but we have more error info now: > Apr 18 07:34:04 magna095 ceph-osd[12922]: 2016-04-18 07:34:04.641784 7f9865f8d800 -1 filestore(/var/lib/ceph/osd/ceph-2) FileStore::mount: unable to access basedir '/var/lib/ceph/osd/ceph-2': (13) Permission denied and > Process: 12846 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) feel like déja vu? and this is actually documented in the doc/release-notes.rst. see http://docs.ceph.com/docs/master/release-notes/#upgrading-from-hammer
Thanks Kefu :), I actually checked only /var/lib/ceph and was suprised to see the ownership set to ceph:ceph right after upgrade, but deep down inside mon/osds didnt need -R.