Bug 1328239

Summary: [RH Ceph 2.0 - 10.1.1] One osd fails to start after upgrade ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fe2b857f285]
Product: Red Hat Ceph Storage Reporter: Vasu Kulkarni <vakulkar>
Component: RADOSAssignee: Kefu Chai <kchai>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, dzafman, kchai
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-20 04:02:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vasu Kulkarni 2016-04-18 19:48:06 UTC
Description of problem:
1. 3 node cluser
  node1   - [mon.a, osd.0, osd.1, osd.2]
  node2   - [mon.b, client.0]
  node3   - [osd.3, osd.4, client.1]


node1 sucessfully upgraded to 2.0(0.94.5 to 10.1.1), had some permission issues with ceph:ceph on /var/lib/ceph which kefu resolved and all the osd/mon active on node1.

upgrade node3 with 2 osd's, 1 osd fails to start and i see the below crash, for some reason the time is out of sync by 2 or 3 seconds, I did sync them but it still show the warning of monitor clock out of sync between node1 and node 2. There is no activity from any client side so no IO operations are in progress.

Version-Release number of selected component (if applicable):
10.1.1


[ubuntu@magna094 ~]$ sudo systemctl status ceph-osd@1
● ceph-osd - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Mon 2016-04-18 18:11:20 UTC; 1h 27min ago
  Process: 3188 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
  Process: 3121 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
 Main PID: 3188 (code=killed, signal=ABRT)

Apr 18 18:11:20 magna094 ceph-osd[3188]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7fed2cc68467]
Apr 18 18:11:20 magna094 ceph-osd[3188]: 6: (OSDService::get_map(unsigned int)+0x3d) [0x7fed2c649e3d]
Apr 18 18:11:20 magna094 ceph-osd[3188]: 7: (OSD::init()+0x17c2) [0x7fed2c5fcec2]
Apr 18 18:11:20 magna094 ceph-osd[3188]: 8: (main()+0x2c45) [0x7fed2c561a05]
Apr 18 18:11:20 magna094 ceph-osd[3188]: 9: (__libc_start_main()+0xf5) [0x7fed29153b15]
Apr 18 18:11:20 magna094 ceph-osd[3188]: 10: (()+0x352e29) [0x7fed2c5abe29]
Apr 18 18:11:20 magna094 ceph-osd[3188]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 18 18:11:20 magna094 systemd[1]: ceph-osd: main process exited, code=killed, status=6/ABRT
Apr 18 18:11:20 magna094 systemd[1]: Unit ceph-osd entered failed state.
Apr 18 18:11:20 magna094 systemd[1]: ceph-osd failed.
[ubuntu@magna094 ~]$ sudo ls -lt /var/lib/ceph/
total 32
drwxr-x---. 2 ceph ceph 4096 Apr 18 18:11 tmp
drwxr-x---. 2 ceph ceph 4096 Apr  8 09:59 bootstrap-mds
drwxr-x---. 2 ceph ceph 4096 Apr  8 09:59 bootstrap-osd
drwxr-x---. 2 ceph ceph 4096 Apr  8 09:59 bootstrap-rgw
drwxr-x---. 2 ceph ceph 4096 Apr  8 09:59 mds
drwxr-x---. 2 ceph ceph 4096 Apr  8 09:59 mon
drwxr-x---. 4 ceph ceph 4096 Apr  8 09:59 osd
drwxr-xr-x. 2 ceph ceph 4096 Apr  8 09:59 radosgw
[ubuntu@magna094 ~]$ sudo ls -lt /var/lib/ceph/osd
total 0
drwxr-xr-x. 3 ceph ceph 217 Apr 18 18:11 ceph-1
drwxr-xr-x. 3 ceph ceph 217 Apr 18 18:11 ceph-0
[ubuntu@magna094 ~]$ sudo ls -lt /var/lib/ceph/osd/ceph-1
total 44
-rw-r--r--.  1 root root    0 Apr 18 18:11 systemd
drwxr-xr-x. 68 ceph ceph 1075 Apr 18 18:04 current
-rw-r--r--.  1 ceph ceph    3 Apr 15 22:37 active
-rw-------.  1 ceph ceph   56 Apr 15 22:37 keyring
-rw-r--r--.  1 ceph ceph    6 Apr 15 22:37 ready
-rw-r--r--.  1 ceph ceph   53 Apr 15 22:37 superblock
-rw-r--r--.  1 ceph ceph    4 Apr 15 22:37 store_version
-rw-r--r--.  1 ceph ceph  342 Apr 15 22:37 activate.monmap
-rw-r--r--.  1 ceph ceph    2 Apr 15 22:37 whoami
-rw-r--r--.  1 ceph ceph   21 Apr 15 22:37 magic
-rw-r--r--.  1 ceph ceph   37 Apr 15 22:37 journal_uuid
-rw-r--r--.  1 ceph ceph   37 Apr 15 22:37 fsid
-rw-r--r--.  1 ceph ceph   37 Apr 15 22:37 ceph_fsid
lrwxrwxrwx.  1 ceph ceph   58 Apr 15 22:37 journal -> /dev/disk/by-partuuid/b296d58b-882e-449a-8508-8aa62492661e


  2624 Apr 18 18:04:58 magna094 ceph-osd: ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
   2625 Apr 18 18:04:58 magna094 ceph-osd: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fe2b857f285]
   2626 Apr 18 18:04:58 magna094 ceph-osd: 2: (OSDService::get_map(unsigned int)+0x3d) [0x7fe2b7f60e3d]
   2627 Apr 18 18:04:58 magna094 ceph-osd: 3: (OSD::init()+0x17c2) [0x7fe2b7f13ec2]
   2628 Apr 18 18:04:58 magna094 ceph-osd: 4: (main()+0x2c45) [0x7fe2b7e78a05]
   2629 Apr 18 18:04:58 magna094 ceph-osd: 5: (__libc_start_main()+0xf5) [0x7fe2b4a6ab15]
   2630 Apr 18 18:04:58 magna094 ceph-osd: 6: (()+0x352e29) [0x7fe2b7ec2e29]
   2631 Apr 18 18:04:58 magna094 ceph-osd: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
   2632 Apr 18 18:04:58 magna094 ceph-osd: 2016-04-18 18:04:58.102523 7fe2b7b51800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fe2b7b51800 tim        e 2016-04-18 18:04:58.100909
   2633 Apr 18 18:04:58 magna094 ceph-osd: osd/OSD.h: 885: FAILED assert(ret)
   2634 Apr 18 18:04:58 magna094 ceph-osd: ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
   2635 Apr 18 18:04:58 magna094 ceph-osd: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fe2b857f285]
   2636 Apr 18 18:04:58 magna094 ceph-osd: 2: (OSDService::get_map(unsigned int)+0x3d) [0x7fe2b7f60e3d]
   2637 Apr 18 18:04:58 magna094 ceph-osd: 3: (OSD::init()+0x17c2) [0x7fe2b7f13ec2]
   2638 Apr 18 18:04:58 magna094 ceph-osd: 4: (main()+0x2c45) [0x7fe2b7e78a05]
   2639 Apr 18 18:04:58 magna094 ceph-osd: 5: (__libc_start_main()+0xf5) [0x7fe2b4a6ab15]
   2640 Apr 18 18:04:58 magna094 ceph-osd: 6: (()+0x352e29) [0x7fe2b7ec2e29]
   2641 Apr 18 18:04:58 magna094 ceph-osd: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
   2642 Apr 18 18:04:58 magna094 ceph-osd: 0> 2016-04-18 18:04:58.102523 7fe2b7b51800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fe2b7b51800         time 2016-04-18 18:04:58.100909
   2643 Apr 18 18:04:58 magna094 ceph-osd: osd/OSD.h: 885: FAILED assert(ret)
   2644 Apr 18 18:04:58 magna094 ceph-osd: ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
   2645 Apr 18 18:04:58 magna094 ceph-osd: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fe2b857f285]
   2646 Apr 18 18:04:58 magna094 ceph-osd: 2: (OSDService::get_map(unsigned int)+0x3d) [0x7fe2b7f60e3d]
   2647 Apr 18 18:04:58 magna094 ceph-osd: 3: (OSD::init()+0x17c2) [0x7fe2b7f13ec2]
   2648 Apr 18 18:04:58 magna094 ceph-osd: 4: (main()+0x2c45) [0x7fe2b7e78a05]
   2649 Apr 18 18:04:58 magna094 ceph-osd: 5: (__libc_start_main()+0xf5) [0x7fe2b4a6ab15]
   2650 Apr 18 18:04:58 magna094 ceph-osd: 6: (()+0x352e29) [0x7fe2b7ec2e29]
   2651 Apr 18 18:04:58 magna094 ceph-osd: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
   2652 Apr 18 18:04:58 magna094 ceph-osd: *** Caught signal (Aborted) **
   2653 Apr 18 18:04:58 magna094 ceph-osd: in thread 7fe2b7b51800 thread_name:ceph-osd
   2654 Apr 18 18:04:58 magna094 ceph-osd: ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
   2655 Apr 18 18:04:58 magna094 ceph-osd: 1: (()+0x90f0ba) [0x7fe2b847f0ba]
   2656 Apr 18 18:04:58 magna094 ceph-osd: 2: (()+0xf100) [0x7fe2b64bb100]
   2657 Apr 18 18:04:58 magna094 ceph-osd: 3: (gsignal()+0x37) [0x7fe2b4a7e5f7]
   2658 Apr 18 18:04:58 magna094 ceph-osd: 4: (abort()+0x148) [0x7fe2b4a7fce8]
   2659 Apr 18 18:04:58 magna094 ceph-osd: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7fe2b857f467]
   2660 Apr 18 18:04:58 magna094 ceph-osd: 6: (OSDService::get_map(unsigned int)+0x3d) [0x7fe2b7f60e3d]
   2661 Apr 18 18:04:58 magna094 ceph-osd: 7: (OSD::init()+0x17c2) [0x7fe2b7f13ec2]
   2662 Apr 18 18:04:58 magna094 ceph-osd: 8: (main()+0x2c45) [0x7fe2b7e78a05]
   2663 Apr 18 18:04:58 magna094 ceph-osd: 9: (__libc_start_main()+0xf5) [0x7fe2b4a6ab15]
   2664 Apr 18 18:04:58 magna094 ceph-osd: 10: (()+0x352e29) [0x7fe2b7ec2e29]
   2665 Apr 18 18:04:58 magna094 ceph-osd: 2016-04-18 18:04:58.104345 7fe2b7b51800 -1 *** Caught signal (Aborted) **
   2666 Apr 18 18:04:58 magna094 ceph-osd: in thread 7fe2b7b51800 thread_name:ceph-osd
   2667 Apr 18 18:04:58 magna094 ceph-osd: ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
   2668 Apr 18 18:04:58 magna094 ceph-osd: 1: (()+0x90f0ba) [0x7fe2b847f0ba]
   2669 Apr 18 18:04:58 magna094 ceph-osd: 2: (()+0xf100) [0x7fe2b64bb100]
   2670 Apr 18 18:04:58 magna094 ceph-osd: 3: (gsignal()+0x37) [0x7fe2b4a7e5f7]
   2671 Apr 18 18:04:58 magna094 ceph-osd: 4: (abort()+0x148) [0x7fe2b4a7fce8]
   2672 Apr 18 18:04:58 magna094 ceph-osd: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7fe2b857f467]
   2673 Apr 18 18:04:58 magna094 ceph-osd: 6: (OSDService::get_map(unsigned int)+0x3d) [0x7fe2b7f60e3d]
   2674 Apr 18 18:04:58 magna094 ceph-osd: 7: (OSD::init()+0x17c2) [0x7fe2b7f13ec2]
   2675 Apr 18 18:04:58 magna094 ceph-osd: 8: (main()+0x2c45) [0x7fe2b7e78a05]
   2676 Apr 18 18:04:58 magna094 ceph-osd: 9: (__libc_start_main()+0xf5) [0x7fe2b4a6ab15]
   2677 Apr 18 18:04:58 magna094 ceph-osd: 10: (()+0x352e29) [0x7fe2b7ec2e29]
   2678 Apr 18 18:04:58 magna094 ceph-osd: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
   2679 Apr 18 18:04:58 magna094 ceph-osd: 0> 2016-04-18 18:04:58.104345 7fe2b7b51800 -1 *** Caught signal (Aborted) **
   2680 Apr 18 18:04:58 magna094 ceph-osd: in thread 7fe2b7b51800 thread_name:ceph-osd
   2681 Apr 18 18:04:58 magna094 ceph-osd: ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
   2682 Apr 18 18:04:58 magna094 ceph-osd: 1: (()+0x90f0ba) [0x7fe2b847f0ba]
   2683 Apr 18 18:04:58 magna094 ceph-osd: 2: (()+0xf100) [0x7fe2b64bb100]
   2684 Apr 18 18:04:58 magna094 ceph-osd: 3: (gsignal()+0x37) [0x7fe2b4a7e5f7]
   2685 Apr 18 18:04:58 magna094 ceph-osd: 4: (abort()+0x148) [0x7fe2b4a7fce8]
   2686 Apr 18 18:04:58 magna094 ceph-osd: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7fe2b857f467]
   2687 Apr 18 18:04:58 magna094 ceph-osd: 6: (OSDService::get_map(unsigned int)+0x3d) [0x7fe2b7f60e3d]
   2688 Apr 18 18:04:58 magna094 ceph-osd: 7: (OSD::init()+0x17c2) [0x7fe2b7f13ec2]
   2689 Apr 18 18:04:58 magna094 ceph-osd: 8: (main()+0x2c45) [0x7fe2b7e78a05]
   2690 Apr 18 18:04:58 magna094 ceph-osd: 9: (__libc_start_main()+0xf5) [0x7fe2b4a6ab15]
   2691 Apr 18 18:04:58 magna094 ceph-osd: 10: (()+0x352e29) [0x7fe2b7ec2e29]
   2692 Apr 18 18:04:58 magna094 ceph-osd: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
   2693 Apr 18 18:04:58 magna094 systemd: ceph-osd: main process exited, code=killed, status=6/ABRT
   2694 Apr 18 18:04:58 magna094 systemd: Unit ceph-osd entered failed state.
   2695 Apr 18 18:04:58 magna094 systemd: ceph-osd failed.

Comment 2 Vasu Kulkarni 2016-04-18 19:56:42 UTC
Feel, free to use magna094.ceph.redhat.com

[ubuntu@magna094 ~]$ sudo ceph osd tree
ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 4.49997 root default                                        
-2 1.79999     host magna094                                   
 0 0.89999         osd.0          up  1.00000          1.00000 
 1 0.89999         osd.1        down        0          1.00000 
-3 2.69998     host magna095                                   
 2 0.89999         osd.2          up  1.00000          1.00000 
 3 0.89999         osd.3          up  1.00000          1.00000 
 4 0.89999         osd.4          up  1.00000          1.00000 
[ubuntu@magna094 ~]$ sudo ceph -s
    cluster 3cc10402-b0cd-456e-9119-b7ae9ceccff6
     health HEALTH_WARN
            clock skew detected on mon.magna095
            23 pgs degraded
            23 pgs stuck degraded
            24 pgs stuck unclean
            23 pgs stuck undersized
            23 pgs undersized
            Monitor clock skew detected 
     monmap e1: 2 mons at {magna093=10.8.128.93:6789/0,magna095=10.8.128.95:6789/0}
            election epoch 12, quorum 0,1 magna093,magna095
     osdmap e73: 5 osds: 4 up, 4 in; 1 remapped pgs
      pgmap v172: 64 pgs, 1 pools, 0 bytes data, 0 objects
            136 MB used, 3704 GB / 3704 GB avail
                  40 active+clean
                  23 active+undersized+degraded
                   1 active+remapped

Comment 3 Kefu Chai 2016-04-20 04:02:57 UTC
Vasu, 

as /var/log/ceph/ceph-osd.1.log does not show anything interesting, you can launch the osd daemon manually:


# /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph  --debug-osd=10 --debug-filestore=10 --log-to-stderr=1

- debug-osd and debug-filestore to get more verbose logging, since we are debugging osd
- log-to-stderr so it prints the log to stderr.

so i have following log messages:


    -7> 2016-04-20 03:58:23.982224 7f4f95866800 10 register_cxx_method refcount.get flags 3 0x7f4f824d6f30
    -6> 2016-04-20 03:58:23.982226 7f4f95866800 10 register_cxx_method refcount.put flags 3 0x7f4f824d6270
    -5> 2016-04-20 03:58:23.982227 7f4f95866800 10 register_cxx_method refcount.set flags 3 0x7f4f824d5250
    -4> 2016-04-20 03:58:23.982228 7f4f95866800 10 register_cxx_method refcount.read flags 1 0x7f4f824d68a0
    -3> 2016-04-20 03:58:23.982229 7f4f95866800 10 _load_class refcount success
    -2> 2016-04-20 03:58:23.982266 7f4f95866800 10 filestore(/var/lib/ceph/osd/ceph-1) error opening file /var/lib/ceph/osd/ceph-1/current/meta/osdmap.57__0_64172C0C__none with flags=2: (13) Permission denied
    -1> 2016-04-20 03:58:23.982273 7f4f95866800 10 filestore(/var/lib/ceph/osd/ceph-1) FileStore::read(meta/#-1:3034e826:::osdmap.57:0#) open error: (13) Permission denied
     0> 2016-04-20 03:58:23.983951 7f4f95866800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f4f95866800 time 2016-04-20 03:58:23.982283
osd/OSD.h: 885: FAILED assert(ret)


does it sound familiar to you?

>    -2> 2016-04-20 03:58:23.982266 7f4f95866800 10 filestore(/var/lib/ceph/osd/ceph-1) error opening file /var/lib/ceph/osd/ceph-1/current/meta/osdmap.57__0_64172C0C__none with flags=2: (13) Permission denied

it does ring a bell with me. so i checked the permissions:

# ls -l /var/lib/ceph/osd/ceph-1/current/meta/osdmap.57__0_64172C0C__none
-rw-r--r--. 1 root root 4150 Apr 18 18:02 /var/lib/ceph/osd/ceph-1/current/meta/osdmap.57__0_64172C0C__none

and there are more of them:

[root@magna094 ceph]# ls -l /var/lib/ceph/osd/ceph-1/current/meta/ | grep root
-rw-r--r--. 1 root root  179 Apr 18 18:02 inc\uosdmap.56__0_F4E9E663__none
-rw-r--r--. 1 root root  198 Apr 18 18:02 inc\uosdmap.57__0_F4E9E733__none
-rw-r--r--. 1 root root 4150 Apr 18 18:02 osdmap.56__0_64172F7C__none
-rw-r--r--. 1 root root 4150 Apr 18 18:02 osdmap.57__0_64172C0C__none

you might want to enable the log and double check the permissions next time =)

Comment 4 Vasu Kulkarni 2016-04-21 00:02:50 UTC
Thanks Kefu, I did use -R this time on /var/lib/ceph, I am still not sure how that folder wasn't changed, it did work for 1 OSD, but thanks I am going to keep an eye on all subfolders now :)