Bug 1491634 - caught signal (Illegal instruction) when activation bluestore OSD
Summary: caught signal (Illegal instruction) when activation bluestore OSD
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 3.0
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: rc
: 3.0
Assignee: Boris Ranto
QA Contact: Persona non grata
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-14 10:18 UTC by Boris Ranto
Modified: 2017-12-05 23:42 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1476453
Environment:
Last Closed: 2017-12-05 23:42:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3387 0 normal SHIPPED_LIVE Red Hat Ceph Storage 3.0 bug fix and enhancement update 2017-12-06 03:03:45 UTC

Description Boris Ranto 2017-09-14 10:18:31 UTC
We should be hitting this in 3.0 as well.


+++ This bug was initially created as a clone of Bug #1476453 +++

Description of problem:
ceph-disk activate fails with illegal instruction:

# ceph-disk -v activate --reactivate /dev/mapper/mpathg1

main_activate: path = /dev/mapper/mpathg1
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid path is /sys/dev/block/253:13/dm/uuid
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid is part1-mpath-3600508b300954b90aa418f385e820016

get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid path is /sys/dev/block/253:13/dm/uuid
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid is part1-mpath-3600508b300954b90aa418f385e820016

command: Running command: /usr/sbin/blkid -o udev -p /dev/mapper/mpathg1
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid path is /sys/dev/block/253:13/dm/uuid
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid is part1-mpath-3600508b300954b90aa418f385e820016

command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathg1
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
mount: Mounting /dev/mapper/mpathg1 on /var/lib/ceph/tmp/mnt.WJ9Mm4 with options noatime,inode64
command_check_call: Running command: /usr/bin/mount -t xfs -o noatime,inode64 -- /dev/mapper/mpathg1 /var/lib/ceph/tmp/mnt.WJ9Mm4
command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.WJ9Mm4
activate: Cluster uuid is 40216c8a-ba3c-4cec-9e80-212396e214a1
command: Running command: /usr/bin/ceph-osd --cluster=toad --show-config-value=fsid
command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
activate: Cluster name is ceph
activate: OSD uuid is 25efb560-300e-48d3-9156-2f4c0746a313
activate: OSD id is 13
activate: Initializing OSD...
command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/tmp/mnt.WJ9Mm4/activate.monmap
got monmap epoch 8
command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 13 --monmap /var/lib/ceph/tmp/mnt.WJ9Mm4/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.WJ9Mm4 --osd-uuid 25efb560-300e-48d3-9156-2f4c0746a313 --keyring /var/lib/ceph/tmp/mnt.WJ9Mm4/keyring --setuser ceph --setgroup ceph
mount_activate: Failed to activate
unmount: Unmounting /var/lib/ceph/tmp/mnt.WJ9Mm4
command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.WJ9Mm4
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 11, in <module>
    load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5731, in run
    main(sys.argv[1:])
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5682, in main
    args.func(args)
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3755, in main_activate
    reactivate=args.reactivate,
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3512, in mount_activate
    (osd_id, cluster) = activate(path, activate_key_template, init)
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3688, in activate
    keyring=keyring,
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3100, in mkfs
    '--setgroup', get_ceph_group(),
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3064, in ceph_osd_mkfs
    raise Error('%s failed : %s' % (str(arguments), error))
ceph_disk.main.Error: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'13', '--monmap', '/var/lib/ceph/tmp/mnt.WJ9Mm4/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.WJ9Mm4', '--osd-uuid', u'25efb560-300e-48d3-9156-2f4c0746a313', '--keyring', '/var/lib/ceph/tmp/mnt.WJ9Mm4/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed :

*** Caught signal (Illegal instruction) **
 in thread 7f0c68b4bd00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x55a55ca1a458]
 2: (()+0x122c0) [0x7f0c6620d2c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x55a55ce0c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x55a55cd0787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x55a55ccd3ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x55a55ccd5941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x55a55ccd6fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x55a55c9646cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x55a55c965e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x55a55c8f48e3]
 11: (BlueStore::mkfs()+0x8e8) [0x55a55c920a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x55a55c48f8b9]
 13: (main()+0xe29) [0x55a55c3e78b9]
 14: (__libc_start_main()+0xea) [0x7f0c6517a4da]
 15: (_start()+0x2a) [0x55a55c46d98a]
2017-07-29 09:15:16.982152 7f0c68b4bd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0c68b4bd00 thread_name:ceph-osd

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x55a55ca1a458]
 2: (()+0x122c0) [0x7f0c6620d2c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x55a55ce0c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x55a55cd0787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x55a55ccd3ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x55a55ccd5941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x55a55ccd6fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x55a55c9646cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x55a55c965e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x55a55c8f48e3]
 11: (BlueStore::mkfs()+0x8e8) [0x55a55c920a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x55a55c48f8b9]
 13: (main()+0xe29) [0x55a55c3e78b9]
 14: (__libc_start_main()+0xea) [0x7f0c6517a4da]
 15: (_start()+0x2a) [0x55a55c46d98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2017-07-29 09:15:16.982152 7f0c68b4bd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0c68b4bd00 thread_name:ceph-osd
ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x55a55ca1a458]
 2: (()+0x122c0) [0x7f0c6620d2c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x55a55ce0c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x55a55cd0787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x55a55ccd3ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x55a55ccd5941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x55a55ccd6fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x55a55c9646cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x55a55c965e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x55a55c8f48e3]
 11: (BlueStore::mkfs()+0x8e8) [0x55a55c920a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x55a55c48f8b9]
 13: (main()+0xe29) [0x55a55c3e78b9]
 14: (__libc_start_main()+0xea) [0x7f0c6517a4da]
 15: (_start()+0x2a) [0x55a55c46d98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/usr/bin/timeout: the monitored command dumped core


Version-Release number of selected component (if applicable):
ceph-osd-12.1.1-3.fc27.x86_64

How reproducible:
Always, on two of my machines (one with pentium4 CPU, the other with AMD Opteron 8354)

Steps to Reproduce:
1. ceph-disk prepare /dev/xxx
2. ceph-disk activate /dev/xxx
3.

Actual results:
The trackback.

Expected results:
Running OSD

Additional info:

--- Additional comment from Loic Dachary on 2017-07-29 09:30:12 CEST ---

Hi !

It would be very useful to have detailed steps to reproduce. Would you mind explaining how multipath was set ? Also I'm curious about why you have used the --reactivate flag ?

    ceph-disk -v activate --reactivate /dev/mapper/mpathg1

Thanks !

--- Additional comment from Tomasz Torcz on 2017-07-29 09:40:29 CEST ---

I'm using --reactivate because I'm recreating btrfs-based OSDs with bluestore. I'm using steps documented on http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd , but with the '--reactivate' because '--osd-id' from step 3. seem to no longer exists.

Multipath is irrelevant, I get the same backtrace on OSD using plain partitions, but for the sake of completness:

# multipath -ll /dev/mapper/mpathg
mpathg (3600508b300954b90aa418f385e820016) dm-6 COMPAQ  ,MSA1000 VOLUME  
size=279G features='1 queue_if_no_path' hwhandler='1 hp_sw' wp=rw
|-+- policy='service-time 0' prio=2 status=enabled
| `- 4:0:0:5 sdl        8:176  active ghost running
`-+- policy='service-time 0' prio=4 status=active
  `- 2:0:0:5 sde        8:64   active ready running

lblk:
sde              8:64   0 279.4G  0 disk  
└─mpathg       253:6    0 279.4G  0 mpath 
  ├─mpathg1    253:13   0   100M  0 part  
  └─mpathg2    253:14   0 279.3G  0 part  
sdl              8:176  0 279.4G  0 disk  
└─mpathg       253:6    0 279.4G  0 mpath 
  ├─mpathg1    253:13   0   100M  0 part  
  └─mpathg2    253:14   0 279.3G  0 part  

And no special configuration, just multipath defaults.



To rule out multipath, here's the backtrace from OTHER machine. It has sda3 as mountable OSD store and sda6 as block store.


# ceph-disk activate /dev/sda3 1>&2 2>/tmp/out.txt
got monmap epoch 8
mount_activate: Failed to activate
ceph-disk: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'8', '--monmap', '/var/lib/ceph/tmp/mnt.eD_OyV/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/
mnt.eD_OyV', '--osd-uuid', u'7671ea65-8cdd-407f-963b-fa4ad85ba9b1', '--keyring', '/var/lib/ceph/tmp/mnt.eD_OyV/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : *** Ca
ught signal (Illegal instruction) **
 in thread 7f0029f4cd00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x22fe05a458]
 2: (()+0x12720) [0x7f002761b720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x22fe44c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x22fe34787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x22fe313ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescript
or, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x2
2fe315941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x22fe316fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x22fdfa46cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x22fdfa5e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x22fdf348e3]
 11: (BlueStore::mkfs()+0x8e8) [0x22fdf60a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x22fdacf8b9]
 13: (main()+0xe29) [0x22fda278b9]
 14: (__libc_start_main()+0xea) [0x7f002653400a]
 15: (_start()+0x2a) [0x22fdaad98a]
2017-07-29 09:38:28.304361 7f0029f4cd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0029f4cd00 thread_name:ceph-osd

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x22fe05a458]
 2: (()+0x12720) [0x7f002761b720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x22fe44c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x22fe34787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x22fe313ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescript
or, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x2
2fe315941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x22fe316fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x22fdfa46cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x22fdfa5e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x22fdf348e3]
 11: (BlueStore::mkfs()+0x8e8) [0x22fdf60a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x22fdacf8b9]
 13: (main()+0xe29) [0x22fda278b9]
 14: (__libc_start_main()+0xea) [0x7f002653400a]
 15: (_start()+0x2a) [0x22fdaad98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2017-07-29 09:38:28.304361 7f0029f4cd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0029f4cd00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x22fe05a458]
 2: (()+0x12720) [0x7f002761b720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x22fe44c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x22fe34787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x22fe313ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x22fe315941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x22fe316fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x22fdfa46cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x22fdfa5e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x22fdf348e3]
 11: (BlueStore::mkfs()+0x8e8) [0x22fdf60a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x22fdacf8b9]
 13: (main()+0xe29) [0x22fda278b9]
 14: (__libc_start_main()+0xea) [0x7f002653400a]
 15: (_start()+0x2a) [0x22fdaad98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/usr/bin/timeout: the monitored command dumped core

--- Additional comment from Loic Dachary on 2017-07-29 13:49:34 CEST ---

It's interesting that it can be reproduced without multipath (because multipath adds a level of complexity that may make it more difficult to diagnose the problem). 

Would it be possible for me to reproduce the same problem ? Should I first create an OSD with a btrfs file system, then try to migrate that OSD to bluestore ? Ideally I would run a series of commands on my own machine and run into the same problem. Could you send me such a series of commands ?

Thanks !

--- Additional comment from Tomasz Torcz on 2017-07-30 11:19:20 CEST ---

There's no need to start from btrfs OSD. This is 100% reproducible for me on clean disks, when creating new OSD with 12.1.1. Following steps should let you reproduce it:

1. Get Fedora rawhide installed.
2. Rawhide repositories seem not be updated for past few days, so get ceph 12.1.1-3 build manually:

koji download-build --arch=x86_64 923148

3. Install downloaded RPMS

4. Prepare disk with two partitions. For me it would be sda3 and sda6:

# wipefs -a /dev/sda6
# wipefs -a /dev/sda3
/dev/sda3: 4 bytes were erased at offset 0x00000000 (xfs): 58 46 53 42

5. ceph-prepare disk:
# ceph-disk prepare /dev/sda3
set_data_partition: incorrect partition UUID: 0x83, expected ['4fbd7e29-9d25-41b8-afd0-5ec00ceff05d', '4fbd7e29-9d25-41b8-afd0-062c0ceff05d', '4fbd7e29-8ae0-4982-bf9d-5a8d867af560', '4fbd7e29-9d25-41b8-afd0-35865ceff05d']
meta-data=/dev/sda3              isize=2048   agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


5. More preparation - point to block storage partition. It should be symlink to partuuid, but for clarity I've used short name:

# mount /dev/sda3 /mnt/tmp
# ls -l /mnt/tmp
total 16
-rw-r--r--. 1 ceph ceph 37 Jul 30 11:06 ceph_fsid
-rw-r--r--. 1 ceph ceph 37 Jul 30 11:06 fsid
-rw-r--r--. 1 ceph ceph 21 Jul 30 11:06 magic
-rw-r--r--. 1 ceph ceph 10 Jul 30 11:06 type
# ln -s /dev/sda6 /mnt/tmp/block
# chown ceph:ceph /dev/sda6
# umount /mnt/tmp


6. Activate new OSD, receive backtrace:

# ceph-disk activate /dev/sda3
got monmap epoch 8
mount_activate: Failed to activate
ceph-disk: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'17', '--monmap', '/var/lib/ceph/tmp/mnt.qG3_2R/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.qG3_2R', '--osd-uuid', u'2e70309e-12dd-4a54-9547-ab68a3f842de', '--keyring', '/var/lib/ceph/tmp/mnt.qG3_2R/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : *** Caught signal (Illegal instruction) **
 in thread 7f2653804d00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x906fb14458]
 2: (()+0x12720) [0x7f2650ed3720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x906ff06939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x906fe0187f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x906fdcded3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x906fdcf941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x906fdd0fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x906fa5e6cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x906fa5fe14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x906f9ee8e3]
 11: (BlueStore::mkfs()+0x8e8) [0x906fa1aa38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x906f5898b9]
 13: (main()+0xe29) [0x906f4e18b9]
 14: (__libc_start_main()+0xea) [0x7f264fdec00a]
 15: (_start()+0x2a) [0x906f56798a]
2017-07-30 11:09:10.793415 7f2653804d00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f2653804d00 thread_name:ceph-osd

[… snipped … ]

--- Additional comment from Loic Dachary on 2017-07-31 08:27:46 CEST ---

Thanks for the detailed instructions, this is very helpful. Preparing each partition individually is uncommon but it should not crash the way it does. I'll reproduce this and figure out what to do.

Unrelated question: is there a reason why you do not simply ceph-disk prepare /dev/sda and let it partition the disk itself ?

--- Additional comment from Tomasz Torcz on 2017-07-31 15:33:34 CEST ---

This is my experimental CEPH cluster, running latest code to catch bugs like this one early. As such, is not production ready, created from generally decommisioned hardware and this particular node has only one HDD. This drive is shared between operating system and OSD – thus partitions.

--- Additional comment from Jan Kurik on 2017-08-15 08:43:15 CEST ---

This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.

--- Additional comment from Tomasz Torcz on 2017-09-12 12:00:21 CEST ---

I was able to run ceph-osd under GDB. The result is following info:

Thread 1 "ceph-osd" received signal SIGILL, Illegal instruction.
0x00005555563ad449 in std::__sort<__gnu_cxx::__normal_iterator<rocksdb::FileMetaData**, std::vector<rocksdb::FileMetaData*, std::allocator<rocksdb::FileMetaData*> > >, __gnu_cxx::__ops::_Iter_comp_iter<rocksdb::VersionBuilder::Rep::FileComparator> > (__comp=..., __last=..., __first=...) at /usr/include/c++/7/bits/stl_algo.h:1966
1966          if (__first != __last)
(gdb) x/i 0x00005555563ad449

=> 0x5555563ad449 <rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+1993>: pinsrd $0x0,%ecx,%xmm1


"pinsrd" seems to be SSE4.1/AVX instruction. My servers don't have CPUs with SSE4.1/AVX.
I'm not sure how RocksDB in ceph-osd got miscompiled to include this instruction, but it's clearly cause of the crash.

--- Additional comment from Tomasz Torcz on 2017-09-12 12:05:00 CEST ---

See: https://github.com/facebook/rocksdb/issues/690

"Right now the 'default' build is to build with -march=native. […] The issue with this is that my build box CPU has instructions that my cluster CPU's do not support."

Also: https://github.com/ceph/ceph/pull/11677

--- Additional comment from Boris Ranto on 2017-09-12 17:10:39 CEST ---

The commit you referenced is already in the 12.x packages. I was able to find a reference to march=native in the sources if we are doing dpdk-enabled build. Maybe, that is a one more suspect to look at.

Alternatively, we might want to add PORTABLE=1 before calling 'make' in the spec file to see if that helps.

--- Additional comment from Boris Ranto on 2017-09-12 19:35:13 CEST ---

There is upstream issue for this:

http://tracker.ceph.com/issues/20529

The upstream PR that should fix this is still open/in review:

https://github.com/ceph/ceph/pull/17388

--- Additional comment from Boris Ranto on 2017-09-13 00:38:51 CEST ---

Can you test this build?

https://koji.fedoraproject.org/koji/taskinfo?taskID=21828197

Comment 2 Ken Dreyer (Red Hat) 2017-09-19 16:20:07 UTC
Will be fixed in the rebase to v12.2.1.

Comment 6 Vasu Kulkarni 2017-10-25 16:32:40 UTC
Sanity verified with normal disk(none of our lab has mpath devices)

Comment 9 errata-xmlrpc 2017-12-05 23:42:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387


Note You need to log in before you can comment on or make changes to this bug.