Bug 1476453

Summary:	caught signal (Illegal instruction) when activation bluestore OSD
Product:	[Fedora] Fedora	Reporter:	Tomasz Torcz <tomek>
Component:	ceph	Assignee:	Boris Ranto <branto>
Status:	CLOSED EOL	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	27	CC:	bhubbard, branto, david, fedora, loic, ramkrsna, steve
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1491634 (view as bug list)		Environment:
Last Closed:	2018-11-30 17:56:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomasz Torcz 2017-07-29 07:21:32 UTC

Description of problem:
ceph-disk activate fails with illegal instruction:

# ceph-disk -v activate --reactivate /dev/mapper/mpathg1

main_activate: path = /dev/mapper/mpathg1
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid path is /sys/dev/block/253:13/dm/uuid
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid is part1-mpath-3600508b300954b90aa418f385e820016

get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid path is /sys/dev/block/253:13/dm/uuid
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid is part1-mpath-3600508b300954b90aa418f385e820016

command: Running command: /usr/sbin/blkid -o udev -p /dev/mapper/mpathg1
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid path is /sys/dev/block/253:13/dm/uuid
get_dm_uuid: get_dm_uuid /dev/mapper/mpathg1 uuid is part1-mpath-3600508b300954b90aa418f385e820016

command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathg1
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
mount: Mounting /dev/mapper/mpathg1 on /var/lib/ceph/tmp/mnt.WJ9Mm4 with options noatime,inode64
command_check_call: Running command: /usr/bin/mount -t xfs -o noatime,inode64 -- /dev/mapper/mpathg1 /var/lib/ceph/tmp/mnt.WJ9Mm4
command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.WJ9Mm4
activate: Cluster uuid is 40216c8a-ba3c-4cec-9e80-212396e214a1
command: Running command: /usr/bin/ceph-osd --cluster=toad --show-config-value=fsid
command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
activate: Cluster name is ceph
activate: OSD uuid is 25efb560-300e-48d3-9156-2f4c0746a313
activate: OSD id is 13
activate: Initializing OSD...
command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/tmp/mnt.WJ9Mm4/activate.monmap
got monmap epoch 8
command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 13 --monmap /var/lib/ceph/tmp/mnt.WJ9Mm4/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.WJ9Mm4 --osd-uuid 25efb560-300e-48d3-9156-2f4c0746a313 --keyring /var/lib/ceph/tmp/mnt.WJ9Mm4/keyring --setuser ceph --setgroup ceph
mount_activate: Failed to activate
unmount: Unmounting /var/lib/ceph/tmp/mnt.WJ9Mm4
command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.WJ9Mm4
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 11, in <module>
    load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5731, in run
    main(sys.argv[1:])
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5682, in main
    args.func(args)
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3755, in main_activate
    reactivate=args.reactivate,
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3512, in mount_activate
    (osd_id, cluster) = activate(path, activate_key_template, init)
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3688, in activate
    keyring=keyring,
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3100, in mkfs
    '--setgroup', get_ceph_group(),
  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3064, in ceph_osd_mkfs
    raise Error('%s failed : %s' % (str(arguments), error))
ceph_disk.main.Error: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'13', '--monmap', '/var/lib/ceph/tmp/mnt.WJ9Mm4/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.WJ9Mm4', '--osd-uuid', u'25efb560-300e-48d3-9156-2f4c0746a313', '--keyring', '/var/lib/ceph/tmp/mnt.WJ9Mm4/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed :

*** Caught signal (Illegal instruction) **
 in thread 7f0c68b4bd00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x55a55ca1a458]
 2: (()+0x122c0) [0x7f0c6620d2c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x55a55ce0c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x55a55cd0787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x55a55ccd3ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x55a55ccd5941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x55a55ccd6fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x55a55c9646cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x55a55c965e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x55a55c8f48e3]
 11: (BlueStore::mkfs()+0x8e8) [0x55a55c920a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x55a55c48f8b9]
 13: (main()+0xe29) [0x55a55c3e78b9]
 14: (__libc_start_main()+0xea) [0x7f0c6517a4da]
 15: (_start()+0x2a) [0x55a55c46d98a]
2017-07-29 09:15:16.982152 7f0c68b4bd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0c68b4bd00 thread_name:ceph-osd

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x55a55ca1a458]
 2: (()+0x122c0) [0x7f0c6620d2c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x55a55ce0c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x55a55cd0787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x55a55ccd3ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x55a55ccd5941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x55a55ccd6fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x55a55c9646cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x55a55c965e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x55a55c8f48e3]
 11: (BlueStore::mkfs()+0x8e8) [0x55a55c920a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x55a55c48f8b9]
 13: (main()+0xe29) [0x55a55c3e78b9]
 14: (__libc_start_main()+0xea) [0x7f0c6517a4da]
 15: (_start()+0x2a) [0x55a55c46d98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2017-07-29 09:15:16.982152 7f0c68b4bd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0c68b4bd00 thread_name:ceph-osd
ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x55a55ca1a458]
 2: (()+0x122c0) [0x7f0c6620d2c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x55a55ce0c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x55a55cd0787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x55a55ccd3ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x55a55ccd5941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x55a55ccd6fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x55a55c9646cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x55a55c965e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x55a55c8f48e3]
 11: (BlueStore::mkfs()+0x8e8) [0x55a55c920a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x55a55c48f8b9]
 13: (main()+0xe29) [0x55a55c3e78b9]
 14: (__libc_start_main()+0xea) [0x7f0c6517a4da]
 15: (_start()+0x2a) [0x55a55c46d98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/usr/bin/timeout: the monitored command dumped core


Version-Release number of selected component (if applicable):
ceph-osd-12.1.1-3.fc27.x86_64

How reproducible:
Always, on two of my machines (one with pentium4 CPU, the other with AMD Opteron 8354)

Steps to Reproduce:
1. ceph-disk prepare /dev/xxx
2. ceph-disk activate /dev/xxx
3.

Actual results:
The trackback.

Expected results:
Running OSD

Additional info:

Comment 1 Loic Dachary 2017-07-29 07:30:12 UTC

Hi !

It would be very useful to have detailed steps to reproduce. Would you mind explaining how multipath was set ? Also I'm curious about why you have used the --reactivate flag ?

    ceph-disk -v activate --reactivate /dev/mapper/mpathg1

Thanks !

Comment 2 Tomasz Torcz 2017-07-29 07:40:29 UTC

I'm using --reactivate because I'm recreating btrfs-based OSDs with bluestore. I'm using steps documented on http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd , but with the '--reactivate' because '--osd-id' from step 3. seem to no longer exists.

Multipath is irrelevant, I get the same backtrace on OSD using plain partitions, but for the sake of completness:

# multipath -ll /dev/mapper/mpathg
mpathg (3600508b300954b90aa418f385e820016) dm-6 COMPAQ  ,MSA1000 VOLUME  
size=279G features='1 queue_if_no_path' hwhandler='1 hp_sw' wp=rw
|-+- policy='service-time 0' prio=2 status=enabled
| `- 4:0:0:5 sdl        8:176  active ghost running
`-+- policy='service-time 0' prio=4 status=active
  `- 2:0:0:5 sde        8:64   active ready running

lblk:
sde              8:64   0 279.4G  0 disk  
└─mpathg       253:6    0 279.4G  0 mpath 
  ├─mpathg1    253:13   0   100M  0 part  
  └─mpathg2    253:14   0 279.3G  0 part  
sdl              8:176  0 279.4G  0 disk  
└─mpathg       253:6    0 279.4G  0 mpath 
  ├─mpathg1    253:13   0   100M  0 part  
  └─mpathg2    253:14   0 279.3G  0 part  

And no special configuration, just multipath defaults.



To rule out multipath, here's the backtrace from OTHER machine. It has sda3 as mountable OSD store and sda6 as block store.


# ceph-disk activate /dev/sda3 1>&2 2>/tmp/out.txt
got monmap epoch 8
mount_activate: Failed to activate
ceph-disk: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'8', '--monmap', '/var/lib/ceph/tmp/mnt.eD_OyV/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/
mnt.eD_OyV', '--osd-uuid', u'7671ea65-8cdd-407f-963b-fa4ad85ba9b1', '--keyring', '/var/lib/ceph/tmp/mnt.eD_OyV/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : *** Ca
ught signal (Illegal instruction) **
 in thread 7f0029f4cd00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x22fe05a458]
 2: (()+0x12720) [0x7f002761b720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x22fe44c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x22fe34787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x22fe313ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescript
or, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x2
2fe315941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x22fe316fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x22fdfa46cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x22fdfa5e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x22fdf348e3]
 11: (BlueStore::mkfs()+0x8e8) [0x22fdf60a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x22fdacf8b9]
 13: (main()+0xe29) [0x22fda278b9]
 14: (__libc_start_main()+0xea) [0x7f002653400a]
 15: (_start()+0x2a) [0x22fdaad98a]
2017-07-29 09:38:28.304361 7f0029f4cd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0029f4cd00 thread_name:ceph-osd

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x22fe05a458]
 2: (()+0x12720) [0x7f002761b720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x22fe44c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x22fe34787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x22fe313ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescript
or, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x2
2fe315941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x22fe316fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x22fdfa46cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x22fdfa5e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x22fdf348e3]
 11: (BlueStore::mkfs()+0x8e8) [0x22fdf60a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x22fdacf8b9]
 13: (main()+0xe29) [0x22fda278b9]
 14: (__libc_start_main()+0xea) [0x7f002653400a]
 15: (_start()+0x2a) [0x22fdaad98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2017-07-29 09:38:28.304361 7f0029f4cd00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f0029f4cd00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x22fe05a458]
 2: (()+0x12720) [0x7f002761b720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x22fe44c939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x22fe34787f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x22fe313ed3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x22fe315941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x22fe316fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x22fdfa46cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x22fdfa5e14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x22fdf348e3]
 11: (BlueStore::mkfs()+0x8e8) [0x22fdf60a38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x22fdacf8b9]
 13: (main()+0xe29) [0x22fda278b9]
 14: (__libc_start_main()+0xea) [0x7f002653400a]
 15: (_start()+0x2a) [0x22fdaad98a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/usr/bin/timeout: the monitored command dumped core

Comment 3 Loic Dachary 2017-07-29 11:49:34 UTC

It's interesting that it can be reproduced without multipath (because multipath adds a level of complexity that may make it more difficult to diagnose the problem). 

Would it be possible for me to reproduce the same problem ? Should I first create an OSD with a btrfs file system, then try to migrate that OSD to bluestore ? Ideally I would run a series of commands on my own machine and run into the same problem. Could you send me such a series of commands ?

Thanks !

Comment 4 Tomasz Torcz 2017-07-30 09:19:20 UTC

There's no need to start from btrfs OSD. This is 100% reproducible for me on clean disks, when creating new OSD with 12.1.1. Following steps should let you reproduce it:

1. Get Fedora rawhide installed.
2. Rawhide repositories seem not be updated for past few days, so get ceph 12.1.1-3 build manually:

koji download-build --arch=x86_64 923148

3. Install downloaded RPMS

4. Prepare disk with two partitions. For me it would be sda3 and sda6:

# wipefs -a /dev/sda6
# wipefs -a /dev/sda3
/dev/sda3: 4 bytes were erased at offset 0x00000000 (xfs): 58 46 53 42

5. ceph-prepare disk:
# ceph-disk prepare /dev/sda3
set_data_partition: incorrect partition UUID: 0x83, expected ['4fbd7e29-9d25-41b8-afd0-5ec00ceff05d', '4fbd7e29-9d25-41b8-afd0-062c0ceff05d', '4fbd7e29-8ae0-4982-bf9d-5a8d867af560', '4fbd7e29-9d25-41b8-afd0-35865ceff05d']
meta-data=/dev/sda3              isize=2048   agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


5. More preparation - point to block storage partition. It should be symlink to partuuid, but for clarity I've used short name:

# mount /dev/sda3 /mnt/tmp
# ls -l /mnt/tmp
total 16
-rw-r--r--. 1 ceph ceph 37 Jul 30 11:06 ceph_fsid
-rw-r--r--. 1 ceph ceph 37 Jul 30 11:06 fsid
-rw-r--r--. 1 ceph ceph 21 Jul 30 11:06 magic
-rw-r--r--. 1 ceph ceph 10 Jul 30 11:06 type
# ln -s /dev/sda6 /mnt/tmp/block
# chown ceph:ceph /dev/sda6
# umount /mnt/tmp


6. Activate new OSD, receive backtrace:

# ceph-disk activate /dev/sda3
got monmap epoch 8
mount_activate: Failed to activate
ceph-disk: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'17', '--monmap', '/var/lib/ceph/tmp/mnt.qG3_2R/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.qG3_2R', '--osd-uuid', u'2e70309e-12dd-4a54-9547-ab68a3f842de', '--keyring', '/var/lib/ceph/tmp/mnt.qG3_2R/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : *** Caught signal (Illegal instruction) **
 in thread 7f2653804d00 thread_name:ceph-osd
 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
 1: (()+0x9b3458) [0x906fb14458]
 2: (()+0x12720) [0x7f2650ed3720]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x7c9) [0x906ff06939]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x15cf) [0x906fe0187f]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x113) [0x906fdcded3]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xf31) [0x906fdcf941]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x66e) [0x906fdd0fde]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x62d) [0x906fa5e6cd]
 9: (RocksDBStore::create_and_open(std::ostream&)+0x174) [0x906fa5fe14]
 10: (BlueStore::_open_db(bool)+0x4f3) [0x906f9ee8e3]
 11: (BlueStore::mkfs()+0x8e8) [0x906fa1aa38]
 12: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x259) [0x906f5898b9]
 13: (main()+0xe29) [0x906f4e18b9]
 14: (__libc_start_main()+0xea) [0x7f264fdec00a]
 15: (_start()+0x2a) [0x906f56798a]
2017-07-30 11:09:10.793415 7f2653804d00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f2653804d00 thread_name:ceph-osd

[… snipped … ]

Comment 5 Loic Dachary 2017-07-31 06:27:46 UTC

Thanks for the detailed instructions, this is very helpful. Preparing each partition individually is uncommon but it should not crash the way it does. I'll reproduce this and figure out what to do.

Unrelated question: is there a reason why you do not simply ceph-disk prepare /dev/sda and let it partition the disk itself ?

Comment 6 Tomasz Torcz 2017-07-31 13:33:34 UTC

This is my experimental CEPH cluster, running latest code to catch bugs like this one early. As such, is not production ready, created from generally decommisioned hardware and this particular node has only one HDD. This drive is shared between operating system and OSD – thus partitions.

Comment 7 Jan Kurik 2017-08-15 06:43:15 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.

Comment 8 Tomasz Torcz 2017-09-12 10:00:21 UTC

I was able to run ceph-osd under GDB. The result is following info:

Thread 1 "ceph-osd" received signal SIGILL, Illegal instruction.
0x00005555563ad449 in std::__sort<__gnu_cxx::__normal_iterator<rocksdb::FileMetaData**, std::vector<rocksdb::FileMetaData*, std::allocator<rocksdb::FileMetaData*> > >, __gnu_cxx::__ops::_Iter_comp_iter<rocksdb::VersionBuilder::Rep::FileComparator> > (__comp=..., __last=..., __first=...) at /usr/include/c++/7/bits/stl_algo.h:1966
1966          if (__first != __last)
(gdb) x/i 0x00005555563ad449

=> 0x5555563ad449 <rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+1993>: pinsrd $0x0,%ecx,%xmm1


"pinsrd" seems to be SSE4.1/AVX instruction. My servers don't have CPUs with SSE4.1/AVX.
I'm not sure how RocksDB in ceph-osd got miscompiled to include this instruction, but it's clearly cause of the crash.

Comment 9 Tomasz Torcz 2017-09-12 10:05:00 UTC

See: https://github.com/facebook/rocksdb/issues/690

"Right now the 'default' build is to build with -march=native. […] The issue with this is that my build box CPU has instructions that my cluster CPU's do not support."

Also: https://github.com/ceph/ceph/pull/11677

Comment 10 Boris Ranto 2017-09-12 15:10:39 UTC

The commit you referenced is already in the 12.x packages. I was able to find a reference to march=native in the sources if we are doing dpdk-enabled build. Maybe, that is a one more suspect to look at.

Alternatively, we might want to add PORTABLE=1 before calling 'make' in the spec file to see if that helps.

Comment 11 Boris Ranto 2017-09-12 17:35:13 UTC

There is upstream issue for this:

http://tracker.ceph.com/issues/20529

The upstream PR that should fix this is still open/in review:

https://github.com/ceph/ceph/pull/17388

Comment 12 Boris Ranto 2017-09-12 22:38:51 UTC

Can you test this build?

https://koji.fedoraproject.org/koji/taskinfo?taskID=21828197

Comment 13 Tomasz Torcz 2017-09-14 12:01:30 UTC

It didn't work, failed with illegal instruction again.

Unfortunately, I cannot check with gdb what instruction is that because I just lost the access to this cluster. Merits of giving a notice, I guess :)

I think this could be reproduced by creating a virtual machine with baseline x86_64 CPU emulation (no AVX, no SSE higher that 2, etc) and installing ceph-osd inside. I haven't checked the idea, though.

Comment 14 Ben Cotton 2018-11-27 18:26:12 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 15 Ben Cotton 2018-11-30 17:56:32 UTC

Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.