1619090 – nvme journal: ondisk fsid 00000000-0000-0000-0000-00000000000 doesn't match expected c325f439-6849-47ef-ac43-439d9909d391

Bug 1619090 - nvme journal: ondisk fsid 00000000-0000-0000-0000-00000000000 doesn't match expected c325f439-6849-47ef-ac43-439d9909d391

Summary: nvme journal: ondisk fsid 00000000-0000-0000-0000-00000000000 doesn't match e...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z2
Target Release:	3.3
Assignee:	Sébastien Han
QA Contact:	Vasishta
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks:	1584264 1629656
TreeView+	depends on / blocked

Reported:	2018-08-20 04:22 UTC by Vasu Kulkarni
Modified:	2019-10-03 15:12 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.When putting a dedicated journal on an NVMe device installation can fail When the `dedicated_devices` setting contains an NVMe device and it has partitions or signatures on it Ansible installation might fail with an error like the following: ---- journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected c325f439-6849-47ef-ac43-439d9909d391, invalid (someone else's?) journal ---- To work around this issue, ensure there are no partitions or signatures on the NVMe device.
Clone Of:
Environment:
Last Closed:	2019-10-03 15:12:28 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vasu Kulkarni 2018-08-20 04:22:42 UTC

Description of problem:


I am not sure if this is ceph-ansible or core Ceph osd issue, Please feel free to change after first level analysis

specify dedicated journal on nvme with other config as

[clients]
pluto005.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'

[mdss]
pluto008.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'

[mgrs]
pluto004.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'

[mons]
pluto004.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'
pluto009.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'

[osds]
pluto005.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'
pluto006.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'
pluto010.ceph.redhat.com dedicated_devices='["/dev/nvme0n1"]' devices='["/dev/sdb"]' monitor_interface='eno1' public_network='10.8.128.0/21' radosgw_interface='eno1'


Running sensible playbook and following issue is seen

2018-08-18T17:04:26.885 INFO:teuthology.orchestra.run.pluto009.stdout:got monmap epoch 1
2018-08-18T17:04:26.885 INFO:teuthology.orchestra.run.pluto009.stdout:2018-08-18 21:04:24.837653 7f4463512d80 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2018-08-18T17:04:26.885 INFO:teuthology.orchestra.run.pluto009.stdout:2018-08-18 21:04:24.837685 7f4463512d80 -1 journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected c325f439-6849-47ef-ac43-439d9909d391, invalid (someone else's?) journal
2018-08-18T17:04:26.885 INFO:teuthology.orchestra.run.pluto009.stdout:2018-08-18 21:04:24.837727 7f4463512d80 -1 filestore(/var/lib/ceph/tmp/mnt.7k5fVX) mkjournal(1068): error creating journal on /var/lib/ceph/tmp/mnt.7k5fVX/journal: (22) Invalid argument
2018-08-18T17:04:26.885 INFO:teuthology.orchestra.run.pluto009.stdout:2018-08-18 21:04:24.837783 7f4463512d80 -1 OSD::mkfs: ObjectStore::mkfs failed with error (22) Invalid argument
2018-08-18T17:04:26.885 INFO:teuthology.orchestra.run.pluto009.stdout:2018-08-18 21:04:24.837855 7f4463512d80 -1 [0;31m ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.7k5fVX: (22) Invalid argument[0m
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:mount_activate: Failed to activate
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:Traceback (most recent call last):
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/sbin/ceph-disk", line 9, in <module>
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:    load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5735, in run
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:    main(sys.argv[1:])
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5688, in main
2018-08-18T17:04:26.886 INFO:teuthology.orchestra.run.pluto009.stdout:    main_catch(args.func, args)
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5713, in main_catch
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:    func(args)
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3776, in main_activate
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:    reactivate=args.reactivate,
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3539, in mount_activate
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:    (osd_id, cluster) = activate(path, activate_key_template, init)
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3716, in activate
2018-08-18T17:04:26.887 INFO:teuthology.orchestra.run.pluto009.stdout:    keyring=keyring,
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3183, in mkfs
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:    '--setgroup', get_ceph_group(),
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 566, in command_check_call
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:    return subprocess.check_call(arguments)
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:    raise CalledProcessError(retcode, cmd)
2018-08-18T17:04:26.888 INFO:teuthology.orchestra.run.pluto009.stdout:subprocess.CalledProcessError: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '-i', u'0', '--monmap', '/var/lib/ceph/tmp/mnt.7k5fVX/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.7k5fVX', '--osd-journal', '/var/lib/ceph/tmp/mnt.7k5fVX/journal', '--osd-uuid', u'c325f439-6849-47ef-ac43-439d9909d391', '--setuser', 'ceph', '--setgroup', 'ceph']' returned non-zero exit status 1

Full logs:

http://magna002.ceph.redhat.com/rakesh-2018-08-17_07:29:56-smoke-luminous-distro-basic-pluto/306626/teuthology.log

Comment 3 Sébastien Han 2018-08-20 10:23:32 UTC

This error is reported by ceph itself when doing mkfs. Was the journal device purged correctly and removed from any Ceph metadata on it?
Thanks.

This does not seem like a Ceph ansible issue, although we could run checks for this I suppose.

Comment 4 Vasu Kulkarni 2018-08-21 02:40:41 UTC

osd_auto_discovery is false so I believe the device specified in inventory should be used cleaned up by ansible.  teuthology  also does its own cleanup as well at the beginning and I need to check that

Comment 7 Vasu Kulkarni 2018-10-01 21:17:30 UTC

John,

I was not using lv_create option in ceph-ansible during that time, I think we can add to clean up any old partitions manually if admin hits this issue and retry again.

Thanks

Comment 8 John Brier 2018-10-02 20:53:00 UTC

Vasu, is the Doc Text I added accurate/good?

Comment 10 Vasu Kulkarni 2018-10-05 21:41:42 UTC

That looks good to me, thanks.

Comment 11 John Brier 2018-10-05 22:33:04 UTC

Thanks Vasu.

I need to rebuild the Release Notes to pull in this Doc Text.

Comment 14 Giridhar Ramaraju 2019-08-20 07:17:10 UTC

Level setting the severity of this defect to "High" with a bulk update. Pls
refine it to a more closure value, as defined by the severity definition in
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity

Note You need to log in before you can comment on or make changes to this bug.