1334636 – ceph-disk waiting for file lock

Bug 1334636 - ceph-disk waiting for file lock

Summary: ceph-disk waiting for file lock

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Disk
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	2.0
Assignee:	Loic Dachary
QA Contact:	Daniel Horák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1343229
TreeView+	depends on / blocked

Reported:	2016-05-10 08:24 UTC by Daniel Horák
Modified:	2017-07-31 21:00 UTC (History)
CC List:	14 users (show)
Fixed In Version:	RHEL: ceph-10.2.1-12.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:38:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages (11.75 KB, text/plain) 2016-05-24 17:57 UTC, Alfredo Deza	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	13522	None	None	None	2016-05-25 16:13:20 UTC
Red Hat Bugzilla	1313935	high	CLOSED	validate input: adding two mons with same host	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1339710	unspecified	CLOSED	initalization of 'recursive' tls variable in libunwind stack capturer occasionally triggers deadlock in ceph	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2016:1755	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.0 bug fix and enhancement update	2016-08-23 23:23:52 UTC

Internal Links: 1313935 1339710

Description Daniel Horák 2016-05-10 08:24:57 UTC

Description of problem:
  Cluster creation task in USM fails because of some timeout.
  The root cause is ceph-installer task stuck for nearly ten hours.

  # date
  Tue May 10 10:21:27 CEST 2016
  # ceph-installer task 797f6111-3f77-4906-bc0c-61227d9cf514
  --> endpoint: /api/osd/configure
  --> succeeded: False
  --> stdout: None
  --> started: 2016-05-10 01:10:02.380725
  --> exit_code: None
  --> ended: None
  --> command: /bin/ansible-playbook -v -u ceph-installer /usr/share/ceph-ansible/osd-configure.yml -i /tmp/797f6111-3f77-4906-bc0c-61227d9cf514_NN4aJM --extra-vars {"raw_journal_devices": ["/dev/vdb"], "devices": ["/dev/vde"], "cluster": "ceph", "ceph_stable_rh_storage_cdn_install": true, "public_network": "10.70.44.0/22", "fetch_directory": "/var/lib/ceph-installer/fetch", "cluster_network": "10.70.44.0/22", "journal_size": 5120, "raw_multi_journal": true, "fsid": "cce29303-f65d-4124-a7da-fdc286e1cebe", "ceph_stable_rh_storage": true} --skip-tags package-install
  --> stderr: None
  --> identifier: 797f6111-3f77-4906-bc0c-61227d9cf514
  
Version-Release number of selected component (if applicable):
  ceph-ansible-1.0.5-5.el7scon.noarch
  ceph-installer-1.0.6-1.el7scon.noarch

  ceph-base-10.2.0-1.el7cp.x86_64
  ceph-common-10.2.0-1.el7cp.x86_64
  ceph-osd-10.2.0-1.el7cp.x86_64
  ceph-selinux-10.2.0-1.el7cp.x86_64
  
How reproducible:
  Not sure.


Steps to Reproduce:
1. Run cluster creation from USM.

Actual results:
  ceph-installer task is stuck for hours

Expected results:
  ceph-installer task will finish in reasonable time, or timeout with proper problem explanation

Comment 3 monti lawrence 2016-05-10 20:03:48 UTC

Daniel to try to reproduce.

Comment 4 Daniel Horák 2016-05-19 05:44:31 UTC

Closing this bug, because I wasn't able to reproduce it again.
Prospectively I'll reopen it if find similar issue.

Comment 6 Daniel Horák 2016-05-24 15:59:37 UTC

Reopening this Bug, as I saw same issue on two clusters today. The symptoms are similar as it was mentioned in the description.
I have the cluster running so I'll try to debug it and gather as much data as I'll be able and compare it with the logs from comment 2.

I'll try to update here shortly.

Comment 9 Alfredo Deza 2016-05-24 17:57:24 UTC

Created attachment 1161165 [details]
/var/log/messages

Comment 11 Loic Dachary 2016-05-25 08:53:01 UTC

To avoid races ceph-disk acquires a global lock, including ceph-disk list. If a ceph-disk hangs for some reason, the others will just wait for it to complete. You just need to find which ceph-disk is the oldest and what (strace or something) it is blocked on and you'll find what the cause of the problem is.

Comment 13 Daniel Horák 2016-05-25 11:18:56 UTC

It seems like it is related only to OSD nodes with more data disks (OSDs) peer one node.
(It passed twice on the same cluster from the same nodes, where it fails twice yesterday. The only change was using only 1 ceph data disk for OSD peer node.)

Comment 14 Daniel Horák 2016-05-25 12:35:27 UTC

As it is visible in the comment 8, the problem begun with this flock process (pid 19721):

> root       19721  0.0  0.0 107892   596 ?        Ss   11:30   0:00 flock /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vdc1
> root       19724  0.0  0.6 211460 11556 ?        S    11:30   0:00 /usr/bin/python /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vdc1

This is the subprocesses tree:

flock,19721 /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vdc1
  └─ceph-disk,19724 /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vdc1
      └─ceph-disk,19752 /usr/sbin/ceph-disk --verbose activate /dev/vdc1
          └─ceph-osd,19877 --cluster ceph --mkfs --mkkey -i 14 --monmap /var/lib/ceph/tmp/mnt.lsSCAp/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.lsSCAp --osd-journal /var/lib/ceph/tmp/mnt.lsSCAp/journal --osd-uuid fb9716ea-19c6-40d3-9380-3cc14824fbbd --keyring /var/lib/ceph/tmp/mnt.lsSCAp/keyring --setuser ceph --setgroup ceph
              └─{ceph-osd},19881

It seems like the 'flock ...' command was launched by related ceph-disk@....service (ceph-disk in this example) not sure when and how, but it is defined in /usr/lib/systemd/system/ceph-disk@.service.

When is the ceph-disk@...service created for particular disk?

Comment 15 Loic Dachary 2016-05-25 14:41:37 UTC

You need to strace -p 19881 which is apparently blocked on something. You could also gdb -p 19881 to find out where it is at. And also lsof -p 19881 to show which files are open in case that may help. It's blocked on something but what ...

Comment 17 Loic Dachary 2016-05-25 16:10:53 UTC

# strace -p 19877
Process 19877 attached
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x7fa8a7438820, FUTEX_WAIT_PRIVATE, 2, {0, 149352576}) = -1 ETIMEDOUT (Connection timed out)
futex(0x7fa8a7438820, FUTEX_WAIT_PRIVATE, 2, {0, 244311408}) = -1 ETIMEDOUT (Connection timed out)
futex(0x7fa8a7438820, FUTEX_WAIT_PRIVATE, 2, {0, 164106848}) = -1 ETIMEDOUT (Connection timed out)

# lsof -p 19877
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
ceph-osd 19877 ceph  cwd    DIR  253,2     4096       128 /
ceph-osd 19877 ceph  rtd    DIR  253,2     4096       128 /
ceph-osd 19877 ceph  txt    REG  253,2 14520192   1896129 /usr/bin/ceph-osd
ceph-osd 19877 ceph  mem    REG  253,2   753232   8417152 /usr/lib64/libsqlite3.so.0.8.6
ceph-osd 19877 ceph  mem    REG  253,2   261352   8663466 /usr/lib64/libsoftokn3.so
ceph-osd 19877 ceph  mem    REG  253,2    61928   9441099 /usr/lib64/libnss_files-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2    11112   8428079 /usr/lib64/liburcu-common.so.1.0.0
ceph-osd 19877 ceph  mem    REG  253,2    28216   8428077 /usr/lib64/liburcu-cds.so.1.0.0
ceph-osd 19877 ceph  mem    REG  253,2    28448   8428075 /usr/lib64/liburcu-bp.so.1.0.0
ceph-osd 19877 ceph  mem    REG  253,2    40552   9735207 /usr/lib64/liblttng-ust-tracepoint.so.0.0.0
ceph-osd 19877 ceph  mem    REG  253,2    68192   8416251 /usr/lib64/libbz2.so.1.0.6
ceph-osd 19877 ceph  mem    REG  253,2    20032   8416169 /usr/lib64/libuuid.so.1.3.0
ceph-osd 19877 ceph  mem    REG  253,2    15768   8415975 /usr/lib64/libplds4.so
ceph-osd 19877 ceph  mem    REG  253,2    20016   8415974 /usr/lib64/libplc4.so
ceph-osd 19877 ceph  mem    REG  253,2   182056   8415976 /usr/lib64/libnssutil3.so
ceph-osd 19877 ceph  mem    REG  253,2    48672   8624411 /usr/lib64/libunwind.so.8.0.1
ceph-osd 19877 ceph  mem    REG  253,2  2112384   8389556 /usr/lib64/libc-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2    88720   8388745 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
ceph-osd 19877 ceph  mem    REG  253,2  1141560   8389562 /usr/lib64/libm-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2   999936   8415978 /usr/lib64/libstdc++.so.6.0.19
ceph-osd 19877 ceph  mem    REG  253,2   106448   8624396 /usr/lib64/libboost_iostreams-mt.so.1.53.0
ceph-osd 19877 ceph  mem    REG  253,2   252704   8390765 /usr/lib64/libblkid.so.1.1.0
ceph-osd 19877 ceph  mem    REG  253,2    44096   9461559 /usr/lib64/librt-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2    11384   8624398 /usr/lib64/libboost_random-mt.so.1.53.0
ceph-osd 19877 ceph  mem    REG  253,2    15592   8513788 /usr/lib64/libboost_system-mt.so.1.53.0
ceph-osd 19877 ceph  mem    REG  253,2    94304   8584522 /usr/lib64/libboost_thread-mt.so.1.53.0
ceph-osd 19877 ceph  mem    REG  253,2    19520   8389559 /usr/lib64/libdl-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2   142304   9461549 /usr/lib64/libpthread-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2   251784   8415972 /usr/lib64/libnspr4.so
ceph-osd 19877 ceph  mem    REG  253,2  1220152   9008119 /usr/lib64/libnss3.so
ceph-osd 19877 ceph  mem    REG  253,2   255104   8428049 /usr/lib64/libfuse.so.2.9.2
ceph-osd 19877 ceph  mem    REG  253,2    23800   8790630 /usr/lib64/libsnappy.so.1.1.4
ceph-osd 19877 ceph  mem    REG  253,2   348928   9270607 /usr/lib64/libleveldb.so.1.0.7
ceph-osd 19877 ceph  mem    REG  253,2    90632   8416074 /usr/lib64/libz.so.1.2.7
ceph-osd 19877 ceph  mem    REG  253,2     6264   8584488 /usr/lib64/libaio.so.1.0.1
ceph-osd 19877 ceph  mem    REG  253,2   305136   8624415 /usr/lib64/libtcmalloc.so.4.2.6
ceph-osd 19877 ceph  mem    REG  253,2   164440   8663471 /usr/lib64/ld-2.17.so
ceph-osd 19877 ceph  mem    REG  253,2    34742    997567 /etc/ld.so.cache
ceph-osd 19877 ceph    0r   CHR    1,3      0t0      1028 /dev/null
ceph-osd 19877 ceph    1w  FIFO    0,8      0t0    634603 pipe
ceph-osd 19877 ceph    2w  FIFO    0,8      0t0    634604 pipe
ceph-osd 19877 ceph    3r   REG   0,18        0    544896 /run/lock/ceph-disk
ceph-osd 19877 ceph    4w   REG  253,2        0 117481929 /var/lib/ceph/tmp/ceph-disk.activate.lock
ceph-osd 19877 ceph    5w   REG  253,2        0  41974958 /var/log/ceph/ceph-osd.14.log
ceph-osd 19877 ceph    7r   CHR    1,9      0t0      1033 /dev/urandom
ceph-osd 19877 ceph    8r   CHR    1,9      0t0      1033 /dev/urandom
ceph-osd 19877 ceph    9r   REG  253,2     2246  25176371 /usr/share/zoneinfo/Europe/Prague


gdb

(gdb) thread apply all bt

Thread 2 (Thread 0x7fa8a30ab700 (LWP 19881)):
#0  0x00007fa8a5facf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fa8a5fa8d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007fa8a5fa8c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fa8a7442029 in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fa8a7201849 in GetStackTrace_libunwind(void**, int, int) () from /lib64/libtcmalloc.so.4
#5  0x00007fa8a72020be in GetStackTrace(void**, int, int) () from /lib64/libtcmalloc.so.4
#6  0x00007fa8a71f3314 in tcmalloc::PageHeap::GrowHeap(unsigned long) () from /lib64/libtcmalloc.so.4
#7  0x00007fa8a71f3633 in tcmalloc::PageHeap::New(unsigned long) () from /lib64/libtcmalloc.so.4
#8  0x00007fa8a71f1f64 in tcmalloc::CentralFreeList::Populate() () from /lib64/libtcmalloc.so.4
#9  0x00007fa8a71f2148 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#10 0x00007fa8a71f21dd in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#11 0x00007fa8a71f5235 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /lib64/libtcmalloc.so.4
#12 0x00007fa8a7203d5b in tc_malloc () from /lib64/libtcmalloc.so.4
#13 0x00007fa8a45eb3d4 in __tzfile_read () from /lib64/libc.so.6
#14 0x00007fa8a45ea3c4 in tzset_internal () from /lib64/libc.so.6
#15 0x00007fa8a45ead73 in __tz_convert () from /lib64/libc.so.6
#16 0x00007fa8a7fbb93b in ceph::log::Log::_flush(ceph::log::EntryQueue*, ceph::log::EntryQueue*, bool) ()
#17 0x00007fa8a7fbbf49 in ceph::log::Log::flush() ()
#18 0x00007fa8a7fbc1be in ceph::log::Log::entry() ()
#19 0x00007fa8a5fa6dc5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fa8a4631ced in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa8a764a800 (LWP 19877)):
#0  0x00007fa8a7201573 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1  0x00007fa8a7201447 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2  0x00007fa8a71f2078 in tcmalloc::CentralFreeList::Populate() () from /lib64/libtcmalloc.so.4
#3  0x00007fa8a71f2148 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#4  0x00007fa8a71f21dd in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#5  0x00007fa8a71f5235 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /lib64/libtcmalloc.so.4
#6  0x00007fa8a7203d5b in tc_malloc () from /lib64/libtcmalloc.so.4
#7  0x00007fa8a3cedb0c in PL_HashTableRawAdd () from /lib64/libplds4.so
#8  0x00007fa8a410a8ec in SECOID_Init () from /lib64/libnssutil3.so
#9  0x00007fa8a245e2ed in sftk_startup_tests () from /lib64/libsoftokn3.so
#10 0x00007fa8a74503a3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#11 0x00007fa8a7454ab6 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fa8a74501b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fa8a74541ab in _dl_open () from /lib64/ld-linux-x86-64.so.2
#14 0x00007fa8a5d9c02b in dlopen_doit () from /lib64/libdl.so.2
#15 0x00007fa8a74501b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#16 0x00007fa8a5d9c62d in _dlerror_run () from /lib64/libdl.so.2
#17 0x00007fa8a5d9c0c1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#18 0x00007fa8a61cfb8b in PR_LoadLibraryWithFlags () from /lib64/libnspr4.so
#19 0x00007fa8a410a41f in loader_LoadLibInReferenceDir () from /lib64/libnssutil3.so
#20 0x00007fa8a410a47d in PORT_LoadLibraryFromOrigin () from /lib64/libnssutil3.so
#21 0x00007fa8a642fc2e in softoken_LoadDSO () from /lib64/libnss3.so
#22 0x00007fa8a61d5b9a in PR_CallOnce () from /lib64/libnspr4.so
#23 0x00007fa8a6437817 in secmod_LoadPKCS11Module () from /lib64/libnss3.so
#24 0x00007fa8a64432ef in SECMOD_LoadModule () from /lib64/libnss3.so
#25 0x00007fa8a64124cb in nss_Init () from /lib64/libnss3.so
#26 0x00007fa8a6412ea1 in NSS_InitContext () from /lib64/libnss3.so
#27 0x00007fa8a8091cb8 in ceph::crypto::init(CephContext*) ()
#28 0x00007fa8a808b229 in CephContext::init_crypto() ()
#29 0x00007fa8a809ae39 in common_init_finish(CephContext*) ()
#30 0x00007fa8a796c001 in main ()

Comment 18 Loic Dachary 2016-05-25 16:12:22 UTC

It looks like this could be another instance of http://tracker.ceph.com/issues/13522

Comment 21 Loic Dachary 2016-05-26 07:27:11 UTC

Daniel, how frequently do you run into this problem ? Was it just once (that's what I would expect) or more than once ?

Comment 22 Loic Dachary 2016-05-26 07:32:57 UTC

Daniel, I see you may have ran twice into this problem. I write "may" because it is unclear if the problem that prompted the creation of this bug entry was indeed the same problem since the root cause was not found at the time.

Comment 23 Daniel Horák 2016-05-26 08:41:30 UTC

Loic, I checked the logs from the first occurrence of this issue (comment 2, attachment 1155623 [details] - usm5-node3/machine_information) and it seems like the same issue:

|-flock,30745 /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vde1
|   `-ceph-disk,30746 /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vde1
|       `-ceph-disk,30762 /usr/sbin/ceph-disk --verbose activate /dev/vde1
|           `-ceph-osd,30921 --cluster ceph --mkfs --mkkey -i 10 --monmap /var/lib/ceph/tmp/mnt.lHmAb6/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.lHmAb6 --osd-journal /var/lib/ceph/tmp/mnt.lHmAb6/journal --osd-uuid 7732940b-4c2b-482e-a32f-becf6a9b6633 --keyring /var/lib/ceph/tmp/mnt.lHmAb6/keyring --setuser ceph --setgroup ceph
|               `-{ceph-osd},30923
|-flock,30768 /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vde1
|-flock,30895 /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync /dev/vde1

So I saw it once few weeks ago, and then few times this week (at least 4 or 5 times on at least 2 different sets of VMs in different environment).

Yesterday and today I unsuccessfully tried 3 times to reproduce it on VMs in OpenStack environment with 1 data disk + 1 journal disk peer OSD VM.
Today I was able to reproduce it again on the same VMs but with 2 Ceph data disks (+ 1 journal disk) peer each OSD VM.

I launched the installation again on the configuration where it was reproducible, fell free to ping me on IRC, if it will help you to watch it directly there.

Comment 24 Loic Dachary 2016-05-26 08:55:54 UTC

Daniel, thanks for the quick feedback. That's a lot more frequent than I expected. I'm inclined to suggest this should be marked as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1339710 because I can't think of a viable workaround.

Comment 27 Christina Meno 2016-05-26 18:46:35 UTC

This is a blocker we need more data to decide if it remains one.
frequency and Number of tests

Comment 28 Christina Meno 2016-05-26 22:44:37 UTC

Daniel,

Would you please give an estimation of how many runs you done and not seen this issue and how many times you have?

cheers,
G

Comment 29 Daniel Horák 2016-05-27 09:40:10 UTC

Gregory,

On one specific configuration in OpenStack  (2 OSDs peer node), it happened about 5-6 times from about 7-8 attempts in the last week.
On the same cluster but with only 1 OSD peer node it didn't happened in neither from about 5-7 attempts.

On another clusters (different environment) I saw it only once or twice if I remember it correctly, but I don't know from how much attempts.

And it is also true, that most of the clusters where are testing my colleagues are with only 1 OSD peer node (because of there were another issues with more OSDs peer node in USM and they need slightly more stable environment for testing of another thinks).

Comment 30 Loic Dachary 2016-05-27 09:53:39 UTC

@Daniel it would be great if you could try the following workaround https://github.com/ceph/ceph/pull/9343/files

Comment 31 Daniel Horák 2016-05-27 11:02:17 UTC

Loic, I'll try to test it, but it takes some time.

Comment 32 Loic Dachary 2016-05-27 11:08:40 UTC

Thanks Daniel. I've tested it to work, it's not holding up the merge or anything. Whenever you can confirm it improves things on your side, it would be most welcome.

Comment 33 monti lawrence 2016-05-27 17:37:14 UTC

Please see Comment 32. This is a blocker for Beta 1 and needs to be resolved ASAP.

Comment 34 Daniel Horák 2016-05-31 06:49:26 UTC

I'm working on that, but it takes quite a lot of time and resources, which I have quite limited.

Comment 35 Daniel Horák 2016-05-31 12:28:07 UTC

It seems like the occurrence of this issue depends also on outer conditions of the environment where it is tested, so while I was able to reproduce it 5 times out of 6 attempts previously, now the score is contrary - 1 occurrence out of 4 attempts - all o on the same environment with the same configuration as initially...
In this state it's hard to confirm that the workaround works as expected as it is hard to recognise if it didn't happened because of the workaround or simply because it will not occur even without that patch.

Comment 36 Loic Dachary 2016-05-31 12:46:14 UTC

Daniel, thanks for trying :-) It would be good enough if you can at least confirm that you ran X times with the patch included and never run into the issue. It does not mean the workaround actually works, but that's a step forward to confirm that it does.

Comment 38 Loic Dachary 2016-06-01 13:01:57 UTC

https://github.com/ceph/ceph/pull/9343/commits has a revised commit that includes a test. The code itself is slighly different from 21caf964c05c41324f04dcc06e4ad7350e5cdf38 but only to the extent that it helps with testing. The logic is unchanged.

Comment 39 Neil Levine 2016-06-01 15:52:21 UTC

I'd still like to get some confidence that this code actually fixes the problem or that at the very least, it doesn't introduce any regressions..

Comment 40 Loic Dachary 2016-06-01 16:21:28 UTC

It has been unit tested and is now in master. It will now be used in all integration tests and as time passes we'll be more certain that it does not introduce a regression.

Comment 41 Daniel Horák 2016-06-03 10:19:43 UTC

I run our cluster creation tests on patched version of ceph_disk and later on ceph_disk from ceph-base-10.2.1-13.el7cp.x86_64 (where it is already patched) and I can confirm that I didn't noticed this issue again.

Comment 42 Samuel Just 2016-06-09 22:49:08 UTC

Merged into jewel. https://github.com/ceph/ceph/pull/9427

Comment 44 Daniel Horák 2016-07-11 07:01:13 UTC

I've run similar scenarios multiple times in the last few weeks and this issue didn't occurred, so considering this as tested and verified.
Latest run was on ceph 10.2.2-16.el7cp.x86_64.

>> VERIFIED

Comment 46 errata-xmlrpc 2016-08-23 19:38:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.