When preaparing the osd device with --mkfs, the ceph-osd binary tries to acquire an exclusive lock on the device (soon to become an OSD). Unfortunately, when running in containers, we have seen cases where there is a race between ceph-osd and systemd-udevd to acquire a lock on the device. Sometimes systemd-udevd gets the lock and releases it soon so that the ceph-osd gets sometimes the lock is still held and because ceph-osd uses LOCK_NB the command fails. Beofre running --mkfs, ceph-volume call ceph-bluestore-tool which opens the devices seeking bluestore label, then close, this is where udev might be triggering its scan on the block. This commit retries if the lock cannot be acquired, up to 5 times for 5 seconds, this should be more than enough to acquire the lock and proceed with the OSD mkfs. Unfortunately, this is so transient that we cannot lock earlier from c-v, this won't do anything. Ceph tracker bug: https://tracker.ceph.com/issues/47010
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
This race impacts OCS.
Moving to z3 since we worked around it in Rook and it seems to close to make it for z2.
Moving this to 4.2. It doesn't seem to be a rush as there is already a workaround in rook and an upstream fix.
Andrew, there is no workaround in Rook, just the fix upstream.
We cannot introduce such a race, just verify the deployment is successful, also now the code retries so by looking at the prepare job logs you might see it retrying (meaning we are hitting the issue and retrying).
Checking for regression is fine too.
As per comment #11 #12 #13 Verified using Ceph Version: 14.2.11-95.el8cp Ceph Ansible Version: 4.0.41-1.el8cp.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0081