Description of problem: If rbd-mirror cannot keep-up with the creation rate of new mirroring snapshots, it's possible for a mirror snapshot to incorrectly have an empty set of peer mirror uuids. This unexpected situation can lead to a recursive stack overflow in the MGR, rbd CLI, or other librbd clients that attempt to prune the invalid snaphot. Version-Release number of selected component (if applicable): 4.2 How reproducible: 100% if mirroring cannot keep up Steps to Reproduce: 1. Overload rbd-mirror snapshot-based replicatoin 1a. create hundreds of images and use MGR mirror snapshot schedule 1b. run "rbd mirror image snapshot" quickly Actual results: Eventually you will get a mirror snapshot with an empty mirror peer uuid. Once you get to 4 total mirror snapshots, it will fail to remove the corrupt snapshot and will crash: $ rbd --cluster cluster1 --pool mirror snap ls image0001 --all SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 4336 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.de0af227-363a-43a9-ac48-9737d2578151 1 MiB Wed Dec 9 19:05:00 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f]) 7836 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.10a07443-37d6-4b58-a13c-f3171d6d2cea 1 MiB Wed Dec 9 19:10:00 2020 mirror (primary peer_uuids:[]) 11348 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.507b7ec7-472e-4ad9-ad9a-225db0af7e67 1 MiB Wed Dec 9 19:15:00 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f]) 14113 .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.1a4350ca-1a95-4144-beb7-34f1d52f5a4f 1 MiB Wed Dec 9 19:21:29 2020 mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f]) Expected results: The system never gets into a state where the corrupt snapshot exists (and even if it did, it shouldn't result in a stack overflow) Additional info:
The MGR can also potentially crash when more than 1000 images are being mirrored, and a mirror schedule exists, and mirroring is concurrently being enabled/disabled on one or more images while the MGR is gathering the list of mirroring images.
QA verified. 1. Create pool and image and enable mirroring. 2. write data to the image 3. schedule snapshot mirroring 4. Continuously write data 5. in a loop run #rbd mirror image snapshot data/bug3 #rbd --pool data snap ls bug3 --all 6. make sure the mirror snapshot with an empty mirror peer uuid is not occured. 7. in the background running 100s of image's mirroring ex: no emprty mirror peer uiid seen 728064 .mirror.primary.d259f97a-8b76-419f-940a-756139109cb0.f12104fe-b051-4a51-a92f-560f624ef6cb 50 GiB Sun Jun 6 12:36:01 2021 mirror (primary peer_uuids:[1c73d2b5-6f8c-452f-9586-fc02b3d07700]) 728129 .mirror.primary.d259f97a-8b76-419f-940a-756139109cb0.4639be51-7465-4044-8da6-af2ebae70f43 50 GiB Sun Jun 6 12:36:06 2021 mirror (primary peer_uuids:[1c73d2b5-6f8c-452f-9586-fc02b3d07700]) 728194 .mirror.primary.d259f97a-8b76-419f-940a-756139109cb0.0fc748fd-802b-4c1d-adb0-b0d12a94a200 50 GiB Sun Jun 6 12:36:11 2021 mirror (primary peer_uuids:[1c73d2b5-6f8c-452f-9586-fc02b3d07700])
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2445