Bug 1906262 - Various potential crashes are possible if snapshot mirroring is behind
Summary: Various potential crashes are possible if snapshot mirroring is behind
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 4.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2z2
Assignee: Ilya Dryomov
QA Contact: Harish Munjulur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-10 03:52 UTC by Jason Dillaman
Modified: 2021-06-15 17:13 UTC (History)
5 users (show)

Fixed In Version: ceph-14.2.11-162.el8cp, ceph-14.2.11-162.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-15 17:13:09 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 48522 0 None None None 2020-12-18 13:23:36 UTC
Ceph Project Bug Tracker 48525 0 None None None 2020-12-10 03:53:31 UTC
Ceph Project Bug Tracker 48527 0 None None None 2020-12-10 03:53:46 UTC
Github ceph ceph pull 38517 0 None closed rbd-mirror: bad state and crashes in snapshot-based mirroring 2020-12-15 07:41:26 UTC
Github ceph ceph pull 38613 0 None open librbd/api: avoid retrieving more than max mirror image info records 2020-12-18 13:32:44 UTC
Red Hat Product Errata RHSA-2021:2445 0 None None None 2021-06-15 17:13:33 UTC

Description Jason Dillaman 2020-12-10 03:52:01 UTC
Description of problem:
If rbd-mirror cannot keep-up with the creation rate of new mirroring snapshots, it's possible for a mirror snapshot to incorrectly have an empty set of peer mirror uuids. This unexpected situation can lead to a recursive stack overflow in the MGR, rbd CLI, or other librbd clients that attempt to prune the invalid snaphot. 

Version-Release number of selected component (if applicable):
4.2

How reproducible:
100% if mirroring cannot keep up

Steps to Reproduce:
1. Overload rbd-mirror snapshot-based replicatoin
  1a. create hundreds of images and use MGR mirror snapshot schedule
  1b. run "rbd mirror image snapshot" quickly

Actual results:
Eventually you will get a mirror snapshot with an empty mirror peer uuid. Once you get to 4 total mirror snapshots, it will fail to remove the corrupt snapshot and will crash:

$ rbd --cluster cluster1 --pool mirror snap ls image0001 --all
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  4336  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.de0af227-363a-43a9-ac48-9737d2578151  1 MiB             Wed Dec  9 19:05:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
  7836  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.10a07443-37d6-4b58-a13c-f3171d6d2cea  1 MiB             Wed Dec  9 19:10:00 2020  mirror (primary peer_uuids:[])                                    
 11348  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.507b7ec7-472e-4ad9-ad9a-225db0af7e67  1 MiB             Wed Dec  9 19:15:00 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])
 14113  .mirror.primary.d08a8d0b-cbde-4606-aa53-62faf2c6f6d8.1a4350ca-1a95-4144-beb7-34f1d52f5a4f  1 MiB             Wed Dec  9 19:21:29 2020  mirror (primary peer_uuids:[61a576e8-a1cc-46d7-befc-3b7c82ebc12f])

Expected results:
The system never gets into a state where the corrupt snapshot exists (and even if it did, it shouldn't result in a stack overflow)

Additional info:

Comment 1 Jason Dillaman 2020-12-18 13:33:17 UTC
The MGR can also potentially crash when more than 1000 images are being mirrored, and a mirror schedule exists, and mirroring is concurrently being enabled/disabled on one or more images while the MGR is gathering the list of mirroring images.

Comment 9 Harish Munjulur 2021-06-06 12:48:31 UTC
QA verified. 

1. Create pool and image and enable mirroring. 
2. write data to the image
3. schedule snapshot mirroring
4. Continuously write data
5. in a loop run 
#rbd mirror image snapshot data/bug3
#rbd --pool data snap ls bug3 --all
6. make sure the mirror snapshot with an empty mirror peer uuid is not occured. 
7. in the background running 100s of image's mirroring

ex: no emprty mirror peer uiid seen
728064 .mirror.primary.d259f97a-8b76-419f-940a-756139109cb0.f12104fe-b051-4a51-a92f-560f624ef6cb 50 GiB           Sun Jun  6 12:36:01 2021 mirror (primary peer_uuids:[1c73d2b5-6f8c-452f-9586-fc02b3d07700]) 
728129 .mirror.primary.d259f97a-8b76-419f-940a-756139109cb0.4639be51-7465-4044-8da6-af2ebae70f43 50 GiB           Sun Jun  6 12:36:06 2021 mirror (primary peer_uuids:[1c73d2b5-6f8c-452f-9586-fc02b3d07700]) 
728194 .mirror.primary.d259f97a-8b76-419f-940a-756139109cb0.0fc748fd-802b-4c1d-adb0-b0d12a94a200 50 GiB           Sun Jun  6 12:36:11 2021 mirror (primary peer_uuids:[1c73d2b5-6f8c-452f-9586-fc02b3d07700])

Comment 11 errata-xmlrpc 2021-06-15 17:13:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445


Note You need to log in before you can comment on or make changes to this bug.