Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
-----------------------
System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended.
1. kernel 2.6.32-504.3.3.el6.x86_64
-----------------------------------
crash> epython storage/dmshow
NUMBER NAME MAPPED_DEVICE FIELDS
dm-0 vg00-root 0xffff8810710d0c00 flags: 0x40
dm-1 vg00-swap 0xffff881071bda000 flags: 0x40
...
dm-39 vgDisk4-disk1 0xffff880870e1f000 flags: 0x40
dm-40 vgDisk2-disk1 0xffff880870f14000 flags: 0x40
dm-41 vgIndexCache1-disk1 0xffff880872adc400 flags: 0x40
dm-42 vgddb01-ddb1 0xffff88086d5be400 flags: 0x43 <--
dm-43 vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800 flags: 0x43 <--
dm-44 vgddb01-ddb1-real 0xffff880779ca7000 flags: 0x43 <--
dm-45 vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00 flags: 0x43 <--
2. kernel 2.6.32-573.el6.x86_64
-------------------------------
crash> epython storage/dmshow
NUMBER NAME MAPPED_DEVICE FIELDS
dm-0 vg00-root 0xffff88086b247400 flags: 0x40
dm-1 vg00-swap 0xffff88086f22bc00 flags: 0x40
...
dm-45 vgDisk6-disk1 0xffff88086c7c9c00 flags: 0x40
dm-46 vgDisk12-disk1 0xffff88086bc30000 flags: 0x40
dm-47 vgddb01-ddb1 0xffff88086f239400 flags: 0x43 <--
dm-48 vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400 flags: 0x40
dm-49 vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800 flags: 0x40
dm-50 vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000 flags: 0x40
dm-51 vgddb01-ddb1-real 0xffff88081df5a800 flags: 0x43 <--
dm-52 vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400 flags: 0x43 <--
dm-53 vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000 flags: 0x43 <--
Interpreting the flags setting of 0x43
crash> eval -b 0x43 | grep bits
bits set: 6 1 0
So we have following flags set:
drivers/md/dm.c
/*
* Bits for the md->flags field.
*/
#define DMF_BLOCK_IO_FOR_SUSPEND 0 <---
#define DMF_SUSPENDED 1 <---
#define DMF_FROZEN 2
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
#define DMF_MERGE_IS_OPTIONAL 6 <---
#define DMF_DEFERRED_REMOVE 7
#define DMF_SUSPENDED_INTERNALLY 8
Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely.
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Seen at least on 2 configurations:
Kernel: 2.6.32-504.3.3.el6.x86_64
LVM2: lvm2-2.02.111-2.el6_6.1.x86_64
lvm2-libs-2.02.111-2.el6_6.1.x86_64
Kernel: 2.6.32-573.el6.x86_64
LVM2: lvm2-2.02.118-2.el6.x86_64
lvm2-libs-2.02.118-2.el6.x86_64
How reproducible:
-----------------
Fairly regularly in the customer environment
Steps to Reproduce:
-------------------
- Commvault (the backup software) is backing up its internal database (repository)
- It is using lvm snapshot for this backup
- Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up
- They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB
- In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend
snapshot_autoextend_threshold = 70
snapshot_autoextend_percent = 50
- As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension.
Actual results:
---------------
Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system.
Any operation would hang: ls, df, .. etc...
The affected volumes remain suspended, I/O flow to them stops.
Expected results:
-----------------
suspended LVOLs get resumed, no hang
Additional info:
----------------
2 crash dump images available. Details of their analysis will follow.
Description of problem: ----------------------- System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended. 1. kernel 2.6.32-504.3.3.el6.x86_64 ----------------------------------- crash> epython storage/dmshow NUMBER NAME MAPPED_DEVICE FIELDS dm-0 vg00-root 0xffff8810710d0c00 flags: 0x40 dm-1 vg00-swap 0xffff881071bda000 flags: 0x40 ... dm-39 vgDisk4-disk1 0xffff880870e1f000 flags: 0x40 dm-40 vgDisk2-disk1 0xffff880870f14000 flags: 0x40 dm-41 vgIndexCache1-disk1 0xffff880872adc400 flags: 0x40 dm-42 vgddb01-ddb1 0xffff88086d5be400 flags: 0x43 <-- dm-43 vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800 flags: 0x43 <-- dm-44 vgddb01-ddb1-real 0xffff880779ca7000 flags: 0x43 <-- dm-45 vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00 flags: 0x43 <-- 2. kernel 2.6.32-573.el6.x86_64 ------------------------------- crash> epython storage/dmshow NUMBER NAME MAPPED_DEVICE FIELDS dm-0 vg00-root 0xffff88086b247400 flags: 0x40 dm-1 vg00-swap 0xffff88086f22bc00 flags: 0x40 ... dm-45 vgDisk6-disk1 0xffff88086c7c9c00 flags: 0x40 dm-46 vgDisk12-disk1 0xffff88086bc30000 flags: 0x40 dm-47 vgddb01-ddb1 0xffff88086f239400 flags: 0x43 <-- dm-48 vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400 flags: 0x40 dm-49 vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800 flags: 0x40 dm-50 vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000 flags: 0x40 dm-51 vgddb01-ddb1-real 0xffff88081df5a800 flags: 0x43 <-- dm-52 vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400 flags: 0x43 <-- dm-53 vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000 flags: 0x43 <-- Interpreting the flags setting of 0x43 crash> eval -b 0x43 | grep bits bits set: 6 1 0 So we have following flags set: drivers/md/dm.c /* * Bits for the md->flags field. */ #define DMF_BLOCK_IO_FOR_SUSPEND 0 <--- #define DMF_SUSPENDED 1 <--- #define DMF_FROZEN 2 #define DMF_FREEING 3 #define DMF_DELETING 4 #define DMF_NOFLUSH_SUSPENDING 5 #define DMF_MERGE_IS_OPTIONAL 6 <--- #define DMF_DEFERRED_REMOVE 7 #define DMF_SUSPENDED_INTERNALLY 8 Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely. Version-Release number of selected component (if applicable): ------------------------------------------------------------- Seen at least on 2 configurations: Kernel: 2.6.32-504.3.3.el6.x86_64 LVM2: lvm2-2.02.111-2.el6_6.1.x86_64 lvm2-libs-2.02.111-2.el6_6.1.x86_64 Kernel: 2.6.32-573.el6.x86_64 LVM2: lvm2-2.02.118-2.el6.x86_64 lvm2-libs-2.02.118-2.el6.x86_64 How reproducible: ----------------- Fairly regularly in the customer environment Steps to Reproduce: ------------------- - Commvault (the backup software) is backing up its internal database (repository) - It is using lvm snapshot for this backup - Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up - They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB - In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend snapshot_autoextend_threshold = 70 snapshot_autoextend_percent = 50 - As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension. Actual results: --------------- Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system. Any operation would hang: ls, df, .. etc... The affected volumes remain suspended, I/O flow to them stops. Expected results: ----------------- suspended LVOLs get resumed, no hang Additional info: ---------------- 2 crash dump images available. Details of their analysis will follow.