Hide Forgot
Description of problem: ----------------------- System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended. 1. kernel 2.6.32-504.3.3.el6.x86_64 ----------------------------------- crash> epython storage/dmshow NUMBER NAME MAPPED_DEVICE FIELDS dm-0 vg00-root 0xffff8810710d0c00 flags: 0x40 dm-1 vg00-swap 0xffff881071bda000 flags: 0x40 ... dm-39 vgDisk4-disk1 0xffff880870e1f000 flags: 0x40 dm-40 vgDisk2-disk1 0xffff880870f14000 flags: 0x40 dm-41 vgIndexCache1-disk1 0xffff880872adc400 flags: 0x40 dm-42 vgddb01-ddb1 0xffff88086d5be400 flags: 0x43 <-- dm-43 vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800 flags: 0x43 <-- dm-44 vgddb01-ddb1-real 0xffff880779ca7000 flags: 0x43 <-- dm-45 vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00 flags: 0x43 <-- 2. kernel 2.6.32-573.el6.x86_64 ------------------------------- crash> epython storage/dmshow NUMBER NAME MAPPED_DEVICE FIELDS dm-0 vg00-root 0xffff88086b247400 flags: 0x40 dm-1 vg00-swap 0xffff88086f22bc00 flags: 0x40 ... dm-45 vgDisk6-disk1 0xffff88086c7c9c00 flags: 0x40 dm-46 vgDisk12-disk1 0xffff88086bc30000 flags: 0x40 dm-47 vgddb01-ddb1 0xffff88086f239400 flags: 0x43 <-- dm-48 vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400 flags: 0x40 dm-49 vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800 flags: 0x40 dm-50 vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000 flags: 0x40 dm-51 vgddb01-ddb1-real 0xffff88081df5a800 flags: 0x43 <-- dm-52 vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400 flags: 0x43 <-- dm-53 vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000 flags: 0x43 <-- Interpreting the flags setting of 0x43 crash> eval -b 0x43 | grep bits bits set: 6 1 0 So we have following flags set: drivers/md/dm.c /* * Bits for the md->flags field. */ #define DMF_BLOCK_IO_FOR_SUSPEND 0 <--- #define DMF_SUSPENDED 1 <--- #define DMF_FROZEN 2 #define DMF_FREEING 3 #define DMF_DELETING 4 #define DMF_NOFLUSH_SUSPENDING 5 #define DMF_MERGE_IS_OPTIONAL 6 <--- #define DMF_DEFERRED_REMOVE 7 #define DMF_SUSPENDED_INTERNALLY 8 Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely. Version-Release number of selected component (if applicable): ------------------------------------------------------------- Seen at least on 2 configurations: Kernel: 2.6.32-504.3.3.el6.x86_64 LVM2: lvm2-2.02.111-2.el6_6.1.x86_64 lvm2-libs-2.02.111-2.el6_6.1.x86_64 Kernel: 2.6.32-573.el6.x86_64 LVM2: lvm2-2.02.118-2.el6.x86_64 lvm2-libs-2.02.118-2.el6.x86_64 How reproducible: ----------------- Fairly regularly in the customer environment Steps to Reproduce: ------------------- - Commvault (the backup software) is backing up its internal database (repository) - It is using lvm snapshot for this backup - Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up - They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB - In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend snapshot_autoextend_threshold = 70 snapshot_autoextend_percent = 50 - As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension. Actual results: --------------- Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system. Any operation would hang: ls, df, .. etc... The affected volumes remain suspended, I/O flow to them stops. Expected results: ----------------- suspended LVOLs get resumed, no hang Additional info: ---------------- 2 crash dump images available. Details of their analysis will follow.
Moving to POST as issue is likely already resolved according to Comment 11.