| Summary: | LVM volumes left in suspended state leading to I/O hang | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Stanislav Saner <ssaner> |
| Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> |
| lvm2 sub component: | Snapshots (RHEL6) | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | high | ||
| Priority: | urgent | CC: | agk, heinzm, jbrassow, loberman, mgandhi, msnitzer, mthacker, prajnoha, prockai, redhat-bugzilla, zkabelac |
| Version: | 6.7 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-10 08:07:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1269194, 1324930 | ||
Moving to POST as issue is likely already resolved according to Comment 11. |
Description of problem: ----------------------- System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended. 1. kernel 2.6.32-504.3.3.el6.x86_64 ----------------------------------- crash> epython storage/dmshow NUMBER NAME MAPPED_DEVICE FIELDS dm-0 vg00-root 0xffff8810710d0c00 flags: 0x40 dm-1 vg00-swap 0xffff881071bda000 flags: 0x40 ... dm-39 vgDisk4-disk1 0xffff880870e1f000 flags: 0x40 dm-40 vgDisk2-disk1 0xffff880870f14000 flags: 0x40 dm-41 vgIndexCache1-disk1 0xffff880872adc400 flags: 0x40 dm-42 vgddb01-ddb1 0xffff88086d5be400 flags: 0x43 <-- dm-43 vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800 flags: 0x43 <-- dm-44 vgddb01-ddb1-real 0xffff880779ca7000 flags: 0x43 <-- dm-45 vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00 flags: 0x43 <-- 2. kernel 2.6.32-573.el6.x86_64 ------------------------------- crash> epython storage/dmshow NUMBER NAME MAPPED_DEVICE FIELDS dm-0 vg00-root 0xffff88086b247400 flags: 0x40 dm-1 vg00-swap 0xffff88086f22bc00 flags: 0x40 ... dm-45 vgDisk6-disk1 0xffff88086c7c9c00 flags: 0x40 dm-46 vgDisk12-disk1 0xffff88086bc30000 flags: 0x40 dm-47 vgddb01-ddb1 0xffff88086f239400 flags: 0x43 <-- dm-48 vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400 flags: 0x40 dm-49 vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800 flags: 0x40 dm-50 vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000 flags: 0x40 dm-51 vgddb01-ddb1-real 0xffff88081df5a800 flags: 0x43 <-- dm-52 vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400 flags: 0x43 <-- dm-53 vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000 flags: 0x43 <-- Interpreting the flags setting of 0x43 crash> eval -b 0x43 | grep bits bits set: 6 1 0 So we have following flags set: drivers/md/dm.c /* * Bits for the md->flags field. */ #define DMF_BLOCK_IO_FOR_SUSPEND 0 <--- #define DMF_SUSPENDED 1 <--- #define DMF_FROZEN 2 #define DMF_FREEING 3 #define DMF_DELETING 4 #define DMF_NOFLUSH_SUSPENDING 5 #define DMF_MERGE_IS_OPTIONAL 6 <--- #define DMF_DEFERRED_REMOVE 7 #define DMF_SUSPENDED_INTERNALLY 8 Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely. Version-Release number of selected component (if applicable): ------------------------------------------------------------- Seen at least on 2 configurations: Kernel: 2.6.32-504.3.3.el6.x86_64 LVM2: lvm2-2.02.111-2.el6_6.1.x86_64 lvm2-libs-2.02.111-2.el6_6.1.x86_64 Kernel: 2.6.32-573.el6.x86_64 LVM2: lvm2-2.02.118-2.el6.x86_64 lvm2-libs-2.02.118-2.el6.x86_64 How reproducible: ----------------- Fairly regularly in the customer environment Steps to Reproduce: ------------------- - Commvault (the backup software) is backing up its internal database (repository) - It is using lvm snapshot for this backup - Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up - They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB - In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend snapshot_autoextend_threshold = 70 snapshot_autoextend_percent = 50 - As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension. Actual results: --------------- Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system. Any operation would hang: ls, df, .. etc... The affected volumes remain suspended, I/O flow to them stops. Expected results: ----------------- suspended LVOLs get resumed, no hang Additional info: ---------------- 2 crash dump images available. Details of their analysis will follow.