Bug 1322896 - LVM volumes left in suspended state leading to I/O hang
Summary: LVM volumes left in suspended state leading to I/O hang
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.7
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Zdenek Kabelac
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1269194 1324930
TreeView+ depends on / blocked
 
Reported: 2016-03-31 14:56 UTC by Stanislav Saner
Modified: 2019-12-16 05:35 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-10 08:07:21 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Stanislav Saner 2016-03-31 14:56:41 UTC
Description of problem:
-----------------------

System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended.


1. kernel 2.6.32-504.3.3.el6.x86_64  
-----------------------------------


crash> epython storage/dmshow 
NUMBER  NAME                   MAPPED_DEVICE       FIELDS
dm-0    vg00-root              0xffff8810710d0c00  flags: 0x40      
dm-1    vg00-swap              0xffff881071bda000  flags: 0x40      
...     
dm-39   vgDisk4-disk1          0xffff880870e1f000  flags: 0x40      
dm-40   vgDisk2-disk1          0xffff880870f14000  flags: 0x40      
dm-41   vgIndexCache1-disk1    0xffff880872adc400  flags: 0x40      
dm-42   vgddb01-ddb1           0xffff88086d5be400  flags: 0x43       <--
dm-43   vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800  flags: 0x43  <--     
dm-44   vgddb01-ddb1-real      0xffff880779ca7000  flags: 0x43      <--
dm-45   vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00  flags: 0x43      <--



2. kernel 2.6.32-573.el6.x86_64
-------------------------------

crash> epython storage/dmshow 
NUMBER  NAME                   MAPPED_DEVICE       FIELDS
dm-0    vg00-root              0xffff88086b247400  flags: 0x40      
dm-1    vg00-swap              0xffff88086f22bc00  flags: 0x40      
...
dm-45   vgDisk6-disk1          0xffff88086c7c9c00  flags: 0x40      
dm-46   vgDisk12-disk1         0xffff88086bc30000  flags: 0x40      
dm-47   vgddb01-ddb1           0xffff88086f239400  flags: 0x43   <--    
dm-48   vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400  flags: 0x40      
dm-49   vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800  flags: 0x40      
dm-50   vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000  flags: 0x40      
dm-51   vgddb01-ddb1-real      0xffff88081df5a800  flags: 0x43      <--
dm-52   vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400  flags: 0x43      <--
dm-53   vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000  flags: 0x43  <--


Interpreting the flags setting of 0x43

crash> eval -b 0x43 | grep bits
   bits set: 6 1 0 


So we have following flags set:

drivers/md/dm.c

/*
 * Bits for the md->flags field.
 */
#define DMF_BLOCK_IO_FOR_SUSPEND 0     <---
#define DMF_SUSPENDED 1                <---
#define DMF_FROZEN 2
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
#define DMF_MERGE_IS_OPTIONAL 6        <---
#define DMF_DEFERRED_REMOVE 7
#define DMF_SUSPENDED_INTERNALLY 8


Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely. 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

Seen at least on 2 configurations:

Kernel: 2.6.32-504.3.3.el6.x86_64
LVM2:   lvm2-2.02.111-2.el6_6.1.x86_64                             
        lvm2-libs-2.02.111-2.el6_6.1.x86_64        

Kernel: 2.6.32-573.el6.x86_64
LVM2:   lvm2-2.02.118-2.el6.x86_64      
        lvm2-libs-2.02.118-2.el6.x86_64 

How reproducible: 
-----------------
Fairly regularly in the customer environment


Steps to Reproduce:
-------------------

- Commvault (the backup software) is backing up its internal database (repository)
- It is using lvm snapshot for this backup
- Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up
- They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB
- In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend 
     snapshot_autoextend_threshold = 70
     snapshot_autoextend_percent = 50
- As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension.



Actual results:
---------------
Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system.
Any operation would hang: ls, df, .. etc...

The affected volumes remain suspended, I/O flow to them stops.


Expected results:
-----------------
suspended LVOLs get resumed, no hang


Additional info:
----------------

2 crash dump images available. Details of their analysis will follow.

Comment 12 Zdenek Kabelac 2016-10-06 14:22:23 UTC
Moving to POST as issue is likely already resolved according to Comment 11.


Note You need to log in before you can comment on or make changes to this bug.