Bug 1322896

Summary: LVM volumes left in suspended state leading to I/O hang
Product: Red Hat Enterprise Linux 6 Reporter: Stanislav Saner <ssaner>
Component: lvm2Assignee: Zdenek Kabelac <zkabelac>
lvm2 sub component: Snapshots (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: urgent CC: agk, heinzm, jbrassow, loberman, mgandhi, msnitzer, mthacker, prajnoha, prockai, redhat-bugzilla, zkabelac
Version: 6.7   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-10 08:07:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1269194, 1324930    

Description Stanislav Saner 2016-03-31 14:56:41 UTC
Description of problem:
-----------------------

System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended.


1. kernel 2.6.32-504.3.3.el6.x86_64  
-----------------------------------


crash> epython storage/dmshow 
NUMBER  NAME                   MAPPED_DEVICE       FIELDS
dm-0    vg00-root              0xffff8810710d0c00  flags: 0x40      
dm-1    vg00-swap              0xffff881071bda000  flags: 0x40      
...     
dm-39   vgDisk4-disk1          0xffff880870e1f000  flags: 0x40      
dm-40   vgDisk2-disk1          0xffff880870f14000  flags: 0x40      
dm-41   vgIndexCache1-disk1    0xffff880872adc400  flags: 0x40      
dm-42   vgddb01-ddb1           0xffff88086d5be400  flags: 0x43       <--
dm-43   vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800  flags: 0x43  <--     
dm-44   vgddb01-ddb1-real      0xffff880779ca7000  flags: 0x43      <--
dm-45   vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00  flags: 0x43      <--



2. kernel 2.6.32-573.el6.x86_64
-------------------------------

crash> epython storage/dmshow 
NUMBER  NAME                   MAPPED_DEVICE       FIELDS
dm-0    vg00-root              0xffff88086b247400  flags: 0x40      
dm-1    vg00-swap              0xffff88086f22bc00  flags: 0x40      
...
dm-45   vgDisk6-disk1          0xffff88086c7c9c00  flags: 0x40      
dm-46   vgDisk12-disk1         0xffff88086bc30000  flags: 0x40      
dm-47   vgddb01-ddb1           0xffff88086f239400  flags: 0x43   <--    
dm-48   vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400  flags: 0x40      
dm-49   vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800  flags: 0x40      
dm-50   vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000  flags: 0x40      
dm-51   vgddb01-ddb1-real      0xffff88081df5a800  flags: 0x43      <--
dm-52   vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400  flags: 0x43      <--
dm-53   vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000  flags: 0x43  <--


Interpreting the flags setting of 0x43

crash> eval -b 0x43 | grep bits
   bits set: 6 1 0 


So we have following flags set:

drivers/md/dm.c

/*
 * Bits for the md->flags field.
 */
#define DMF_BLOCK_IO_FOR_SUSPEND 0     <---
#define DMF_SUSPENDED 1                <---
#define DMF_FROZEN 2
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
#define DMF_MERGE_IS_OPTIONAL 6        <---
#define DMF_DEFERRED_REMOVE 7
#define DMF_SUSPENDED_INTERNALLY 8


Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely. 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

Seen at least on 2 configurations:

Kernel: 2.6.32-504.3.3.el6.x86_64
LVM2:   lvm2-2.02.111-2.el6_6.1.x86_64                             
        lvm2-libs-2.02.111-2.el6_6.1.x86_64        

Kernel: 2.6.32-573.el6.x86_64
LVM2:   lvm2-2.02.118-2.el6.x86_64      
        lvm2-libs-2.02.118-2.el6.x86_64 

How reproducible: 
-----------------
Fairly regularly in the customer environment


Steps to Reproduce:
-------------------

- Commvault (the backup software) is backing up its internal database (repository)
- It is using lvm snapshot for this backup
- Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up
- They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB
- In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend 
     snapshot_autoextend_threshold = 70
     snapshot_autoextend_percent = 50
- As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension.



Actual results:
---------------
Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system.
Any operation would hang: ls, df, .. etc...

The affected volumes remain suspended, I/O flow to them stops.


Expected results:
-----------------
suspended LVOLs get resumed, no hang


Additional info:
----------------

2 crash dump images available. Details of their analysis will follow.

Comment 12 Zdenek Kabelac 2016-10-06 14:22:23 UTC
Moving to POST as issue is likely already resolved according to Comment 11.