Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1322896

Summary: LVM volumes left in suspended state leading to I/O hang
Product: Red Hat Enterprise Linux 6 Reporter: Stan Saner <ssaner>
Component: lvm2Assignee: Zdenek Kabelac <zkabelac>
lvm2 sub component: Snapshots (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: urgent CC: agk, heinzm, jbrassow, loberman, mgandhi, msnitzer, mthacker, prajnoha, prockai, redhat-bugzilla, zkabelac
Version: 6.7   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-10 08:07:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1269194, 1324930    

Description Stan Saner 2016-03-31 14:56:41 UTC
Description of problem:
-----------------------

System hangs, Commvault backup using LVM snapshots leaves the origin, COW and snapshot volumes suspended.


1. kernel 2.6.32-504.3.3.el6.x86_64  
-----------------------------------


crash> epython storage/dmshow 
NUMBER  NAME                   MAPPED_DEVICE       FIELDS
dm-0    vg00-root              0xffff8810710d0c00  flags: 0x40      
dm-1    vg00-swap              0xffff881071bda000  flags: 0x40      
...     
dm-39   vgDisk4-disk1          0xffff880870e1f000  flags: 0x40      
dm-40   vgDisk2-disk1          0xffff880870f14000  flags: 0x40      
dm-41   vgIndexCache1-disk1    0xffff880872adc400  flags: 0x40      
dm-42   vgddb01-ddb1           0xffff88086d5be400  flags: 0x43       <--
dm-43   vgddb01-DDBSnap_1456149639_988734 0xffff88085fe28800  flags: 0x43  <--     
dm-44   vgddb01-ddb1-real      0xffff880779ca7000  flags: 0x43      <--
dm-45   vgddb01-DDBSnap_1456149639_988734-cow 0xffff880860379c00  flags: 0x43      <--



2. kernel 2.6.32-573.el6.x86_64
-------------------------------

crash> epython storage/dmshow 
NUMBER  NAME                   MAPPED_DEVICE       FIELDS
dm-0    vg00-root              0xffff88086b247400  flags: 0x40      
dm-1    vg00-swap              0xffff88086f22bc00  flags: 0x40      
...
dm-45   vgDisk6-disk1          0xffff88086c7c9c00  flags: 0x40      
dm-46   vgDisk12-disk1         0xffff88086bc30000  flags: 0x40      
dm-47   vgddb01-ddb1           0xffff88086f239400  flags: 0x43   <--    
dm-48   vgddb01-DDBSnap_1454738487_314455 0xffff88086ffc1400  flags: 0x40      
dm-49   vgddb01-DDBSnap_1455199246_9407 0xffff88086b202800  flags: 0x40      
dm-50   vgddb01-DDBSnap_1457013649_54682 0xffff88086b985000  flags: 0x40      
dm-51   vgddb01-ddb1-real      0xffff88081df5a800  flags: 0x43      <--
dm-52   vgddb01-DDBSnap_1457618434_921759-cow 0xffff88086ffcb400  flags: 0x43      <--
dm-53   vgddb01-DDBSnap_1457618434_921759 0xffff8806b598e000  flags: 0x43  <--


Interpreting the flags setting of 0x43

crash> eval -b 0x43 | grep bits
   bits set: 6 1 0 


So we have following flags set:

drivers/md/dm.c

/*
 * Bits for the md->flags field.
 */
#define DMF_BLOCK_IO_FOR_SUSPEND 0     <---
#define DMF_SUSPENDED 1                <---
#define DMF_FROZEN 2
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
#define DMF_MERGE_IS_OPTIONAL 6        <---
#define DMF_DEFERRED_REMOVE 7
#define DMF_SUSPENDED_INTERNALLY 8


Clearly the flag DMF_BLOCK_IO_FOR_SUSPEND says block the IO if DMF_SUSPENDED flag is set. And that flag _is_ set, so no IO flows. We need to figure out why the device is left in this state for a prolonged length of time, perhaps indefinitely. 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

Seen at least on 2 configurations:

Kernel: 2.6.32-504.3.3.el6.x86_64
LVM2:   lvm2-2.02.111-2.el6_6.1.x86_64                             
        lvm2-libs-2.02.111-2.el6_6.1.x86_64        

Kernel: 2.6.32-573.el6.x86_64
LVM2:   lvm2-2.02.118-2.el6.x86_64      
        lvm2-libs-2.02.118-2.el6.x86_64 

How reproducible: 
-----------------
Fairly regularly in the customer environment


Steps to Reproduce:
-------------------

- Commvault (the backup software) is backing up its internal database (repository)
- It is using lvm snapshot for this backup
- Initially customer used 4 GB snapshots which hung - most likely due to the snapshot filling up
- They have later got a procedure from the vendor on how to increase the snapshot initial size to 8 GB
- In addition Red Hat technical account management suggested customer to use the LVM snapshot auto extend 
     snapshot_autoextend_threshold = 70
     snapshot_autoextend_percent = 50
- As far as we can tell from the last sosreport we see that the snapshot was auto extended (March 10 16:22) and we see hang messages starting to appear in the messages file about half an hour after the extension.



Actual results:
---------------
Customer noticed backup jobs were not making progress for some time. He logged into the server and couldn't access the Commvault database file system.
Any operation would hang: ls, df, .. etc...

The affected volumes remain suspended, I/O flow to them stops.


Expected results:
-----------------
suspended LVOLs get resumed, no hang


Additional info:
----------------

2 crash dump images available. Details of their analysis will follow.

Comment 12 Zdenek Kabelac 2016-10-06 14:22:23 UTC
Moving to POST as issue is likely already resolved according to Comment 11.