Bug 845048

Summary: LVM startup hangs when starting snapshot
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: lvm2Assignee: LVM and device-mapper development team <lvm-team>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3CC: agk, dwysocha, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, prockai, rpeterso, thornber, webmaster, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-24 08:49:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nate Straz 2012-08-01 15:20:20 UTC
Description of problem:

We have a system on a SAN which hangs during boot with a fresh install of RHEL6.3.  On the SAN is a volume group with a snapshot.  The backtrace of the blocked process points to activating this LV.

Volumes found by dracut:

dracut: Scanning devices sda2 sdc sdc1 sdd1  for LVM logical volumes vg_roth06/lv_swap vg_roth06/lv_root 
dracut: inactive '/dev/roth_vg/roth_lv' [100.00 GiB] inherit
dracut: inactive Snapshot '/dev/roth_vg/roth_sparse' [500.00 GiB] inherit
dracut: inactive '/dev/vg_roth06/lv_root' [50.00 GiB] inherit
dracut: inactive '/dev/vg_roth06/lv_home' [20.07 GiB] inherit
dracut: inactive '/dev/vg_roth06/lv_swap' [3.94 GiB] inherit
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: 
dracut: Mounted root filesystem /dev/mapper/vg_roth06-lv_root
dracut: Loading SELinux policy

Startup messages and sysrq-W:

Setting hostname roth-06.lab.msp.redhat.com:  [  OK  ]
Setting up Logical Volume Management: [halt sent]
SysRq : Show Blocked State
  task                        PC stack   pid father
lvm           D 0000000000000001     0   922    484 0x00000000
 ffff880037e5d9f8 0000000000000082 0000000000000000 ffffffffa0009eba
 ffff880037e5d968 ffffffffa0009a90 ffffffffa0009a00 000000008109cd39
 ffff88007d3b7ab8 ffff880037e5dfd8 000000000000fb88 ffff88007d3b7ab8
Call Trace:
 [<ffffffffa0009eba>] ? dispatch_io+0x28a/0x2c0 [dm_mod]
 [<ffffffffa0009a90>] ? vm_get_page+0x0/0x70 [dm_mod]
 [<ffffffffa0009a00>] ? vm_next_page+0x0/0x30 [dm_mod]
 [<ffffffff814fdfc3>] io_schedule+0x73/0xc0
 [<ffffffffa0009f85>] sync_io+0x95/0x110 [dm_mod]
 [<ffffffff81054c30>] ? __dequeue_entity+0x30/0x50
 [<ffffffffa000a227>] dm_io+0x1b7/0x1c0 [dm_mod]
 [<ffffffffa0009a90>] ? vm_get_page+0x0/0x70 [dm_mod]
 [<ffffffffa0009a00>] ? vm_next_page+0x0/0x30 [dm_mod]
 [<ffffffffa041eba1>] chunk_io+0x111/0x120 [dm_snapshot]
 [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffffa041baf0>] ? dm_add_exception+0x0/0x60 [dm_snapshot]
 [<ffffffffa041ec76>] area_io+0x26/0x30 [dm_snapshot]
 [<ffffffffa041f2c2>] persistent_read_metadata+0x222/0x3a0 [dm_snapshot]
 [<ffffffffa041dd5e>] snapshot_ctr+0x6fe/0x800 [dm_snapshot]
 [<ffffffffa0005f2f>] dm_table_add_target+0x13f/0x3b0 [dm_mod]
 [<ffffffffa00085b9>] table_load+0xc9/0x340 [dm_mod]
 [<ffffffffa0009514>] ctl_ioctl+0x1b4/0x270 [dm_mod]
 [<ffffffffa00084f0>] ? table_load+0x0/0x340 [dm_mod]
 [<ffffffffa00095e3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
 [<ffffffff8118dec2>] vfs_ioctl+0x22/0xa0
 [<ffffffff8118e064>] do_vfs_ioctl+0x84/0x580
 [<ffffffff8118e5e1>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
kernel-2.6.32-279.el6
lvm2-libs-2.02.95-10.el6.x86_64
lvm2-2.02.95-10.el6.x86_64
lvm2-cluster-2.02.95-10.el6.x86_64


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Alasdair Kergon 2012-08-01 16:14:22 UTC
This is one I think where we may need to examine the actual Logical Volume contents.

IOW whichever LV holds the COW area for the snapshot might need to be checked.

- Is the device completely readable?
- Is there any corruption?
(A read-only version based on http://people.gnome.org/~markmc/code/merge-dm-snapshot.c perhaps.)

Comment 3 Nate Straz 2012-08-01 19:14:43 UTC
(In reply to comment #2)
> - Is the device completely readable?

I've seen no indications that the device is having issues since it's being used by another node on the SAN for a test.

> - Is there any corruption?
> (A read-only version based on
> http://people.gnome.org/~markmc/code/merge-dm-snapshot.c perhaps.)

I'm running that now with the call the merge_chunk() commented out.  There aren't many checks in the code for corruption.  So far the output is looking sane, but the COW device is 500G (2T virtual size).

Comment 4 Nate Straz 2012-08-01 21:06:33 UTC
I reinstalled the hanging system with RHEL6.2 and that system hangs for a while, but eventually continues.

Going back to RHEL6.3 the system hangs for much longer.  After a while sysrq-W doesn't produce a backtrace anymore.  It shows the lvm command as runnable and using a lot of time.

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
R            lvm   926    200754.808642    118199   102    200754.808642   3170374.196015    620922.885992 /


I am able to recreate this on another system to a certain extent.  I think the length of the hang is proportinal to the amount of space used in the cow volume.

This is the orignal volume that causes the hang:

  LV          VG        Attr   LSize   Origin                Snap%  
  roth_sparse roth_vg   swi-a- 500.00G [roth_sparse_vorigin]  25.45

Comment 5 Olivier D 2012-08-13 09:40:56 UTC
I think this bug is the same I encountered.
I detailed what was happening in the linux-lvm mailing list : 
https://www.redhat.com/archives/linux-lvm/2012-August/msg00018.html
There are many additional informations for this bug report, such as : 
- Stall is with vgchange -a y
- Last line before hang is 
#ioctl/libdm-iface.c:1628         dm reload   (253:5) NF   [16384]

Volume is read, but very slowly.

Comment 6 Zdenek Kabelac 2012-08-30 13:07:21 UTC
(In reply to comment #5)
> I think this bug is the same I encountered.
> I detailed what was happening in the linux-lvm mailing list : 
> https://www.redhat.com/archives/linux-lvm/2012-August/msg00018.html
> There are many additional informations for this bug report, such as : 
> - Stall is with vgchange -a y
> - Last line before hang is 
> #ioctl/libdm-iface.c:1628         dm reload   (253:5) NF   [16384]
> 
> Volume is read, but very slowly.


lvs shows volumes correctly  :
  LV                    VG     Attr     LSize   Pool Origin
Data%  Move Log Copy%  Convert
  backup_filer          backup owi-i-s-   4.00t
  backup_filer_20120611 backup swi-a- 450.00g backup_filer  63.40
  backup_filer_20120703 backup swi-a- 400.00g backup_filer  31.87
  backup_filer_20120804 backup swi-a- 400.00g backup_filer   0.89


Hmmm - using   283GB of exceptions - this is not really going to work well with old snapshots. The exception list must take a lot of your RAM, and reading list of exception blocks from disc must be very slow (thought 1 hour looks weird?? - unless a lot of swapping were involved - how much memory you have?)

Old-style snapshot were not designed for multi-GB list of exceptions - and having even 3 such backups....

My best advice here is - to use old snaps in the way they were designed - i.e. short term and small living static snapshot of filesystem.

If you want to have long-term backup -  I may suggest to experiment with thin provisioning which will handle such workload much more efficiently (On the other hand, it may have numerous problems especially in recover scenarios).

Comment 7 Zdenek Kabelac 2012-08-30 13:12:53 UTC
> runnable tasks:
>             task   PID         tree-key  switches  prio     exec-runtime    
> sum-exec        sum-sleep
> -----------------------------------------------------------------------------
> -----------------------------
> R            lvm   926    200754.808642    118199   102    200754.808642  
> 3170374.196015    620922.885992 /
> 
> 
> I am able to recreate this on another system to a certain extent.  I think
> the length of the hang is proportinal to the amount of space used in the cow
> volume.
> 
> This is the orignal volume that causes the hang:
> 
>   LV          VG        Attr   LSize   Origin                Snap%  
>   roth_sparse roth_vg   swi-a- 500.00G [roth_sparse_vorigin]  25.45


Hmm also a lot of exceptions - 125GB 

How much RAM do you have in your system ?

I'm afraid using such huge snapshot is not going to give good performance.

Comment 8 Nate Straz 2012-08-30 13:18:27 UTC
(In reply to comment #7)
> >   LV          VG        Attr   LSize   Origin                Snap%  
> >   roth_sparse roth_vg   swi-a- 500.00G [roth_sparse_vorigin]  25.45
> 
> Hmm also a lot of exceptions - 125GB 
> 
> How much RAM do you have in your system ?
> 
> I'm afraid using such huge snapshot is not going to give good performance.

I think the system we were having trouble from only has 2GB of RAM.  We were using a sparse device which I believe uses the same copy-on-write layer underneth as a snapshot uses.

Comment 9 Olivier D 2012-08-30 15:50:09 UTC
We have 16GB of memory on the server. As I understand, this hang is not a bug, but the result of how snaps are implemented today, right ? 
I understand that the exception list is stored in memory, is that correct ?

You talk about "old-style snapshot". What is new style ? Using thin provisioning ? But tools are far from ready (forced to use dmsetup), right ?

Comment 10 Zdenek Kabelac 2012-08-31 09:06:26 UTC
(In reply to comment #9)
> We have 16GB of memory on the server. As I understand, this hang is not a
> bug, but the result of how snaps are implemented today, right ? 
> I understand that the exception list is stored in memory, is that correct ?

The old-snaps we designed to keep GB, but not 100 of GBs.


> 
> You talk about "old-style snapshot". What is new style ? Using thin
> provisioning ? But tools are far from ready (forced to use dmsetup), right ?

if you have sufficiently new  lvm tools you should be able to create thin pool, thin volumes and and thin snapshots. But yes, it's for now considered experimental (since especially the recovery when something goes wrong is not really addressed well yet)

Basic usage goes like this - more in manpages.

lvcreate -T vg/thinpoolname -Lsize
lvcreate -Vthinvolsize -n thinvolname  -T vg/thinpoolname
lvcreate -s  -n thinsnapname  vg/thinvolname

Comment 11 Zdenek Kabelac 2012-10-19 08:29:39 UTC
Do we have here any regression in terms of  'startup hang' with some earlier version.  If not - I'm going to close this BZ - since we cannot improve dramatically multi snapshots with XXX GB size - it's annoyingly slow with old-style snaps, but it's never been designed for such operations.

Comment 12 Marian Csontos 2012-11-09 11:16:40 UTC
Is this a documentation bug?

The memory implications though "obvious" are not documented anywhere.

We should document some sane level for expected maximal snapshot size / chunksize vs. available memory.

Workaround: use lvconvert to increase chunksize on existing snapshots and use much larger chunksize (1024) when creating snapshots. Cons: negative impacting write performance until the snapshot "stabilizes" and stops growing.

Comment 13 Marian Csontos 2012-11-09 12:45:45 UTC
Tried it out, and though man page suggests it might be possible, it's not working for existing snapshots, only when converting normal lv into one.

Comment 14 Alasdair Kergon 2012-11-13 13:46:07 UTC
1. To activate an old-style snapshot, a significant fraction of the device has to be read and processed (every Nth chunk) to generate the block index which is stored entirely in memory.
2. You can't change the chunk size of an existing snapshot because the on-disk layout depends on the chunk size, so the whole device would have to be rewritten.
3. We could add a sentence to the lvcreate man page about the limitations.