Bug 948001

Summary:

lvm thin pool will hang system when full

Product:

Red Hat Enterprise Linux 6

Reporter:

Milos Vyletel <milos.vyletel>

Component:

lvm2

Assignee:

Zdenek Kabelac <zkabelac>

Status:

CLOSED WONTFIX

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.4

CC:

agk, dsulliva, dustymabe, dwysocha, heinzm, jbrassow, jkulesa, msnitzer, nperic, prajnoha, prockai, rmarti, thornber, zkabelac

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1059771 (view as bug list)

Environment:

Last Closed:

2014-04-30 13:47:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

994246, 1056252

Attachments:

Description	Flags
3e1a0699095803e53072699a4a1485af7744601d upstream commit	none

Description Milos Vyletel 2013-04-03 18:24:11 UTC

Description of problem:
When thin pool becomes 100% full all processes accessing hang. Only way how to recover is to cold boot the server. I've found bugzilla describing same issue against Fedora 17 (BZ# 812927). I've hit this issue while looking into sparse LVs that were introduced in libvirt in RHEL 6.4. More info below.

Version-Release number of selected component (if applicable):
kernel-2.6.32-358.2.1.el6
lvm2-2.02.98-9.el6.x86_64
device-mapper-1.02.77-9.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. lvcreate  -L100M -T vgguests/pool -V 12G --name thin
  Rounding up size to full physical extent 128.00 MiB
  Logical volume "thin" created

2. lvs vgguests
  LV   VG       Attr      LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  pool vgguests twi-a-tz- 128.00m               0.00
  thin vgguests Vwi-a-tz-  12.00g pool          0.00

3. mkfs.ext4 /dev/vgguests/thin
mke2fs 1.41.12 (17-May-2010)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=16 blocks
786432 inodes, 3145728 blocks
157286 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=3221225472
96 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information:
...
<hang>
  
Actual results:

Every process hangs when accesing device. All lv* commands hang too. Nothing but power down resolves this state. Below is output from /var/log/messages.

...
Apr  3 14:04:18 localhost kernel: device-mapper: thin: Data device (dm-7) discard unsupported: Disabling discard passdown.
Apr  3 14:04:18 localhost dmeventd[30781]: dmeventd ready for processing.
Apr  3 14:04:18 localhost lvm[30781]: Monitoring thin vgguests-pool-tpool.
Apr  3 14:05:21 localhost kernel: device-mapper: thin: 253:8: reached low water mark, sending event.
Apr  3 14:05:21 localhost kernel: device-mapper: thin: 253:8: no free space available.
Apr  3 14:05:21 localhost lvm[30781]: Thin vgguests-pool-tpool is now 100% full.
Apr  3 14:07:36 localhost kernel: INFO: task dmeventd:30785 blocked for more than 120 seconds.
Apr  3 14:07:36 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  3 14:07:36 localhost kernel: dmeventd      D 0000000000000001     0 30785      1 0x00000080
Apr  3 14:07:36 localhost kernel: ffff8806198b5a78 0000000000000082 0000000000000286 0000000000000010
Apr  3 14:07:36 localhost kernel: 0000000218bbb500 ffff8806198b5b08 ffff8806198b5b80 0000000000000282
Apr  3 14:07:36 localhost kernel: ffff880618bbbab8 ffff8806198b5fd8 000000000000fb88 ffff880618bbbab8
Apr  3 14:07:36 localhost kernel: Call Trace:
Apr  3 14:07:36 localhost kernel: [<ffffffff810a1a49>] ? ktime_get_ts+0xa9/0xe0
Apr  3 14:07:36 localhost kernel: [<ffffffff81119d30>] ? sync_page+0x0/0x50
Apr  3 14:07:36 localhost kernel: [<ffffffff8150e563>] io_schedule+0x73/0xc0
Apr  3 14:07:36 localhost kernel: [<ffffffff81119d6d>] sync_page+0x3d/0x50
Apr  3 14:07:36 localhost kernel: [<ffffffff8150ef1f>] __wait_on_bit+0x5f/0x90
Apr  3 14:07:36 localhost kernel: [<ffffffff81119fa3>] wait_on_page_bit+0x73/0x80
Apr  3 14:07:36 localhost kernel: [<ffffffff81096c40>] ? wake_bit_function+0x0/0x50
Apr  3 14:07:36 localhost kernel: [<ffffffff8112efa5>] ? pagevec_lookup_tag+0x25/0x40
Apr  3 14:07:36 localhost kernel: [<ffffffff8111a3cb>] wait_on_page_writeback_range+0xfb/0x190
Apr  3 14:07:36 localhost kernel: [<ffffffff8112e124>] ? generic_writepages+0x24/0x30
Apr  3 14:07:36 localhost kernel: [<ffffffff8112e151>] ? do_writepages+0x21/0x40
Apr  3 14:07:36 localhost kernel: [<ffffffff8111a51b>] ? __filemap_fdatawrite_range+0x5b/0x60
Apr  3 14:07:36 localhost kernel: [<ffffffff8111a598>] filemap_write_and_wait_range+0x78/0x90
Apr  3 14:07:36 localhost kernel: [<ffffffff8111b948>] generic_file_aio_read+0x418/0x700
Apr  3 14:07:36 localhost kernel: [<ffffffff81223281>] ? avc_has_perm+0x71/0x90
Apr  3 14:07:36 localhost kernel: [<ffffffff811bcd79>] ? __blkdev_get+0x1a9/0x3b0
Apr  3 14:07:36 localhost kernel: [<ffffffff811bb8f3>] blkdev_aio_read+0x53/0xc0
Apr  3 14:07:36 localhost kernel: [<ffffffff81180e3a>] do_sync_read+0xfa/0x140
Apr  3 14:07:36 localhost kernel: [<ffffffff81096c00>] ? autoremove_wake_function+0x0/0x40
Apr  3 14:07:36 localhost kernel: [<ffffffff811bb7cc>] ? block_ioctl+0x3c/0x40
Apr  3 14:07:36 localhost kernel: [<ffffffff81194f32>] ? vfs_ioctl+0x22/0xa0
Apr  3 14:07:36 localhost kernel: [<ffffffff81228cab>] ? selinux_file_permission+0xfb/0x150
Apr  3 14:07:36 localhost kernel: [<ffffffff8121bb86>] ? security_file_permission+0x16/0x20
Apr  3 14:07:36 localhost kernel: [<ffffffff81181725>] vfs_read+0xb5/0x1a0
Apr  3 14:07:36 localhost kernel: [<ffffffff81181861>] sys_read+0x51/0x90
Apr  3 14:07:36 localhost kernel: [<ffffffff810dc5c5>] ? __audit_syscall_exit+0x265/0x290
Apr  3 14:07:36 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Apr  3 14:07:36 localhost kernel: INFO: task mkfs.ext4:30815 blocked for more than 120 seconds.
Apr  3 14:07:36 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  3 14:07:36 localhost kernel: mkfs.ext4     D 0000000000000001     0 30815  30722 0x00000080
Apr  3 14:07:36 localhost kernel: ffff8806189fdcc8 0000000000000082 0000000000000000 ffff880028216768
Apr  3 14:07:36 localhost kernel: ffff8806189fdc38 ffffffff81012b69 ffff8806189fdc78 0000000000000282
Apr  3 14:07:36 localhost kernel: ffff880619bd85f8 ffff8806189fdfd8 000000000000fb88 ffff880619bd85f8
Apr  3 14:07:36 localhost kernel: Call Trace:
Apr  3 14:07:36 localhost kernel: [<ffffffff81012b69>] ? read_tsc+0x9/0x20
Apr  3 14:07:36 localhost kernel: [<ffffffff81119d30>] ? sync_page+0x0/0x50
Apr  3 14:07:36 localhost kernel: [<ffffffff8150e563>] io_schedule+0x73/0xc0
Apr  3 14:07:36 localhost kernel: [<ffffffff81119d6d>] sync_page+0x3d/0x50
Apr  3 14:07:36 localhost kernel: [<ffffffff8150ef1f>] __wait_on_bit+0x5f/0x90
Apr  3 14:07:36 localhost kernel: [<ffffffff81119fa3>] wait_on_page_bit+0x73/0x80
Apr  3 14:07:36 localhost kernel: [<ffffffff81096c40>] ? wake_bit_function+0x0/0x50
Apr  3 14:07:36 localhost kernel: [<ffffffff8112efa5>] ? pagevec_lookup_tag+0x25/0x40
Apr  3 14:07:36 localhost kernel: [<ffffffff8111a3cb>] wait_on_page_writeback_range+0xfb/0x190
Apr  3 14:07:36 localhost kernel: [<ffffffff8111a598>] filemap_write_and_wait_range+0x78/0x90
Apr  3 14:07:36 localhost kernel: [<ffffffff811b1a6e>] vfs_fsync_range+0x7e/0xe0
Apr  3 14:07:36 localhost kernel: [<ffffffff811b1b3d>] vfs_fsync+0x1d/0x20
Apr  3 14:07:36 localhost kernel: [<ffffffff811b1b7e>] do_fsync+0x3e/0x60
Apr  3 14:07:36 localhost kernel: [<ffffffff811b1bd0>] sys_fsync+0x10/0x20
Apr  3 14:07:36 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


Expected results:
Filling up thin volume is not smart idea but system hang is not expected. I would expect to see I/O error of some kind. On possible solution I've tried to use autoextend of thin volume but that just postpones hang. Ultimately dmeventd is not able to resize thin pool fast enough and once it's 100% full system hangs anyways.

Additional info:
# grep autoextend /etc/lvm/lvm.conf
    snapshot_autoextend_threshold = 50
    snapshot_autoextend_percent = 50
    thin_pool_autoextend_threshold = 50
    thin_pool_autoextend_percent = 50

# vgdisplay vgguests
  --- Volume group ---
  VG Name               vgguests
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               260.06 GiB
  PE Size               32.00 MiB
  Total PE              8322
  Alloc PE / Size       5 / 160.00 MiB
  Free  PE / Size       8317 / 259.91 GiB
  VG UUID               sMFZWa-cGRg-mnOn-pcJG-biva-ZdTY-Iy8DVt

Comment 1 Zdenek Kabelac 2013-04-03 20:14:17 UTC

There is another way - to add/extend new PV to VG and resize thin pool volume to bigger size - but yes - we want improve policy handling.

Comment 2 Milos Vyletel 2013-04-04 13:09:14 UTC

Thanks. I've had plenty of space available in VG, I could create/resize the pool to bigger size but the point is that even though I've started at small size (128M) with virtual size of 12G the system hang is not cool. 

Here's how I got to this point. We have a in house made GUI talking to libvirt and creating guests. While creating we've been using capacity xml tag to pass the size to libvirt and allocation xml tag set to 0 because LV pool in libvirt did not support sparse volumes. RHEL 6.4 however adds sparse LV support so by using allocation of 0 we tell libvirt to create sparse volume. This is obviously something we fix in our code but I went ahead and started to test it.

Default snapshot based sparse LVs are useless. Once they are filled to 100% per design of snapshot is invalidated and all data are gone. Since we create the snapshot of 32M (1 extent) in size we will fill it up very quickly because dmeventd is not able to keep up with the autoextend. That lead me to thin volumes when we see same behaviour (dmeventd not able to resize fast enough) and once we fill the thin volume system hangs.

Comment 3 Mike Snitzer 2013-04-04 13:23:47 UTC

Seems you need to be more conservative with the amount of space you add to the thin-pool.  I understand your concern but you're not using the thin-pool device as designed.  What exactly are you saying you want to happen?

I understand that "system hang is not cool" but it isn't system hang; it is a hang of IO being issued to the thin pool.  Now if these leads to system-wide deadlock due to interdependent writeback needed to free memory in the VM (as in memory mgmt VM ;) then yes that certainly isn't cool.

Again, what would be the ideal response you're looking for?  Do you just want the thin-pool's metadata to transition to read-only mode?  This means writes will fail with -EIO.

Comment 4 Milos Vyletel 2013-04-04 13:40:38 UTC

Ideally I would like to see behavior similar (or same) as with regular LVs. When we're running out of space and currently allocated size is <= virtual size I would expect delayed IO or EBUSY until it's enlarged. When we hit the virtual size boundary I would expect ENOSPC error while keeping the LV mounted RW so that one can clean up when necessary.

Comment 7 Joel Kulesa 2013-11-26 16:32:30 UTC

Agreed with Milos, if we encounter a situation where we inadvertently run out of space, I can see "hanging" being an answer until the pool is expanded, however, most folks would expect an ENOSPC error to kick back.   Indeed its hard to unwedge the situation since the LVM tools also lock up when they stat the thin volume that's hung.  Once you run a pool out of space - you are SOL.  That's not Enterprise Linux, sorry :-)

I was excited about thin volumes until I discovered this.  Pretty much unusable until this is fixed.

For example here is an strace of the lvs command:

ioctl(3, DM_TABLE_STATUS, 0x1813fa0)    = 0
stat("/dev/vg-local-test01/thin", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 5), ...}) = 0
open("/dev/vg-local-test01/thin", O_RDONLY|O_DIRECT|O_NOATIME
[hang]

Comment 11 Milos Vyletel 2014-04-22 18:44:17 UTC

Hi,

I've noticed that this BZ was moved to needinfo. Not sure if I should provide any additional information or not but I've taken a look at the code and found upstream commit(3e1a0699095803e53072699a4a1485af7744601d) that seem to be enhancing error handling in this particular case. I hope I'll find some time to test this in coming days. I'm attaching the patch.

Comment 12 Milos Vyletel 2014-04-22 18:47:00 UTC

Created attachment 888625 [details]
3e1a0699095803e53072699a4a1485af7744601d upstream commit

Comment 14 RHEL Program Management 2014-04-30 13:47:06 UTC

Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.