Bug 559959 - kernel hang with LVM mirrored swap under heavy stress
Summary: kernel hang with LVM mirrored swap under heavy stress
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-01-29 13:30 UTC by starlight
Modified: 2012-06-20 15:58 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 15:58:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg output (21.19 KB, text/plain)
2010-01-29 13:30 UTC, starlight
no flags Details
vmstat output (1.67 KB, text/plain)
2010-01-29 13:55 UTC, starlight
no flags Details

Description starlight 2010-01-29 13:30:58 UTC
Created attachment 387563 [details]
dmesg output

Description of problem:

Complete system kernel hang with LVM mirrored swap space
under high stress.  Hugepage allocation is present.

Version-Release number of selected component (if applicable):

2.6.9-89.0.19.ELsmp

How reproducible:

Allocate all but 1GB of memory to the huge page pool.

Configure a LVM2 mirrored swap logical volume with
either a core log or a disk log residing on the
mirror drives.  Didn't try with third drive, but
the core log failure makes that seem pointless.

With ICC 11.2.064 compile several large programs with
'make -j2' (or higher -j set equal to core count).

# cat /proc/swaps
Filename              Type      Size    Used   Priority
/dev/mapper/vg00-lv01 partition 2097144 436164 -1      

$ cat /proc/meminfo
MemTotal:      4041068 kB
MemFree:         64740 kB
Buffers:          1436 kB
Cached:          37816 kB
SwapCached:     182020 kB
Active:         546040 kB
Inactive:        26312 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      4041068 kB
LowFree:         64740 kB
SwapTotal:     2097144 kB
SwapFree:      1411000 kB
Dirty:             140 kB
Writeback:         292 kB
Mapped:         524396 kB
Slab:           183280 kB
CommitLimit:   2544812 kB
Committed_AS:  1071380 kB
PageTables:       4640 kB
VmallocTotal: 536870911 kB
VmallocUsed:      5604 kB
VmallocChunk: 536865127 kB
HugePages_Total:  1536
HugePages_Free:   1536
Hugepagesize:     2048 kB

Actual results:

System hangs shortly after swapping starts.  100% reproducible.

Expected results:

Works fine if the swap volume is not mirrored.
Tried with swap located on each drive and it
works either way.

Additional info:

Tyan S2866-A2NRF
  Athlon 64 X2 Dual Core Processor 4800+ 
  Phoenix BIOS v6.00PG 8/23/06
Kingston KVR400X64C3AK2/2G
  2GB 400MHz DDR Non-ECC CL3 (3-3-3) DIMM (Kit of 2)
Corsair CMPSU-620HX

sda *SAMSUNG SP2504C [250GB]
sdb *WDC WD20EARS-00S8B1 [2TB]
sdc  Hitachi HDS721010KLA330 [500GB]
sdd  Hitachi HDS721010KLA330 [500GB]
* used for system and swap partitions

# pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda2  vg00 lvm2 a-   232.81G 195.31G
  /dev/sdb2  vg00 lvm2 a-    93.00G  53.69G
  /dev/sdb3  vg01 lvm2 a-     1.73T      0
  /dev/sdc   vg02 lvm2 a-   931.50G      0
  /dev/sdd   vg02 lvm2 a-   931.50G      0

# lvs
  LV   VG   Attr   LSize   Origin Snap%  Move Log       Copy%  Convert
  lv00 vg00 mwi-ao   5.00G                    lv00_mlog 100.00
  lv01 vg00 -wi-ao   2.00G
  lv02 vg00 mwi-ao  32.06G                    lv02_mlog 100.00
  lv03 vg00 mwi-ao 256.00M                    lv03_mlog 100.00
  lv10 vg01 -wc-ao   1.73T
  lv20 vg02 -wc-ao   1.82T

# df -HP
Filesystem             Size   Used  Avail Use% Mounted on
/dev/mapper/vg00-lv00   5.3G   2.8G   2.3G  55% /
/dev/sda1               32M    11M    20M  36% /boot
/dev/mapper/vg00-lv03   260M    33M   228M  13% /d
/dev/mapper/vg00-lv02    34G    20G    15G  59% /w
/dev/mapper/vg01-lv10   1.9T   812G   1.1T  43% /ww
/dev/mapper/vg02-lv20   2.1T   806G   1.2T  41% /wx

# mount
/dev/mapper/vg00-lv00 on / type ext3 (rw,noreservation)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw,noreservation)
/dev/mapper/vg00-lv03 on /d type ext3 (rw,noreservation)
/dev/mapper/vg00-lv02 on /w type ext3 (rw,noreservation)
/dev/mapper/vg01-lv10 on /ww type ext3 (rw,noreservation)
/dev/mapper/vg02-lv20 on /wx type ext2 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

# fdisk -l

Disk /dev/sda: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1           4       32098+  83  Linux
/dev/sda2               5       30401   244163902+  8e  Linux LVM

Disk /dev/sdb: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1          13      104391   83  Linux
/dev/sdb2              14       12157    97546680   8e  Linux LVM
/dev/sdb3           12158      243201  1855860930   8e  Linux LVM

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdd doesn't contain a valid partition table

Comment 1 starlight 2010-01-29 13:55:35 UTC
Created attachment 387571 [details]
vmstat output

representative vmstat taken with swap mirror disabled
(hang not exhibited)

Comment 2 starlight 2010-01-30 21:51:43 UTC
Mirroring the system logical volumes was done as an 
afterthought to upgrading a 500GB hard drive to the 2TB Western 
Digital Caviar Green WD20EARS.

The newer Caviar Green drives have a 4KB physical sector size 
with 512 byte sector size emulation.  Realized today that default 
'fdisk' partitions are misaligned so that two physical sectors 
must be accessed every time a 4KB ext3 access or page file access
occurs.

This misalignment could be a factor.

Planning to repartition the drive with 'parted' 1.9.0 as GPT and 
setup aligned partitions.  Will retry mirroring the swap LV to 
see if proper alignment corrects the issue (or re-submerges the
bug, whatever the case may be).

Comment 3 starlight 2010-02-01 02:12:31 UTC
Correcting partition alignment improved drive performance 
dramatically.  As an example a 'lvconvert -m1' runs at 60MB/s 
instead of 10MB/s.

However the LVM mirrored swap hang is still present.

No swap LV mirror--works fine.

With mirror system locks under heavy Intel compiler memory 
paging and CPU load.

One difference is that, athough all terminal sessions were hung 
for over an hour, the system continued to respond to ICMP echo 
and the ntp daemon continued to respond to remote 'ntpq -p' 
queries.  With misalignment partitions, all aspects of system 
operation froze.

Comment 4 Mike Snitzer 2010-02-01 15:16:45 UTC
(In reply to comment #3)
> Correcting partition alignment improved drive performance 
> dramatically.  As an example a 'lvconvert -m1' runs at 60MB/s 
> instead of 10MB/s.
> 
> However the LVM mirrored swap hang is still present.
> 
> No swap LV mirror--works fine.
> 
> With mirror system locks under heavy Intel compiler memory 
> paging and CPU load.
> 
> One difference is that, athough all terminal sessions were hung 
> for over an hour, the system continued to respond to ICMP echo 
> and the ntp daemon continued to respond to remote 'ntpq -p' 
> queries.  With misalignment partitions, all aspects of system 
> operation froze.    

Thanks for reporting.  This speaks to the DM raid1 code doing memory allocation in the writeback path.  Such allocations on a swap device are inherently prone to deadlock (nfs, iscsi, and other complex IO layers all have this risk).

The DM layer has been engineered in such a way that we attempt to avoid such deadlock (using mempools, GFP_NOIO, etc) but it would seem we're clearly missing something in DM's raid1 code.

Would you be willing to test swap on a normal "linear" LV?  Using a linear LV for swap should _not_ be cause for deadlock.  Using a mirrored swap is dubious at best.. but it does illustrate the fact that it is making allocations in the writeback path that require further freeing of memory.

Comment 5 starlight 2010-02-01 16:55:28 UTC
>Would you be willing to test swap on a normal "linear" LV?

It appears I applied incorrect terminology.  When saying 
"no-swap LV mirror" above "linear LV" was the intent, as 
obtained with the 'lvconvert -m0' command.

In every test run a linear swap space was verified to work fine 
and a mirrored swap space determined to deadlock the kernel.  No 
mistaking the difference.  In the linear scenario the system 
remains responsive at all times and the compile completes 
successfully while in the mirrored scenario the system deadlocks 
and becomes unresponsive.  The compile job and other system 
configuration parameters are exactly identical.

So for now the swap volume has been left as a linear LV and 
everything works.  It would be nice to mirror the swap 
space so that a drive failure will not take out the system, 
though this particular situation is not production critical. 
Otherwise hardware or 'mdadm' mirroring would be used.  LVM 
mirrors are handy in our development environment for their 
flexibility and capacity for dynamic reconfiguration.

>using a mirrored swap is dubious at best

This is not something that seems obvious from a use-case 
perspective.  The idea behind mirroring is fault tolerance and 
it seem likely that a the failure of a drive holding a linear 
swap volume can crash a system.  So one would be inclined
to configure a mirror for swap LVs in addition to the
mirroring of system image volumes.

Comment 6 Jiri Pallich 2012-06-20 15:58:50 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.


Note You need to log in before you can comment on or make changes to this bug.