Bug 559959
Summary: | kernel hang with LVM mirrored swap under heavy stress | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | starlight | ||||||
Component: | kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 4.8 | CC: | coughlan, heinzm, msnitzer | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-06-20 15:58:50 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 387571 [details]
vmstat output
representative vmstat taken with swap mirror disabled
(hang not exhibited)
Mirroring the system logical volumes was done as an afterthought to upgrading a 500GB hard drive to the 2TB Western Digital Caviar Green WD20EARS. The newer Caviar Green drives have a 4KB physical sector size with 512 byte sector size emulation. Realized today that default 'fdisk' partitions are misaligned so that two physical sectors must be accessed every time a 4KB ext3 access or page file access occurs. This misalignment could be a factor. Planning to repartition the drive with 'parted' 1.9.0 as GPT and setup aligned partitions. Will retry mirroring the swap LV to see if proper alignment corrects the issue (or re-submerges the bug, whatever the case may be). Correcting partition alignment improved drive performance dramatically. As an example a 'lvconvert -m1' runs at 60MB/s instead of 10MB/s. However the LVM mirrored swap hang is still present. No swap LV mirror--works fine. With mirror system locks under heavy Intel compiler memory paging and CPU load. One difference is that, athough all terminal sessions were hung for over an hour, the system continued to respond to ICMP echo and the ntp daemon continued to respond to remote 'ntpq -p' queries. With misalignment partitions, all aspects of system operation froze. (In reply to comment #3) > Correcting partition alignment improved drive performance > dramatically. As an example a 'lvconvert -m1' runs at 60MB/s > instead of 10MB/s. > > However the LVM mirrored swap hang is still present. > > No swap LV mirror--works fine. > > With mirror system locks under heavy Intel compiler memory > paging and CPU load. > > One difference is that, athough all terminal sessions were hung > for over an hour, the system continued to respond to ICMP echo > and the ntp daemon continued to respond to remote 'ntpq -p' > queries. With misalignment partitions, all aspects of system > operation froze. Thanks for reporting. This speaks to the DM raid1 code doing memory allocation in the writeback path. Such allocations on a swap device are inherently prone to deadlock (nfs, iscsi, and other complex IO layers all have this risk). The DM layer has been engineered in such a way that we attempt to avoid such deadlock (using mempools, GFP_NOIO, etc) but it would seem we're clearly missing something in DM's raid1 code. Would you be willing to test swap on a normal "linear" LV? Using a linear LV for swap should _not_ be cause for deadlock. Using a mirrored swap is dubious at best.. but it does illustrate the fact that it is making allocations in the writeback path that require further freeing of memory. >Would you be willing to test swap on a normal "linear" LV? It appears I applied incorrect terminology. When saying "no-swap LV mirror" above "linear LV" was the intent, as obtained with the 'lvconvert -m0' command. In every test run a linear swap space was verified to work fine and a mirrored swap space determined to deadlock the kernel. No mistaking the difference. In the linear scenario the system remains responsive at all times and the compile completes successfully while in the mirrored scenario the system deadlocks and becomes unresponsive. The compile job and other system configuration parameters are exactly identical. So for now the swap volume has been left as a linear LV and everything works. It would be nice to mirror the swap space so that a drive failure will not take out the system, though this particular situation is not production critical. Otherwise hardware or 'mdadm' mirroring would be used. LVM mirrors are handy in our development environment for their flexibility and capacity for dynamic reconfiguration. >using a mirrored swap is dubious at best This is not something that seems obvious from a use-case perspective. The idea behind mirroring is fault tolerance and it seem likely that a the failure of a drive holding a linear swap volume can crash a system. So one would be inclined to configure a mirror for swap LVs in addition to the mirroring of system image volumes. Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue. |
Created attachment 387563 [details] dmesg output Description of problem: Complete system kernel hang with LVM mirrored swap space under high stress. Hugepage allocation is present. Version-Release number of selected component (if applicable): 2.6.9-89.0.19.ELsmp How reproducible: Allocate all but 1GB of memory to the huge page pool. Configure a LVM2 mirrored swap logical volume with either a core log or a disk log residing on the mirror drives. Didn't try with third drive, but the core log failure makes that seem pointless. With ICC 11.2.064 compile several large programs with 'make -j2' (or higher -j set equal to core count). # cat /proc/swaps Filename Type Size Used Priority /dev/mapper/vg00-lv01 partition 2097144 436164 -1 $ cat /proc/meminfo MemTotal: 4041068 kB MemFree: 64740 kB Buffers: 1436 kB Cached: 37816 kB SwapCached: 182020 kB Active: 546040 kB Inactive: 26312 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 4041068 kB LowFree: 64740 kB SwapTotal: 2097144 kB SwapFree: 1411000 kB Dirty: 140 kB Writeback: 292 kB Mapped: 524396 kB Slab: 183280 kB CommitLimit: 2544812 kB Committed_AS: 1071380 kB PageTables: 4640 kB VmallocTotal: 536870911 kB VmallocUsed: 5604 kB VmallocChunk: 536865127 kB HugePages_Total: 1536 HugePages_Free: 1536 Hugepagesize: 2048 kB Actual results: System hangs shortly after swapping starts. 100% reproducible. Expected results: Works fine if the swap volume is not mirrored. Tried with swap located on each drive and it works either way. Additional info: Tyan S2866-A2NRF Athlon 64 X2 Dual Core Processor 4800+ Phoenix BIOS v6.00PG 8/23/06 Kingston KVR400X64C3AK2/2G 2GB 400MHz DDR Non-ECC CL3 (3-3-3) DIMM (Kit of 2) Corsair CMPSU-620HX sda *SAMSUNG SP2504C [250GB] sdb *WDC WD20EARS-00S8B1 [2TB] sdc Hitachi HDS721010KLA330 [500GB] sdd Hitachi HDS721010KLA330 [500GB] * used for system and swap partitions # pvs PV VG Fmt Attr PSize PFree /dev/sda2 vg00 lvm2 a- 232.81G 195.31G /dev/sdb2 vg00 lvm2 a- 93.00G 53.69G /dev/sdb3 vg01 lvm2 a- 1.73T 0 /dev/sdc vg02 lvm2 a- 931.50G 0 /dev/sdd vg02 lvm2 a- 931.50G 0 # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lv00 vg00 mwi-ao 5.00G lv00_mlog 100.00 lv01 vg00 -wi-ao 2.00G lv02 vg00 mwi-ao 32.06G lv02_mlog 100.00 lv03 vg00 mwi-ao 256.00M lv03_mlog 100.00 lv10 vg01 -wc-ao 1.73T lv20 vg02 -wc-ao 1.82T # df -HP Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-lv00 5.3G 2.8G 2.3G 55% / /dev/sda1 32M 11M 20M 36% /boot /dev/mapper/vg00-lv03 260M 33M 228M 13% /d /dev/mapper/vg00-lv02 34G 20G 15G 59% /w /dev/mapper/vg01-lv10 1.9T 812G 1.1T 43% /ww /dev/mapper/vg02-lv20 2.1T 806G 1.2T 41% /wx # mount /dev/mapper/vg00-lv00 on / type ext3 (rw,noreservation) none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) usbfs on /proc/bus/usb type usbfs (rw) /dev/sda1 on /boot type ext3 (rw,noreservation) /dev/mapper/vg00-lv03 on /d type ext3 (rw,noreservation) /dev/mapper/vg00-lv02 on /w type ext3 (rw,noreservation) /dev/mapper/vg01-lv10 on /ww type ext3 (rw,noreservation) /dev/mapper/vg02-lv20 on /wx type ext2 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) # fdisk -l Disk /dev/sda: 250.0 GB, 250059350016 bytes 255 heads, 63 sectors/track, 30401 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 4 32098+ 83 Linux /dev/sda2 5 30401 244163902+ 8e Linux LVM Disk /dev/sdb: 2000.3 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdb1 1 13 104391 83 Linux /dev/sdb2 14 12157 97546680 8e Linux LVM /dev/sdb3 12158 243201 1855860930 8e Linux LVM Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdc doesn't contain a valid partition table Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdd doesn't contain a valid partition table