Bug 114675
Summary: | LVM snapshots fail on machine with 8GB RAM | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Andrew Rechenberg <arechenberg> |
Component: | kernel | Assignee: | Heinz Mauelshagen <heinzm> |
Status: | CLOSED CANTFIX | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | agk, coldwell, coughlan, cperry, dustin.tennill, dwysocha, hugh_caley, jkeating, mbroz, petrides, sct, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-09-12 15:37:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Andrew Rechenberg
2004-01-31 00:38:34 UTC
Created attachment 97381 [details]
Kernel BUG when creating snapshots with 2.4.18-27
Works just fine for me on an 8GB quad-CPU system using 2.4.21-4.EL. I'm wondering if there are other modules loaded that might be gobbling up the vmalloc space if we have >8GB memory. What modules are present on the systems where you see the problems? Background: it is the allocation of the snapshots exception tbale which fails in vmalloc(). We need more iput, because this is not reproducable on Stephen's 8GB box. Created attachment 97554 [details]
lsmod and lspci for server in question
Here is lsmod and lspci for the server in question. I wouldn't be suprised of
the megaraid2 module was the culprit. We've had fits with the Dell PERC3 cards
and have stopped using them for data arrays in favor of Linux software RAID
FYI, we still use the Dell PERC cards for the OS drives. That's why the megaraid2 module is being loaded. This duplicates for me on a 3ware based system with 8 gigs of ram. I'm using a test kernel based on 2.4.21-9.EL called 2.4.21-9.EL.noaffine2smp that dledford put together. It's exactly the same kernel but w/out the affine patch. [root@kickstation-120 root]# lvcreate --size 10M --name snapper -s /dev/test/foo lvcreate -- rounding size up to physical extent boundary lvcreate -- WARNING: the snapshot will be automatically disabled once it gets full lvcreate -- INFO: using default snapshot chunk size of 64 KB for "/dev/test/snapper" lvcreate -- ERROR "Cannot allocate memory" creating VGDA for "/dev/test/snapper" in kernel 2.4.21-9.EL works perfectly for me, but I've been using the hugemem kernel. I'll try again with the smp one. Andrew, which exact kernel variant were you using? Nope, 2.4.21-9.EL-smp works just as well as -hugemem. Jesse, what other modules do you have loaded? FYI, I managed to provoke this one on a 1.5GB RAM dual Athlon system with QLogic 2200 cards once on mainline 2.4. Not reproducable after a reboot. My guesswork is, that this is a general allocation problem on Intel and compatible architectures. Idea: could folks who can provoke that behaviour try loading the dummy module (code below) directly after the snapshot allocation failure, which just vmallocs and frees memory and see if that fails (change size value while testing) ? /* * Copyright (C) 2004 Red Hat, Inc. * * This file is released under the GPL. */ #include <linux/module.h> #include <linux/vmalloc.h> /* Change for testing */ size_t size = 1024 * 1024; int __init dummy_init(void) { void *p = vmalloc(size); if (p) vfree(p); return 0; } void __exit dummy_exit(void) { } module_init(dummy_init); module_exit(dummy_exit); MODULE_DESCRIPTION("Dummy vmalloc() test module"); MODULE_AUTHOR("Heinz Mauelshagen <Mauelshagen>"); Stephen, I'm using the -hugemem kernel How large should we change the size variable? That depends on the size of the snspshot you tried to create. lvm-snap.c allocates space in lvm_snapshot_alloc() for kiovec etc. and the vmalloc for the eception table hash which is failing happens as the last step. For your 1MB snapshot, 1MB are more than sufficient. So if I interpret your comment correctly, then the size = 1024 * 1024 should be fine? If this is the case, the module code above loads just fine. I've tried it up to 1024 * 32768 and it's loaded just fine. Yes, you interpreted correctly. Wondering, if we've got a bug leading to vmalloc(0). Can you printk() the vmalloc() size value in lvm-snap.c and try to provoke the failure again, please ? If it is 0, we can drill it down from there. BTW: creating a larger snapshot than 1MB might work. I've tried creating a number of snapshot sizes from 1M-5G and they all receive the same error. The only reason I posted the 1M above is because since I was receiving a vmalloc error I tried making the snapshot size small. I will modify lvm-snap.c and try the printk() and post the results. Thanks for your help. Someone confiscated my test hardware so it may be about 3-5 days before I can re-test the modules. Just wanted to update. Thanks again. I added the following at line 567 in lvm-snap.c (inside lvm_snapshot_alloc_hash_table) printk("LVM snap size: %d\n", size); This is what was returned in dmesg: LVM snap size: 0 So the size value (if I did what you wanted me to do correctly) is 0, Let me know what's next. Andrew, can you please try this patch which fixes a wrong sector calculation in lvm_snapshot_alloc: --- linux-2.4.21/drivers/md/lvm-snap.c.orig 2004-03-10 11:54:01.000000000 +0100 +++ linux-2.4.21/drivers/md/lvm-snap.c 2004-03-10 11:54:37.000000000 +0100 @@ -583,15 +583,14 @@ int lvm_snapshot_alloc(lv_t * lv_snap) { - int ret, max_sectors; + int ret; /* allocate kiovec to do chunk io */ ret = alloc_kiovec(1, &lv_snap->lv_iobuf); if (ret) goto out; - max_sectors = KIO_MAX_SECTORS << (PAGE_SHIFT-9); - - ret = lvm_snapshot_alloc_iobuf_pages(lv_snap->lv_iobuf, max_sectors); + ret = lvm_snapshot_alloc_iobuf_pages(lv_snap->lv_iobuf, + KIO_MAX_SECTORS); if (ret) goto out_free_kiovec; /* allocate kiovec to do exception table io */ I tried to patch with patch -p1 < linux-2.4.21-lvm-snap.patch and it rejected the patch for some reason (this is against the kernel-source RPM). I manually patched lvm-snap.c and I receive the same "Cannot allocate memory" error. BTW, the printk still returns 0. Here is the exact output from the lvcreate command: [root@cinshrinft1 ~]# !lvcreate lvcreate -s -L1M -n snap /dev/vg00/lv00 lvcreate -- rounding size up to physical extent boundary lvcreate -- WARNING: the snapshot will be automatically disabled once it gets full lvcreate -- INFO: using default snapshot chunk size of 64 KB for "/dev/vg00/snap" lvcreate -- ERROR "Cannot allocate memory" creating VGDA for "/dev/vg00/snap" in kernel I'm getting the same error on a Dell 6450 with 16 gig of ram. The size of the snapshot doesn't seem to matter. I'm sorry, typo, it's a 6650. I can confirm that adding mem=16000M to my kernel parameters and rebooting allowed me to take a snapshot. Dell 6650, Megaraid card. A fix for this problem has just been committed to the RHEL3 U3 patch pool (in kernel version 2.4.21-15.1.EL). I know it will be ready "when it's ready," but is there a timeline for the release of 2.4.21-15.1.EL? Will it be included possibly in an errata, or will it just be released in the U3 update? Can you give me an idea of what was the root cause as well as what needed to be patched to correct the problem? Also, will there be a specific patch(es) in the kernel SRPM that refers to this issue so that it can be backported if necessary? I have some boxes not on RHEL3 and I would still like to have this issue resolved on those boxes as well. Thanks for your help, Andy. Hello, Andrew. That fix went into the first build of RHEL3 U3, and thus it will be a few months before U3 is officially released (since we're just at the beginning of the 3rd Update cycle). It's not likely to be released in a security errata. There won't be a separate patch in the kernel-source* RPM that solely contains this fix. (I incorporated the fix into linux-2.4.20-lvm-updates.patch.) But if you wish to build your own (unsupported) kernels from source, I will append the fix to this bugzilla. Did I mention that the resultant kernel would be unsupported in this case? :) Cheers. -ernie Created attachment 99967 [details]
fix max sectors argument in lvm_snapshot_alloc() (Heinz Mauelshagen)
Thanks for the update Ernie. The patch that you attached is the same one that Heinz put in this bug in Comment #21 and that did not work for me. Are you saying that this patch is the only one that fixes the snapshot issue, because that patch didn't work at all for me? Thanks, Andy. Andrew, I apologize for not reading this bugzilla report thoroughly before changing its state to "modified" on Monday night. I have put it back into "assigned" state until the problem you reported is fully resolved. Heinz, please follow up on this issue, and when/if you post an additional patch (internally), please notate that it definitively resolves this bugzilla id. (To be fair to Heinz, he only wrote that the bug fixed by the prior patch was likely to be causing this problem, and it was my fault for not "closing the loop".) Cheers. -ernie Any update on this bug? No, we weren't able to reproduce your flaw here at any time. You can reproduce the bug by using a hardware RAID card that uses the megaraid2 driver in a Dell PowerEdge server. There are two other reports of the same problem in this bug report (see Comment #9 and Comment #24) Still having the same problem on RHEL3-U4 with kernel 2.4.21-27.ELsmp. I will attach lsmod, lspci, and error. This error is occurring on another Dell PowerEdge server model (PE4600). What needs to be done to have Red Hat get Dell hardware in to reproduce this issue? Created attachment 111599 [details]
Information (lspci, lsmod, etc) for PE4600 on which bug occurs
Contains the following:
uname -a
lsmod
lcpci
lvmdiskscan
lvdisplay
lvcreate showing error
free
dmidecode
We have the same issue, found it this weekend during an upgrade. PowerEdge 6600 - 8G ram PowerEdge 6400 - 8G ram Both affected with the snapshot creation problem. I added mem=6000M and I can now create snaphshots. These are the only two I have tested, but we have ten more servers that could potentially have the same problem. Created attachment 125852 [details]
Sun V40z, 4 x AMD opertons , 4G RAM lsmod, lspci output
Adding a "me too" here, as it seems important since my hardware is ENTIRELY different than anything listed so far: Sun V40z, quad AMD opterons, 4G RAM, RHEL 3, smp kernel,LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) card. 3 x 72G Raid 5 configured, one 72G logical volume on the hardware raid. Machine was patched an hour before trying this experiment. mem=4000M workaround worked for me as well. uname -a: Linux diamondback 2.4.21-37.0.1.ELsmp #1 SMP Wed Jan 11 18:35:45 EST 2006 i686 athlon i386 GNU/Linux I've attached lsmod and lspci. Thanks, Any progress here? :) Wasn't able to reproduce it here so I'm afraid no. Do you have access to Dell 6xxx series hardware with 8GB of RAM and a Dell PERC RAID card? That's all you need to reproduce the issue. :) Sorry for the late answer: no, I haven't. I only tried on a different 8GB machine. Haha ... really late :) I believe the problem only occurs on machines with hardware RAID card ... in particular those that use the megaraid2 kernel module. I'm not sure what kernel module the LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT cards use (as in comment #46), but I've reproduced it on 3 different Dell machines and there are 3 other users in this bug that have reproduced the bug. I *guarantee* you can reproduce it if you use Dell 46xx or 66xx series servers with a Dell PERC RAID card (that uses the megaraid2 module) and at least 8GB of RAM. Comment #9 and comment #46 also indicate hardware that can reproduce the issue. I should refine my Steps to Reproduce to include a Dell PERC hardware RAID card. Closing now w/o possibility to reproduce and fix. |