Bug 117400
Summary: | dentunusd reported as around 4 billion | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Angus Thomas <athomas> | ||||
Component: | kernel | Assignee: | Don Howard <dhoward> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.1 | CC: | bnocera, burks, jlayton | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-01-31 00:36:51 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 143573 | ||||||
Attachments: |
|
Description
Angus Thomas
2004-03-03 16:58:41 UTC
Machine is a POwerEdge 6650. lsmod looks like this: Module Size Used by Tainted: P aic7xxx 127264 0 (unused) ide-tape 61120 0 (autoclean) lp 8224 0 (autoclean) parport 38144 0 (autoclean) [lp] sr_mod 17592 0 (autoclean) (unused) st 30548 0 (autoclean) (unused) ide-cd 35328 0 (autoclean) cdrom 35520 0 (autoclean) [sr_mod ide-cd] sg 35076 0 (autoclean) nfs 92192 112 (autoclean) lockd 61184 1 (autoclean) [nfs] sunrpc 86128 1 (autoclean) [nfs lockd] esm 78310 1 autofs4 12644 9 (autoclean) bcm5700 73156 1 ext3 74240 5 jbd 55176 5 [ext3] megaraid 28288 7 sd_mod 13888 6 scsi_mod 125980 6 [aic7xxx sr_mod st sg megaraid sd_mod] Not a sar bug necessarily. Sar is failing to handle negative values reported from /proc. Specifically: [thomasaa@xldn0145pap bin}$ cat /proc/sys/fs/dentry-state 337518 -953 45 0 0 0 That negative value for unused dentry objects is illogical so, really, this is a kernel bug in the 2.4.9-e.25enterprise kernel. Has a fix for this bug been generated? I too am running 2.1AS (with kernel 2.4.9-e.27enterprise) and experiencing the same dentunusd issue. After running the system for months at a time with 2.5GB of RAM, we recently added more memory (now at 6GB total) and have been having lots of i/o problems since then (and uptime hasn't made it past 4 days without requiring a reboot). Process counts get up into the hundreds, most waiting on i/o, with loadavg over 50 (and I've seen as high at 80 or 90 sustained). In looking at sar output to try to help determine the problem, I noticed this same dentunusd output as reported in this bug. Any suggestions? The machine is a Compaq DL-380 with dual 2.8GHz P4s. I can provide more details as needed. It runs an internal SourceForge instance, so disk i/o is critical (lots and lots of CVS repositories served). I have no idea if I'm barking up the wrong tree here or not, but any help you could provide would be appreciated. It looks like a negative value for dentry_stat->nr_unused would cause a vicious cycle of dcache pruning (note that all of this is signed int arithmetic): int shrink_dcache_memory(int priority, unsigned int gfp_mask) { ... if (priority) count = dentry_stat.nr_unused / priority; prune_dcache(count); ... void prune_dcache(int count) { ... dentry_stat.nr_unused--; /* Unused dentry with a count? */ if (atomic_read(&dentry->d_count)) BUG(); prune_one_dentry(dentry); if (!--count) break; ... If I'm interpreting this correctly, I'd expect that this would bring a machine with much disk IO activity to its knees. Yes, my system does get brought to its knees when the disk IO gets to even moderate levels. Is there any plan to provide a fix for this issue? Jason Burks GE Global Reseach *** Bug 169635 has been marked as a duplicate of this bug. *** Can folks who are experiencing this bug please test with the following kernel? http://people.redhat.com/dhoward/bz117400/ Please note that this kernel is provided for debug/testing purposes, and is not officially supported. Don, I'll have to see whether I can figure out a way to test that kernel. This is happening on a production machine and not any of my test machines. I'm not sure I can duplicate the production load on my test boxes, but I'll see what I can do. By the way, I've also noticed a similiar signed/unsigned problem with the inode-sz field in my sar output. It generally corresponds to when my dentunusd number gets way up there, but not always. For instance, my machine crashed around 2:15 am or so today, and here is the relevant sar output from that time window: 12:00:01 AM dentunusd file-sz %file-sz inode-sz super-sz %super-sz dquot- sz %dquot-sz rtsig-sz %rtsig-sz 01:00:00 AM 3117 600 0.60 3629 0 0.00 0 0.00 0 0.00 01:10:00 AM 4294966953 638 0.64 776 0 0.00 0 0.00 0 0.00 01:30:10 AM 4294966355 405 0.41 660 0 0.00 0 0.00 1 0.10 01:40:00 AM 4294966461 657 0.66 671 0 0.00 0 0.00 2 0.20 01:50:00 AM 4294966443 555 0.85 731 0 0.00 0 0.00 4 0.39 02:00:01 AM 4294966236 569 0.87 653 0 0.00 0 0.00 4 0.39 02:10:01 AM 4294966199 500 0.76 733 0 0.00 0 0.00 4 0.39 03:20:00 AM 8000 43 0.07 4731 0 0.00 0 0.00 0 0.00 03:30:00 AM 157474 114 0.17 134079 0 0.00 0 0.00 0 0.00 03:40:00 AM 49440 210 0.32 32106 0 0.00 0 0.00 0 0.00 03:50:00 AM 15827 130 0.20 14193 0 0.00 0 0.00 0 0.00 As you can see, sar read the counters at 2:10 am, and then not again until after 3:00 am. I had to powercycle the box after waiting for shutdown -r to do its magic for nearly 45 minutes. So I guess, while you're looking into the dentry cache values, you might look to see if this is related. I'll let you know about testing that kernel. Thanks, Jason Burks GE Global Research Don, I was finally able to deploy your patched kernel to my production system (unfortunately the only place I could reproduce the issue). It's now been up for 12 days with no hiccups whatsoever. The system load numbers are much more stable, old processes waiting on i/o finish up and go away properly, and the problem seems to have gone away. With the old kernel, running a full backup of the system would almost always kill the box. We've run the backup 12 nights in a row with the new kernel and haven't had any problems. Until the system has been up a full month, I won't call the patch a success, but for now, it sure looks like it is. Regards, Jason Burks GE Global Research Jason - Thanks for the feedback, I look forward to hearing how the patch fares over the next few weeks. I'd also be interested to know if you see any kernel warning messages like this: "prune_dcache(): negative count" ? Don, I don't see any kernel warnings like that in anything in /var/log. Where would that warning appear if it showed up? -Jason Jason - If they appear, I'd expect to see them in /var/log/messages. BTW: do you use any third party filesystems? What does /proc/modules look like on this system? Don, Okay, checked /var/log/messages and don't see any messages about prune_dcache. --snip-- [root@openge log]# cat /proc/modules cpqci 28232 0 (unused) tg3 61984 2 lpfcdfc 150016 2 vxportal-3.4.3 3152 0 (autoclean) (unused) qlog-3.4.3 53632 0 (autoclean) (unused) vxfs-3.4.3 1054972 1 (autoclean) [vxportal-3.4.3 qlog-3.4.3] vxspec 5796 2 vxio 659168 2 vxdmp 121176 59 usb-ohci 23488 0 (unused) usbcore 68896 1 [usb-ohci] ext3 71264 10 jbd 55700 10 [ext3] lpfc 224768 112 [lpfcdfc] cciss 47648 12 sd_mod 14048 112 scsi_mod 128508 2 [lpfc cciss sd_mod] --snip-- As you can see, yes, we use Veritas vxfs for the filesystem on our SAN. -Jason Don, Just wanted to update you on my production server. We're currently at 33+ days of uptime and counting. There have been no load problems at all, and I have not seen any log messages about negative cache counts. I think your patch has solved my problem. -Jaosn An updated test kernel is available at http://people.redhat.com/dhoward/bz117400/ The included patch addresses racey updates to dentry_stat. Testing and feedback requested. Don, Please provide more details on what this latest patch does. The previously provided test kernel has been running very well for us. I'd like to know what symptoms I should be looking for to see if I'm experiencing the race condition. As my machine is a large production environment and I have not been able to reproduce the problems in my test environment, I'm reluctant to start playing with other kernels unless I know there is a specific need. Thanks, Jason Hi Jason - The previous patch treated the symptom - it reset the dentry_stat counters when they became negative. The updated kernel contains a patch that arranges for dentry_stat updates to happen only when the caller is holding dcache_lock. This should keep the counters from being decremented below zero. Created attachment 128808 [details]
Ensure that dcache_lock is held when updating dentry_stat.
Are folks still seeing negative dentry values? There's been no feedback on the updated test kernel. I need some positive testing feedback before the patch can be included in official builds. It's not been something that the client has found to be readily reproducable and they're loathe to deploy a test kernel widely for something along these lines, thus I'm not likely able to confirm the fix as such. I'm sorry I did not reply earlier. We've been running this latest kernel (2.4.9-e.70.bz117400enterprise #1 SMP) since July without any issues. I only have one box where the original error was occuring, so I can't say for sure that the problem is gone, only that it is fixed on my one box that was having the problem. Regards, Jason Interest in this issue appears to have waned. I'm closing it now, due to lack of firm feedback. Eric Sandeen pointed out this patch as a possible fix: http://linus.bkbits.net:8080/linux-2.5/?PAGE=gnupatch&REV=1.1722.49.23 If anyone is experiencing this issue, please re-open the ticket with details, and I can provide a test kernel. Thanks. |