Bug 147933
Summary: | Mem leak in 2.6.10-1.760_FC3 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mike Bird <mgb> | ||||||
Component: | kernel | Assignee: | Dave Jones <davej> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3 | CC: | davej, davem, paul+rhbugz, pfrields, wtogami, zing | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2005-05-22 06:11:46 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 147461 | ||||||||
Attachments: |
|
Description
Mike Bird
2005-02-13 14:06:58 UTC
Two servers running sub-DS1 loads seem to be leaking both "bio" and "biovec-1" at a rate of just under one million of each type of object per day - approx 110MB per day. As yet, unable to develop a simple test case. Have tried process creation, file creation and deletion, user quotas. Leak seems to be related to setting up or tearing down TCP connections. Problem seems to be unrelated to volume of TCP data passed. Assuming you have sshd configured to allow localhost to login to itself without user interaction, this will demonstrate the leak: while true; do ssh localhost echo -n .; done Monitor with slabtop. davem, does this ring a bell ? ISTR some talk of networking related leaks post 2.6.10, so if something went into 2.6.11rc, it's possible we need that backported until we rebase to 2.6.11. Can we get a /proc/slabinfo snapshot? That will help me figure out what is leaking exactly. There were some fixes post-2.6.11rc, but they were related to ipv6 which I'm not sure these folks are using. Hmmm... Created attachment 111058 [details]
/proc/slabinfo from server up 39 hours
Attachment shows 1.7M leaked of each of biovec-1 and bio resulting from normal
use of a server for 39 hours. This server was not used for the various tests
described in this log.
Running following on system `A' results in leakage only on system `B'.
Suggests `accept(2)' or something close to it is leaking.
while true; do ssh B echo -n; done
Leaked BIO objects resulting from networking leak is painfully mysterious to me. I thought you were saying that if you run the "ssh" loop, and that's the entire test, you get leaks. This may not be my department after all. :-) Created attachment 111098 [details]
/proc/slabinfo, same server, uptime now 65 hours
For comparison purposes, /proc/slabinfo with uptime now 65 hours. Another 1M
each of leaked "bio" and "biovec-1" objects. Remember this is one of several
servers hanging off a DS1. A mail or news server handling DS3 loads with this
O/S is going to leak RAM at around 2-4GB per day - i.e. this O/S is unusable
for any but the smallest ISP's.
We'll have to reboot again in a day or two. If you need more info before we
reboot, let me know soon.
BTW, bug is still assigned to davem who has disavowed knowledge of kernel
networking. How do we get this reassigned?
I have not disavowed knowledge of kernel networking, I am in fact to person for those issues. I have disavowed that this is even a networking problem. What is leaking is filesystem I/O buffers, not networking resources. Therefore I do not think it is a networking bug at all. There have been a couple of BIO leaks in the 2.6.x kernel fixed over the past month or two, so depending upon your configuration (using device mapper, LVM, RAID, or similar) you could be hitting one of those problems. Therefore someone more well versed in this area should take hold of this bug. I'll take it back for now and have a trawl through the post 2.6.10 changes to see if I can spot anything obvious. Thanks for taking a look davem, feel free to drop off the Cc: Mike, can you tell me some more info about the IO subsystem in use ? IDE ? SCSI ? MD ? Raid ? Apologies to davem for misunderstanding his comment. The servers are each a mix of EXT3/LVM/RAID1/IDE and EXT3/RAID1/IDE. Most of the regular files (/var, /tmp, /usr etc) are on EXT3/RAID1/IDE without LVM. Looks like problem doesn't occur on EXT3/IDE systems without RAID1. Will verify, investigate some more, and get back to you in an hour or two. Some kind of syslog-RAID1 leak, or just a coincidence? Simply copying/deleting large file trees doesn't seem to cause leaks. Leak occurs on EXT3/RAID1/IDE filesystems with syslog running. Leak does not occur with syslog off (i.e. minilogd running). We use LVM on some of these systems but not /tmp, /usr, /var. Leak does not occur on systems A or B (see below) which are EXT3/IDE. Leak doesn't occur on system R (see below) which is EXT3/RAID5/SCSI. I don't have RAID1/SCSI or RAID5/IDE for comparison. Stats on some systems (EXT3/RAID1/IDE except where noted otherwise): System Uptime Est Daily Est Syslogs Active Significant Syslogs Since Boot bio objs Services (3,6) A (5) 49 hrs 12k 25k 0k none B (5) 49 hrs 1k 2k 0k Workstation D 56 hrs 568k 1325k 2071k QMail/Apache H 49 hrs 15k 31k 497k Samba L 48 hrs 146k 292k 649k Squid/QMail N (1) 68 hrs 429k 1215k 2937k INN/QMail N (2) 68 hrs 696k 1972k 2937k INN/QMail R (4) 57 hrs 58k 138k 132k Samba/Squid S 56 hrs 5k 12k 160k Apache U 68 hrs 6k 17k 140k Workstation Notes: (1) Counting only syslogged logs on EXT3/RAID1/IDE filesystems,.. (2) Counting all syslogged logs. (3) Apache/Samba/Squid logs not counted as they're direct logs. (4) EXT3/RAID5/SCSI (5) EXT3/IDE (6) news.{crit,err,notice} counted but not INND direct logs. I just built a test kernel fixing up a bio leak in md. Can you try the -767 kernels at http://people.redhat.com/davej/kernels/Fedora/FC3/ and see if thats any better ? One thing syslog does (that perhaps minilogd does not) is call fsync() on the log every time it writes a new message there. This might help narrow down the true cause. 767 fixes problem in test case. Will roll out to news server tonight and let you know real world results. Have to shake it down a bit before rolling out to other servers. Thanks. Just checked 766 which appeared on mirrors last night. Problem exists in 766. 767 still looking good. Early days. Confirming 767 fixes bug with no undesirable side-effects on the eight systems we've tried it on thus far. We'll be rolling it out to our other systems tonight. Thanks guys. thanks for testing, this will go out in the next official errata, though theres a bunch of other things I'd like to get fixed before I push something else out. Machine here hit this bug too. 770 has been running for a day now and problem appears alleviated (bio slabs actually shrunk in half from an earlier peak today). It took about a week to really become painfuly obvious previously though, however looks good so far. Moving this to PROD_READY based on the positive testing feedback. |