Bug 66818
Summary: | ldt allocation failed | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | william ewing <ewingw> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.1 | CC: | bcrl, jakub, jeff, redhat |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-04-05 01:13:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
william ewing
2002-06-17 11:30:53 UTC
1) what kernel version are you using 2) do you have any idea how many threads there are from openmail when things start to stop working? 1. I have tried Kernel 2.4.2, 2.4.9-31gz and 2.4.9-34 in SMP and enterprise mode. 2. No, how can I tell what threads are being used ? Could this be a GLIBC restriction on the number of Threads ? Would I need to generate a libpthread.so ? As the system is crashing everyday with over 1200 users on any help or guidance would be apprepricated ps -waux | wc -l will give an approximation of the number of threads. I doubt it's a glibc limitation (adding our glibc maintainter to CC to confirm) but each thread will use in the order of 16Kb of unswappable kernel memory + whatever else the threads do. The maximum amount of unswappable kernel memory is limited to some 860Mb. You could try the kernel from the Advanced Server product, there is code in that kernel to keep extra "unswappable kernel memory" available for situations like yours. glibc has limitation on the amount of threads in one process (1024), but that can be changed by recompiling glibc. How many threads is HP OpenMail using? The reason why ldt allocation fails is more likely that there are no more 64KB contiguous chunks of physical memory which are needed for LDT. But it would much more likely trigger if HP OpenMail created 1200 separate processes, each linked with -lpthread, thus each of those 1200 processes would require allocation of 64K physically contiguous memory. >But it would much more likely trigger if HP OpenMail created 1200 separate
>processes, each linked with -lpthread, thus each of those 1200 processes would
>require allocation of 64K physically contiguous memory.
How can I tell if this is the case and if so what do I need to do to resolve
the problem ?
If Jakub's hypothesis is correct then HP needs to fix their program.... 64Kb extra (+ the 16 normal Kb's per process) means the unswappable memory gets exhausted fast. The Advanced Server kernel (2.4.9-e.3) will extend the limit a bit, but not to inifinte. The correct way (and tons faster) to do this would be to not make 1200 threads or processes but just a few that each handle several connections at once. Is there any way that I can prove that this is the problem ? ie Show the number of threads and how much unswappable kernel memory they use. Am in right in thinking that there is 860MB of unswapable kernel memory. How much does the system use as 860MB/1200 users = 700+ KB per user ! As my active user population is 1600, would the advanced server allow me to get these addition 300 to 400 users on ? Is there any way to decrease the sixe of ldt ? Can I run the Advanced server Kernel on Redhat 7.x ? if so where would I get it from and would it need recompiled ? > As my active user population is 1600, would the advanced server allow me to get > these addition 300 to 400 users on ? 300 to 400 sounds possible, depending a bit on how the memory is used exactly. > ftp://ftp.redhat.com/pub/redhat/linux/enterprise/2.1AS/en/os/i386/SRPMS/ has the src.rpm of the kernel; easy to rebuild with rpm --rebuild --target=i686 kernel-2.4.9-e.3.src.rpm (you'll need one or two other packages as well, same directory) It would help me if you could answer the questions below
1. >Is there any way that I can prove that this is the problem ?
>ie Show the number of threads and how much unswappable kernel memory they
use.
2. >Am in right in thinking that there is 860MB of unswapable kernel memory.
>How much does the system use as 860MB/1200 users = 700+ KB per user !
Can you explain why my maths has gone wrong above
3. >Is there any way to decrease the size of ldt ?
4. > Can I run the Advanced server Kernel on Redhat 7.x ?
1) cat /proc/meminfo combined with cat /proc/slabinfo will give me enough info so that I can say a "most probably" or "no" 2) The 860Mb is used for other things too, so it's not THAT simple. Also things like open files, administration data for allocated memory etc etc scale with the number of processes and come from this pool 3) Not that I know of currently 4) It works. If it works it might be a plan to go to AS completely though.. Taking into account that I have a dual processor server with upto 4 x 1.8 GB Swap devices and 1 to 6 GB of RAM what do you think is tyhe best combiantion to resolve my problem ? Am I likely to hit any VM or Big Memory problems ? Is there an Advanced Kernel for 2.4.18 ? Many Thanks for the Info so far. As I am a novice when I do a rpm -ivh --test kernel.... I get initscripts < 6.41 conflicts with kernel-enterprise-2.4.9-e.3 dev < 3.2-9 conflicts with kernel-enterprise-2.4.9-e.3 I take it I get these files from the redhat site ? MEMINFO and SLABinfo details as requested With 1159 users on cat /proc/meminfo is total: used: free: shared: buffers: cached: Mem: 2108424192 2099486720 8937472 4304896 34455552 802594816 Swap: 1846616064 596975616 1249640448 MemTotal: 2059008 kB MemFree: 8728 kB MemShared: 4204 kB Buffers: 33648 kB Cached: 249364 kB SwapCached: 534420 kB Active: 195712 kB Inact_dirty: 624160 kB Inact_clean: 1764 kB Inact_target: 524284 kB HighTotal: 1179632 kB HighFree: 3116 kB LowTotal: 879376 kB LowFree: 5612 kB SwapTotal: 1803336 kB SwapFree: 1220352 kB cat /proc/slabinfo is slabinfo - version: 1.1 (SMP) kmem_cache 80 80 244 5 5 1 : 252 126 ip_fib_hash 10 226 32 2 2 1 : 252 126 ip_conntrack 0 0 384 0 0 1 : 124 62 clip_arp_cache 0 0 128 0 0 1 : 252 126 ip_mrt_cache 0 0 96 0 0 1 : 252 126 tcp_tw_bucket 4 30 128 1 1 1 : 252 126 tcp_bind_bucket 147 226 32 2 2 1 : 252 126 tcp_open_request 0 40 96 0 1 1 : 252 126 inet_peer_cache 4 59 64 1 1 1 : 252 126 ip_dst_cache 2105 2200 192 110 110 1 : 252 126 arp_cache 471 630 128 21 21 1 : 252 126 blkdev_requests 44352 44360 96 1109 1109 1 : 252 126 dnotify cache 0 0 20 0 0 1 : 252 126 file lock cache 168 168 92 4 4 1 : 252 126 fasync cache 0 0 16 0 0 1 : 252 126 uid_cache 1214 1356 32 12 12 1 : 252 126 skbuff_head_cache 1340 1560 160 65 65 1 : 252 126 sock 1263 1284 1312 428 428 1 : 60 30 sigqueue 58 58 132 2 2 1 : 252 126 kiobuf 0 0 8768 0 0 4 : 0 0 cdev_cache 64 177 64 3 3 1 : 252 126 bdev_cache 11 177 64 3 3 1 : 252 126 mnt_cache 18 118 64 2 2 1 : 252 126 inode_cache 33728 42570 448 4730 4730 1 : 124 62 dentry_cache 27765 36600 128 1220 1220 1 : 252 126 dquot 0 0 128 0 0 1 : 252 126 filp 106082 106120 96 2653 2653 1 : 252 126 names_cache 7 7 4096 7 7 1 : 60 30 buffer_head 82051 109880 96 2747 2747 1 : 252 126 mm_struct 1416 1416 160 59 59 1 : 252 126 vm_area_struct 113512 118413 64 2007 2007 1 : 252 126 fs_cache 1416 1416 64 24 24 1 : 252 126 files_cache 1346 1359 416 151 151 1 : 124 62 signal_act 1268 1332 1312 444 444 1 : 60 30 pae_pgd 1335 1469 32 13 13 1 : 252 126 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 2 2 65536 2 2 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 4 4 32768 4 4 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 8 8 16384 8 8 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 4 4 8192 4 4 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 60 30 size-4096 57 58 4096 57 58 1 : 60 30 size-2048(DMA) 4 4 2048 2 2 1 : 60 30 size-2048 1120 1138 2048 565 569 1 : 60 30 size-1024(DMA) 1 4 1024 1 1 1 : 124 62 size-1024 1350 1352 1024 338 338 1 : 124 62 size-512(DMA) 0 0 512 0 0 1 : 124 62 size-512 712 712 512 89 89 1 : 124 62 size-256(DMA) 0 0 256 0 0 1 : 252 126 size-256 300 300 256 20 20 1 : 252 126 size-128(DMA) 68 120 128 4 4 1 : 252 126 size-128 840 870 128 29 29 1 : 252 126 size-64(DMA) 0 0 64 0 0 1 : 252 126 size-64 2419 2419 64 41 41 1 : 252 126 size-32(DMA) 0 0 32 0 0 1 : 252 126 size-32 5505 6215 32 55 55 1 : 252 126 Latest Info. I have a test server (dual processor, 4 GB of RAm, 6 GB of SWAP) and RedHat 7.3 installed. I have installed the Redhat 2.4.18-4 Big Memory Kernel and the Advanced Server Kernel. Greg Royle from HP OpenMail supplied me waith a Memory Grabbing and process using C program. (Each process allocates itself more memory every some nany seconds and a new process is generated every second). Using a batch file I requested the program to add 100KB every 60 seconds to each process and then run it on both Kernels. For the advanced Server Kernel , using Top, I used 1335 process. For the Big memory Kernel , Using Top, I used 1342 process. In both cases allow the program continued to run!, no more processes were generated and in each case I still had memory left over. If required I can supply the C Program. Can you supply me with any more information/suggestions ? programs can grab swapable memory just as well... but your system has over 100000 files open during the problem time, for example, and each open file takes unswapable memory. The vm_area_struct number is high too, but we fixed some things in the AS kernel for that. Sorry not sure how you would like me to proceed ? Do you want me to install the Advanced Server Kernel on the Live server and test? If so is there any other values that I need to change in either the kernel , /proc or anywhere else (glibc ...) ? Does the Advanced Server Kernel support large Memory ? Or do you think that this application is unsuitable for Redhat ? The test program doesn't look like it has anything to do with the problem. I think that the only way to be sure that the Advanced Server kernel fixes this particular problem for you without deploying live would be to set up a test environment that duplicates your live environment and do something like use fetchmail running on other machines to try to duplicate the problem, then switch kernels and test again. The problem you are seeing shows up when there is contention for low memory. We did several things in Advanced Server to decrease demand for low memory. Exactly how much help it can be in this case is not something that we can determine precisely here, but it should be an improvement. One note: if you wish to deploy this in a supported environment, the Advanced Server kernel is only officially supported by Red Hat as part of Red Hat Linux Advanced Server. We've made it available for testing, but aren't recommending a deployment of that kernel that is not on the Red Hat Linux Advanced Server platform. If this works I have no problem buying the Advanced server package. What is the latest version of Kernel in the advanced server range as I believe that 2.4.18 and above has far better Big Memory support ? Would there be any gain in using Redhat 7.3 with the 2.4.18-4 kernel? The AS kernel is called 2.4.9-e.3 but it has a LOT of changes from later kernels included. Is the following revelant https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=43742 as I am running Redhat 7.1 and on a dual processor server ? If so was there a fix apart from going to REDHAT 7.2 Jakub? There was an errata for 7.1 (well, at least 2, one including security fixes). The latest one is http://rhn.redhat.com/errata/RHBA-2002-056.html But I don't think #43742 has to do anything with this bugreport. With regard to this
> I have a test server (dual processor, 4 GB of RAm, 6 GB of SWAP) and RedHat
>7.3 installed. I have installed the Redhat 2.4.18-4 Big Memory Kernel and the
>Advanced Server Kernel. Greg Royle from HP OpenMail supplied me waith a Memory
>Grabbing and process using C program. (Each process allocates itself more
>memory every some nany seconds and a new process is generated every second).
>Using a batch file I requested the program to add 100KB every 60 seconds to
>each process and then run it on both Kernels.
>For the advanced Server Kernel , using Top, I used 1335 process.
>For the Big memory Kernel , Using Top, I used 1342 process.
>In both cases allow the program continued to run!, no more processes were
>generated and in each case I still had memory left over.
I have now recompiled the c program WITHOUT lpthread.
Rerunning the tests I get about 1577 processes before the program generates a
segmentation fault. If I run it again I get 3076 processes before the next
segmentation fault and so on ....
As the application uses lpthreads my test would seem to suggest that I can only
generate 1342 processes in total so 1342 minus (system and HP OpenMAil
processes) = about 1200 users allowed Access !
Is this correct ?
Is there a limit on the number of forks or processes that a parent process can
generate ?
C program can be supplied.
As each user access the HP Openamil system they are allocated a UAL.REMOTE process. This program was compiled with -lpthread. When this program is not compiled with lpthread the number of users allowed access has increased to the number requiring access. THe previous C program I mentioned was able to duplicate this no matter what Kernel I used. Can you explain why ? This bug will be fixed when NPTL is included in a future release. It seems there's now an openmail version that does NOT link to pthreads I have seen similar problems when running both kernel-summit-2.4.9-e.3 and kernel-summit-2.4.9-e.10 on an IBM x440 with 4 processors and 4gb RAM. We are running into the problem when running apache and apache forks out over 1200 processes. You can reproduce easily enough by writing a small script that forks off a bunch of sleeps. Here's how I reproduce: for ((i=1;i<1000;i++)); do ./fork.sh; done fork.sh looks like this: #!/bin/sh sleep 3600 & I should note that to reproduce it you'll have to run that command a few times. Normally I get about 1296 sleeps before I hit the wall. This is also reproducible on the uniprocessor and smp 2.4.18-19 kernels on the ibm x335 series. I would expect this behavior on these smaller machines, but would sure like to be able to run more apache children on my x440 if possible. Is this likely better in an upcoming version of the advanced server kernel? I get this same problem on RH7.3 with apache configured to allow 2048 processes running - after about 1300 are created, they start failing with the ldt allocation failed messages - On a test machine I upgraded to a stock 2.4.20 kernel and the problem went away, but I would rather use RH approved kernels. 2.4.18-27.7.x has the problem with ldt allocation I can verify. This problem is fixed in the 2.4.20 kernel released as part of Red Hat Linux 9 |