Tim I work in Pierre Salkazanov's team at Bull . I am contacting you to provide to you all the necessary informations to reproduce and fix the "low memory" bug discussed during the meeting between RedHat and Bull held at Westford the 17&18th march . The short behavior of the bug is : Running the test program dbgen on Tiger machine with 32 GB of memory , after a while ( about 2 hours ) the server is no more accessible, only ping is answering but we can't logged in or ssh or anything else . to the console we've messages such as: journal_write_metadata_buffer : ENOMEM at get_unused_buffer_head, trying again. Out of Memory Killed process 6170 (nohup). ... the only thing to do is to reboot the machine. Attached are some documents like readme to start the tescase and informations after the "system hang" allowing you to reproduce and debug the problem. I hope that all these informations are sufficient to help you to fix the bug nevertheless feel free to ask me if you need more details. I sent also to you 32 GB of memory mandatory to quickly reproduce the bug ( 16 Dimms of 2GB )
Created attachment 99475 [details] Attached "readme" showing system config info Attached "readme" showing system config info
Created attachment 99476 [details] test program
Created attachment 99477 [details] syslog output
Created attachment 99478 [details] meminfo
Here's who the original problem description came from: Serge CHABUEL LINUX & HPC Project Manager Bull S.A. 1, Rue de Provence 38432 ECHIROLLES Frec B1-247 Tel: +33 (0)4 76 29 77 36 / fax : 04 76 29 75 18
Oh, his email is serge.chaubel. Another Bull person on the original mail is pierre.salkazanov. The sales person is fmeyer. I tried to add all these guys to the `cc` but apparently, none of the above have established bugzilla accounts.
Bull claims that this problem can be reproduced on a stock Intel Tiger4. So we don't need to wait for a complete Bull system to test it out; rather we can run it on our existing systems. (Once the memory arrives.)
We really should try this on with Larry's change to move the page structs out of the DMA zone first.
Sure. Good idea.
ok, i've run this test on a 32GB tiger system, saw no -ENOMEM in the logs. Also, the dbgen processes are still running, going on 3 days now...
Pierre, the kernel that includes the change which allocates the physical pages for the virtual mem_map from the normal zone instead of the DMAzone can be downloaded from here: http://people.redhat.com/~lwoodman/.IA64/ BTW, this change will save ~250MB of DMA zone memory so this should postpone the ENOMEM failures and the OOM killing of processes so that you never see them. Can you try out that kernel and let us know if it fixes your peoblem? Thanks, Larry Woodman
The kernel that Larry recommends has made a big difference for other customers who were seeing memory exhuastion. Can you pleae evaluate this kernel? Adding Tim to the 'cc list so that we can make some progress on this...it seems like Bull is not seeing this bug...
If we don't get moving on this issue in the next week or so, a resolution to this issue is in serious jeporday for U3.
Test is currently running to reproduce the original problem and to verify that the RHEL3 Update 3 pre-beta kernel fixes it. More details on configuration and test results (hopefully good !) should be posted by tomorrow.
Unfortunately, the test fails! The test was stard at 9.30 am, was always running at 11.30 am, but at 12.30 the machine hangs. With the new kernel (rpm : kernel-2.4.21-15.5.EL.ia64.rpm), the machine cannot boot if the SCSI controler is an LSI one, but OK with an Adaptec. (LSI : LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07), ADAPTEC : Adaptec AHA-39160D / AIC-7899A U160/m (rev 01)).
-----Forwarded Message----- From: Pierre Fumery <pierre.fumery> To: Susan S. Denham <sdenham> Cc: bnocera Subject: (121029 good and ... bad) Re: RHN access keys (product IDs) for RHEL 4 alpha and beta access Date: 03 Jun 2004 17:33:38 +0200 Hi Sue, . . . About 121029, to let you directly know: There are improvements: Machine hung have about three hours (RHEL3 update 3) instead of one hour previously (same configuration with RHEL3 update 1). Memory doesn't seem to be exhausted as previously. So, things oviously have been changed but it doesn't seem to address the whole problem ? Or there is another one. I already requested guys to post more details directly in 121029. It should be done very soon. We'll need RedHat help to figure out the best way to give you enough information to go further on this investigation. On our side, we'll try to set up a Tiger configuration back with 32Gb here to reproduce this problem in a similar configuration as yours in Red Hat. The current test has been done on an avialable NovaScale machine which had already 32Gb installed/available. Pierre.
Could you please explain which kind of fix has been integrated. No much explanation above and http://people.redhat.com/~lwoodman/.IA64/ seems to me empty ? Also, any recommendation to get further needed traces or logs would be helpful. THen we could provide them to you to help investigating.
the change was to not allocate the the 'page structs' out of lowmem. This should leave a much larger set of lowmem to work with. So in this hang there is plenty of lowmem. Alt-Syrq-m provides should provide this data. However, its been mentioned that without the fancyiommu patch, these tests do not hang....so that is a bit confusing to me as to where to attack this problem. I guess if we also got and Alt-Sysrq-t, and the time of the hang that would be helpful.
So the first thing we need to determine is if this hang is still a lowmem issue. This can found by doing: 1. echo 1 > /proc/sys/kernel/sysrq then: echo m > /proc/sysrq-trigger, will then give us a good breakdown of the memory free on the system. you can put this in a script.
Test has been launched yesterday evening. Unfortunately, a bad configuration made it failed ! ... and we found nothing this morning as test aborted by itself. Test is currently being relaunched and we set up several traces (memory, top, ...) to track pertinent information in order to catch what goes wrong.
BTW, I requested kernel guys to have a look if/when machine will hang. They asked for a kernel debugger or other tools, available on RHEL3. We use KDB on our system but they ported it on 2.6 as we're mainly using this kernel level. Which kernel debugger/tools are you using on RHEL3 to debug such problems ? Any information (other that memory traces as described above) would be helpful. Thanks.
As mentioned on the call today, a useful tool for getting system information is sysreport. Its just /usr/sbin/sysreport, which creates a .bz2 file, which contains, the 'proc' filesytems, system logs etc.
As mentioned on the call today, test is still running on a 8-way system with 32GB. We passed 7 hours now and we'll let it run some more time. On this same system, the same test (dbgen) crashed RHEL3 Update 1 in about one hour. On this same system, dbgen crashed one time RHEL3 Update 3 in about 3 hours but without same memory problems which were seen with RHEL3 Update 1. I will post more information in a next note. We plan to let it run until tomorrow (our time) and we will then restart it on this same machine but running with 16-way and 64 GB of memory.
With RHEL Update 1, dbgen was crashing our machine (8-way, 32GB) after about one hour. Following traces were kept at the beginning and at the end (machine hang) when running: > meminfo.txt while : do mois=`date | awk '{print $2}'` jour=`date | awk '{print $3}'` heure=`date | awk '{print $4}'` HighFree=`cat /proc/meminfo | grep HighFree | awk '{print $2}'` LowFree=`cat /proc/meminfo | grep LowFree | awk '{print $2}'` echo "$mois $jour $heure $HighFree $LowFree" >> meminfo.txt sleep 300 ; done Jun 2 17:10:03 31109424 101136 ... Jun 2 18:06:31 3817472 20272
With RHEL Update 3, dbgen crashed one time our machine (8-way, 32GB) after about three hours. Same script (see previous note) was used to keep traces and we got following traces at the beginning and at the end (machine hang): Jun 3 09:28:15 30906672 1567648 ... Jun 3 12:33:30 29187264 1354816 Obviously, HighFree and LowFree values were still high and it seems there was another problem we still not understand.
Today, with RHEL Update 3 again, dbgen is still running on our machine (8-way, 32GB) after more than 7 hours. Trace script had been improved with Jason's recommendation (thanks !) as follow: echo 1 > /proc/sys/kernel/sysrq > meminfo.txt while : do mois=`date | awk '{print $2}'` jour=`date | awk '{print $3}'` heure=`date | awk '{print $4}'` HighFree=`cat /proc/meminfo | grep HighFree | awk '{print $2}'` LowFree=`cat /proc/meminfo | grep LowFree | awk '{print $2}'` echo "$mois $jour $heure $HighFree $LowFree" >> meminfo.txt echo m > /proc/sysrq-trigger sleep 300 ; done
Today, with RHEL Update 3 again, dbgen is still running on our machine (8-way, 32GB) after more than 7 hours. Corresponding traces are the following ones: Jun 8 10:43:32 30906656 1084320 ... Jun 8 13:44:05 29071696 753744 ... Jun 8 17:44:53 27146880 724208 Trace in /var/log/messages: Jun 8 10:43:32 eroski kernel: SysRq : Show Memory Jun 8 10:43:32 eroski kernel: Mem-info: Jun 8 10:43:32 eroski kernel: Zone:DMA freepages: 67778 min: 1279 low: 3050 high: 4063 Jun 8 10:43:32 eroski kernel: Zone:Normal freepages: 0 min: 0 low: 0 high: 0 Jun 8 10:43:32 eroski kernel: Zone:HighMem freepages:1931655 min: 255 low: 30718 high: 46077 Jun 8 10:43:32 eroski kernel: Free pages: 1999433 (1931655 HighMem) Jun 8 10:43:32 eroski kernel: ( Active: 17229/13431, inactive_laundry: 2527, inactive_clean: 1829, free: 1999433 ) Jun 8 10:43:32 eroski kernel: aa:0 ac:2107 id:11463 il:1911 ic:1824 fr:67778 Jun 8 10:43:32 eroski kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 8 10:43:32 eroski kernel: aa:8852 ac:6270 id:1968 il:616 ic:5 fr:1931655 Jun 8 10:43:32 eroski kernel: 13968*16kB 11263*32kB 5259*64kB 813*128kB 50*256kB 6*512kB 5*1024kB 3*2048kB 2*4096kB 1*8192kB 1*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB = 1084448kB) Jun 8 10:43:32 eroski kernel: 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB 0*8192kB 0*16384kB 1*32768kB 1*65536kB 1*131072kB 117*262144kB = 30906480kB) Jun 8 10:43:32 eroski kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jun 8 10:43:32 eroski kernel: 18538 pages of slabcache Jun 8 10:43:32 eroski kernel: 252 pages of kernel stacks Jun 8 10:43:32 eroski kernel: 0 lowmem pagetables, 404 highmem pagetables Jun 8 10:43:32 eroski kernel: Free swap: 2040176kB Jun 8 10:43:33 eroski kernel: 2095752 pages of RAM Jun 8 10:43:33 eroski kernel: 14993 reserved pages Jun 8 10:43:33 eroski kernel: 35383 pages shared Jun 8 10:43:33 eroski kernel: 0 pages swap cached Jun 8 10:43:33 eroski kernel: 245 pages in page table cache Jun 8 10:43:33 eroski kernel: Buffer memory: 33712kB Jun 8 10:43:33 eroski kernel: Cache memory: 389152kB Jun 8 10:43:33 eroski kernel: CLEAN: 754 buffers, 3016 kbyte, 72 used (last=754), 0 locked, 0 dirty 0 delay Jun 8 10:43:33 eroski kernel: LOCKED: 1 buffers, 4 kbyte, 1 used (last=1), 0 locked, 0 dirty 0 delay Jun 8 10:43:33 eroski kernel: DIRTY: 51 buffers, 204 kbyte, 51 used (last=51), 0 locked, 37 dirty 0 delay Jun 8 10:44:12 eroski kernel: keyboard.c: can't emulate rawmode for keycode 272 Jun 8 10:45:25 eroski last message repeated 4 times Jun 8 10:46:28 eroski last message repeated 14 times Jun 8 10:47:30 eroski last message repeated 20 times Jun 8 10:48:06 eroski last message repeated 21 times Jun 8 10:48:33 eroski kernel: SysRq : Show Memory ... Jun 8 17:04:46 eroski kernel: SysRq : Show Memory Jun 8 17:04:46 eroski kernel: Mem-info: Jun 8 17:04:46 eroski kernel: Zone:DMA freepages: 45851 min: 1279 low: 3050 high: 4063 Jun 8 17:04:46 eroski kernel: Zone:Normal freepages: 0 min: 0 low: 0 high: 0 Jun 8 17:04:46 eroski kernel: Zone:HighMem freepages:1717724 min: 255 low: 30718 high: 46077 Jun 8 17:04:46 eroski kernel: Free pages: 1763577 (1717724 HighMem) Jun 8 17:04:46 eroski kernel: ( Active: 38960/166207, inactive_laundry: 48360, inactive_clean: 1813, free: 1763577 ) Jun 8 17:04:46 eroski kernel: aa:0 ac:10784 id:11463 il:1882 ic:1808 fr:45851 Jun 8 17:04:46 eroski kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 8 17:04:46 eroski kernel: aa:10456 ac:17708 id:154744 il:46478 ic:5 fr:1717724 Jun 8 17:04:46 eroski kernel: 519*16kB 6954*32kB 5282*64kB 817*128kB 51*256kB 6*512kB 5*1024kB 3*2048kB 2*4096kB 1*8192kB 1*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB = 733616kB) Jun 8 17:04:46 eroski kernel: 214*16kB 41*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB 0*8192kB 1*16384kB 0*32768kB 1*65536kB 1*131072kB 104*262144kB = 27483584kB) Jun 8 17:04:46 eroski kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jun 8 17:04:47 eroski kernel: 29698 pages of slabcache Jun 8 17:04:47 eroski kernel: 310 pages of kernel stacks Jun 8 17:04:47 eroski kernel: 0 lowmem pagetables, 541 highmem pagetables Jun 8 17:04:47 eroski kernel: Free swap: 2040176kB Jun 8 17:04:47 eroski kernel: 2095752 pages of RAM Jun 8 17:04:47 eroski kernel: 14993 reserved pages Jun 8 17:04:48 eroski kernel: 201630 pages shared Jun 8 17:04:48 eroski kernel: 0 pages swap cached Jun 8 17:04:48 eroski kernel: 815 pages in page table cache Jun 8 17:04:48 eroski kernel: Buffer memory: 174256kB Jun 8 17:04:48 eroski kernel: Cache memory: 3607024kB Jun 8 17:07:54 eroski sshd(pam_unix)[15071]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=129.183.160.3 user=root Jun 8 17:08:01 eroski sshd(pam_unix)[15071]: session opened for user root by (uid=0) Jun 8 17:09:27 eroski sshd(pam_unix)[28237]: session opened for user root by (uid=0) Jun 8 17:09:47 eroski kernel: SysRq : Show Memory
Today, with RHEL Update 3 again, dbgen is still running on our machine (8-way, 32GB) after more than 7 hours. HighFree is still going down slowly from 30906656 to 27146880 (vs. from 31109424 to 3817472 with RHEL3 Update 1). LowFree is still going down slowly from 1084320 to 724208 (vs. from 101136 to 20272 with RHEL3 Update 1). However, with current RHEL3 Update 3: DMA freepages: 67778 is going down (after more than 7 hours) to 45851. Zone:HighMem freepages:1931655 is going down (after more than 7 hours) to 1717724
We'll see where we'll be tomorrow morning (our time) on these values. But, please let us know if these values going slowly down is a normal behaviour. We're not sure to have a look at right triggers and any advice will be welcomed. Thanks in advance for your analysis and your feedback.
Pierre and team: Since 16-way support in U3 is a goal for Bull, may I suggest that you also simultaneously kick off this testing (that is, immediately, please!) on your 16-way system. Running these tests in parallel on the 8-way and 16-way -- and hopefully, they'll run successfully -- will be the most efficient way to gain confidence before our 15 June U3 code freeze that the problem has been addressed. Thanks very much, Sue
This is currently the same machine and we don't have (right now available) a 8-way + a 16-way. Our machine will be upgraded from 8-way/32GB to a 16-way/64GB machine tomorrow morning. But, in the mean time, I'm working to get another machine NS4040 (4-way) upgraded with 32GB too. And we will start another campaign on this other configuration. To summarize, I expect to run dbgen with RHEL3 Update 3 on both one NS5160 (16-way/64GB) and one NS4040 (4-way/32GB) by tomorrow.
The freepages going down is simply that the caches are growing-page cache and buffer caches as the test performs I/O. No abnormal behavior there.
dbgen is a test that performs I/O. But I've been told it restarts tests when previous ones completed. So, we could expect to get all system ressources cleaned up by the kernel before re-acquiring them for the next loop. If I well understood, we should not see freepages going down in this case. Our 8-way system is still running and we decided this morning to let it run a little bit more, as freepages are still going down. We'd like to see if something will happen (daemon killing other process ?) when a mimimum trigger is reached ?
In the mean time, we're setting up another machine NS4040 with 32GB (done) and with enough disks (> 200 GB, in progress) to run dbgen on such a configuration for several days too.
The buffer and page caches are 'independent' of any specific processes, so even if dbgen cleans itself up, the caches are not completely flush. The system will only begin to aggresively scrub these caches when there is memory pressure.
OK, I got it. Test is currently running and we'll see if there is a problem when "the system will begin to aggresively scrub these caches when there is memory pressure." If there is such a problem (to be confirmed !), we need to test on a 16-way system and 64GB to analyze if and when it could occur. On a 8-way + 32GB memory, no problem so far (> 24 hours). Good thing !
Our 8-way system is still running and lowFree memory is going up and down that confirms your explanation. And it seems to work well now. So, we will upgrade our system as a 16-way. I just requested people to do so but it involve specific setting. Anyway, find below last traces results: Jun 9 14:57:12 20229648 33440 Jun 9 15:02:15 20197088 29008 Jun 9 15:07:16 20181040 25568 Jun 9 15:12:18 20169440 23472 Jun 9 15:17:20 20158288 37712 Jun 9 15:22:22 20152496 36304 Jun 9 15:27:26 20146304 39056 Jun 9 15:32:29 20123056 34656 Jun 9 15:37:33 20117200 33504 Jun 9 15:37:33 eroski kernel: SysRq : Show Memory Jun 9 15:37:33 eroski kernel: Mem-info: Jun 9 15:37:33 eroski kernel: Zone:DMA freepages: 2101 min: 1279 low: 3050 high: 4063 Jun 9 15:37:33 eroski kernel: Zone:Normal freepages: 0 min: 0 low: 0 high: 0 Jun 9 15:37:33 eroski kernel: Zone:HighMem freepages:1257325 min: 255 low: 30718 high: 46077 Jun 9 15:37:33 eroski kernel: Free pages: 1259427 (1257325 HighMem) Jun 9 15:37:33 eroski kernel: ( Active: 68914/516749, inactive_laundry: 150760, inactive_clean: 4308, free: 1259427 ) Jun 9 15:37:33 eroski kernel: aa:0 ac:37209 id:10673 il:1584 ic:1618 fr:2102 Jun 9 15:37:33 eroski kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 9 15:37:33 eroski kernel: aa:13363 ac:18347 id:506076 il:149176 ic:2690 fr:1257325 Jun 9 15:37:33 eroski kernel: 530*16kB 70*32kB 2*64kB 14*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB 0*8192kB 1*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB = 33632kB) Jun 9 15:37:33 eroski kernel: 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB 1*8192kB 1*16384kB 1*32768kB 0*65536kB 1*131072kB 76*262144kB = 20117200kB) Jun 9 15:37:33 eroski kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jun 9 15:37:35 eroski kernel: 49259 pages of slabcache Jun 9 15:37:35 eroski kernel: 310 pages of kernel stacks Jun 9 15:37:37 eroski kernel: 0 lowmem pagetables, 541 highmem pagetables Jun 9 15:37:37 eroski kernel: Free swap: 2040176kB Jun 9 15:37:37 eroski kernel: 2095752 pages of RAM Jun 9 15:37:37 eroski kernel: 14993 reserved pages Jun 9 15:37:37 eroski kernel: 575762 pages shared Jun 9 15:37:37 eroski kernel: 0 pages swap cached Jun 9 15:37:37 eroski kernel: 1389 pages in page table cache Jun 9 15:37:37 eroski kernel: Buffer memory: 596864kB Jun 9 15:37:38 eroski kernel: Cache memory: 10903008kB
Just to repeat: claiming victory on the 8-way with 32Gb. So the problem is considered resolved on the 8-way for U3. On to the 16-way testing!
to further clarify: This is where Bull is right now: Per Pierre: To summarize, we will now run dbgen with RHEL3 Update 3 on both one NS5160 (16-way/64GB) and one NS4040 (4-way/32GB).
Pierre expects to have the NS5160 testing underway today, with results we hope by tomorrow.
Following our teleconf., Jason will provide us the latest RHEL3 Update 3 available kernel. We may want to use it for the NS5160 testing.
let's just stick with the 15.5 kernel for now, since the later ones might introduce additional variables at this point.
As we were waiting a new kernel, we didn't stop our 8-way machine yet and it ran another night around without big problem. Only swap problems seem to appear that prevent to get full traces when reading trace files. But here are traces information from this morning: Jun 10 08:29:06 17440080 23920 Jun 10 08:36:02 17440880 23840 Jun 10 08:42:30 17439040 23504 Jun 10 08:49:46 17435584 24944 Jun 10 08:56:11 17434624 24496 Jun 10 03:05:40 eroski kernel: SysRq : Show Memory Jun 10 03:05:40 eroski kernel: Mem-info: Jun 10 03:05:40 eroski kernel: Zone:DMA freepages: 2401 min: 1279 low: 3050 high: 4063 Jun 10 03:05:40 eroski kernel: Zone:Normal freepages: 0 min: 0 low: 0 high: 0 Jun 10 03:05:40 eroski kernel: Zone:HighMem freepages:1090496 min: 255 low: 30718 high: 46077 Jun 10 03:05:40 eroski kernel: Free pages: 1092895 (1090496 HighMem) Jun 10 03:05:40 eroski kernel: ( Active: 70710/644199, inactive_laundry: 183090, inactive_clean: 10195, free: 1092895 ) Jun 10 03:05:40 eroski kernel: aa:0 ac:37850 id:10673 il:1584 ic:1618 fr:2399 Jun 10 03:05:40 eroski kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 10 03:05:40 eroski kernel: aa:14516 ac:18346 id:633526 il:181506 ic:8577 fr:1090496 Jun 10 03:05:41 eroski kernel: 891*16kB 40*32kB 1*64kB 14*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB 0*8192kB 1*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB = 38384kB) Jun 10 03:05:41 eroski kernel: 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB 1*8192kB 0*16384kB 0*32768kB 0*65536kB 1*131072kB 66*262144kB = 17447936kB) Jun 10 03:05:41 eroski kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jun 10 03:05:46 eroski kernel: 48565 pages of slabcache Jun 10 03:05:49 eroski kernel: 310 pages of kernel stacks Jun 10 03:05:53 eroski kernel: 0 lowmem pagetables, 554 highmem pagetables Jun 10 03:05:55 eroski kernel: Free swap: 2040176kB Jun 10 03:05:58 eroski kernel: 2095752 pages of RAM Jun 10 03:06:04 eroski kernel: 14993 reserved pages Jun 10 03:06:09 eroski kernel: 703255 pages shared Jun 10 03:06:14 eroski kernel: 0 pages swap cached Jun 10 03:06:16 eroski kernel: 286 pages in page table cache Jun 10 03:06:20 eroski kernel: Buffer memory: 607120kB
If we don't expect a new kernel, let's move on setting our machine as a 16-way/64GB machine right now.
Results on NS4040 (4 ways, 32 GB) : Still running since yesterday afternoon, that means about 17 hours run, without all traces on. HighFree and LowFree values are still going down and we're expecting to reach a stable point as we got on our 8-way/32GB machine. Then values should go up and down around a medium value, but without hanging our machine. Let's see ... Jun 9 17:51:45 30921008 1301648 Jun 9 17:52:15 30542112 1298704 Jun 9 17:52:45 30127408 1297824 Jun 9 17:53:15 29720000 1296976 Jun 9 17:53:45 29317440 1296016 Jun 10 10:16:26 24384 25280 Jun 10 10:19:16 29216 24912 Jun 10 10:21:18 28688 23472 Jun 10 10:23:37 28096 23296 Jun 10 10:25:21 27728 22784 Jun 10 10:27:38 31664 22144 Jun 10 10:29:22 31664 20880 Jun 10 10:32:14 26560 21808 Jun 10 10:34:51 26560 20992 Jun 10 10:36:49 24736 20592
After some problems to upgrade our system to 16 ways (encrypted validation key to be renewed and to be downloaded), we succeeded to start dbgen with the RHEL3 upgrade 3 (release 15.5 as mentionned by Jason).
Results on NS5160 (16-way, 64GB): Started yesterday and still running that means more than 18 hours run, with traces on. Jun 10 16:59:21 64206496 1412272 Jun 10 17:04:23 64069088 1338640 Jun 10 17:09:24 64002208 1296800 Jun 10 17:14:26 63951952 1288128 Jun 10 17:19:28 63905200 1287472 Jun 11 10:27:09 55825088 72976 Jun 11 10:32:12 55778992 58560 Jun 11 10:37:15 55736608 48112 Jun 11 10:42:17 55733200 64320 Jun 11 10:47:19 55679680 54144 Jun 11 10:52:22 55653808 49696 Jun 11 10:57:26 55616496 48784 Jun 11 11:02:28 55610176 51664 Jun 11 11:07:33 55588816 53760 Jun 11 11:12:37 55552464 48208 Jun 10 16:59:21 eroski kernel: SysRq : Show Memory Jun 10 16:59:21 eroski kernel: Mem-info: Jun 10 16:59:21 eroski kernel: Zone:DMA freepages: 88275 min: 1279 low: 3050 high: 4063 Jun 10 16:59:21 eroski kernel: Zone:Normal freepages: 0 min: 0 low: 0 high: 0 Jun 10 16:59:21 eroski kernel: Zone:HighMem freepages:4012883 min: 255 low: 63486 high: 95229 Jun 10 16:59:21 eroski kernel: Free pages: 4101158 (4012883 HighMem) Jun 10 16:59:21 eroski kernel: ( Active: 16571/1961, inactive_laundry: 629, inactive_clean: 0, free: 4101158 ) Jun 10 16:59:21 eroski kernel: aa:0 ac:2102 id:0 il:9 ic:0 fr:88275 Jun 10 16:59:21 eroski kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 10 16:59:21 eroski kernel: aa:8393 ac:6076 id:1961 il:620 ic:0 fr:4012883 Jun 10 16:59:21 eroski kernel: 1*16kB 3*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB 0*8192kB 0*16384kB 1*32768kB 1*65536kB 0*131072kB 5*262144kB = 1412400kB) Jun 10 16:59:21 eroski kernel: 1*16kB 1*32kB 0*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB 1*8192kB 0*16384kB 1*32768kB 1*65536kB 3*131072kB 243*262144kB = 64206128kB) Jun 10 16:59:21 eroski kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jun 10 16:59:21 eroski kernel: 5645 pages of slabcache Jun 10 16:59:21 eroski kernel: 284 pages of kernel stacks Jun 10 16:59:21 eroski kernel: 0 lowmem pagetables, 397 highmem pagetables Jun 10 16:59:21 eroski kernel: Free swap: 2040176kB Jun 10 16:59:23 eroski kernel: 4192882 pages of RAM Jun 10 16:59:23 eroski kernel: 29463 reserved pages Jun 10 16:59:23 eroski kernel: 24006 pages shared Jun 10 16:59:23 eroski kernel: 0 pages swap cached Jun 10 16:59:23 eroski kernel: 148 pages in page table cache Jun 10 16:59:23 eroski kernel: Buffer memory: 34032kB Jun 10 16:59:23 eroski kernel: Cache memory: 142448kB Jun 10 16:59:23 eroski kernel: CLEAN: 723 buffers, 2892 kbyte, 60 used (last=723), 0 locked, 0 dirty 0 delay Jun 10 16:59:23 eroski kernel: DIRTY: 171 buffers, 684 kbyte, 171 used (last=171), 0 locked, 150 dirty 0 delay Jun 11 11:27:44 eroski kernel: SysRq : Show Memory Jun 11 11:27:44 eroski kernel: Mem-info: Jun 11 11:27:44 eroski kernel: Zone:DMA freepages: 4294 min: 1279 low: 3050 high: 4063 Jun 11 11:27:44 eroski kernel: Zone:Normal freepages: 0 min: 0 low: 0 high: 0 Jun 11 11:27:44 eroski kernel: Zone:HighMem freepages:3468949 min: 255 low: 63486 high: 95229 Jun 11 11:27:44 eroski kernel: Free pages: 3473243 (3468949 HighMem) Jun 11 11:27:44 eroski kernel: ( Active: 41523/419463, inactive_laundry: 123401, inactive_clean: 2475, free: 3473245 ) Jun 11 11:27:44 eroski kernel: aa:0 ac:23471 id:4992 il:702 ic:820 fr:4296 Jun 11 11:27:44 eroski kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:0 Jun 11 11:27:44 eroski kernel: aa:11000 ac:7064 id:414471 il:122699 ic:1655 fr:3468949 Jun 11 11:27:44 eroski kernel: 1070*16kB 420*32kB 36*64kB 2*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB 0*8192kB 0*16384kB 1*32768kB 0*65536kB 0*131072kB 0*262144kB = 68704kB) Jun 11 11:27:44 eroski kernel: 1347*16kB 149*32kB 2*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 1*32768kB 0*65536kB 1*131072kB 211*262144kB = 55503184kB) Jun 11 11:27:44 eroski kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Jun 11 11:27:44 eroski kernel: 57970 pages of slabcache Jun 11 11:27:44 eroski kernel: 348 pages of kernel stacks Jun 11 11:27:44 eroski kernel: 0 lowmem pagetables, 562 highmem pagetables Jun 11 11:27:44 eroski kernel: Free swap: 2040176kB Jun 11 11:27:45 eroski kernel: 4192882 pages of RAM Jun 11 11:27:45 eroski kernel: 29463 reserved pages Jun 11 11:27:45 eroski kernel: 465186 pages shared Jun 11 11:27:45 eroski kernel: 0 pages swap cached Jun 11 11:27:45 eroski kernel: 82 pages in page table cache Jun 11 11:27:45 eroski kernel: Buffer memory: 455824kB Jun 11 11:27:45 eroski kernel: Cache memory: 8760672kB
Results on NS4040 (4 ways, 32 GB) : Test started two days ago, still running, that means more than 41 hours run, without all traces on. HighFree and LowFree values are now stabilized and they go up and down. System performance slow down but system seems stable. Jun 9 17:51:45 30921008 1301648 Jun 9 17:52:15 30542112 1298704 Jun 9 17:52:45 30127408 1297824 Jun 9 17:53:15 29720000 1296976 Jun 9 17:53:45 29317440 1296016 Jun 11 11:04:44 25392 22464 Jun 11 11:06:53 27056 21616 Jun 11 11:09:01 27488 21344 Jun 11 11:11:08 27488 21392 Jun 11 11:13:06 26688 21792 Jun 11 11:15:45 26608 20944 Jun 11 11:17:42 24400 33424 Jun 11 11:20:18 24640 31936 Jun 11 11:22:41 28192 29952 Jun 11 11:24:53 26912 30016
As a reminder, Results on NS5160 (configured as 8-way, 32GB): dbgen ran more than 48 hours without crash.
Good and bad news: ****** GOOD ****** NS4040 (4-way, 32GB) : still running (6 days), few traces on. NS5160 (16-way, 64GB) : still running (about 21 hours), few traces on. ****** BAD ****** NS5160 (16-way, 64GB) (same as above) : crashed/hang over last week-end after 5 hours run, few traces on. Jun 11 15:18:42 63983232 1145136 ... Jun 11 20:23:50 60436976 932480 This hang/crash looks like same problem we got at first on 2004-06-08.
We definitively have another *hidden* problem somewhere. But this problem doesn't seem to be directly linked anymore on the *well-known low memory* bug. I'd like to close this defect and to open another one in which we will focus on re-creating these unpredictible hangs/crashes.
ok, changing state to modified. Also, its important that we try and reproduce the hang on the latest U3 candidate kernel, as the issue might already be addressed.
Where can we get this latest U3 kernel ? Previous one had been picked up from ftp partners site (2.4.21-15.5.EL). Please give me a new pointer to get it. Also, we just received our NS6160 and it's currently being unpacked on our facilites. hoppefully, we should be able to use it soon.
here is a link to the latest U3 ia64 kernel, please test with this and let me know if there are any problems. thanks. http://people.redhat.com/~jbaron/.private/u3/ia64/
Got it. We will restart the dbgen test with this kernel on our NS5160 16 ways, 64GB. FYI, we got another crash/hang last night after (apparently) only 10 minutes ... It was with your previous kernel. We'll see with this new version.
Bad and very bad news ! Directly from Claude who performed tests and sent results to me (in french): - noyau 2.4.21-15.5.EL 2 fois KO apres 10 minutes de test. - noyau 2.4.21-15.11.EL KO apres 10 minutes de test. Je laisse la machine en l'etat jusqu'a demain matin It seems there were uncomplete settings on first tests campaign. So, less files as expected were created by dbgen. Claude did set them back and a full stessfull test is now running and ... machine crashes/hang in about 10 minutes each time. Same bad news with both kernel which were provided. Currently, we were putting traces that gave information every 5 minutes (machines ran several hours) but it's obviously not enough when machine crashes/hang after 10 minutes only. We'll set more frequent traces (about 30 seconds) to try to better figure out what's going on.
: ( Okay, we'll wait for additional info. Can you also please send us (if this is a hang) the altsysreq -t data? And anything else that you can throw at us (/var/log/messages, etc.). Any chance that there's a firmware component to what you're seeing? What SCSI adapters are you using?
Created attachment 101210 [details] /var/log/messages after machine HANG (def 121029)
Some more informations: - the /var/log/messages file given in comment #63 comes from our NS5160 16 ways 64GB after the HANG using the dbgen test - SCSI Adapter= Adaptec Content of /proc/pci: Bus 4, device 1, function 0: SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 1). IRQ 54. Master Capable. Latency=64. Min Gnt=40.Max Lat=25. I/O at 0xc400 [0xc4ff]. Non-prefetchable 64 bit memory at 0xfa6fe000 [0xfa6fefff]. Bus 4, device 1, function 1: SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (#2) (rev 1). IRQ 55. Master Capable. Latency=64. Min Gnt=40.Max Lat=25. I/O at 0xc800 [0xc8ff]. Non-prefetchable 64 bit memory at 0xfa6ff000 [0xfa6fffff].
Created attachment 101217 [details] trace files after HANG (with "echo m > /proc/sysrq-trigger") The attached "compressed tar" file contents trace files about dbgen HANG with a NS5160. This last test has been done with traces taken every 30s and including the "echo m > /proc/sysrq-trigger". The machine "broke" after 40 minutes. The tar file includes: - meminfo.sh: script that takes the traces - meminfo.txt: ouput from meminfo.sh - top.txt: result of the "top" command runned during the test - messages: /var/log/messages saved after rebooting the machine.
Did you get a chance to have a look at previous traces ? Did you find out something wrong ? As already mentionned, we reproduced this problem several times now.
This bug could be set as "closed" and remaining problem is tracked against 126998.
Thanks Pierre, closing now.
Thanks for confirming that the original problem has been resolved. The fix has already been committed to the RHEL3 U3 patch pool, but the bug should remain in MODIFIED state until the U3 errata is "pushed" (released) on RHN (at which time it will be set autmoatically to CLOSED/ERRATA).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html