Description of problem: This problem was seen on a 64CPU ia64 system running the HP proprietary "hazard" test suite. The storage was ~80 72GB LUNs on MSA1000 arrays spread accros 8 qlogic FC controllers. The default qlogic driver shipped with RHEL was used. No data corruption or other issues causing the test to fail were seen but the "Badness" debug message was seen approx once every 15 minutes. I have found 14 unique stack traces. All give the same error but took different paths to get there. I will attach a text file containing all of the unique traces. Version-Release number of selected component (if applicable): kernel-2.6.9-34.EL How reproducible: I ran this for a 48 hour test and saw the message regularly. The storage I am using is currently borrowed so I may not be able to recreate the configuration needed to reproduce. If this is needed please inform me and I will see if I can borrow storage again. Steps to Reproduce: 1. obtain a 64cpu system 2. obtain the HP proprietary "hazard" test sutie 3. obtain a TON of storage 4. run hazard with the -c3 option (filesystem only) Actual results: Expected results: Additional info:
Created attachment 126054 [details] various unique stack traces seen
FYI, I am now able to reproduce this on a much smaller system. I have a 4cpu ia64 system in my private rack in the Red Hat lab connected to a single MSA1000. I am able to hit these stack traces (although not nearly as often as on the 64 cpu with 8 MSA1000's).
I filed this quite some time back when I was the only one seeing it however we are now seeing this more often in other testing inside HP. It is no longer just seen on massive systems like the one I originally reported it on so I am increasing the severity. It has been reported to be easily reproduced on a 2 socket dual core ia64 system. Here is a stacktrace as seeon on RHEL4 U4 partner beta: VFS: brelse: Trying to free free buffer Badness in __brelse at fs/buffer.c:1372 Call Trace: [<a000000100016da0>] show_stack+0x80/0xa0 sp=e00000003d997940 bsp=e00000003d991058 [<a000000100016df0>] dump_stack+0x30/0x60 sp=e00000003d997b10 bsp=e00000003d991040 [<a000000100129990>] __brelse+0xd0/0x100 sp=e00000003d997b10 bsp=e00000003d991020 [<a0000002001de770>] __try_to_free_cp_buf+0x1b0/0x220 [jbd] sp=e00000003d997b10 bsp=e00000003d990ff0 [<a0000002001de930>] __journal_clean_checkpoint_list+0x150/0x180 [jbd] sp=e00000003d997b10 bsp=e00000003d990f98 [<a0000002001d9090>] journal_commit_transaction+0x6d0/0x3080 [jbd] sp=e00000003d997b10 bsp=e00000003d990ea0 [<a0000002001e18d0>] kjournald+0x170/0x580 [jbd] sp=e00000003d997d80 bsp=e00000003d990e38 [<a000000100018c70>] kernel_thread_helper+0x30/0x60 sp=e00000003d997e30 bsp=e00000003d990e10 [<a000000100008c60>] start_kernel_thread+0x20/0x40 sp=e00000003d997e30 bsp=e00000003d990e1
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this enhancement by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This enhancement is not yet committed for inclusion in an Update release.
Er... the obvious question: can you reproduce it with different controller? I.e. is that a memory corruptor in SCSI that happens to hit buffer cache under that specific load or is that a bug in fs/buffer.c and/or VM and/or fs code? Is it dependent on the fs type, while we are at it?
Alexander, The one common card in all of the systems we have seen this on is a qlogic 4GB fibre chanel card. I have asked people back in HP to see if they can reproduce this with other cards. Did you intend to remove the issue tracker link when you updated this BZ? You removed IT 96777 on your last update.
looks like a duplicate of 168301 *** This bug has been marked as a duplicate of 168301 ***
Pulling in ack from 168301.