+++ This bug was initially created as a clone of Bug #363961 +++ -- Additional comment from dzickus on 2008-01-14 15:05 EST -- in 2.6.18-68.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 -- Additional comment from nick.cheng.tw on 2008-01-14 21:10 EST -- Hi Tomas, About Comment #39, I found if I use pci_alloc_consistent and then open Areca AP, the memory usage will increase gradually. But if I use the kmalloc, it will not. This was ever reported by Falconstor and Pogolinux. Especially for Falconstor, if their AP is open, it will use up the memory overnight and the system will hang. We though it only happens on specific version and the pci_alloc_consistent() was also proposed by the kernel org fellow as well then there is no need to change, but to my surprise it happens again. I check the kernel source for this piece of code of pci_alloc_consistent but found it has been unchanged for a long time, therefore we guest it malfunctions on the memory management or somewhere in the kernel code. This is why this time I decide to upstream to RedHat and kernel org. The most important is it will hang the system for long-term running. -- Additional comment from nick.cheng.tw on 2008-01-14 23:38 EST -- Tomas, Thanks for your patience. I would explain one by one and let you decide which is acceptable and which is not. (1). fix the portability problems [Discription] This fixs the endian issue. This will hang the system, which defines big-endian, while I/O. (2). fix the iomem release on type B [Discription] This fixs the io-memory allocation and release issue on Type B. It could cause memory leakage on the heavy load system and leads the system unstable. (3). add return -ENOMEM in case of ioremap() failing [Discription] It fixs the system unstable issue while iomemory allocation fails. (4). modify acb->devstate[i][j] as ARECA_RAID_GONE in the initial stage instead of ARECA_RAID_GOOD in arcmsr_alloc_ccb_pool() [Discription] This fixs the exiting volumes' initial states. The wrong setting could lead to system hang. (5). fix the assignment of arcmsr_cdb->Context as (unsigned long)arcmsr_cdb [Discription] This fixs the scsi command allocation to a predefined structure. It could lead to system unstable while the AP accesses this address. (6). add the checking state of (outbound_intstatus & ARCMSR_MU_OUTBOUND_HANDLE_INT) == 0 in arcmsr_handle_hba_isr() [Discription] This fixs the interrupt routine to handle the controller abnormal interrupts. If not, it could not handle share IRQ and leads to system crash. (7). fix the scsi error handling in arcmsr_polling_hbb_ccbdone() [Discription] It fixs the host scsi error handling callback in case of I/O on failed volumes. If not, the continual I/O on failed volume could hang the system. -- Additional comment from thenzl on 2008-01-15 04:47 EST -- Nick, thanks for the explanation. (No more questions now.) Please do not forget to create bugzilla for 4.7 even without having the patch right now. -- Additional comment from thenzl on 2008-01-15 11:06 EST -- Nick, I've made minor changes to your previous patches, it would help me if you could create the patch against our latest sources.I'm going to send the files to you separately asap. Btw. I could swear that I've seen this part somewhere, but now it magically vanished from the patch - ver_addr = pci_alloc_consistent(acb->pdev, 1032, &buf_handle); - if (!ver_addr) { + tmp = kmalloc(1032, GFP_KERNEL|GFP_DMA); + ver_addr = (unsigned long *)tmp;
This is a continuation of the discussion from bug 363961. The proposed patch should be posted in this bugzilla.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Tomas, I'm assuming there is a version bump in this as well? If so could you update the summary line to reflect that?
Need the usual bullet items for an exception request post beta kernel; 1) business justification So it seems there is a real bug hanging the system, but this request is for a complete rebase and looking at the patch pointed to by bug 363961 a wc -l shows some 3000+ lines of code. 2) patch status, is patch upstream 3) the fnal patch attached to the bz or a clear pointer to said 4) testing status 5) test plan QE Can't ack without this data.
Martin, Based on discussions Tomas had with the developer, I do believe that these fixes are important, and that we should take them. Especially if we are able to get this in the beta kernel. Chip has acked the patch on rhkernel-list. Thanks for the qa_ack, and thanks to all for moving quickly on this very late patch. Tom
QE nack to bringing in these changes during beta. There's too much change involved to bring in after feature-complete date. Would be a tremendous validation burden on QE.
OK, maybe I was a bit hasty in comment 7. The latest 5.2-candidate kernel (-72.el5) includes acrmsr-1.20.00.15.RH . . . which appears to cause legitimate system hangs. What's the scope of the patch being proposed here to fix those? If this request is really "fix some bugs present in 1.20.00.15.RH then I'm fine with that coming in during beta . . . that's the whole point of beta. A complet rebase of the driver? That's a different matter entirely.
(In reply to comment #8) > OK, maybe I was a bit hasty in comment 7. The latest 5.2-candidate kernel > (-72.el5) includes acrmsr-1.20.00.15.RH . . . which appears to cause legitimate > system hangs. What's the scope of the patch being proposed here to fix those? > If this request is really "fix some bugs present in 1.20.00.15.RH then I'm fine > with that coming in during beta . . . that's the whole point of beta. A complet > rebase of the driver? That's a different matter entirely. I'm not aware of that the previous patch is causing system hangs etc. So it doesn't fix problems in our latest version. It has more then 1000 lines, but most of them are simple type changes, so I think we shouldn't talk about complete rebase.
in 2.6.18-74.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Nick, now it's your time, please install and test the patched kernel(2.6.18-74.el5) thoroughly.
Tomas, I have run the test for a whole week. So far so good.
Nick, thanks for the cooperation, I hope this is for now finished.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html