Bug 429877

Summary: [Areca 5.2 bug] Update arcmsr to 1.20.00.15.RH 2007/12/24 (refreshed)
Product: Red Hat Enterprise Linux 5 Reporter: Andrius Benokraitis <andriusb>
Component: kernelAssignee: Tomas Henzl <thenzl>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: andriusb, coughlan, eriley, jturner, nick.cheng
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0314 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 15:07:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 363961    
Bug Blocks: 262201    

Description Andrius Benokraitis 2008-01-23 15:57:03 UTC
+++ This bug was initially created as a clone of Bug #363961 +++

-- Additional comment from dzickus on 2008-01-14 15:05 EST --
in 2.6.18-68.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

-- Additional comment from nick.cheng.tw on 2008-01-14 21:10 EST --
Hi Tomas,
About Comment #39, I found if I use pci_alloc_consistent and then open Areca AP,
the memory usage will increase gradually. But if I use the kmalloc, it will not.
This was ever reported by Falconstor and Pogolinux.
Especially for Falconstor, if their AP is open, it will use up the memory
overnight and the system will hang. 
We though it only happens on specific version and the  pci_alloc_consistent()
was also proposed by the kernel org fellow as well then there is no need to
change, but to my surprise it happens again.

I check the kernel source for this piece of code of pci_alloc_consistent but
found it has been unchanged for a long time, therefore we guest it malfunctions
on the memory management or somewhere in the kernel code.

This is why this time I decide to upstream to RedHat and kernel org.
The most important is it will hang the system for long-term running.


 

-- Additional comment from nick.cheng.tw on 2008-01-14 23:38 EST --
Tomas,
Thanks for your patience.
I would explain one by one and let you decide which is acceptable and which is not.
(1). fix the portability problems
[Discription]
This fixs the endian issue. This will hang the system, which defines big-endian,
while I/O.

(2). fix the iomem release on type B
[Discription]
This fixs the io-memory allocation and release issue on Type B. It could cause
memory leakage on the heavy  load system and leads the system unstable. 

(3). add return -ENOMEM in case of ioremap() failing
[Discription]
It fixs the system unstable issue while iomemory allocation fails.

(4). modify acb->devstate[i][j] as ARECA_RAID_GONE in the initial stage instead
of ARECA_RAID_GOOD in arcmsr_alloc_ccb_pool()
[Discription]
This fixs the exiting volumes' initial states. The wrong setting could lead to
system hang. 

(5). fix the assignment of arcmsr_cdb->Context as (unsigned long)arcmsr_cdb
[Discription]
This fixs the scsi command allocation to a predefined structure. It could lead
to system unstable while the AP accesses this address.

(6). add the checking state of (outbound_intstatus &
ARCMSR_MU_OUTBOUND_HANDLE_INT) == 0 in arcmsr_handle_hba_isr()
[Discription]
This fixs the interrupt routine to handle the controller abnormal interrupts. If
not, it could not handle share IRQ  and leads to system crash. 

(7). fix the scsi error handling in arcmsr_polling_hbb_ccbdone()
[Discription]
It fixs the host scsi error handling callback in case of I/O on failed volumes.
If not, the continual I/O on failed volume could hang the system. 


-- Additional comment from thenzl on 2008-01-15 04:47 EST --
Nick,
thanks for the explanation. (No more questions now.)

Please do not forget to create bugzilla for 4.7 even without having the patch
right now.




-- Additional comment from thenzl on 2008-01-15 11:06 EST --
Nick,
I've made minor changes to your previous patches, it would help me if you could
create the patch against our latest sources.I'm going to send the files to you
separately asap. 
Btw. I could swear that I've seen this part somewhere, but now it magically 
vanished from the patch
-		ver_addr = pci_alloc_consistent(acb->pdev, 1032, &buf_handle);
-		if (!ver_addr) {
+		tmp = kmalloc(1032, GFP_KERNEL|GFP_DMA);
+		ver_addr = (unsigned long *)tmp;

Comment 1 Andrius Benokraitis 2008-01-23 16:00:28 UTC
This is a continuation of the discussion from bug 363961. The proposed patch
should be posted in this bugzilla.

Comment 2 RHEL Program Management 2008-01-23 16:17:11 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Andrius Benokraitis 2008-01-23 16:31:40 UTC
Tomas, I'm assuming there is a version bump in this as well? If so could you
update the summary line to reflect that?

Comment 4 Martin Jenner 2008-01-23 18:25:38 UTC
Need the usual bullet items for an exception request post beta kernel;

1) business justification

So it seems there is a real bug hanging the system, but this request is for a
complete rebase and looking at the patch pointed to by bug 363961 a wc -l shows
some 3000+ lines of code.

2) patch status, is patch upstream
3) the fnal patch attached to the bz or a clear pointer to said
4) testing status
5) test plan

QE Can't ack without this data.

Comment 6 Tom Coughlan 2008-01-23 22:48:07 UTC
Martin,

Based on discussions Tomas had with the developer, I do believe that these fixes
are important, and that we should take them. Especially if we are able to get
this in the beta kernel. Chip has acked the patch on rhkernel-list. Thanks for
the qa_ack, and thanks to all for moving quickly on this very late patch. 

Tom 

Comment 7 Jay Turner 2008-01-24 09:43:35 UTC
QE nack to bringing in these changes during beta.  There's too much change
involved to bring in after feature-complete date.  Would be a tremendous
validation burden on QE.

Comment 8 Jay Turner 2008-01-24 09:56:32 UTC
OK, maybe I was a bit hasty in comment 7.  The latest 5.2-candidate kernel
(-72.el5) includes acrmsr-1.20.00.15.RH . . . which appears to cause legitimate
system hangs.  What's the scope of the patch being proposed here to fix those? 
If this request is really "fix some bugs present in 1.20.00.15.RH then I'm fine
with that coming in during beta . . . that's the whole point of beta.  A complet
rebase of the driver?  That's a different matter entirely.

Comment 10 Tomas Henzl 2008-01-24 10:28:47 UTC
(In reply to comment #8)
> OK, maybe I was a bit hasty in comment 7.  The latest 5.2-candidate kernel
> (-72.el5) includes acrmsr-1.20.00.15.RH . . . which appears to cause legitimate
> system hangs.  What's the scope of the patch being proposed here to fix those? 
> If this request is really "fix some bugs present in 1.20.00.15.RH then I'm fine
> with that coming in during beta . . . that's the whole point of beta.  A complet
> rebase of the driver?  That's a different matter entirely.
I'm not aware of that the previous patch is causing system hangs etc. So it
doesn't fix problems in our latest version. 
It has more then 1000 lines, but most of them are simple  type changes,
so I think we shouldn't talk about complete rebase.






Comment 11 Don Zickus 2008-01-24 16:09:10 UTC
in 2.6.18-74.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 Tomas Henzl 2008-01-24 16:57:31 UTC
Nick,
now it's your time, please install and test the patched kernel(2.6.18-74.el5)
thoroughly. 

Comment 14 Nick Cheng 2008-02-01 10:02:17 UTC
Tomas,
I have run the test for a whole week.
So far so good.


Comment 15 Tomas Henzl 2008-02-01 14:01:42 UTC
Nick,
thanks for the cooperation, I hope this is for now finished.

Comment 18 errata-xmlrpc 2008-05-21 15:07:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html