Bug 128788
Summary: | RHEL3 U6: Diskdump support for Compaq Smart Array Controllers (cciss) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Robert Proffitt <rproffit> | ||||||||
Component: | kernel | Assignee: | Tom Coughlan <coughlan> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3.0 | CC: | dff, dgregor, ltroan, lwang, masanari_iida, mike.miller, ntachino, peterm, petrides, rperkins, rproffit, sabdelg, tao, tburke | ||||||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
URL: | IT_48801(HP-Proliant), IT_48773(HP-GSE) | ||||||||||
Whiteboard: | Kernel | ||||||||||
Fixed In Version: | RHSA-2005-663 | Doc Type: | Enhancement | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2005-09-28 14:25:02 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 156320 | ||||||||||
Attachments: |
|
Description
Robert Proffitt
2004-07-29 16:14:19 UTC
My note to Brian Baker dated Wed, 08 Sep 2004 14:54:35:
Brian,
Rick Beldin from HP has requested diskdump functionality on RHEL3.
RH Engineering asks if HP Engineering can provide a patch for the cciss
driver. See Issue Tracker 48801 for details.
Larry
On Wed, 2004-09-08 at 13:49, Robert Perkins wrote:
> Hi Larry,[edited]
> Tim asks if HP would be willing to assist Red Hat in coding up support
> for diskdump for the CCISS driver.
>
> Can you please ask HP if they are interested, and if so, which
> architectures are most important to them?
>
> Thanks,
> Rob
My note to Brian and Mike dated 09/15/2004 at 04:20:17 PM: Mike, Brian, Engineering needs to know if HP plans to provide the code to support diskdump on the cciss adapter. We can offer guidance. RHEL3 U3 has some adaptec drivers supporting this function. Larry > > RHEL3 U4: Diskdump support for Compaq Smart Array Controllers > (cciss)https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128788 > > Requested by: rprofitt > Assigned to: ltroan > Status: NEEDINFO > Upshot: Larry was going to ask HP to help on it; consequently kernel team has > not done any work on it; U4 kernel code freeze next week! From Brian at HP: Mike is investigating this now. WHO CAN MIKE CONTACT WITH QUESTIONS ON THIS ???
From email exchanges.....
From Mike Miler:
Any guidance is welcomed. It will speed up the process.
I have an idea where to look, I think the adaptec driver is supposed to support
diskdump, right?
mikem
> -----Original Message----
> From: Baker, Brian (ISS - Houston)
> Sent: Wednesday, September 15, 2004 3:55 PM
> To: 'Larry Troan'; Miller, Mike (OS Dev
> Cc: Hoffert, Maureen B (WW Linux Engr LPMO); Pherigo, Suzanne S
> Subject: RE: RHEL3 U4: Diskdump support for Compaq Smart Array
> Controllers(cciss) - ref: IT_48801
>
>
> Mike is investigating this now.
>
> -----Original Message-----
> From: Larry Troan [ltroan]
> Sent: Wednesday, September 15, 2004 3:54 PM
> To: Baker, Brian (ISS - Houston); Miller, Mike (OS Dev)
> Cc: Hoffert, Maureen B (WW Linux Engr LPMO); Pherigo, Suzanne S
> Subject: RHEL3 U4: Diskdump support for Compaq Smart Array
> Controllers(cciss) - ref: IT_48801
>
> Mike, Brian,
>
> Engineering needs to know if HP plans to provide the code to support
> diskdump on the cciss adapter.
>
> We can offer guidance. RHEL3 U3 has some adaptec drivers
> supporting this> function.
>
> Larry
>
>
> RHEL3 U4: Diskdump support for Compaq Smart Array
> Controllers (cciss)
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128788
> > Requested by: rprofitt
> > Assigned to: ltroan
> > Status: NEEDINFO
> > Upshot: Larry was going to ask HP to help on it; consequently
> > kernel team has not done any work on it; U4 kernel code freeze
> > next week!
> >
Supported adapters in RHEL3 U3 are aic7xxx, aic79xx and mptfusion. We suggest that HP follow the examples of the other diskdump enabled drivers listed above and apply it to the driver they are interested in. The disk dump interfaces are very well isolated and easy to identify in the driver. After doing that, conduct extensive testing and have the convincing test results accompany the patch proposal. Reopening for U5 consideration. My response to Brian Baker at HP: Tom is the contact point for Mike -- Tom volunteered :-) Larry On Wed, 20 Oct 2004 15:48, Brian BAker wrote: Hi, Sorry, I was waiting on a response from Mike on this. Larry offered up Tom Coughlin (I believe) to aid in getting this work done. We are targeting Update 5 for this to be included. Thanks, Brian. -----Original Message----- From: Don Langley [dlangley] Sent: Tuesday, October 19, 2004 6:44 PM To: Fernandez, Michael Cc: Baker, Brian (ISS - Houston); Don Langley; ltroan; Elder,Scott C (Linux Americas); Hansen, Eric; Robert Proffitt Subject: Diskdump driver for Cisco Mike, We have not heard back from Brian Baker yet, and was wondering if you can help. Please read the email below. (We also appreciate that it is end of fiscal year for you on 10/31) Mike Miller of HP is the driver maintainer. We need diskdump support in the cciss driver. They are ready to buy 600 more ProLiants! Thanks for what you can find out! Don --- Larry Troan referred me to you regarding Cisco's request to directly contact HP engineers regarding addition of the RHEL 3 diskdump feature to HP's cciss driver. In July, Phil Wolf and Brian Long of Cisco (my account) requested the feature be added to cciss drivers in a near-term update to RHEL 3. Cisco is aware that feature development has been reassigned to HP and has asked for HP contacts in your team. Phil and Brian havecharacterized diskdump as a "critical" requirement to support their 600+ProLiant servers deployed to run RHEL 3. May I refer Phil and Brian to you, Mike Miller, or any other HP Engineering contact? -- Robert Proffitt, RHCE Red Hat, Inc. Enterprise Solutions Architect 719.487.9236 Office Red Hat, Inc. 719.331.3708 Cell ------------> http://www.redhat.com/ <------------ cciss driver code from HP needs to be at Red Hat on 11/17 for inclusion in U5 -- per Rob Perkins. Setting status to NEEDINFO in anticipation of new driver code from Mike at HP. Reassigning this to Kevin Anderson in anticipation he will assign to Tom or other kernel engineer. Larry, it is on you to get HP to get the code to us by the deadline, right? Yes. What is the deadline? I have a tentative date of 11/17...... HP, I need a patch by COB Friday. Otherwise, it's a U6 candidate. From Issue Tracker: -----Event posted 11-18-2004 05:56pm by mike.miller with duration of 0.00 This will have to be a U6 candidate. We are working on the support, but it cannot be completely tested in time to make the U5 schedule. ------Event posted 12-09-2004 09:30am by brian.b with duration of 0.00 Is Cisco the only one asking for this? We're trying to understand best how to approach release and support of this. Status set to: Waiting on Tech ------Event posted 12-09-2004 12:30pm by cww with duration of 0.10 I have at least 3 customers in the HP-GSE queue asking for diskdump support for cciss drivers, particularly on Iatnium servers where there is not netdump. It would make GSS's life much easier. Require updated code from HP 4/20 for U6. Please let me know if this is a problem. PM ACK for U6 Eng. ACK, assuming PM is okay with the risk that it may not be in RHEL 4 in time for some customer migrations. *** Bug 156989 has been marked as a duplicate of this bug. *** Converting from FZ to BZ. Propagating devel/pm/qa acks from bug 156989. Which patch are HP testing, their original patch or my version? When does they finish the test? Created attachment 116028 [details]
cciss block dump patches
This is what HP sent me, after we rejected their first two attempts because
they broke KABI.
I was not aware of another version.
I have built this and have started some basic tests. Please review and let me
know if this version is correct.
Created attachment 116029 [details] linux-2.4.21-diskdump-cciss.patch I received the review request of BZ#156989 from Peter Martuccelli. I found that it is the same as the patch which you sent me. I thought I can solve kABI problem and created the new patch. The source, kernel and utilities are placed on http://people.redhat.com/aimamura/.hp to be tested by HP. I expect Larry Troan passed them to HP. I want to make sure I don't have cciss adapter and have not tested the patch. If the patch does not work, HP must fix it. Okay then, we need confirmation from HP immediately that Nobuhiro's patch is acceptable, and that HP has completed the disk dump test plan using his patch. We are running out of time for U6. Per comment #43 above, Engineering is requesting that HP verify the revised patch which doesn't break kABI. Note that comments 43, 44, 45 are in reverse order due to the fact that Issue Tracker displays in reverse order by default and this is therefore the way the IT comments are posted to Bugzilla. Setting status to NEEDINFO. > Per comment #43 above, Engineering is requesting that HP verify the revised > patch which doesn't break kABI. > > Note that comments 43, 44, 45 are in reverse order due to the fact that Issue > Tracker displays in reverse order by default and this is therefore the way the > IT comments are posted to Bugzilla. I am assigned to this BZ#, but I cannot see private comments 43, 44, 45 because I am not a Redhat employee. If 43, 44, 45 has information which I should know, please make them public. Thank you. Comments 43 and 44 are requesting HP test results from your IT generated post, (comment #45). These changes made by ntachino. Bugzilla comment added: Created an attachment (id=116029) [edit] linux-2.4.21-diskdump-cciss.patch I received the review request of BZ#156989 from Peter Martuccelli. I found that it is the same as the patch which you sent me. I thought I can solve kABI problem and created the new patch. The source, kernel and utilities are placed on http://people.redhat.com/aimamura/.hp to be tested by HP. I expect Larry Troan passed them to HP. I want to make sure I don't have cciss adapter and have not tested the patch. If the patch does not work, HP must fix it. From User-Agent: XML-RPC 1. The cciss driver is now dependent on diskdump being present for it to load. The purpose behind having a mid-level driver was to off load these dependencies from the low-level drivers similar to how scsi_dump does for the scsi drivers. 2. The mid-level driver has been changed to only do the registration of the cciss driver dump operations. At this point if the dependencies are already in the low-level driver there is no need for a mid-level driver at all. 3. The sources appear to be based on an early beta patch with elements of the newer patch thrown in. The patches to be used should be the ones attached to issue 71038 which are also the same as the ones sent to Larry Troan. These patches remove the dependencies from cciss and any other block driver that wishes to use them as well as not breaking the KABI. I am attaching the patches to this issue tracker as well. Internal Status set to 'Waiting on Customer' File uploaded: release10.zip This event sent from IssueTracker by ltroan issue 48801 it_file 42143 Per above response from Brian Baker at HP.... (In reply to comment #49) > 1. The cciss driver is now dependent on diskdump being present for it to > load. The purpose behind having a mid-level driver was to off load these > dependencies from the low-level drivers similar to how scsi_dump does for > the scsi drivers. > > 2. The mid-level driver has been changed to only do the registration of > the cciss driver dump operations. At this point if the dependencies are > already in the low-level driver there is no need for a mid-level driver at > all. How does cciss driver depend on diskdump? Even with my patch, cciss driver should work without diskdump module if the customer does not select to use diskdump. My patch introduces a new module cciss_dump and it works as the bridge between cciss driver and diskdump module. It depends on both cciss driver and diskdump module. Actually cciss_dump module has the same role as block_dump module. > 3. The sources appear to be based on an early beta patch with elements of > the newer patch thrown in. The patches to be used should be the ones > attached to issue 71038 > which are also the same as the ones sent to Larry > Troan. These patches remove the dependencies from cciss and any other > block driver that wishes to use them as well as not breaking the KABI. I > am attaching the patches to this issue tracker as well. Yes, I agree. The newer patch seems to have no kABI problem. I think two patches do the same thing by a different ways. HP's patch is more generic because block_dump module can be used by other block device, but block_dump assumes cciss driver has hidden data structure for diskdump after block_device_operations and it is a little bit ugly. My patch is not generic, but more compact than HP`s patch. I don't have strong opinion to choice. I need comments from other developer. In my opinion, the more generic approach is probably not worth the price. It touches more places in the cciss driver, so it will be more difficult to port to future versions of the cciss driver. The only other driver in block that is a candidate to use the generic approach is DAC960. It is not likey that we will need disk dump for that in RHEL 3. The generic approach might be good to re-visit for RHEL 5, when we can change KABI, if we are still doing disk dump the same way in RHEL5. Also, I like Nobuhiro patch better because it follows the Linux coding conventions more closely. Any agreement at HP on this? *** Bug 159938 has been marked as a duplicate of this bug. *** I have been testing the patch that HP provided. After I forced a crash, I see the stack trace on the console. Then I see these messages: CPU frozen: #1#2#3#4#5#6#7 CPU#0 is executing diskdump. start dumping <4>cciss cciss0: sendcmd Error 1 <4>cciss cciss0: sendcmd offensive info size 0 num 0 value 0 <1>cciss0: Error flushing cache check dump partition... dumping memory.. <0>halt In at least one case, I believe that the dump was written, but these cciss error messages raise concern. Has HP seen these errors during their cciss disk dump testing? Opening up Bugzilla so HP can see it. Patch posted for review and inclusion in RHEL3 U6. Diskdump-over-CCISS support has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-33.EL). Created attachment 117062 [details]
Latest HP patch
The latest "patch" is binary data? Has anyone successfully tested diskdump using the U6 beta kernel (2.4.21-34.EL)? I get everything all set up and apparently in a fashion that _should_ work, but it fails to work at all. After starting the diskdump service, the device and diskdump are in this state: saias11 / 12# cat /proc/diskdump # sample_rate: 8 # block_order: 2 # fallback_on_err: 1 # allow_risky_dumps: 1 # total_blocks: 655286 # /dev/cciss/c0d0p6 23052001 12280799 saias11 / 13# /etc/init.d/diskdump devicestatus /dev/cciss/c0d0p6 status : formatted block size : 1535099 version : 1-1.1.7 sample rate: 8 block order: 2 format size: 655286 note : parameters satisfied However, when I panic the machine ("echo c > /proc/sysrq-trigger"), I get this on the console: CPU frozen: #0 CPU#1 is executing diskdump. start dumping <4>cciss cciss0: SendCmd Invalid command list address returned! (9f5c0000) <1>cciss: read capacity failed <4>cciss cciss0: SendCmd Invalid command list address returned! (9f5c024c) <1>cciss: read capacity failed <3>disk_dump: No sane dump device found And that's all. It worked for me: # cat /proc/diskdump # sample_rate: 8 # block_order: 2 # fallback_on_err: 1 # allow_risky_dumps: 1 # total_blocks: 130980 # /dev/cciss/c1d0p1 32 106577728 # echo c > /proc/sysrq-trigger CPU frozen: #0#1#2#4#5#6#7 CPU#3 is executing diskdump. start dumping check dump partition... dumping memory.. <0>halt (reboot) ... Saving panic dump: /dev/cciss/c1d0p1: [100.0%] [ OK ] Formatting dump device: /dev/cciss/c1d0p1: [100.0%] [ OK ] Starting diskdump: [ OK ] ----- This was with 2.4.21-34.ELsmp, and the following cciss device: kernel: HP CISS Driver (v 2.4.56.RH1) kernel: blk: queue c0502020, I/O limit 4294967295Mb (mask 0xffffffffffffffff) kernel: blocks= 106577760 block_size= 512 kernel: heads= 255, sectors= 32, cylinders= 13061 RAID ADG kernel: kernel: blocks= 106577760 block_size= 512 kernel: heads= 255, sectors= 32, cylinders= 13061 RAID 0 kernel: kernel: blk: queue c05020f0, I/O limit 4294967295Mb (mask 0xffffffffffffffff) kernel: cciss/c1d0: p1 kernel: cciss/c1d1: unknown partition table I'll have to check on the cciss hardware details, if needed. Do you get any errors when you try to do normal I/O to the cciss partition you want to use for dump? I'll have to dig in to those "SendCmd Invalid" errors you aree getting. In my LAB, I have encountered similar failure with SmartArray 5i. # uname -r 2.4.21-34.ELsmp # dmesg | grep CISS HP CISS Driver (v 2.4.56.RH1) # cat /proc/diskdump # sample_rate: 8 # block_order: 2 # fallback_on_err: 1 # allow_risky_dumps: 1 # total_blocks: 262052 # /dev/cciss/c0d1p1 32 2203168 # /etc/init.d/diskdump devicestatus /dev/cciss/c0d1p1 status : formatted block size : 275396 version : 1-1.1.7 sample rate: 8 block order: 2 format size: 262052 note : parameters satisfied ------------Failed Pattern1 ------------------ CPU frozen: #0 CPU#1 is executing diskdump. start dumping <4>cciss cciss0: SendCmd Invalid command list addreess returned! (37940000) <1>cciss: read capacity failed check dump partition... <4>cciss cciss0: sendcmd Error 4 <4>cciss cciss0: sendcmd offensive info size 0 num 0 value 0 <3>disk_dump: bad signature in block 3 <3>disk_dump: check partition failed. <0>halt -------- failed pattern2 ------------------- CPU forzen: #1 CPU#0 is executing diskdump start dumping <4>cciss cciss0: SendCmd Invalid command list addreess returned! (37940000) <1>cciss: read capacity failed <4>cciss cciss0: SendCmd Invalid command list addreess returned! (37940030) <1>cciss: read capacity failed <3>disk_dump: No sane dump device found <0> kernel panic : Fatal exception ----------------- Additional information. Now, the diskdump success intermittently. It works if I use kernel module to make the system to panic. (The kernel module called panic() ) It doesn't work if I use #echo c > sysrq-trigger. Thanks Masanari Similarly, it works if I crash the system with the crash.c module compiled and insmodded (/usr/share/doc/netdump-0.6.11/crash.c), but not through /proc/sysrq-trigger. I'm putting this bug into FAILS_QA state until Tom's latest fix is committed to the next RHEL3 U6 respin (which will occur this week). A fix for the problem found during Q/A has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-35.EL). Was there a respin of the RHEL3 U6 beta ISOs? Md5sums on RHN still match the ISOs I downloaded on 8/11. Don Fischer, when will Friday's RHEL3 U6 beta kernel respin appear in the RHN beta channels? Dennis Gregorovic, when will Friday's RHEL3 U6 beta kernel respin appear in the RHN beta channels? 2.4.21-35.EL kernel is now in the RHEL 3 Beta channels on RHN Test with 2.4.21-35, it worked as expected. Magic Key "c" from keyboard crash the system, and dump succesfully. #echo c > sysrq-trigger also works now. cciss.c has: #include "cciss_diskdump.c" I did not need to manually load any modules. Please try this: 1. Make sure that a cciss device is listed in /etc/sysconfig/diskdump. E.g.: DEVICE=/dev/cciss/c0d1p1 2. Do "service diskdump restart". See if that solves the problem. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html |