Bug 1557659

Summary: aacraid: Host adapter abort request
Product: [Fedora] Fedora Reporter: lnie <lnie>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 28CC: airlied, awilliam, bskeggs, ewk, fzatlouk, hdegoede, ichavero, itamar, jarodwilson, jforbes, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, lnie, mchehab, mjg59, robatino, rrenukun, sgallagh, steved
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: RejectedBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-29 15:16:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
console output of the f28 installation beaker job
none
console output of the "success" f28 installation beaker job(with Fedora-28-20180315.n.0 Server x86_64,too)
none
screenshot1
none
journal after boot into the 4.15 kernel,just in case it's useful
none
dmesg of the successful boot
none
journal of the successful boot
none
last picture before dracut messags flowing(with reset_devices)
none
picture6
none
picture of the RAID controller none

Description lnie 2018-03-17 12:52:27 UTC
Created attachment 1409136 [details]
console output of the f28 installation beaker job

Description of problem:
Try to install f28 to a fcoe server with aacraid driver,but failed.
I've tried three times,all failed,but installation of f27/rhel to the save server is successful.

Version-Release number of selected component (if applicable):
kernel-4.16.0-0.rc4.git0.1.fc28.x86_64.rpm    
Fedora-28-20180315.n.0 Server x86_64

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 lnie 2018-03-17 14:48:18 UTC
Created attachment 1409183 [details]
console output of the "success" f28 installation beaker job(with Fedora-28-20180315.n.0 Server x86_64,too)

Comment 2 lnie 2018-03-17 14:49:58 UTC
The installation on the server with megaraid_sas driver succeed.

Comment 3 Fedora Blocker Bugs Application 2018-03-17 14:52:00 UTC
Proposed as a Blocker for 28-beta by Fedora user lnie using the blocker tracking app because:

 seems affects "The installer must be able to detect and install to hardware or firmware RAID storage devices"

Comment 4 Adam Williamson 2018-03-19 21:22:36 UTC
Discussed at 2018-03-19 Fedora 28 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2018-03-19/f28-blocker-review.2018-03-19-16.02.html . We agreed that if this only affects a RAID set accessed via FCoE - effectively stacking two relatively unusual use cases that are blockers alone - this is too much of a corner case to constitute a blocker. However, if this can be reproduced with a *local* Adaptec RAID controller, we will re-evaluate it.

Lili, can you test with the RAID controller accessed as a regular local device? Thanks!

Comment 5 lnie 2018-03-20 06:51:35 UTC
Hi Adam, I have done a manual installation with a local aacraid server(Dell 	PowerEdge T110),I found the bug with HP ProLiant DL120 G7,so it seems that this bug affects widely.

Comment 6 lnie 2018-03-20 06:55:51 UTC
er,I mean the manual installation with local server is failed due to this bug,of course. I forgot to type"but faild"

Comment 7 lnie 2018-03-20 09:38:20 UTC
 The two guilty lines might be: 
 dracut-pre-udev[489]: modprobe: ERROR: could not insert 'floppy': No such device 
 dracut-pre-udev[489]: modprobe: ERROR: could not insert 'sha256_mb': No such device 
didn't see that from the log of the successful installations.

Comment 8 Adam Williamson 2018-03-20 16:09:55 UTC
Nah, I don't think so; it's odd that there's a difference, but those are just kernel modules for floppy disk support and SHA-256, I don't think they relate to aacraid.

This is definitely a problem, though :( Laura, Justin, how can lili debug this further?

I do see various module params for tweaking the modules behaviour:

parm:           aac_sync_mode:Force sync. transfer mode 0=off, 1=on (int)
parm:           aac_convert_sgl:Convert non-conformable s/g list 0=off, 1=on (int)
parm:           nondasd:Control scanning of hba for nondasd devices. 0=off, 1=on (int)
parm:           cache:Disable Queue Flush commands:
	bit 0 - Disable FUA in WRITE SCSI commands
	bit 1 - Disable SYNCHRONIZE_CACHE SCSI command
	bit 2 - Disable only if Battery is protecting Cache (int)
parm:           dacmode:Control whether dma addressing is using 64 bit DAC. 0=off, 1=on (int)
parm:           commit:Control whether a COMMIT_CONFIG is issued to the adapter for foreign arrays.
This is typically needed in systems that do not have a BIOS. 0=off, 1=on (int)
parm:           msi:IRQ handling. 0=PIC(default), 1=MSI, 2=MSI-X) (int)
parm:           startup_timeout:The duration of time in seconds to wait for adapter to have it's kernel up and
running. This is typically adjusted for large systems that do not have a BIOS. (int)
parm:           aif_timeout:The duration of time in seconds to wait for applications to pick up AIFs before
deregistering them. This is typically adjusted for heavily burdened systems. (int)
parm:           aac_fib_dump:Dump controller fibs prior to IOP_RESET 0=off, 1=on (int)
parm:           numacb:Request a limit to the number of adapter control blocks (FIB) allocated. Valid values are 512 and down. Default is to use suggestion from Firmware. (int)
parm:           acbsize:Request a specific adapter control block (FIB) size. Valid values are 512, 2048, 4096 and 8192. Default is to use suggestion from Firmware. (int)
parm:           update_interval:Interval in seconds between time sync updates issued to adapter. (int)
parm:           check_interval:Interval in seconds between adapter health checks. (int)
parm:           check_reset:If adapter fails health check, reset the adapter. a value of -1 forces the reset to adapters programmed to ignore it. (int)
parm:           expose_physicals:Expose physical components of the arrays. -1=protect 0=off, 1=on (int)
parm:           reset_devices:Force an adapter reset at initialization. (int)

Could you try playing with any of those that might affect things? You might need help from someone who knows what values might sensibly be used, though, I honestly have no idea (I've never worked with one of these adapters).

Comment 9 Justin M. Forbes 2018-03-20 20:41:35 UTC
Just out of curiosity, as I am trying to track this down, does this system function as expected with F27 and the latest 4.15 kernel update?  Entirely likely that this was introduced in 4.16, as there were many changes, but I want to rule out earlier kernels.

Comment 10 lnie 2018-03-21 07:00:11 UTC
I have installed f27 on the Dell PowerEdge T110 and update the kernel to 4.15,it works fine.

Comment 11 Adam Williamson 2018-03-21 07:07:53 UTC
lnie: just to confirm, can you also try installing a 4.16 kernel - an fc28 or f29 one, from koji - and see if *then* you see the bug? thanks!

Comment 12 lnie 2018-03-21 08:08:48 UTC
Adam,the system failed to boot after I updated the kernel to kernel-4.16.0-0.rc6.git0.2.fc28,screenshot1 is the final picture before dracut messages flowing.

Comment 13 lnie 2018-03-21 08:09:35 UTC
Created attachment 1411013 [details]
screenshot1

Comment 14 lnie 2018-03-21 08:43:21 UTC
I've gotten that picture by removing  rhgb quiet,for sure

Comment 15 lnie 2018-03-21 08:44:37 UTC
Created attachment 1411035 [details]
journal after boot into the 4.15 kernel,just in case it's useful

Comment 16 lnie 2018-03-21 14:05:15 UTC
(In reply to Adam Williamson from comment #8)

> Could you try playing with any of those that might affect things? You might
> need help from someone who knows what values might sensibly be used, though,
> I honestly have no idea (I've never worked with one of these adapters).

I'm gonna to see whether reset_devices param will bring us any difference tomorrow:)

Comment 17 Justin M. Forbes 2018-03-21 15:48:30 UTC
One last thing to try and narrow things down, can you try to boot that F27 install with  the rc4 kernel? https://koji.fedoraproject.org/koji/buildinfo?buildID=1053469

Comment 18 Stephen Gallagher 2018-03-21 17:25:09 UTC
The criteria says we block on failures to install to hardware RAID, but I can also see us handwaving this if it was the last blocker at the Go/No-Go meeting and treating it as a Final blocker. I'm not going to +1 blocker this unless I hear evidence that it's going to hit a meaningful percentage of our users.

I'm definitely +1 FE in the meantime.

Comment 19 lnie 2018-03-22 05:12:46 UTC
Adam,I've played with reset_devices and aac_sync_mode,aac_sync_mode works,the system successfully boot with kernel-4.16.0-0.rc6.git0.2.fc28.x86_64,yay~~

Comment 20 lnie 2018-03-22 05:13:34 UTC
Created attachment 1411599 [details]
dmesg of the successful boot

Comment 21 lnie 2018-03-22 05:14:46 UTC
Created attachment 1411600 [details]
journal of the successful boot

Comment 22 lnie 2018-03-22 05:16:50 UTC
Created attachment 1411601 [details]
last picture before dracut messags flowing(with reset_devices)

Comment 23 lnie 2018-03-22 05:18:33 UTC
(In reply to Justin M. Forbes from comment #17)
> One last thing to try and narrow things down, can you try to boot that F27
> install with  the rc4 kernel?
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1053469

rc4 kernel also dosen't work,system just hang there(picture6)

Comment 24 lnie 2018-03-22 05:19:07 UTC
Created attachment 1411602 [details]
picture6

Comment 25 Adam Williamson 2018-03-22 07:48:52 UTC
So if this fails with default options on 4.16 but passes with default options on 4.15, that's clearly a bug, but if a non-default option can make it work, that's probably a sufficient workaround at least for Beta. On that basis I'm -1 blocker for Beta, not sure about FE, would depend on the fix.

Thanks for the testing, Lili. Presumably RC4 kernel on F27 also works if you use aac_sync_mode ?

Comment 26 Justin M. Forbes 2018-03-22 12:29:41 UTC
This would make sense, what I was looking for with the rc4 test isn't directly related to the aacraid code, but with patches that upstream hasn't picked up with yet. I will see what upstream has as a fix.  With a known work around, I am -1 to blocker as well.

Comment 27 Stephen Gallagher 2018-03-22 12:37:32 UTC
With the information above, I'm firmly -1 blocker.

I'm also going to go for -1 FE; if the fix is going to require a kernel rebase, I'm not comfortable with that level of risk during Freeze. Let's document the workaround in Known Issues and fix it as soon after Freeze lifts as possible.

Comment 28 František Zatloukal 2018-03-22 18:29:05 UTC
Discussed during blocker review [1]:

RejectedBlocker (Final) - this is specific to one particular type of RAID adapter, and we have a reasonable workaround, so we don't consider this a serious enough violation to be a Beta blocker

[1] https://meetbot-raw.fedoraproject.org/fedora-meeting-1/2018-03-22/

Comment 29 František Zatloukal 2018-03-22 18:36:42 UTC
Discussed during blocker review [1]:

RejectedBlocker (Beta) - this is specific to one particular type of RAID adapter, and we have a reasonable workaround, so we don't consider this a serious enough violation to be a Beta blocker

[1] https://meetbot-raw.fedoraproject.org/fedora-meeting-1/2018-03-22

Comment 30 lnie 2018-03-23 05:47:08 UTC
(In reply to Adam Williamson from comment #25)
> So if this fails with default options on 4.16 but passes with default
> options on 4.15, that's clearly a bug, but if a non-default option can make
> it work, that's probably a sufficient workaround at least for Beta. On that
> basis I'm -1 blocker for Beta, not sure about FE, would depend on the fix.
> 
> Thanks for the testing, Lili. Presumably RC4 kernel on F27 also works if you
> use aac_sync_mode ?

  yeah,checked.

Comment 31 Raghava Aditya Renukunta 2018-03-23 15:03:09 UTC
Hi Inie,
Which card and system did you reproduce this on? and what is the firmware version for the card?

Regards,
Raghava Aditya

Comment 32 lnie 2018-03-27 04:54:20 UTC
Created attachment 1413479 [details]
picture of the RAID controller

Comment 33 lnie 2018-03-27 04:55:31 UTC
Hi,
  please feel free to ask if more information is needed.

Comment 34 Justin M. Forbes 2018-07-23 15:11:34 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.17.7-200.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 35 Justin M. Forbes 2018-08-29 15:16:23 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.