735261 – Anaconda doesn't support installation on host with 160 disks ( 20 LUN via 8 paths)

Bug 735261 - Anaconda doesn't support installation on host with 160 disks ( 20 LUN via 8 paths)

Summary: Anaconda doesn't support installation on host with 160 disks ( 20 LUN via 8 p...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	anaconda
Sub Component:
Version:	19
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Anaconda Maintenance Team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	756082
TreeView+	depends on / blocked

Reported:	2011-09-02 03:16 UTC by Gris Ge
Modified:	2014-03-17 19:51 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-03-17 19:51:34 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Anaconda log for installation failure on storageqe-03 (3.34 MB, application/x-bzip2) 2011-09-06 02:29 UTC, Gris Ge	no flags	Details
View All

Description Gris Ge 2011-09-02 03:16:23 UTC

Description of problem:
Got there message when try to install:
====
Running anaconda 13.21.134, the Red Hat Enterprise Linux system installer - please wait.
Anaconda died after receiving signal 9.
install exited abnormally [1/1] 
The system will be rebooted when you press Ctrl-C or Ctrl-Alt-Delete.
udevd[133]: worker [10936] failed while handling '/devices/pci0000:00/0000:00:07.0/0000:0d:00.1/host4/rport-4:0-4/target4:0:2/4:0:2:15/block/sdiw/sdiw1'

udevd[133]: worker [10987] unexpectedly returned with status 0x0100
====

I was investigated into this (but forget how). This is what I can recall:
1. I found many lvm pvs process, about 400, each are scaning on 1 disk or 1 partion.
2. These disks should be assembled by multipath, no scan needed on disk itself.

These lvm process might cause out of memory and anaconda process got killed by OOM killer.

I know nothing about anaconda storage part, please forgive my wild guess.

After I unmap reduce LUN count into 10 (80 disks) on storage array, installation works well.

Version-Release number of selected component (if applicable):
RHEL6.2-20110823.1
RHEL 6.1 GA also have same problem.

How reproducible:
100%

Steps to Reproduce:
1. mask/map 20+ LUNs to a host via 8 paths. 
2. Install OS

Actual results:
Installation failed.

Expected results:
Installation correctly.

Additional info:

Comment 1 Ales Kozumplik 2011-09-05 10:42:10 UTC

Please attach /tmp/*log.

Comment 2 Gris Ge 2011-09-05 11:24:52 UTC

Ales,

I tried sshd kernel option, but still no change to got a shell for it.

Will the syslog=<IP> options provide sufficient information?

Comment 3 Ales Kozumplik 2011-09-05 11:32:38 UTC

Please switch to tty2, cd into /tmp and then copy out all the *log files there (using scp for instance).

Comment 4 Gris Ge 2011-09-05 12:02:01 UTC

no ttys2 for beaker console.

I will try KVM and provide the info to you.

Comment 5 Ales Kozumplik 2011-09-05 12:12:58 UTC

ok thanks. I'll keep this in needinfo until then.

Comment 6 Gris Ge 2011-09-06 02:29:45 UTC

Created attachment 521561 [details]
Anaconda log for installation failure on storageqe-03

Storageqe-04 is busy with HBA driver auto-testing.

This is the log for storageqe-03, it only have 50 LUNs via 4 paths. (200 disks).

Not sure why we got many I/O error (maybe the LVM try to access the passive link).

I found 3 issue:
1. There will be 380+ LVM PVS process sanning each patition.
2. storage log show that you are try to scan multipath on /dev/sda1, it might be incorrect, multipath is not base on partition. for sda1, kpartx will handle it after mpath created.
3. syslog and other logs show that it suddenly been killed as the last log line only is partial.

My wild guess it's out of memory.

Comment 7 Ales Kozumplik 2011-09-06 06:28:52 UTC

The syslog is full of errors like this:

22:14:12,133 ERR kernel:end_request: I/O error, dev sdcr, sector 4063216
22:14:12,133 INFO kernel:sd 4:0:1:46: [sdcr] Unhandled error code
22:14:12,133 INFO kernel:sd 4:0:1:46: [sdcr] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
22:14:12,133 INFO kernel:sd 4:0:1:46: [sdcr] CDB: Read(10): 28 00 00 00 08 08 00 00 08 00

I think it is either a hardware error or some kernel problem.

Comment 8 Gris Ge 2011-09-07 04:31:08 UTC

I don't think that is storage array hardware issue.
And storageqe-03 and -04 are using different storage array. It's a little possibility that they are having problem at the same time.
HBA driver testing show both storage array works well.

I will investigate on these I/O errors' but I think you might need to find a good way for LVM PVS scan. I think kicking off large mount of lvm pvs process is what this bug supposed to fix.

LVM should not touch any /dev/sdX before multipath started because accessing passive link will cause storage array performing controller transition. (That might why the I/O come). Customer will get angry if their storage array keep bouncing if you are access all link (passive and active).

Comment 9 Gris Ge 2011-09-07 05:34:26 UTC

This is anaconda or initramfs issue. Change back to anaconda component.

Comment 10 Ales Kozumplik 2011-09-07 05:57:31 UTC

clearing the needinfo.

Comment 19 David Cantrell 2012-07-03 16:46:42 UTC

This one has been skipped through several update releases, moving it over to rawhide so we can address it upstream and consider backports to RHEL releases after that.

Comment 20 Fedora End Of Life 2013-04-03 19:13:17 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 21 Brian Lane 2013-05-06 23:04:34 UTC

Is anyone able to test this with F18 or F19-Beta-TC3? There has been a considerable amount of change in the storage since F13.

Note You need to log in before you can comment on or make changes to this bug.