Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 691606

Summary:

Panic due to oom on 32bit RHEL6.1 with FCoE MPIO of 256LUNs

Product:

Red Hat Enterprise Linux 6

Reporter:

Yi Zou <yi.zou>

Component:

device-mapper-multipath

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.1

CC:

agk, bmarzins, drjones, dwysocha, heinzm, jack.morgan, jane.lv, john.r.fastabend, john.ronciak, jvillalo, luyu, mbroz, mchristi, prajnoha, prockai, rdoty, robert.w.love, zkabelac

Target Milestone:

Target Release:

6.1

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-04-26 13:38:38 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

635490

Attachments:

Description	Flags
oom panic in mpio 32bit RH	none

Description Yi Zou 2011-03-28 23:50:18 UTC

Created attachment 488304 [details]
oom panic in mpio 32bit RH

FCoE MPIO on 32bit with 256LUNs ends up oom panic:

This has been observed on both 32bit 6.0 and 6.1 with the following setup on both Intel 82599 + Open-FCoE.org FCoE protocol stack as well as Qlogic CNAs. kernel is 2.6.32-122.el6 and device-mapper-multipath-0.4.9-39.el6

Steps to Reproduce:
1. Have 256LUNs each on the dual port HBA using EMC V-MAX
2. Use default MPIO config
3. Start multipathd service
  
Actual results: panic on oom

Expected results: at least gracefully fail

Additional info: This seems to be related to a pretty big mem footprint from multipathd, while spawning per lun udevd (see syslog attached). as this is 32bit, even there is enough highmem, multipathd is able to make use of it. 
I have noticed that multipathd by default sets the oom score to be -17, i.e., not killable, this causes the oom panic every time boots up since mpio gets started right after boot since it kills everything else. When changing the oom score (after disabling multipathd on boot) of multipathd to be much higher to allow it to be killed when oom kicks in, at least the system does not panic.

Please see the attached syslog for the roughly same oom behavior on both qlogic and intel CNAs.

Comment 2 Milan Broz 2011-03-30 12:55:45 UTC

There seems to be other bugs covering multipath memory footprint, maybe it is just dup of bug #623644 ?

Comment 3 Ben Marzinski 2011-03-31 16:43:16 UTC

the fix for bug #623644 should already be in that package.  I just setup 512 iscsi paths to a x86_64 bit system, and multipathd is using less than one tenth of the memory that it is using in your logs.

I haven't done much testing on 32bit systems and none with FCoE, so that's where I'm guessing the problem is.

Comment 4 Yi Zou 2011-03-31 16:58:04 UTC

(In reply to comment #3)
> the fix for bug #623644 should already be in that package.  I just setup 512
> iscsi paths to a x86_64 bit system, and multipathd is using less than one tenth
> of the memory that it is using in your logs.
> I haven't done much testing on 32bit systems and none with FCoE, so that's
> where I'm guessing the problem is.

I believe this is already device-mapper-multipath-0.4.9-39.el6 from rhel6.1 beta, which as you mentioned has the fix from bug #623644 but still happens on 32bit. Let me double-check on the rpm pkg on the testing box where this was reported. Meanwhile, is there any similiar issue found in 32bit? 

As I have reproduced this using Qlogic CNA HBA, so this is not an issue of Open-FCoE's FCoE protocol stack.

Comment 5 Yi Zou 2011-04-01 01:07:18 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > the fix for bug #623644 should already be in that package.  I just setup 512
> > iscsi paths to a x86_64 bit system, and multipathd is using less than one tenth
> > of the memory that it is using in your logs.
> > I haven't done much testing on 32bit systems and none with FCoE, so that's
> > where I'm guessing the problem is.
> 
> I believe this is already device-mapper-multipath-0.4.9-39.el6 from rhel6.1
> beta, which as you mentioned has the fix from bug #623644 but still happens on
> 32bit. Let me double-check on the rpm pkg on the testing box where this was
ok, I have doubled-checked, 0.4.9-39 has the fix from #623644 but this problem is still there on 32bit.

Comment 6 RHEL Program Management 2011-04-04 01:48:18 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 7 Ben Marzinski 2011-04-04 16:30:02 UTC

I'll try to setup a x86 system to test this on.  I'm not sure what sort of resources you have available, bu if you would be able to test this on an x86_64 bit system, so that we can confirm that this is a x86 problem that would be really helpful.

Comment 8 Yi Zou 2011-04-04 20:12:47 UTC

(In reply to comment #7)
> I'll try to setup a x86 system to test this on.  I'm not sure what sort of
> resources you have available, bu if you would be able to test this on an x86_64
> bit system, so that we can confirm that this is a x86 problem that would be
> really helpful.
Ok, I will ask our validation folks to run the same test on x86_64 and get back to you. Also, at this point, I am not sure if this is gonna be a storage qualification blocker, I doubt that as it's unlikely people would use 32bit for MPIO w/ that many LUNs, but, if this is, we will have to have this fix targetted for rhel61.

Comment 9 Jack Morgan 2011-04-13 19:53:38 UTC

I cannot reproduce this bug on RHEL6.1-Snapshot2 x86_64 or i386. The multipath service starts up properly. I can run multipath -v2 ; multipath -ll and see all 256 dm-XXX devices and their corresponding paths. Suggest closing.

Comment 10 Russell Doty 2011-04-20 21:03:42 UTC

From comment 9, this bug has been verified by Intel as working in Snapshot 2.