Bug 459969
Summary: | [NetApp 4.8 bug] OOM gets triggered while discovering a large number of LUNs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Tanvi <tanvi> | ||||
Component: | udev | Assignee: | Harald Hoyer <harald> | ||||
Status: | CLOSED ERRATA | QA Contact: | |||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.7 | CC: | andriusb, babu.moger, bmarzins, coughlan, cward, dkovalsk, dmair, marting, michael.hagmann, pknirsch, rajashekhar.a, rsarraf, syeghiay, tanvi, xdl-redhat-bugzilla, ykopkova | ||||
Target Milestone: | rc | Keywords: | OtherQA, ZStream | ||||
Target Release: | 4.8 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-05-18 20:28:46 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 487502 | ||||||
Bug Blocks: | 445585, 450897, 481277 | ||||||
Attachments: |
|
Description
Tanvi
2008-08-25 12:35:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I tested with 512 devices (256 LUNs * 2 paths). The OOM issue is still present. can you do: # killall udevd # export UDEVD_MAX_CHILDS=10 # export UDEVD_MAX_CHILDS_RUNNING=10 # udevd & and then retrigger # iscsi-rescan After following above steps, OOM was not seen, but it took around 3 hours to add 1024 devices. ok, it seems, I have to lower the child limits.. Can you try to find out your maximum limit? How much memory has the machine? Following is the memory information of the machine [root@localhost ~]# free total used free shared buffers cached Mem: 2055296 520644 1534652 0 35596 127512 -/+ buffers/cache: 357536 1697760 Swap: 4095992 70944 4025048 when you set udev_log="yes" in /etc/udev/udev.conf and do: # unset UDEVD_MAX_CHILDS # unset UDEVD_MAX_CHILDS_RUNNING # killall udevd # udevd & you will see in /var/log/messages Feb 4 14:23:21 rhel4 udevd[3333]: initialize max_childs to 190 Feb 4 14:23:21 rhel4 udevd[3333]: initialize max_childs_running to 16 can you report what udev sets for your system? Following is what I get on my machine Feb 4 19:19:05 localhost udevd[11761]: initialize max_childs to 629 Feb 4 19:19:05 localhost udevd[11761]: initialize max_childs_running to 24 So, I might suggest trying to play with lower UDEVD_MAX_CHILDS for an acceptable value, which does not OOM (try e.g. 400, 300, 200...). # killall udevd # export UDEVD_MAX_CHILDS=xyz # udevd & and then retrigger # iscsi-rescan oh, and you might disable logging again in /etc/udev/udev.conf I tried with various values. For UDEVD_MAX_CHILDS=100, I did not see any OOM. But still it took around 1 hour to add all the SCSI (/dev/sd*) and multipath (/dev/mapper/*) entries for 1024 devices (512 LUNs * 2 paths). oh, wow.. 100 is very low.. is this the highest you can get? I hope you disabled logging, which may have slowed down the process Yes, I did disable the logging. I tried with 600, 500, 400, 300, 200 and 100. Saw OOM for 200. Will try with 150 and will update. OOM triggered with 150 as well. So, I will fix the formula to give you ~100 max_childs. Additionally, I will add a kernel command line parameter to set the maximum value manually. I did not see any OOM if iscsi service is already running and we simply rescan the sessions after mapping all the LUNs. But there is one scenario when OOM gets triggered,i.e., when all the device are already discovered and iscsi service is restarted. Again, if we stop iscsi service and wait for some time before starting it again, OOM does not happen. Following are the relevant values Feb 7 12:14:40 localhost udevd[30513]: initialize max_childs to 99 Feb 7 12:14:40 localhost udevd[30513]: initialize max_childs_running to 24 Do we have to live with the long time taken to discover the devices and sluggishness of the host during that period? Does it help to lower the UDEVD_MAX_CHILDS limit for a iscsi restart? And yes, I guess, you will have to live with the slow process to prevent OOM. (In reply to comment #35) > Does it help to lower the UDEVD_MAX_CHILDS limit for a iscsi restart? > > And yes, I guess, you will have to live with the slow process to prevent OOM. That would end up making the OS startup time to (1+ hr of device discovery + fsck time) a very large number. (In reply to comment #35) > Does it help to lower the UDEVD_MAX_CHILDS limit for a iscsi restart? > > And yes, I guess, you will have to live with the slow process to prevent OOM. I decreased the value of UDEVD_MAX_CHILDS to 80. Did not see any OOM for iscsi restart. last question: if you raise UDEVD_MAX_CHILDS_RUNNING (the other limit) does it get faster without OOM? I raised the value of UDEV_MAX_CHILDS_RUNNING to 80. The OOM did not happen but it did not help in decreasing the time either. Ok, hopefully this runs out of the box without parameter tweaking. Please test: http://people.redhat.com/harald/downloads/udev/udev-039-10.27.el4 What interests me also, is what processes bring your system down? Can you provide a task list, when the iscsis are triggered with e.g. # pstree -p Created attachment 331825 [details]
pstree -p output
No OOM was seen for iscsi restart. Following are the values which udev sets on the machine Feb 13 18:18:56 localhost udevd[6950]: initialize max_childs to 78 Feb 13 18:18:56 localhost udevd[6950]: initialize max_childs_running to 24 Attached the pstree -p output. Harald, In RHEL4's udev, is this necessary? if /sbin/lsmod | /bin/grep "^dm_multipath" > /dev/null 2>&1 && \ [ ${mpath_ctl_fail} -eq 0 ] && \ [ "${ACTION}" = "add" -a "${DEVPATH:7:3}" != "dm-" ] ; then /sbin/multipath -v0 ${DEVNAME} > /dev/null 2>&1 fi This is part of multipath.dev script. There was a similar situation in RHEL5.x (Bugzilla #460301). This looks similar to what has been commented in Comment #6 there. If disabling the above helps, we could just document it in the Rel. Notes. (In reply to comment #45) > Harald, > > In RHEL4's udev, is this necessary? It is not _udev_ ... this is device-mapper-multipath $ rpm -qf /etc/dev.d/block/multipath.dev device-mapper-multipath-.... > > if /sbin/lsmod | /bin/grep "^dm_multipath" > /dev/null 2>&1 && \ > [ ${mpath_ctl_fail} -eq 0 ] && \ > [ "${ACTION}" = "add" -a "${DEVPATH:7:3}" != "dm-" ] ; then > /sbin/multipath -v0 ${DEVNAME} > /dev/null 2>&1 > fi > > This is part of multipath.dev script. > > There was a similar situation in RHEL5.x (Bugzilla #460301). This looks similar > to what has been commented in Comment #6 there. > > If disabling the above helps, we could just document it in the Rel. Notes. Does it help? If yes, please open a bug against device-mapper-multipath Okay!! I've cloned the RHEL5 bugzilla to target for RHEL4. This is for the delayed discovery issue. https://bugzilla.redhat.com/show_bug.cgi?id=487502 Netapp, could you please test the package builds available at the url below as well? This package fixes this bug for 4.7.z (EUS z-stream), bug 481277. If you need additional arch packages to test, other than what's provided, please let me know. Thanks. http://people.redhat.com/cward/4.7.z/udev/ Ritesh, can you get this tested in both 4.8 and 4.7.z tests in both BZs? ~~ Attention Partners! ~~ RHEL 4.8Beta has been released on partners.redhat.com. There should be a fix present, which addresses this bug. Please test and report back results on this OtherQA Partner bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe any issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you've encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management I tested on both RHEL 4.7 and RHEL 4.8. The OOM is no more seen. 1024 devices (512 LUNs * 2 paths) were mapped to the host. It took almost 1 hour to make all the SCSI and multipath entries. A point to be noted is that it takes around 20 mins to add all the /dev/sd* entries, rest of the time is taken to create multipath entries. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1004.html |