Bug 902642
| Summary: | rhev-h node cannot join the rhevm, udevd[4196]: worker [63881] unexpectedly returned with status 0x0100. | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | davidyangyi <davidyangyi> | ||||||||
| Component: | udev | Assignee: | Harald Hoyer <harald> | ||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | qe-baseos-daemons | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 6.3 | CC: | bazulay, bmcclain, cpelland, cshao, davidyangyi, fdeutsch, fsimonce, gouyang, hadong, harald, iheim, jboggs, jkt, jmunilla, leiwang, lpeer, lyarwood, mkalinin, ovirt-maint, rbalakri, udev-maint-list, ycui | ||||||||
| Target Milestone: | rc | Flags: | harald:
needinfo-
|
||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-12-19 08:39:21 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1002699 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
davidyangyi
2013-01-22 06:54:09 UTC
Can you provide the exact RHEV-H version which you tested? And also add the /var/log/ovirt.log, dmesg and messages file in their whole length, finaly also the steps how you reproduced this problem. Set needinfo for the above requests Created attachment 686397 [details]
ovirt.og
ovirt.log
Created attachment 686398 [details]
dmesg
dmesg
rhevm v3.1 rhevh-6.3-20121212.0.el6_3.iso (kernel: 2.6.32-279.19.1.el6.x86_64) because the bridge card 'rhevm' is automatically down, I manually execute "service network restart" to bring it up. so the log appears as below: Shutting down interface rhevm: [ OK ] Shutting down interface em1: [ OK ] Shutting down loopback interface: [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface em1: [ OK ] Bringing up interface rhevm: [ OK ] Created attachment 686418 [details]
there are over 6000 logicl volumes on the storage
how do I reproduced this problem when the rhev-h node start up, the connection is ok .but after a while ,the error log appears and the bridge card 'rhevm' is down. there are over 6000 logicl volumes on the storage. the booting progress takes about 30 minutes. I execute "multipath -ll" for 12 seconds to display the result. but the good rhev-h need just 2 seconds. RHEV-H QE are trying to reproduce this bug in QE's environment, but now we can not reproduce this bug, we will continue to analyze and test this bug later, any update, we will update here. David, could you also add /var/log/messages, thanks. Okay, it's in comment 9. Harald,
have you got an idea regarding udev the error messages which can be seen in attachment 686418 [details] ?
thank you all. there are 22 rhev-h node and only 1 rhev-m in my enviroment. there are 53 1T LUN on the SAN storage there are over 6000 LVs there are over 1000 VMs 100VMs on each rhev-h 256G memory and 64core CPU Now only 5 rhev-h nodes left can not be added to rhev-m management. all the 5 rhev-h nodes report the same errors. It appears that, no matter which node it is, the last few nodes who are trying to add into the rhev-m would report the same errors. (In reply to comment #14) > Harald, > > have you got an idea regarding udev the error messages which can be seen in > attachment 686418 [details] ? yes.. too many disks for LVM. LVM's udev rules do not scale linearly but quadratic in I/O and CPU usage. How can I solve the problem? (In reply to comment #17) > How can I solve the problem? IIUIC a short term solution is to try to reduce the number of LVs or disks each host (RHEV-H) sees. Harald, can something be done on the LVM/udev side to allow the usage of such a high number of LVs and disks? Seems to be more a thing about the llvm rules for udev, adding this after talking with Harald on IRC: rhev-h specific or rhel as well? (In reply to comment #21) > rhev-h specific or rhel as well? IIUIC and I haven't confirmed it - but as it's a LVM udev-rule limit I'd expect this bug to be also present in RHEL. Maybe Tom Coughlan knows more about this. Moving to vdsm -- it's likely something that will impact both RHEL and RHEV-H and needs to be evaluated from an overall architecture perspective. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. > (In reply to comment #14) > > Harald, > > > > have you got an idea regarding udev the error messages which can be seen in > > attachment 686418 [details] ? > > yes.. too many disks for LVM. LVM's udev rules do not scale linearly but > quadratic in I/O and CPU usage. This is a udev issue, not vdsm.(In reply to comment #16) (In reply to comment #9) > Created attachment 686418 [details] > there are over 6000 logicl volumes on the storage well, that says it all. lvm scans all of them Please test this udev version: http://people.redhat.com/harald/downloads/udev/udev-147-2.47.hh.1.el6_4/ (In reply to Raul Cheleguini from comment #35) > === In Red Hat Customer Portal Case 00936679 === > --- Comment by Cheleguini, Raul on 23/09/2013 14:40 --- > > Harald, > > The behavior is same even passing the 'udevchilds' proposed value. > > rhevhprd01-2013092315141379949241 $ cat proc/cmdline > root=live:LABEL=Root ro rootfstype=auto rootflags=ro crashkernel=128M > elevator=deadline quiet rd_NO_LVM max_loop=256 rhgb rd_NO_LUKS rd_NO_MD > rd_NO_DM udevchilds=168 So, this machine has 80 CPUs? If yes, does lowering the value have any effect? Jan 22 06:50:57 N2-8 udevd[4196]: worker [63924] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/block/sdfh' According to the log it also has a _lot_ of disks. So, an additional advice is to look for HW problems or firmware updates. (In reply to Harald Hoyer from comment #41) > (In reply to Raul Cheleguini from comment #35) > > === In Red Hat Customer Portal Case 00936679 === > > --- Comment by Cheleguini, Raul on 23/09/2013 14:40 --- > > > > Harald, > > > > The behavior is same even passing the 'udevchilds' proposed value. > > > > rhevhprd01-2013092315141379949241 $ cat proc/cmdline > > root=live:LABEL=Root ro rootfstype=auto rootflags=ro crashkernel=128M > > elevator=deadline quiet rd_NO_LVM max_loop=256 rhgb rd_NO_LUKS rd_NO_MD > > rd_NO_DM udevchilds=168 > > So, this machine has 80 CPUs? > > If yes, does lowering the value have any effect? > > Jan 22 06:50:57 N2-8 udevd[4196]: worker [63924] failed while handling > '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/ > 2:0:1:3/block/sdfh' > > According to the log it also has a _lot_ of disks. So, an additional advice > is to look for HW problems or firmware updates. see also: https://bugzilla.redhat.com/show_bug.cgi?id=1016303#c18 Also, please try out: http://people.redhat.com/harald/downloads/udev/udev-147-2.54.el6/ Ok, please add "udevdebug udevlog" to the kernel command line and attach /dev/.udev/udev.log |