Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 902642

Summary: rhev-h node cannot join the rhevm, udevd[4196]: worker [63881] unexpectedly returned with status 0x0100.
Product: Red Hat Enterprise Linux 6 Reporter: davidyangyi <davidyangyi>
Component: udevAssignee: Harald Hoyer <harald>
Status: CLOSED INSUFFICIENT_DATA QA Contact: qe-baseos-daemons
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: bazulay, bmcclain, cpelland, cshao, davidyangyi, fdeutsch, fsimonce, gouyang, hadong, harald, iheim, jboggs, jkt, jmunilla, leiwang, lpeer, lyarwood, mkalinin, ovirt-maint, rbalakri, udev-maint-list, ycui
Target Milestone: rcFlags: harald: needinfo-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-19 08:39:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1002699    
Attachments:
Description Flags
ovirt.og
none
dmesg
none
there are over 6000 logicl volumes on the storage none

Description davidyangyi 2013-01-22 06:54:09 UTC
Description of problem:
rhev-h can not join rhev-m. connecting status. cannot active or maintenance

rhev-m is v3.1

/var/log/messages
Jan 22 06:50:56 N2-8 udevd[4196]: worker [63881] unexpectedly returned with status 0x0100
Jan 22 06:50:56 N2-8 udevd[4196]: worker [63881] failed while handling '/devices/virtual/bdi/130:32'
Jan 22 06:50:56 N2-8 udevd[4196]: worker [63893] unexpectedly returned with status 0x0100
Jan 22 06:50:56 N2-8 udevd[4196]: worker [63893] failed while handling '/devices/virtual/net/Vlan2211'
Jan 22 06:50:56 N2-8 udevd[4196]: worker [63895] unexpectedly returned with status 0x0100
Jan 22 06:50:56 N2-8 udevd[4196]: worker [63895] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:2/block/sdfg'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63906] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63906] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/bsg/2:0:1:3'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63912] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63912] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/scsi_generic/sg164'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63916] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63916] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/scsi_device/2:0:1:3'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63919] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63919] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/scsi_disk/2:0:1:3'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63923] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63923] failed while handling '/devices/virtual/bdi/130:48'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63924] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63924] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/block/sdfh'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63925] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63925] failed while handling '/devices/virtual/net/em4.2211'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63926] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63926] failed while handling '/devices/virtual/net/em4.2212'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63927] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63927] failed while handling '/devices/virtual/net/Vlan2212'
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63931] unexpectedly returned with status 0x0100
Jan 22 06:50:57 N2-8 udevd[4196]: worker [63931] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:4/bsg/2:0:1:4'

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Ying Cui 2013-01-22 09:13:26 UTC
Can you provide the exact RHEV-H version which you tested?

Comment 3 Fabian Deutsch 2013-01-22 14:06:40 UTC
And also add the /var/log/ovirt.log, dmesg and messages file in their whole length, finaly also the steps how you reproduced this problem.

Comment 4 Mike Burns 2013-01-22 15:34:37 UTC
Set needinfo for the above requests

Comment 5 davidyangyi 2013-01-24 01:22:52 UTC
Created attachment 686397 [details]
ovirt.og

ovirt.log

Comment 6 davidyangyi 2013-01-24 01:23:41 UTC
Created attachment 686398 [details]
dmesg

dmesg

Comment 7 davidyangyi 2013-01-24 01:24:39 UTC
rhevm v3.1
rhevh-6.3-20121212.0.el6_3.iso  (kernel: 2.6.32-279.19.1.el6.x86_64)

Comment 8 davidyangyi 2013-01-24 01:34:35 UTC
because the bridge card 'rhevm' is automatically down, I manually execute "service network restart" to bring it up. so the log appears as below:

Shutting down interface rhevm:  [  OK  ]
Shutting down interface em1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface em1:  [  OK  ]
Bringing up interface rhevm:  [  OK  ]

Comment 9 davidyangyi 2013-01-24 02:19:39 UTC
Created attachment 686418 [details]
there are over 6000 logicl volumes on the storage

Comment 10 davidyangyi 2013-01-24 02:20:00 UTC
how do I reproduced this problem
when the rhev-h node start up, the connection is ok .but after a while ,the error log appears and the bridge card 'rhevm' is down.

there are over 6000 logicl volumes on the storage. the booting progress takes about 30 minutes.
I execute "multipath -ll" for 12 seconds to display the result. but the good rhev-h need just 2 seconds.

Comment 11 Ying Cui 2013-01-24 09:03:18 UTC
RHEV-H QE are trying to reproduce this bug in QE's environment, but now we can not reproduce this bug, we will continue to analyze and test this bug later, any update, we will update here.

Comment 12 Fabian Deutsch 2013-01-24 12:27:18 UTC
David,

could you also add /var/log/messages, thanks.

Comment 13 Fabian Deutsch 2013-01-24 12:28:09 UTC
Okay, it's in comment 9.

Comment 14 Fabian Deutsch 2013-01-24 12:43:50 UTC
Harald,

have you got an idea regarding udev the error messages which can be seen in  attachment 686418 [details] ?

Comment 15 davidyangyi 2013-01-24 15:17:01 UTC
thank you all.
there are 22 rhev-h node and only 1 rhev-m in my enviroment.
there are 53 1T LUN on the SAN storage
there are over 6000 LVs
there are over 1000 VMs
100VMs on each rhev-h

256G memory and 64core CPU

Now only 5 rhev-h nodes left can not be added to rhev-m management. all the 5 rhev-h nodes report the same errors.

It appears that, no matter which node it is, the last few nodes who are trying to add into the rhev-m would report the same errors.

Comment 16 Harald Hoyer 2013-01-25 14:40:15 UTC
(In reply to comment #14)
> Harald,
> 
> have you got an idea regarding udev the error messages which can be seen in 
> attachment 686418 [details] ?

yes.. too many disks for LVM. LVM's udev rules do not scale linearly but quadratic in I/O and CPU usage.

Comment 17 davidyangyi 2013-01-28 05:35:55 UTC
How can I solve the problem?

Comment 18 Fabian Deutsch 2013-01-28 09:00:43 UTC
(In reply to comment #17)
> How can I solve the problem?

IIUIC a short term solution is to try to reduce the number of LVs or disks each host (RHEV-H) sees.

Comment 19 Fabian Deutsch 2013-01-28 09:04:10 UTC
Harald,
can something be done on the LVM/udev side to allow the usage of such a high number of LVs and disks?

Comment 20 Fabian Deutsch 2013-01-28 14:42:57 UTC
Seems to be more a thing about the llvm rules for udev, adding this after talking with Harald on IRC:

Comment 21 Itamar Heim 2013-01-28 15:20:05 UTC
rhev-h specific or rhel as well?

Comment 22 Fabian Deutsch 2013-01-28 15:44:28 UTC
(In reply to comment #21)
> rhev-h specific or rhel as well?

IIUIC and I haven't confirmed it - but as it's a LVM udev-rule limit I'd expect this bug to be also present in RHEL.

Maybe Tom Coughlan knows more about this.

Comment 23 Mike Burns 2013-01-28 16:38:46 UTC
Moving to vdsm -- it's likely something that will impact both RHEL and RHEV-H and needs to be evaluated from an overall architecture perspective.

Comment 24 RHEL Program Management 2013-02-02 06:47:24 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 25 Ayal Baron 2013-02-03 16:15:49 UTC
> (In reply to comment #14)
> > Harald,
> > 
> > have you got an idea regarding udev the error messages which can be seen in 
> > attachment 686418 [details] ?
> 
> yes.. too many disks for LVM. LVM's udev rules do not scale linearly but
> quadratic in I/O and CPU usage.

This is a udev issue, not vdsm.(In reply to comment #16)

Comment 26 Harald Hoyer 2013-02-12 15:01:02 UTC
(In reply to comment #9)
> Created attachment 686418 [details]
> there are over 6000 logicl volumes on the storage

well, that says it all. lvm scans all of them

Comment 27 Harald Hoyer 2013-04-02 07:12:53 UTC
Please test this udev version:

http://people.redhat.com/harald/downloads/udev/udev-147-2.47.hh.1.el6_4/

Comment 41 Harald Hoyer 2014-01-29 09:07:08 UTC
(In reply to Raul Cheleguini from comment #35)
> === In Red Hat Customer Portal Case 00936679 ===
> --- Comment by Cheleguini, Raul on 23/09/2013 14:40 ---
> 
> Harald,
> 
> The behavior is same even passing the 'udevchilds' proposed value.
> 
>  rhevhprd01-2013092315141379949241 $ cat proc/cmdline 
> root=live:LABEL=Root ro rootfstype=auto rootflags=ro crashkernel=128M
> elevator=deadline quiet rd_NO_LVM max_loop=256 rhgb rd_NO_LUKS rd_NO_MD
> rd_NO_DM udevchilds=168

So, this machine has 80 CPUs?

If yes, does lowering the value have any effect?

Jan 22 06:50:57 N2-8 udevd[4196]: worker [63924] failed while handling '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/2:0:1:3/block/sdfh'

According to the log it also has a _lot_ of disks. So, an additional advice is to look for HW problems or firmware updates.

Comment 42 Harald Hoyer 2014-01-29 09:10:06 UTC
(In reply to Harald Hoyer from comment #41)
> (In reply to Raul Cheleguini from comment #35)
> > === In Red Hat Customer Portal Case 00936679 ===
> > --- Comment by Cheleguini, Raul on 23/09/2013 14:40 ---
> > 
> > Harald,
> > 
> > The behavior is same even passing the 'udevchilds' proposed value.
> > 
> >  rhevhprd01-2013092315141379949241 $ cat proc/cmdline 
> > root=live:LABEL=Root ro rootfstype=auto rootflags=ro crashkernel=128M
> > elevator=deadline quiet rd_NO_LVM max_loop=256 rhgb rd_NO_LUKS rd_NO_MD
> > rd_NO_DM udevchilds=168
> 
> So, this machine has 80 CPUs?
> 
> If yes, does lowering the value have any effect?
> 
> Jan 22 06:50:57 N2-8 udevd[4196]: worker [63924] failed while handling
> '/devices/pci0000:20/0000:20:0b.0/0000:23:00.1/host2/rport-2:0-3/target2:0:1/
> 2:0:1:3/block/sdfh'
> 
> According to the log it also has a _lot_ of disks. So, an additional advice
> is to look for HW problems or firmware updates.

see also: https://bugzilla.redhat.com/show_bug.cgi?id=1016303#c18

Comment 46 Harald Hoyer 2014-06-20 08:05:10 UTC
Also, please try out:
http://people.redhat.com/harald/downloads/udev/udev-147-2.54.el6/

Comment 48 Harald Hoyer 2014-12-16 11:44:21 UTC
Ok, please add "udevdebug udevlog" to the kernel command line and attach /dev/.udev/udev.log