Bug 855107

Summary: udev race condition -event loop when hotplug is enabled and vlan interface is put down and up
Product: Red Hat Enterprise Linux 6 Reporter: Milos Vyletel <milos.vyletel>
Component: initscriptsAssignee: David Kaspar // Dee'Kej <deekej>
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: a15y87, deekej, harald, joseph.keller, jrieden, jzhenyon, milos.vyletel, pletisan, psedlak, vlad
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-31 13:57:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1075802, 1159926, 1172231, 1269194, 1356047, 1356056    
Attachments:
Description Flags
eth0.280 config
none
eth0 config
none
patch proposal
none
patch deekej: review-

Description Milos Vyletel 2012-09-06 17:40:07 UTC
Created attachment 610440 [details]
eth0.280 config

Description of problem:
We've discovered race condition when ifdown and ifup is called without any delay on VLAN interface udevd ends up in endless loop. Udevd itself does not seem to be a problem. It just does whatever kernel tells it to. The problem itself is with the interaction of initscripts and udev rules. Here's what's going on the server

ifdown eth0.280        |
KERNEL remove event    |   ifup eht0.280
UDEV remove event      |   KERNEL add event
net.hoplug calls idown |   UDEV add event
KERNEL remove event    |   net.hoplug calls ifup
UDEV remove event      |   KERNEL add event
...                    |   UDEV add event
                       |   ...

I have not seen any race when using physical interface or bridge. This seems to be isolated problem for VLAN interfaces. Disabling hotplug (HOTPLUG=no) for VLAN interfaces eliminates this race condition. As well as putting sleep 1 between ifdown and ifup to allow net.hoplug finish before ifup is called again.

I was trying to find a fix but I'm not really sure what the proper fix is. I was thinking about adding some kind of locking to the if{up,down} scripts to lock the execution to only 1 instance at a time. But this may be a bit too complicated and maybe only default to HOTPLUG=no for VLANs would be sufficient.

Version-Release number of selected component (if applicable):
kernel-2.6.32-220.el6.x86_64
initscripts-9.03.27-1.el6.x86_64
udev-147-2.40.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. create vlan interface (see attached ifcfg-eth0(.280))
2. service network restart
3. ifdown eth0.280; ifup eth0.280
4. udevadm monitor (to see the actual udev loop)
  
Actual results:
(while running udevadm monitor in background, HOTPLUG=yes (default))
KERNEL[1346945864.946898] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945864.946925] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1346945864.947003] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945864.947080] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1346945864.947195] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1346945864.974109] add      /devices/virtual/net/eth0.280 (net)
KERNEL[1346945864.974233] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945864.974252] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1346945865.086871] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945865.086900] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1346945865.086924] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1346945866.102280] remove   /devices/virtual/net/eth0.280 (net)
KERNEL[1346945866.137424] add      /devices/virtual/net/eth0.280 (net)
KERNEL[1346945866.137546] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945866.137564] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1346945866.212688] add      /devices/virtual/net/eth0.280 (net)
UDEV  [1346945866.213016] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1346945866.213045] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1346945866.213064] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1346945866.213077] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1346945866.350869] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945866.350994] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1346945866.351101] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1346945867.115054] remove   /devices/virtual/net/eth0.280 (net)
KERNEL[1346945867.150201] add      /devices/virtual/net/eth0.280 (net)
<snip>
loop continues until udevd is killed

Expected results:
(while running udevadm monitor in background, HOTPLUG=no)
[root@localhost network-scripts]# ifdown eth0.280 && ifup eth0.280
KERNEL[1346945676.096862] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945676.096921] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1346945676.097102] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1346945676.097348] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1346945676.097383] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1346945676.123261] remove   /devices/virtual/net/eth0.280 (net)
KERNEL[1346945676.136736] add      /devices/virtual/net/eth0.280 (net)
KERNEL[1346945676.136764] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1346945676.136784] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1346945676.169711] add      /devices/virtual/net/eth0.280 (net)
UDEV  [1346945676.169972] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1346945676.170128] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)

Additional info:

Comment 2 Milos Vyletel 2012-09-06 17:40:59 UTC
Created attachment 610441 [details]
eth0 config

Comment 3 Milos Vyletel 2012-09-07 13:02:24 UTC
Forgot to mention hardware specs:

System Information
        Manufacturer: HP
        Product Name: ProLiant BL460c G1
BIOS Information
        Vendor: HP
        Version: I15
        Release Date: 10/25/2010

[root@localhost ~]# ethtool -i eth0
driver: bnx2
version: 2.1.11
firmware-version: bc 4.4.1
bus-info: 0000:03:00.0
[root@localhost ~]# modinfo bnx2
filename:       /lib/modules/2.6.32-220.el6.x86_64/kernel/drivers/net/bnx2.ko
firmware:       bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware:       bnx2/bnx2-rv2p-09-6.0.17.fw
firmware:       bnx2/bnx2-mips-09-6.2.1a.fw
firmware:       bnx2/bnx2-rv2p-06-6.0.15.fw
firmware:       bnx2/bnx2-mips-06-6.2.1.fw
version:        2.1.11
license:        GPL
description:    Broadcom NetXtreme II BCM5706/5708/5709/5716 Driver
author:         Michael Chan <mchan>
srcversion:     61BD2699C6587068253C2BB
alias:          pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias:          pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias:          pci:v000014E4d00001639sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias:          pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias:          pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
vermagic:       2.6.32-220.el6.x86_64 SMP mod_unload modversions
parm:           disable_msi:Disable Message Signaled Interrupt (MSI) (int)

Not sure if it's important but I'm including it just in case. If you have more questions don't hesitate to ask.

Comment 4 Lukáš Nykrýn 2012-09-10 13:29:52 UTC
Thanks for the report, I was able to reproduce this.
The main issue here is that if we just call ifup eth.280, ifup is started twice
ifup.280 -> kernel event -> udev reaction -> net.hotplug -> ifup.280 (which is definitely bad behavior) and same thing happens with ifdown.

I don't think that some locking would help, so we have two options
1) Ignore hotplug's calls of ifup and ifdown for vlans (but I am not sure if this will not break something)
2) Reassign this to kernel or maybe udev and they might be able solve this better on their level.

Comment 5 Milos Vyletel 2012-09-10 14:07:23 UTC
I've tried option 1) and it did not work:

# Ethernet 802.1Q VLAN support
-if [ "${VLAN}" = "yes" ] && [ "$ISALIAS" = "no" ]; then
+if [ "${VLAN}" = "yes" ] && [ "$ISALIAS" = "no" ] && [ -z "$IN_HOTPLUG" ]; then

not only the vlan ended in down state, I still could see one unnecessary kernel/udev events but they do not loop forever. Also as you've said it may actually break even more things...

[root@localhost ~]# ifdown eth0.280; ifup eth0.280
KERNEL[1347284812.857702] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1347284812.857771] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1347284812.857884] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1347284812.857980] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1347284812.858008] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1347284812.885215] add      /devices/virtual/net/eth0.280 (net)
KERNEL[1347284812.885254] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1347284812.885270] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1347284813.000878] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
KERNEL[1347284813.000930] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
KERNEL[1347284813.001009] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1347284814.016270] remove   /devices/virtual/net/eth0.280 (net)
UDEV  [1347284814.100786] add      /devices/virtual/net/eth0.280 (net)
UDEV  [1347284814.100987] add      /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1347284814.101012] add      /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1347284814.101143] remove   /devices/virtual/net/eth0.280/queues/tx-0 (queues)
UDEV  [1347284814.101160] remove   /devices/virtual/net/eth0.280/queues/rx-0 (queues)
UDEV  [1347284814.202761] remove   /devices/virtual/net/eth0.280 (net)


Having said that I'm fine with reassigning to kernel/udev if you think they are the ones that should be fixing it. However I personally think that initscripts are still responsible. Kernel/udev may have limited ways of knowing if the ifup/ifdown was the trigger for the event they received. In the end it's your call.

Comment 6 Milos Vyletel 2012-12-13 22:18:23 UTC
Created attachment 663210 [details]
patch proposal

Wait for udev to process all current events before exiting. This fixes race condition we've been having with vlan interfaces. All comments are appreciated.

Comment 7 Milos Vyletel 2012-12-13 22:20:33 UTC
Comment on attachment 663210 [details]
patch proposal

swaped filenames in diff

Comment 8 Milos Vyletel 2012-12-13 22:21:19 UTC
Created attachment 663211 [details]
patch

Comment 9 Milos Vyletel 2013-01-30 19:18:20 UTC
Hi, any update? Did anyone had time to look at the proposed patch?

Comment 10 Lukáš Nykrýn 2013-01-31 08:30:18 UTC
This, patch looks quite sane. We will consider to include it in next release.

Comment 11 Milos Vyletel 2013-01-31 13:25:04 UTC
Great. Thanks.

Comment 12 Harald Hoyer 2013-03-18 10:39:01 UTC
udevadm settle --timeout=5

so, timeout=5 hardcoded? I don't think, that is a good idea.

Comment 13 Milos Vyletel 2013-03-18 11:41:14 UTC
Fair enough. I don't like that hardcoded value either but could not come up with better solution. What do you suggest?

Comment 14 Harald Hoyer 2013-03-20 10:39:42 UTC
And I also think, that ifup/ifdown should somehow file lock (see flock(1) for shell).

Concurrently operating on routing tables, interface settings, etc. does not seem to be a way to get consistent settings.

Comment 15 pletisan 2014-07-02 15:03:31 UTC
I can confirm this is happening on Red Hat Enterprise Linux Server release 6.4 (Santiago), x84_64 arch.

Sleeping for 1 second between ifdown and ifup works around the issue.

Comment 16 David Kaspar // Dee'Kej 2016-10-29 14:56:17 UTC
*** Bug 952538 has been marked as a duplicate of this bug. ***

Comment 18 David Kaspar // Dee'Kej 2016-10-31 13:57:57 UTC
According to Lukas, he thinks this BZ has been already fixed:
https://github.com/fedora-sysv/initscripts/commit/0c78d0c

The locking mechanism for ifup/ifdown is nice to have feature, but it would still not work correctly if someone would call different networking subscripts manually, or from some other non-RHEL scripts.

Therefore, I'm closing this BZ as WONTFIX. In case anyone still faces this issue, please use the workaround:
> HOTPLUG=no
as mentioned in comment #0.

Best regards,

David

Comment 19 David Kaspar // Dee'Kej 2016-11-30 09:47:06 UTC
*** Bug 1398326 has been marked as a duplicate of this bug. ***