655855 – [RHEL6] network stops functioning on Hyper-V hosted VMs with irqbalance daemon started

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 655855 - [RHEL6] network stops functioning on Hyper-V hosted VMs with irqbalance daemon started

Summary: [RHEL6] network stops functioning on Hyper-V hosted VMs with irqbalance daemo...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Anton Arapov
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	788607 (view as bug list)
Depends On:
Blocks:	662543
TreeView+	depends on / blocked

Reported:	2010-11-22 14:44 UTC by asilva
Modified:	2018-12-09 16:42 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-19 17:49:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Legacy)	43750	0	None	None	None	Never

Description asilva 2010-11-22 14:44:15 UTC

Description of problem:

there is a problem with RHEL6 as a virtualization guest under Microsoft Hyper-V which is 100% reproducible for them: when irqbalance is enabled in the guest, its networking stops shortly after the network service is started.

Version-Release number of selected component (if applicable):
Guest: Red Hat Enterprise Linux 6
Host: Windows Server 2008 R2 

How reproducible:
100% reproducible

Steps to Reproduce:
Install RHEL6 in Hyper-V with 2 or more vCPUs and start the irqbalance daemon. 

  
Actual results:
I did the tests using network legacy drivers and I have a reproducer behavior in my LAB.

I got following errors after start the irqbalance:

--snip--
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
--/snip--

Probably when the irqbalance started the network interruptions doesn't reach correctly the Hypervisor, then the buffer fills.

==> irqbalance stopped:

[asilva@localhost ~]$ ping www.redhat.com
PING e86.b.akamaiedge.net (96.6.144.112) 56(84) bytes of data.
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=1 ttl=55 time=124 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=2 ttl=55 time=123 ms
------
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=3 ttl=55 time=126 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=4 ttl=55 time=131 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=5 ttl=55 time=125 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=6 ttl=55 time=125 ms
^C
--- e86.b.akamaiedge.net ping statistics ---
96 packets transmitted, 96 received, 0% packet loss, time 95725ms


==> Stating irqbalance:

[root@localhost ~]# service irqbalance start
Starting irqbalance:                                       [  OK  ]
[asilva@localhost ~]$ ping www.redhat.com
PING e86.b.akamaiedge.net (96.6.144.112) 56(84) bytes of data.
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=1 ttl=55 time=123 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=2 ttl=55 time=260 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=3 ttl=55 time=137 ms
64 bytes from a96-6-144-112.deploy.akamaitechnologies.com (96.6.144.112): icmp_seq=4 ttl=55 time=126 ms
ping: sendmsg: No buffer space available
^C
--- e86.b.akamaiedge.net ping statistics ---
22 packets transmitted, 8 received, 63% packet loss, time 40614ms
rtt min/avg/max/mdev = 123.494/172.594/310.905/67.992 ms

==> Monitoring interruptions from eth0 to vCPU0:

[root@localhost ~]# while [ 1 ]  ; do grep eth0 /proc/interrupts >> test; sleep 3; done
^C
[root@localhost ~]# cat test 
  9:       1631          0   IO-APIC-fasteoi   acpi, eth0
  9:       1641          0   IO-APIC-fasteoi   acpi, eth0
  9:       1651          0   IO-APIC-fasteoi   acpi, eth0
  9:       1663          0   IO-APIC-fasteoi   acpi, eth0
  9:       1694          0   IO-APIC-fasteoi   acpi, eth0
  9:       1697          0   IO-APIC-fasteoi   acpi, eth0
  9:       1705          0   IO-APIC-fasteoi   acpi, eth0
  9:       1711          0   IO-APIC-fasteoi   acpi, eth0
  9:       1720          0   IO-APIC-fasteoi   acpi, eth0
  9:       1726          0   IO-APIC-fasteoi   acpi, eth0
  9:       1784          0   IO-APIC-fasteoi   acpi, eth0
  9:       1791          0   IO-APIC-fasteoi   acpi, eth0
  9:       1799          0   IO-APIC-fasteoi   acpi, eth0
  9:       1808          0   IO-APIC-fasteoi   acpi, eth0
  9:       1814          0   IO-APIC-fasteoi   acpi, eth0
  9:       1821          0   IO-APIC-fasteoi   acpi, eth0
  9:       1827          0   IO-APIC-fasteoi   acpi, eth0
  9:       1834          0   IO-APIC-fasteoi   acpi, eth0
  9:       1843          0   IO-APIC-fasteoi   acpi, eth0
  9:       1849          0   IO-APIC-fasteoi   acpi, eth0
  9:       1855          0   IO-APIC-fasteoi   acpi, eth0
  9:       1862          0   IO-APIC-fasteoi   acpi, eth0
  9:       1869          0   IO-APIC-fasteoi   acpi, eth0
  9:       1877          0   IO-APIC-fasteoi   acpi, eth0
  9:       1883          0   IO-APIC-fasteoi   acpi, eth0
  9:       1889          0   IO-APIC-fasteoi   acpi, eth0
  9:       1909          0   IO-APIC-fasteoi   acpi, eth0
  9:       1916          0   IO-APIC-fasteoi   acpi, eth0
  9:       1920          0   IO-APIC-fasteoi   acpi, eth0
  9:       1922          0   IO-APIC-fasteoi   acpi, eth0
  9:       1924          0   IO-APIC-fasteoi   acpi, eth0
  9:       1925          0   IO-APIC-fasteoi   acpi, eth0
  9:       1935          0   IO-APIC-fasteoi   acpi, eth0
  9:       1943          0   IO-APIC-fasteoi   acpi, eth0
==> from here (irqbalance started) the processor no longer responds to interrupts
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0
  9:       1948          0   IO-APIC-fasteoi   acpi, eth0


Expected results:
The irqbalance work correctly.

Additional info:
I also tested RHEL 6 with irqbalance on RHEV 2.2 and I didn't get issues with the network.

Comment 2 Anton Arapov 2010-12-07 10:41:57 UTC

Neil, any guess here?

Comment 5 Suzanne Logcher 2011-01-05 19:46:30 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated 
in the current release, Red Hat is unfortunately unable to 
address this request at this time.  This request has been 
proposed for the next release of Red Hat Enterprise Linux.
If you would like it considered as an exception in the 
current release, please ask your support representative.

Comment 6 Neil Horman 2011-01-05 20:36:15 UTC

Anton, sorry its taken me so long to see this.  In regards to your question, I'm not entirely sure whats going on here.  What I can say definitively is that its not (strictly speaking) an irqbalance problem.  The irqbalance daemon moves processor affinity around by writing to /proc/irq/<irqn>/smp_affinity.  So even if you turned off irqbalance, you could (or should), if such affinity movements are the root of this problem, be able to trigger the issue with a manual echo of an affinity mask to that same file.

As such, it would seem to me to be the kernels job to prevent such malicious behavior.  Comparing hyper-v and xen, it would seem to me that the hyper-v kernel code needs to remap the irqs that get requested in the guest to a new irq_chip structure so that it has control over the set_affinity method when user space changes affinity (see bind_evtchn_to_irq for an example).

Alternatively, hyper-v could add some arch specific code to flag all irq chips as not supporting affinity movement, so that userspace can't make any changes.

Or, the customer could use kvm, which already has this hashed out in qemu.

Comment 10 Anton Arapov 2011-01-19 17:49:28 UTC

Closing the bug as WONTFIX, that stands for unable to fix here. This is the arch-specific issue that must be addressed in hyper-v kernel.

Comment 11 Keshwarsingh Nadan 2012-02-08 19:16:10 UTC

Still same issue with 6.2..

Will this be fixed?

Comment 12 J.H.M. Dassen (Ray) 2012-02-08 20:14:34 UTC

This issue is discussed in <https://access.redhat.com/kb/docs/DOC-49132>, "Network stops functioning on RHEL6 guest under Hyper-V when the irqbalance service is started".

The Hyper-V hypervisor is a product of Microsoft. Fixing its limitations is outside Red Hat's scope. If you would like to see this limitation addressed, please contact your Microsoft support representative.

Comment 13 Keshwarsingh Nadan 2012-02-08 20:23:36 UTC

I did thought that virtualization works this way:

1) Guest OSs are "unaware" of being virtualized.
2) Hypervisor is called only when needed - - -> facilitate simultaneous operation of OSs and protect access to SHARED system resources..

The link above seems useless, how can you adress hyper-v as being the root cause here? I reproduced the same issue with red hat el 6.1 / 6.2 with only 1vcpu..

Comment 14 Robert Scheck 2012-02-14 12:29:28 UTC

From my point of view, the /etc/rc.d/init.d/irqbalance should have a section
like this in the beginning:

if [ -x /usr/sbin/virt-what ]; then
  if [ "$(/usr/sbin/virt-what)" = "hyperv" ]; then
    exit 1
  fi
fi

Comment 15 Robert Scheck 2012-02-14 12:32:29 UTC

I have opened case 00598829 in the Red Hat customer portal to request this 
workaround or similar or better.

Comment 16 Anton Arapov 2012-02-16 11:56:10 UTC

*** Bug 788607 has been marked as a duplicate of this bug. ***

Comment 17 Neil Horman 2013-02-15 16:04:45 UTC

Robert, you have the right idea, but look at the description of the bug - RHEL is the Guest OS here, when we alter irq affinity, we do so without any knoweldge of the fact that we are running under a hyperv hypervisor.  It is the responsibility of the hypervisor to trap such affinity changes and prevent them from occuring, if doing so will stop delivery of needed interrupts to the guest.  The Hyper-V hypervisor is a microsoft product, hence this is their problem to fix

Note You need to log in before you can comment on or make changes to this bug.