Bug 506590 - libvirt should ignore NUMA cells with missing topology
Summary: libvirt should ignore NUMA cells with missing topology
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: libvirt
Version: 11
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Daniel Berrangé
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F11VirtTarget
TreeView+ depends on / blocked
 
Reported: 2009-06-17 21:00 UTC by erikj
Modified: 2009-09-04 04:10 UTC (History)
8 users (show)

Fixed In Version: 0.6.2-17.fc11
Clone Of:
Environment:
Last Closed: 2009-09-04 04:10:06 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
output from strace -o /tmp/foo2 -f libvirtd -v (457.81 KB, text/plain)
2009-06-17 21:00 UTC, erikj
no flags Details
Ignore NUMA cells with missing topology (970 bytes, patch)
2009-06-18 11:59 UTC, Daniel Berrangé
no flags Details | Diff
Ignore NUMA initialization failures (5.01 KB, patch)
2009-08-13 11:32 UTC, Daniel Berrangé
no flags Details | Diff

Description erikj 2009-06-17 21:00:55 UTC
Created attachment 348347 [details]
output from strace -o /tmp/foo2 -f libvirtd -v

When trying to start up virt-manager, I get a failure to connect to libvirtd.

When I start libvirtd in verbose mode, I get a QEMU memory error.

[root@cct201 tmp]# libvirtd -v
libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory
libvir: QEMU error : out of memory
13:59:05.168: info : Received unexpected signal 17
13:59:05.168: info : Received unexpected signal 17
13:59:05.168: info : Received unexpected signal 17


I did some searches on this problem.  I didn't find bugzilla bugs, but I did
find this thread:

http://fedoraforum.org/forum/showthread.php?p=1228012

I see this problem on a 2-socket Nahelem system.  I do not see this problem on
two other non-Nahelem 2-socket systems.

The system is an SGI XE270.
 - Supermicro mainboard X8DTN, 2-socket Nahelem
 - 8gb memory, 2048 DIMMS, 1066 MHz Micron, DDR3, part number 18JSF25672PY-1G1D1


I did some research in to this probelm and looked at the strace output.
It appears that libvirt and its call chain are assuming that the 2nd node
on the multi-socket Nahelem system is number 1.  However, it is really 2.

# ls -ld /sys/devices/system/node/node*
drwxr-xr-x 2 root root 0 2009-06-17 14:57 /sys/devices/system/node/node0
drwxr-xr-x 2 root root 0 2009-06-17 14:57 /sys/devices/system/node/node2

It appears this mismatch -- where the tools assume node1 -- is likely
causing both the /sys message and the QEMU warning.

While true that you only see this issue when libvirtd is running in
verbose mode, the end result is that you cannot connect to libvirt with
virt-manager and you can't create new machines.  So even though libvirtd in
non-verbose mode doesn't spew an error, you're still unable to connect to
it in this scenario.

Over in BZ 499633, this was stated:

 Comment #23 From  Daniel Berrange (berrange)  2009-06-17 16:44:54 EDT   (-) [reply] -------

  Yeah if numactl can't cope with non-contiguous NUMA node numbering and
  returns an error, then the libvirt QEMU driver will shut itself down,
  which would explain your problems there. Feel free to file a bug against
  Libvirt for this - the failure to query NUMA toplogy should not cause
  libvirt to stop working, it should simply continue without NUMA toplogy
  info

Some package versions:

# rpm -q libvirt virt-manager qemu-kvm kernel
libvirt-0.6.2-11.fc11.x86_64
virt-manager-0.7.0-5.fc11.x86_64
qemu-kvm-0.10.5-2.fc11.x86_64
kernel-2.6.29.4-167.fc11.x86_64

CPU info. Note: I tried with Hyperthreading on and off for fun, it's off
right now.

[root@cct201 libvirt-0.6.2]# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 6499.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.76
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.76
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.76
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 1
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 16
initial apicid	: 16
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.80
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 1
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 18
initial apicid	: 18
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.81
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 1
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 20
initial apicid	: 20
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.80
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 1
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 22
initial apicid	: 22
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5865.80
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

Comment 1 erikj 2009-06-17 21:02:41 UTC
I forgot to mention that, on Nahelem systems, each socket is a separate node.
This is a difference from non-Nahelem systems.

Comment 2 erikj 2009-06-17 21:37:47 UTC
Thanks to Simon Phatigaraphong, here is some interesting output from the
numactl command on this problem system.

Kannan Somangili told us that RHEL5.3 showed sequential node numbers instead
of skipping node 1 like it does here. 

[root@cct201 tmp]# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 4087 MB
node 0 free: 3410 MB
libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory
node 1 cpus:
node 1 size: <not available>
node 1 free: <not available>
node 2 cpus:
node 2 size: 4096 MB
node 2 free: 3962 MB
No distance information available.

Comment 3 erikj 2009-06-17 21:39:26 UTC
I will file a separate issue on numactl.  That's because I don't think
libvirtd should become useless if the topology isn't right.

So bug on numactl to follow.

Comment 4 erikj 2009-06-17 21:43:57 UTC
I think I'll hold off on a separate bug as I think it would just be a dupe
of 499633.  I'm open to advice.

Comment 5 Daniel Berrangé 2009-06-18 11:59:40 UTC
Created attachment 348421 [details]
Ignore NUMA cells with missing topology

Could you try applying this patch and seeing if it solves the problems you have. It'll make libvirt just ignore NUMA cells with missing topology

Comment 6 Daniel Berrangé 2009-06-18 12:00:51 UTC
Oh, and if this works can you attach the libvirt 'virsh capabilities' XML that results, just so we can make sure its still correct.

Comment 7 erikj 2009-06-18 18:21:21 UTC
I filed BZ 506795 as a request to change numactl to the upstream version that
doesn't tip over for non-sequential nodes.

I filed BZ 506805 from the kernel angle, with links to a community discussion/patches.

I'm going to run the test for this PV shortly.

Comment 8 erikj 2009-06-18 18:33:55 UTC
Yes!  I re-build libvirt with the patch from comment #5 and it worked.

note: I'm losing access to one of the few systems I can duplicate this on
late today.  Please let me know if there is anything else you want me to run
on this.  I'm not sure when I'll have access again :(

Per comment #6, here is the output:

[root@cct201 ~]# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>x86_64</arch>
    </cpu>
    <topology>
      <cells num='3'>
        <cell id='0'>
          <cpus num='8'>
            <cpu id='0'/>
            <cpu id='1'/>
            <cpu id='2'/>
            <cpu id='3'/>
            <cpu id='4'/>
            <cpu id='5'/>
            <cpu id='6'/>
            <cpu id='7'/>
          </cpus>
        </cell>
        <cell id='1'>
          <cpus num='64'>
            <cpu id='0'/>
            <cpu id='1'/>
            <cpu id='2'/>
            <cpu id='3'/>
            <cpu id='4'/>
            <cpu id='5'/>
            <cpu id='6'/>
            <cpu id='7'/>
            <cpu id='8'/>
            <cpu id='9'/>
            <cpu id='10'/>
            <cpu id='11'/>
            <cpu id='12'/>
            <cpu id='13'/>
            <cpu id='14'/>
            <cpu id='15'/>
            <cpu id='16'/>
            <cpu id='17'/>
            <cpu id='18'/>
            <cpu id='19'/>
            <cpu id='20'/>
            <cpu id='21'/>
            <cpu id='22'/>
            <cpu id='23'/>
            <cpu id='24'/>
            <cpu id='25'/>
            <cpu id='26'/>
            <cpu id='27'/>
            <cpu id='28'/>
            <cpu id='29'/>
            <cpu id='30'/>
            <cpu id='31'/>
            <cpu id='32'/>
            <cpu id='33'/>
            <cpu id='34'/>
            <cpu id='35'/>
            <cpu id='36'/>
            <cpu id='37'/>
            <cpu id='38'/>
            <cpu id='39'/>
            <cpu id='40'/>
            <cpu id='41'/>
            <cpu id='42'/>
            <cpu id='43'/>
            <cpu id='44'/>
            <cpu id='45'/>
            <cpu id='46'/>
            <cpu id='47'/>
            <cpu id='48'/>
            <cpu id='49'/>
            <cpu id='50'/>
            <cpu id='51'/>
            <cpu id='52'/>
            <cpu id='53'/>
            <cpu id='54'/>
            <cpu id='55'/>
            <cpu id='56'/>
            <cpu id='57'/>
            <cpu id='58'/>
            <cpu id='59'/>
            <cpu id='60'/>
            <cpu id='61'/>
            <cpu id='62'/>
            <cpu id='63'/>
          </cpus>
        </cell>
        <cell id='2'>
          <cpus num='0'>
          </cpus>
        </cell>
      </cells>
    </topology>
  </host>

  <guest>
    <os_type>hvm</os_type>
    <arch name='i686'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu</emulator>
      <machine>pc</machine>
      <machine>isapc</machine>
      <domain type='qemu'>
      </domain>
      <domain type='kvm'>
        <emulator>/usr/bin/qemu-kvm</emulator>
      </domain>
    </arch>
    <features>
      <pae/>
      <nonpae/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
    </features>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='x86_64'>
      <wordsize>64</wordsize>
      <emulator>/usr/bin/qemu-system-x86_64</emulator>
      <machine>pc</machine>
      <machine>isapc</machine>
      <domain type='qemu'>
      </domain>
      <domain type='kvm'>
        <emulator>/usr/bin/qemu-kvm</emulator>
      </domain>
    </arch>
    <features>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
    </features>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='mips'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-mips</emulator>
      <machine>mips</machine>
      <domain type='qemu'>
      </domain>
    </arch>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='mipsel'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-mipsel</emulator>
      <machine>mips</machine>
      <domain type='qemu'>
      </domain>
    </arch>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='sparc'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-sparc</emulator>
      <machine>sun4m</machine>
      <domain type='qemu'>
      </domain>
    </arch>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='ppc'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-ppc</emulator>
      <machine>g3bw</machine>
      <machine>mac99</machine>
      <machine>prep</machine>
      <domain type='qemu'>
      </domain>
    </arch>
  </guest>

</capabilities>

Comment 9 Mark McLoughlin 2009-06-22 16:26:53 UTC
Thanks for testing Erik

Comment 10 Daniel Veillard 2009-06-26 20:09:38 UTC
Okay, I applied the patch upstream, it will be in 0.6.5 next week
I'm not sure it really need to be backported to F-11 though,

  thanks !

Daniel

Comment 11 Mark McLoughlin 2009-07-03 08:53:22 UTC
Erik: I think the issue is resolved in F-11 for you by numactl-2.0.3-1.fc11 ... please re-open if you think we should backport the libvirt fix to F-11.

Comment 12 erikj 2009-07-03 15:56:51 UTC
Just an FYI that I still use a patched libvirt because, even with the numactl
fix, I was still seeing the QEMU Memory error issue that prevented proper
startup.

Is that scary?  Maybe I should provide more information on this if it's
important.

Comment 13 Jason Tibbitts 2009-07-22 01:07:33 UTC
For the record, I'm having the same problem (dual socket Nehalem, Supermicro X8DTT-F motherboard, 48GB DDR3), fully updated F11, libvirt-0.6.2-12.fc11.x86_64, numactl-2.0.3-1.fc11.x86_64.  This bug seems pretty serious (no functionality at all, and no workaround) but I guess the number of folks who would want to do virtualization on this level of hardware under F11 is small so I can understand why a backport might not be forthcoming.

Fortunately I can build my own packages; anyone who happens across this ticket is welcome to grab my scratch build at
http://koji.fedoraproject.org/koji/taskinfo?taskID=1491137 (at least until it expires).

Comment 14 Mark McLoughlin 2009-08-07 15:04:49 UTC
(In reply to comment #12)
> Just an FYI that I still use a patched libvirt because, even with the numactl
> fix, I was still seeing the QEMU Memory error issue that prevented proper
> startup.
> 
> Is that scary?

Yes, it certainly is.

I don't really follow exactly what's going on here, but if the libvirt patch fixes it, we should just backport it.

Comment 15 Mark McLoughlin 2009-08-13 10:24:35 UTC
Jes points out related kernel bug #507033 and bug #506805

Comment 16 Daniel Berrangé 2009-08-13 11:32:31 UTC
Created attachment 357304 [details]
Ignore NUMA initialization failures

Comment 17 Fedora Update System 2009-08-13 16:34:05 UTC
libvirt-0.6.2-15.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/libvirt-0.6.2-15.fc11

Comment 18 Fedora Update System 2009-08-15 08:21:34 UTC
libvirt-0.6.2-15.fc11 has been pushed to the Fedora 11 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update libvirt'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-8598

Comment 19 Fedora Update System 2009-08-19 16:35:09 UTC
libvirt-0.6.2-16.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/libvirt-0.6.2-16.fc11

Comment 20 Fedora Update System 2009-08-20 20:58:35 UTC
libvirt-0.6.2-17.fc11 has been pushed to the Fedora 11 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update libvirt'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-8790

Comment 21 Fedora Update System 2009-09-04 04:09:34 UTC
libvirt-0.6.2-17.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.