Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 688162

Summary: failure to create domains on NUMA machines?
Product: Red Hat Enterprise Linux 5 Reporter: Paolo Bonzini <pbonzini>
Component: xenAssignee: Paolo Bonzini <pbonzini>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 5.6CC: leiwang, minovotn, mrezanin, mshao, qwan, xen-maint, yuzhang
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: xen-3.0.3-127.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 669388 Environment:
Last Closed: 2011-07-21 09:18:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 669388    
Bug Blocks: 514499    

Comment 2 Michal Novotny 2011-03-22 12:14:51 UTC
Well, I've added logging to the XendDomainInfo.py file and I've been able to
isolate the code where it happens. The affected code is the affinity setting
code which is used only on the NUMA systems.

The following VmError is coming directly from the xc.vcpu_setaffinity() call
which is the source of the failure:

[2011-03-22 12:50:35 xend.XendDomainInfo 5578] ERROR (XendDomainInfo:243)
Domain construction failed
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
236, in create
    vm.initDomain()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line
2205, in initDomain
    raise VmError(str(exn))
VmError: (3, 'No such process')

The xc.cpu_setaffinity() is basically calling a DOMCTL by the
xc_vcpu_setaffinity() function. According to my investigation error 3: No such
process is defined as -ESRCH which can be found in the hypervisor source codes
either if the domain cannot be found or the affinity for the VCPU specified is
having NULL value, i.e. doesn't exist:

        ret = -ESRCH;
        if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL )
            goto vcpuaffinity_out;

I also added logging to the hypervisor and I've been able to see:

(XEN) domctl.c:438:d0 Looking up for domain ffff8300af0fa080 ... found
(XEN) domctl.c:444:d0 Looking up for affinity for the VCPU done ... found
(XEN) domctl.c:438:d0 Looking up for domain 0000000000000000 ... not found

Which is the place where there's error. The 0000000000000000 or
ffff8300af0fa080 are just pointers printed using "%p" parameter for gdprintk()
functions so obviously the domain pointer is invalid at the time of failure.

I'll investigate this further but I guess I'd rather investigate this in libxc
part to add debugging there first to confirm everything is fine on this side.

Michal

Comment 3 Paolo Bonzini 2011-03-22 12:54:54 UTC
The ESRCH is a red herring, the error happens before.  In find_relaxed_node() the call to XendDomain.instance().list() tries to skip the current domain:

                from xen.xend import XendDomain
                doms = XendDomain.instance().list()
                for dom in filter (lambda d: d.domid != self.domid, doms):

but it fails to do so.  This because XendDomain calls XendDomainInfo.recreate().  This one fails because the domain info hasn't been initialized yet, so that self.info['memory'] is zero.  XendDomain then decides things are broken beyond repair, and destroys the domain.

So, before find_relaxed_node() returns, the domain is destroyed, and the following hypercall fails (as it should).

Comment 4 Michal Novotny 2011-03-22 13:07:07 UTC
(In reply to comment #3)
> The ESRCH is a red herring, the error happens before.  In find_relaxed_node()
> the call to XendDomain.instance().list() tries to skip the current domain:
> 
>                 from xen.xend import XendDomain
>                 doms = XendDomain.instance().list()
>                 for dom in filter (lambda d: d.domid != self.domid, doms):
> 
> but it fails to do so.  This because XendDomain calls
> XendDomainInfo.recreate().  This one fails because the domain info hasn't been
> initialized yet, so that self.info['memory'] is zero.  XendDomain then decides
> things are broken beyond repair, and destroys the domain.
> 
> So, before find_relaxed_node() returns, the domain is destroyed, and the
> following hypercall fails (as it should).

Thanks for your investigation Paolo. This is a good place to start. You're referring to some hypercall now and I guess this is the hypercall being called from the XendDomain.instance().list, right?

I'll investigate it at this place.

Thanks again,
Michal

Comment 5 Paolo Bonzini 2011-03-22 13:21:56 UTC
No, the hypercall that fails is (as you had correctly found) vcpu_setaffinity.  But that's just the first piece that sees the destroyed domain.  The fix is simply not to refresh the list when calling list().

I'm curious if you can make xend fail by calling "xm list" quickly in a loop, while running "xm create", even with the patch...  That would be another bug though.

Comment 9 Qixiang Wan 2011-04-01 12:36:48 UTC
Verified with xen-3.0.3-127.el5

reproduced with xen-3.0.3-126.el5 with host NUMA enabled:

$ xm create test.cfg 
Using config file "./test.cfg".
Using <class 'grub.GrubConf.GrubConfigFile'> to parse /grub/menu.lst
Error: (3, 'No such process')

with 127 build, both HVM and PV guest can be created successfully and work well.

Comment 10 errata-xmlrpc 2011-07-21 09:18:13 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html

Comment 11 errata-xmlrpc 2011-07-21 12:08:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html