Well, I've added logging to the XendDomainInfo.py file and I've been able to isolate the code where it happens. The affected code is the affinity setting code which is used only on the NUMA systems. The following VmError is coming directly from the xc.vcpu_setaffinity() call which is the source of the failure: [2011-03-22 12:50:35 xend.XendDomainInfo 5578] ERROR (XendDomainInfo:243) Domain construction failed Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 236, in create vm.initDomain() File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2205, in initDomain raise VmError(str(exn)) VmError: (3, 'No such process') The xc.cpu_setaffinity() is basically calling a DOMCTL by the xc_vcpu_setaffinity() function. According to my investigation error 3: No such process is defined as -ESRCH which can be found in the hypervisor source codes either if the domain cannot be found or the affinity for the VCPU specified is having NULL value, i.e. doesn't exist: ret = -ESRCH; if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL ) goto vcpuaffinity_out; I also added logging to the hypervisor and I've been able to see: (XEN) domctl.c:438:d0 Looking up for domain ffff8300af0fa080 ... found (XEN) domctl.c:444:d0 Looking up for affinity for the VCPU done ... found (XEN) domctl.c:438:d0 Looking up for domain 0000000000000000 ... not found Which is the place where there's error. The 0000000000000000 or ffff8300af0fa080 are just pointers printed using "%p" parameter for gdprintk() functions so obviously the domain pointer is invalid at the time of failure. I'll investigate this further but I guess I'd rather investigate this in libxc part to add debugging there first to confirm everything is fine on this side. Michal
The ESRCH is a red herring, the error happens before. In find_relaxed_node() the call to XendDomain.instance().list() tries to skip the current domain: from xen.xend import XendDomain doms = XendDomain.instance().list() for dom in filter (lambda d: d.domid != self.domid, doms): but it fails to do so. This because XendDomain calls XendDomainInfo.recreate(). This one fails because the domain info hasn't been initialized yet, so that self.info['memory'] is zero. XendDomain then decides things are broken beyond repair, and destroys the domain. So, before find_relaxed_node() returns, the domain is destroyed, and the following hypercall fails (as it should).
(In reply to comment #3) > The ESRCH is a red herring, the error happens before. In find_relaxed_node() > the call to XendDomain.instance().list() tries to skip the current domain: > > from xen.xend import XendDomain > doms = XendDomain.instance().list() > for dom in filter (lambda d: d.domid != self.domid, doms): > > but it fails to do so. This because XendDomain calls > XendDomainInfo.recreate(). This one fails because the domain info hasn't been > initialized yet, so that self.info['memory'] is zero. XendDomain then decides > things are broken beyond repair, and destroys the domain. > > So, before find_relaxed_node() returns, the domain is destroyed, and the > following hypercall fails (as it should). Thanks for your investigation Paolo. This is a good place to start. You're referring to some hypercall now and I guess this is the hypercall being called from the XendDomain.instance().list, right? I'll investigate it at this place. Thanks again, Michal
No, the hypercall that fails is (as you had correctly found) vcpu_setaffinity. But that's just the first piece that sees the destroyed domain. The fix is simply not to refresh the list when calling list(). I'm curious if you can make xend fail by calling "xm list" quickly in a loop, while running "xm create", even with the patch... That would be another bug though.
Verified with xen-3.0.3-127.el5 reproduced with xen-3.0.3-126.el5 with host NUMA enabled: $ xm create test.cfg Using config file "./test.cfg". Using <class 'grub.GrubConf.GrubConfigFile'> to parse /grub/menu.lst Error: (3, 'No such process') with 127 build, both HVM and PV guest can be created successfully and work well.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1070.html