Bug 1150585

Summary: numad dies after number of "Could not write 1 to /cgroup/cpuset/libvirt/qemu/vm_name/emulator/cpuset.mems -- errno: 13" errors
Product: Red Hat Enterprise Linux 6 Reporter: Alexandros Gkesos <agkesos>
Component: numadAssignee: Bill Gray <bgray>
Status: CLOSED ERRATA QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: medium    
Version: 6.5CC: bgray, jprokes, jscotka, jsynacek, juergen_thomann, ppostler, psklenar, tlavigne
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1235164 (view as bug list) Environment:
Last Closed: 2015-07-22 07:46:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Hypervisor's sosreport none

Description Alexandros Gkesos 2014-10-08 13:08:15 UTC
Description of problem:

When upgrading numad-0.5-9.20130814git.el6.x86_64 -> numad-0.5-10.20140620git.el6_5.x86_64 on KVM-Hypervisor we see the following errors in numad.log and later on, numad crashes.

...
Tue Oct  7 15:00:02 2014: Advising pid 7952 (qemu-kvm) move from nodes (1) to nodes (1)
Tue Oct  7 15:00:02 2014: Could not write 1 to /cgroup/cpuset/libvirt/qemu/zevis2/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:00:02 2014: Could not write 0-1 to /cgroup/cpuset/libvirt/qemu/zevis2/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:00:02 2014: PID 7952 moved to node(s) 1 in 0.0 seconds
Tue Oct  7 15:01:37 2014: Advising pid 8033 (qemu-kvm) move from nodes (1) to nodes (1)
Tue Oct  7 15:01:37 2014: Could not write 1 to /cgroup/cpuset/libvirt/qemu/tcn1/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:01:37 2014: Could not write 0-1 to /cgroup/cpuset/libvirt/qemu/tcn1/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:01:37 2014: PID 8033 moved to node(s) 1 in 0.0 seconds
Tue Oct  7 15:02:57 2014: Advising pid 11520 (qemu-kvm) move from nodes (1) to nodes (1)
Tue Oct  7 15:02:57 2014: Could not write 1 to /cgroup/cpuset/libvirt/qemu/ntvmeuc02/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:02:57 2014: Could not write 0-1 to /cgroup/cpuset/libvirt/qemu/ntvmeuc02/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:02:57 2014: PID 11520 moved to node(s) 1 in 0.0 seconds
Tue Oct  7 15:03:02 2014: Could not open stat file: /proc/1068/stat
Tue Oct  7 15:03:02 2014: Advising pid 26413 (qemu-kvm) move from nodes (1) to nodes (1)
Tue Oct  7 15:03:02 2014: Could not write 1 to /cgroup/cpuset/libvirt/qemu/zevisk1/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:03:02 2014: Could not write 0-1 to /cgroup/cpuset/libvirt/qemu/zevisk1/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:03:02 2014: PID 26413 moved to node(s) 1 in 0.0 seconds
Tue Oct  7 15:03:07 2014: Could not get node meminfo


Behaviour with previous numad version

Tue Oct  7 08:37:53 2014: Advising pid 16553 (qemu-kvm) move from nodes (0) to nodes (1)
Tue Oct  7 08:37:53 2014: Including task: 16553
Tue Oct  7 08:37:53 2014: Including task: 16632
Tue Oct  7 08:37:53 2014: Including task: 16633
Tue Oct  7 08:37:53 2014: Including task: 16634
Tue Oct  7 08:37:53 2014: Including task: 16635
Tue Oct  7 08:37:53 2014: PID 16553 moved to node(s) 1 in 0.0 seconds
Tue Oct  7 08:37:58 2014: Advising pid 11520 (qemu-kvm) move from nodes (0) to nodes (1)
Tue Oct  7 08:37:58 2014: Including task: 11520
Tue Oct  7 08:37:58 2014: Including task: 11592
Tue Oct  7 08:37:58 2014: Including task: 11593
Tue Oct  7 08:37:58 2014: PID 11520 moved to node(s) 1 in 0.1 seconds


Version-Release number of selected component (if applicable): numad-0.5-10.20140620git.el6_5.x86_64


How reproducible: Everytime (by customer)


Steps to Reproduce:
1. Upgrade from numad-0.5-9.20130814git.el6.x86_64 to numad-0.5-10.20140620git.el6_5.x86_64

Actual results:

Tue Oct  7 15:00:02 2014: Could not write 1 to /cgroup/cpuset/libvirt/qemu/zevis2/emulator/cpuset.mems -- errno: 13
Tue Oct  7 15:00:02 2014: Could not write 0-1 to /cgroup/cpuset/libvirt/qemu/zevis2/emulator/cpuset.mems -- errno: 13

and numad crashes after some time.

Expected results:

No errors (i have the normal logs in description)
Numad won't crash

Comment 4 Alexandros Gkesos 2014-10-09 07:21:25 UTC
Created attachment 945210 [details]
Hypervisor's sosreport

Comment 6 Jürgen Thomann 2015-03-13 10:04:33 UTC
We have the same problem and it is caused by too many open file descriptors. I'm now testing the following patch:

--- a/numad.c
+++ b/numad.c
@@ -1111,6 +1111,7 @@ int write_to_cpuset_file(char *fname, char *s) {
     numad_log(LOG_DEBUG, "Writing %s to: %s\n", s, fname);
     if (write(fd, s, strlen(s)) <= 0) {
         numad_log(LOG_CRIT, "Could not write %s to %s -- errno: %d\n", s, fname, errno);
+        close(fd);
         return -1;
     }
     close(fd);

Comment 7 Bill Gray 2015-06-01 12:33:55 UTC
Looks like the patch in Comment 6 would have prevented numad from crashing by fixing the failure to close the file during error conditions when attempting to write to the file.  Thanks.  

The new version of numad no longer uses cpusets, entirely eliminating the writes to the cpuset control files.

Comment 14 errata-xmlrpc 2015-07-22 07:46:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1441.html