Bug 813228

Summary: Add workaround for kernel breaking cpuset cgroup on hibernate/suspend
Product: [Fedora] Fedora Reporter: Daniel BerrangĂ© <berrange>
Component: systemdAssignee: systemd-maint
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 17CC: cfergeau, crobinso, harald, johannbg, mails.bugzilla.redhat.com, marcandre.lureau, metherid, mschmidt, notting, plautrba, srivatsa.bhat, systemd-maint, yashin.vladimir
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-04 11:20:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 714271, 752653    
Attachments:
Description Flags
pm-utils hook to preserve cpuset affinity across suspend/hibernate
none
simplified script none

Description Daniel Berrangé 2012-04-17 09:28:37 UTC
Description of problem:
Due to the following kernel bug:

  https://bugzilla.redhat.com/show_bug.cgi?id=714271

when a host goes into the  suspend/hibernate state, the kernel clears out all CPUs from the cpuset cgroup. This means that upon resume all processes in any cpuset cgroup are pinned to just 1 physical CPU core. 

Since this kernel bug has existed for so long & the kernel devs show no signs of being able to fix it in the forseeable future, we really need to do a temporary userspace workaround for the brokenness.

Using a pm-utils hook, we can save the current cgroup cpuset affinity before suspend/hibernate & restore it upon resume.

We really need this for libvirt, but the fix we came up with is general purpose to apply to any usage of cgroups cpuset. Thus I think the workaround ought to be done by systemd, or perhaps the libcgroup RPM.

Version-Release number of selected component (if applicable):
systemd-44-4.fc17

How reproducible:
Always

Steps to Reproduce:
1. cd /sys/fs/cgroup/cpuset
2. mkdir foo
3. cd foo
4. echo "0-1" > cpuset.cpus
5. pm-suspend
6. cat cpuset.cpus
  
Actual results:
0

Expected results:
0-1

Additional info:

Comment 1 Daniel Berrangé 2012-04-17 09:30:46 UTC
Created attachment 577965 [details]
pm-utils hook to preserve cpuset affinity across suspend/hibernate

The attached patch was developed by Srivatsa S. Bhat as part of this libvirt discussion on the problem

https://www.redhat.com/archives/libvir-list/2012-April/msg00777.html

Comment 2 Harald Hoyer 2012-04-17 13:09:46 UTC
oh 
+-^

	while read line
	do
		cpuset_path=`echo $line | cut -d' ' -f1`
		value=`echo $line | cut -d' ' -f2`
		echo "$value" > $cpuset_path
	done < saved_cpusets.txt

Comment 3 Harald Hoyer 2012-04-17 13:11:20 UTC
(In reply to comment #2)
> oh 
> +-^
> 
>  while read line
>  do
>   cpuset_path=`echo $line | cut -d' ' -f1`
>   value=`echo $line | cut -d' ' -f2`
>   echo "$value" > $cpuset_path
>  done < saved_cpusets.txt

This should really be:

 while read cpuset_path value
 do
  echo "$value" > $cpuset_path
 done < saved_cpusets.txt


Someone does not really know shell...

Comment 4 Michal Schmidt 2012-04-17 13:36:36 UTC
Created attachment 578035 [details]
simplified script

I noticed that too. Here's an improved version of the script.
I'm expressing no opinion yet on adding it to a package.

Comment 5 Matthias Hensler 2012-04-24 13:59:31 UTC
To make the script in attachment 578035 [details] actually work, you have to change the mindepth in the find command from "2" to "1".

Otherwise restoring cpusets will not work, because when run with depth 2 the result is this:

/sys/fs/cgroup/cpuset/libvirt/lxc/cpuset.cpus 0-3
/sys/fs/cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-3


After a suspend/resume cycle you cannot restore the cpusets from above configuration, as /sys/fs/cgroup/cpuset/libvirt/cpuset.cpus is still set to "0" and you will get a permission denied error. Changing the depth to "1" however will create the following result:

/sys/fs/cgroup/cpuset/libvirt/cpuset.cpus 0-3
/sys/fs/cgroup/cpuset/libvirt/lxc/cpuset.cpus 0-3
/sys/fs/cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-3

That configuration can be restored on resume without a problem.

After placing the modified script with depth 1 into /etc/pm/sleep.d/01cpusets.sh libvirtd and all machines work like expected after a suspend/resume cycle.

Comment 6 Cole Robinson 2012-04-25 14:10:29 UTC
Proposing as an f17 nice-to-have

Comment 7 Michal Schmidt 2012-04-25 17:43:33 UTC
I am -1 on NTH for the following reasons:
 - The bug is not at all related to the release criteria.
 - Not many users are affected.
 - Those who really care can place the workaround script on their systems.
 - It can be fixed in a post-release update.
 - Srivatsa S. Bhat is making progress on a proper fix in the kernel:
http://thread.gmane.org/gmane.linux.kernel/1262802/focus=1286289

Comment 8 Vladimir Yashin 2012-05-03 21:40:52 UTC
During suspend when cpuset.cpus is flushed all processes from LXC containers are placed in sysdefault cgroup and they are not restored after suspend.
So effectively on my system all cgroups are empty, except sysdefault.

Here is a small modification to workaround script:

SAVEDIR=/run/ugly-hack-for-bz813228-saved_cpusets
save_cpusets()
{
    mkdir -p $SAVEDIR
    find -L /sys/fs/cgroup/cpuset -mindepth 1 -type d | while read cspath; do
      mkdir -p $SAVEDIR/$cspath
      cp $cspath/cpuset.cpus $cspath/tasks $SAVEDIR/$cspath/
    done
}
restore_cpusets()
{
    cd $SAVEDIR
    find -L . -type d | while read cspath; do
[ -f $cspath/cpuset.cpus ] && cp $cspath/cpuset.cpus /$cspath/cpuset.cpus
[ -f $cspath/tasks ] && while read pid; do echo $pid > /$cspath/tasks; done < $cspath/tasks
    done
    rm -rf $SAVEDIR
}

Unfortunately right after suspend some of processes inside cgroups might be dead, so this error is shown up in logs: "echo: write error: No such process"
Also if new processes were spawned they are not placed into corresponding cgroup.

Comment 9 Michal Schmidt 2012-05-04 11:20:09 UTC
(In reply to comment #8)
> Unfortunately right after suspend some of processes inside cgroups might be
> dead, so this error is shown up in logs: "echo: write error: No such process"
> Also if new processes were spawned they are not placed into corresponding
> cgroup.

Which demonstrates the futility of trying to workaround kernel bugs in userspace.