Bug 813228 - Add workaround for kernel breaking cpuset cgroup on hibernate/suspend
Summary: Add workaround for kernel breaking cpuset cgroup on hibernate/suspend
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 17
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 714271 F17-accepted, F17FinalFreezeExcept
TreeView+ depends on / blocked
 
Reported: 2012-04-17 09:28 UTC by Daniel Berrangé
Modified: 2012-05-04 11:20 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-04 11:20:09 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
pm-utils hook to preserve cpuset affinity across suspend/hibernate (2.28 KB, text/plain)
2012-04-17 09:30 UTC, Daniel Berrangé
no flags Details
simplified script (1.42 KB, text/plain)
2012-04-17 13:36 UTC, Michal Schmidt
no flags Details

Description Daniel Berrangé 2012-04-17 09:28:37 UTC
Description of problem:
Due to the following kernel bug:

  https://bugzilla.redhat.com/show_bug.cgi?id=714271

when a host goes into the  suspend/hibernate state, the kernel clears out all CPUs from the cpuset cgroup. This means that upon resume all processes in any cpuset cgroup are pinned to just 1 physical CPU core. 

Since this kernel bug has existed for so long & the kernel devs show no signs of being able to fix it in the forseeable future, we really need to do a temporary userspace workaround for the brokenness.

Using a pm-utils hook, we can save the current cgroup cpuset affinity before suspend/hibernate & restore it upon resume.

We really need this for libvirt, but the fix we came up with is general purpose to apply to any usage of cgroups cpuset. Thus I think the workaround ought to be done by systemd, or perhaps the libcgroup RPM.

Version-Release number of selected component (if applicable):
systemd-44-4.fc17

How reproducible:
Always

Steps to Reproduce:
1. cd /sys/fs/cgroup/cpuset
2. mkdir foo
3. cd foo
4. echo "0-1" > cpuset.cpus
5. pm-suspend
6. cat cpuset.cpus
  
Actual results:
0

Expected results:
0-1

Additional info:

Comment 1 Daniel Berrangé 2012-04-17 09:30:46 UTC
Created attachment 577965 [details]
pm-utils hook to preserve cpuset affinity across suspend/hibernate

The attached patch was developed by Srivatsa S. Bhat as part of this libvirt discussion on the problem

https://www.redhat.com/archives/libvir-list/2012-April/msg00777.html

Comment 2 Harald Hoyer 2012-04-17 13:09:46 UTC
oh 
+-^

	while read line
	do
		cpuset_path=`echo $line | cut -d' ' -f1`
		value=`echo $line | cut -d' ' -f2`
		echo "$value" > $cpuset_path
	done < saved_cpusets.txt

Comment 3 Harald Hoyer 2012-04-17 13:11:20 UTC
(In reply to comment #2)
> oh 
> +-^
> 
>  while read line
>  do
>   cpuset_path=`echo $line | cut -d' ' -f1`
>   value=`echo $line | cut -d' ' -f2`
>   echo "$value" > $cpuset_path
>  done < saved_cpusets.txt

This should really be:

 while read cpuset_path value
 do
  echo "$value" > $cpuset_path
 done < saved_cpusets.txt


Someone does not really know shell...

Comment 4 Michal Schmidt 2012-04-17 13:36:36 UTC
Created attachment 578035 [details]
simplified script

I noticed that too. Here's an improved version of the script.
I'm expressing no opinion yet on adding it to a package.

Comment 5 Matthias Hensler 2012-04-24 13:59:31 UTC
To make the script in attachment 578035 [details] actually work, you have to change the mindepth in the find command from "2" to "1".

Otherwise restoring cpusets will not work, because when run with depth 2 the result is this:

/sys/fs/cgroup/cpuset/libvirt/lxc/cpuset.cpus 0-3
/sys/fs/cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-3


After a suspend/resume cycle you cannot restore the cpusets from above configuration, as /sys/fs/cgroup/cpuset/libvirt/cpuset.cpus is still set to "0" and you will get a permission denied error. Changing the depth to "1" however will create the following result:

/sys/fs/cgroup/cpuset/libvirt/cpuset.cpus 0-3
/sys/fs/cgroup/cpuset/libvirt/lxc/cpuset.cpus 0-3
/sys/fs/cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-3

That configuration can be restored on resume without a problem.

After placing the modified script with depth 1 into /etc/pm/sleep.d/01cpusets.sh libvirtd and all machines work like expected after a suspend/resume cycle.

Comment 6 Cole Robinson 2012-04-25 14:10:29 UTC
Proposing as an f17 nice-to-have

Comment 7 Michal Schmidt 2012-04-25 17:43:33 UTC
I am -1 on NTH for the following reasons:
 - The bug is not at all related to the release criteria.
 - Not many users are affected.
 - Those who really care can place the workaround script on their systems.
 - It can be fixed in a post-release update.
 - Srivatsa S. Bhat is making progress on a proper fix in the kernel:
http://thread.gmane.org/gmane.linux.kernel/1262802/focus=1286289

Comment 8 Vladimir Yashin 2012-05-03 21:40:52 UTC
During suspend when cpuset.cpus is flushed all processes from LXC containers are placed in sysdefault cgroup and they are not restored after suspend.
So effectively on my system all cgroups are empty, except sysdefault.

Here is a small modification to workaround script:

SAVEDIR=/run/ugly-hack-for-bz813228-saved_cpusets
save_cpusets()
{
    mkdir -p $SAVEDIR
    find -L /sys/fs/cgroup/cpuset -mindepth 1 -type d | while read cspath; do
      mkdir -p $SAVEDIR/$cspath
      cp $cspath/cpuset.cpus $cspath/tasks $SAVEDIR/$cspath/
    done
}
restore_cpusets()
{
    cd $SAVEDIR
    find -L . -type d | while read cspath; do
[ -f $cspath/cpuset.cpus ] && cp $cspath/cpuset.cpus /$cspath/cpuset.cpus
[ -f $cspath/tasks ] && while read pid; do echo $pid > /$cspath/tasks; done < $cspath/tasks
    done
    rm -rf $SAVEDIR
}

Unfortunately right after suspend some of processes inside cgroups might be dead, so this error is shown up in logs: "echo: write error: No such process"
Also if new processes were spawned they are not placed into corresponding cgroup.

Comment 9 Michal Schmidt 2012-05-04 11:20:09 UTC
(In reply to comment #8)
> Unfortunately right after suspend some of processes inside cgroups might be
> dead, so this error is shown up in logs: "echo: write error: No such process"
> Also if new processes were spawned they are not placed into corresponding
> cgroup.

Which demonstrates the futility of trying to workaround kernel bugs in userspace.


Note You need to log in before you can comment on or make changes to this bug.