Bug 1020555 - oo-cgroup-reclassify fails silently
Summary: oo-cgroup-reclassify fails silently
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 1.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: mfisher
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-17 22:18 UTC by Stefanie Forrester
Modified: 2014-01-24 03:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-24 03:25:13 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Stefanie Forrester 2013-10-17 22:18:21 UTC
Description of problem:
oo-cgroup-reclassify exits without displaying an error, after failing to add pid(s) to cgroup.

Version-Release number of selected component (if applicable):
rubygem-openshift-origin-node-1.15.11-1.el6oso.noarch

How reproducible:
Every time.

Steps to Reproduce:
1. Attempt to manually place pid back into cgroup. Memory allocation will fail.

$ sudo bash -c 'echo 20222 >> /cgroup/all/openshift/52602d16e0b8cd3cb50000b9/cgroup.procs'
bash: line 0: echo: write error: Cannot allocate memory

2. Run equivalent command with oo-cgroup-reclassify. It also fails, but doesn't display an error message.

$ sudo oo-cgroup-reclassify -c 52602d16e0b8cd3cb50000b9

3. Confirm that the pid has not been added to cgroup.

$ sudo oo-accept-node -v
FAIL: 52602d16e0b8cd3cb50000b9 has a process missing from cgroups: 20222 cgroups controller: all
FAIL: 52602d16e0b8cd3cb50000b9 has a process missing from cgroups: 20227 cgroups controller: all
FAIL: 52602d16e0b8cd3cb50000b9 has a process missing from cgroups: 27563 cgroups controller: all


Actual results:
Fails without displaying error.

Expected results:
Should error if unsuccessful.

Additional info:

Comment 1 Rob Millner 2013-10-18 00:32:50 UTC
The "write error: Cannot allocate memory" comes from trying to put a zombie process into a cgroup.  Zombies can't be classified, they live in the root cgroup.

oo-cgroup-reclassify deliberately ignores the error for that reason.


The cgroup detection logic in oo-accept-node is intended to ignore zombie processes as well but that logic had to change due to a really pernicious text processing bug.  It looks like that filter is broken in oo-accept-node.

Comment 2 Rob Millner 2013-10-18 00:39:33 UTC
Are those PIDs still around?  Could you do the following:

ps -p 20222,20227,27563 -l -L


I can't reproduce the problem where oo-accept-node is showing zombies - we may need to add other process states to the filter.  Thanks!

[root@ip-10-181-203-105 ~]# oo-accept-node
PASS
[root@ip-10-181-203-105 ~]# ps -p 26139 -l -L
F S   UID   PID  PPID   LWP  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1 Z  1001 26139 26137 26139  0  80   0 -     0 exit   pts/1    00:00:00 ruby <defunct>

Comment 3 Matt Woodson 2013-10-18 15:46:36 UTC
this is another gear that is exhibiting the same issues:

# echo 670 >> /cgroup/all/openshift/5252485d5004469af300007c/cgroup.procs 
bash: echo: write error: Cannot allocate memory

# ps -p 670 -l -L                                                                                                   
F S   UID   PID  PPID   LWP  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  4418   670     1   670  0  80   0 - 338333 futex_ ?       00:00:00 java
1 S  4418   670     1   671  0  80   0 - 338333 futex_ ?       00:00:01 java
1 S  4418   670     1   672  0  80   0 - 338333 futex_ ?       00:00:02 java
1 S  4418   670     1   673  0  80   0 - 338333 futex_ ?       00:00:02 java
1 S  4418   670     1   674  0  80   0 - 338333 futex_ ?       00:00:28 java
1 S  4418   670     1   675  0  80   0 - 338333 futex_ ?       00:00:00 java
1 S  4418   670     1   676  0  80   0 - 338333 futex_ ?       00:00:00 java
1 S  4418   670     1   680  0  80   0 - 338333 futex_ ?       00:00:00 java
1 S  4418   670     1   681  0  80   0 - 338333 futex_ ?       00:00:11 java
1 S  4418   670     1   682  0  80   0 - 338333 futex_ ?       00:00:11 java
1 S  4418   670     1   683  0  80   0 - 338333 futex_ ?       00:00:00 java
1 S  4418   670     1   685  0  80   0 - 338333 futex_ ?       00:09:24 java
1 S  4418   670     1   774  0  80   0 - 338333 futex_ ?       00:00:00 java
1 S  4418   670     1  1098  0  80   0 - 338333 futex_ ?       00:00:19 java
1 S  4418   670     1  1100  0  80   0 - 338333 futex_ ?       00:00:19 java
1 S  4418   670     1 30535  0  80   0 - 338333 futex_ ?       00:00:00 java

Comment 4 Kenny Woodson 2013-10-18 16:13:00 UTC
I'm debugging an existing problem that exhibits this behavior. From mwoodson's comment above.  User has 1 process out of cgroups:
UUID                     PPID  PID
5252485d5004469af300007c 1      670


Here is the proc:
4418       670     1  1 Oct17 ?        00:12:40 /usr/bin/java -Djava.util.logging.config.file=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.prefs.userRoot=/var/lib/openshift/5252485d5004469af300007c/app-root/data/ -Djava.util.prefs.userRoot=/var/lib/openshift/5252485d5004469af300007c/app-root/data//prefs -Djava.endorsed.dirs=/var/lib/openshift/5252485d5004469af300007c/app-root/
data/tomcat/endorsed -classpath /var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/bin/bootstrap.jar:/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat -Dcatalina.home=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat -Djava.io.tmpdir=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/temp org.apache.catalina.startup.Bootstrap start         

Notice this process is not defunct but the gear is out of memory and the process that needs to be placed back into cgroups would then put the application above its memory limit.

cat /cgroup/all/openshift/5252485d5004469af300007c/memory.memsw.failcnt 
2708490834

cat /cgroup/all/openshift/5252485d5004469af300007c/memory.memsw.max_usage_in_bytes 
641728512

cat /cgroup/all/openshift/5252485d5004469af300007c/memory.stat
cache 0
rss 15355904
mapped_file 0
pgpgin 7590333
pgpgout 7586584
swap 626372608
inactive_anon 7909376
active_anon 7446528
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 536870912
hierarchical_memsw_limit 641728512
total_cache 0
total_rss 15355904
total_mapped_file 0
total_pgpgin 7590333
total_pgpgout 7586584
total_swap 626372608
total_inactive_anon 7909376
total_active_anon 7446528
total_inactive_file 0
total_active_file 0
total_unevictable 0


The solutions to this are a few.  
1. Somehow fix cgroups to not allow processes to escape their group.
2. Restart the gear (very cruel but does the job).  Downside is causes downtime for gears.
3. Ignore the problem.  This isn't ideal but this gear really isn't causing an issue other than it is taking more memory than it should.  This is problematic as it does need to be constrained or it could at some point be catastrophic if a process consumed all the very.  Likelihood, not very likely.
4. Up the memory limit with enough memory to include the missing cgroup proc which would then allow us to insert the process into its proper group. The good thing is there is no downtime and accept node would be happy.  Bad thing would be the user has a somewhat increased memory limit for the running time of the app.  This isn't ideal but saves restarting the application which causes downtime.  If we then resize the memory back to what it should be initially we could cause an OOM kill event.

Comment 5 Kenny Woodson 2013-10-18 16:15:31 UTC
/usr/bin/oo-cgroup-reclassify -c 5252485d5004469af300007c -d

Added debug here to see the same error message:

 File.open(File.join(path, "tasks"), File::WRONLY | File::SYNC) do |t|
                  procs.each do |pid|
                    begin
                      t.syswrite "#{pid}\n"
                    #rescue.
                    rescue Exception => e
                      puts e.message
                    end
                  end
                end



Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks
No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks
Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks
No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks
Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks
No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks
Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks
No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks
Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks
No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks

Comment 6 Rob Millner 2013-10-18 20:33:52 UTC
Logs were gathered for 5252485d5004469af300007c, pid 670 above.


A few things stand out:
0. Its a DIY app that runs Tomcat.

1. The process date is Oct 17'th; however, the app was idled and restored several times since then.  The process should have been shut down on idle.

2. Idle and restore events have errors from CGRE (likely cgrulesengd) related to not being able to put processes into the cgroup.

3. The footprint of the java process is most of the gear, its mostly swapped out.

4. The process is in sleep, and all threads are on a futex call.


Here's the hypothesis:

The process is not being terminated because its stuck on a futex call.  Our various normal entry points to classify processes into cgroups (pam, cgrulesengd) are emitting error messages but are allowing processes to be started.

I'm going to see if I can reproduce this failure on a devenv so we can start exploring options.

Try a SIGKILL on the process.

Comment 7 Rob Millner 2013-10-18 21:49:56 UTC
Actually, that theory only works if process 670 were in the cgroup, not outside of it.

The entry point into cgroups for most gear processes should be pam_cgroups.  I'll have a look at that to see what happens if the cgroup is already full.

We need to know more about the gear.  Are there other processes, any of them similarly stuck?

Comment 8 Rob Millner 2013-10-21 20:49:26 UTC
It appears as though this gear is doing something to wedge its processes such that the idler cannot stop them.  Restorer is starting new processes and the gear eventually fills up.

Any cartridge (eg. a downloadable cart we have no control over) can break idler in the same way by not properly handling stop.

So, there are three underlying bugs:
1. Launching gear processes is a soft failure when the gear is full.
2. Idler/restorer is subject to failing to idle/restore due to idiosyncrasies in the application or cartridge.
3. No notification of why processes could not be placed in their cgroups.

The fix to #1 is to make pam_cgroup no longer optional and propagate this failure back out to the node code (ex: gear restore fails).  This requires testing before we can 

The fix for #2 sounds simple but its actually complicated.  If we SIGKILL the gear on gear stop, then we risk smashing cron and ssh login processes.  This needs more research to see what is doable.

The fix for #3 is straightforward but I'm hesitant to go further and SIGKILL processes in the cgroups code.


I'm moving this bug off the release blocker list due to #1 and #2 being fairly high risk.

If you run across a gear in this state, then the following steps are recommended as a workaround:
1. Stop the gear.
2. Sigkill all remaining processes owned by the gear.
  - We're living with the possible side effects.
3. Start the gear.

Comment 9 Rob Millner 2013-10-23 22:56:24 UTC
Pull request to have idle kill off the remainder of the gear processes that are not part of an API command, cron, or ssh session.

https://github.com/openshift/origin-server/pull/3977

Comment 10 Rob Millner 2013-10-24 01:19:29 UTC
The above pull request modified to include the "forcestopgear" and "forcestopall" options to oo-admin-ctl-gears which will completely stop a gear.

Note, these calls will terminate ssh and API calls in the gear as well.

Comment 11 openshift-github-bot 2013-10-24 03:26:14 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/a771d1efbd08d73a499a7f374a90dfc60bbe3a50
Bug 1020555 - Add better process termination control to kill_procs and include it as the last step in idle.

Comment 13 Rob Millner 2013-10-24 19:19:41 UTC
Reviewed the pam_cgroup code.  It will return failure in the following situations:

1. Cgroups cannot be initialized.
2. /etc/cgrules.conf is corrupt.
3. The process cannot place itself into the cgroup.


If a user is not in cgrules.conf, then there's no attempt to classify the process and success is returned.


We want failure #3 in gear processes, but its more important to never see failures #1 or #2 prevent root logins on the node.  I don't think we should switch pam_cgroup from optional to required in the pam configuration files.

Looking for another option.

Comment 14 Rob Millner 2013-10-24 21:37:03 UTC
Updated the pull request to report back when non-zombie process could not be classified.

# oo-cgroup-reclassify  -c 52698f75905a6bf4c5000011
Errors classifying processes
    5139: The cgroup is full

Comment 15 Rob Millner 2013-10-25 19:22:52 UTC
Experiments with pam_cgroup show that setting it to required doesn't seem to make a difference.  Its unclear why; a reading the source shows that it should be returning failure.

I tested scrambling /etc/cgrules.conf (overwriting it with 1MB of /dev/urandom) and it did result in not being able to classify processes but did not stop ssh login or starting a gear.

The memory limits of a gear were set to zero to simulate the original failure.  Gear processes could still start, they would still fail to be placed in the cgroup.

I guess the good news is that setting it to required seems to have no risk of breaking root logins; and the bad news is that it doesn't do any good.

Comment 16 Rob Millner 2013-11-04 23:12:52 UTC
Trello card filed to continue to explore the solution.
https://trello.com/c/82ZHdxls/344-r-d-enter-cgroup-or-die-trying

For Q/E, double check that "oo-cgroup-reclassify" is doing its thing by the following steps:

1. Create an application

2. Force stop the gear
# oo-admin-ctl-gears forcestopgear  5278228a878c9c9c30000001

3. Verify there are no more processes running for the gear:
# ps -u 1000

4. Force the gear memory to empty
# echo 1 > /cgroup/all/openshift/5278228a878c9c9c30000001/memory.force_empty

5. Wait till the cgroup is no longer using memory
# cat /cgroup/all/openshift/5278228a878c9c9c30000001/memory.{memsw.,}usage_in_bytes
0
0

6. Set the gear limits to zero
# echo 0 > /cgroup/all/openshift/5278228a878c9c9c30000001/memory.limit_in_bytes
# echo 0 > /cgroup/all/openshift/5278228a878c9c9c30000001/memory.memsw.limit_in_bytes

7. Start the gear
# oo-admin-ctl-gears startgear  5278228a878c9c9c30000001

8. Attempt to classify gear processes
# oo-cgroup-reclassify -c 5278228a878c9c9c30000001
Errors classifying processes
    12486: The cgroup is full


Steps 1 through 7 should have no failures.  Step 8 should return an error.

Comment 17 chunchen 2013-11-05 06:52:59 UTC
Tested it by following the steps in Comment 16 on devenv_3986, the results are as expected,it's fixed,so mark it as VERIFIED.

please refer to the following results:
At Step 8, it return error as below:
# oo-cgroup-reclassify -c 52788df77cc10deb69000099
Errors classifying processes
    9972: The cgroup is full
    10100: The cgroup is full
    10163: The cgroup is full
    10165: The cgroup is full
    10166: The cgroup is full
    10167: The cgroup is full


Note You need to log in before you can comment on or make changes to this bug.