Hide Forgot
Description of problem: oo-cgroup-reclassify exits without displaying an error, after failing to add pid(s) to cgroup. Version-Release number of selected component (if applicable): rubygem-openshift-origin-node-1.15.11-1.el6oso.noarch How reproducible: Every time. Steps to Reproduce: 1. Attempt to manually place pid back into cgroup. Memory allocation will fail. $ sudo bash -c 'echo 20222 >> /cgroup/all/openshift/52602d16e0b8cd3cb50000b9/cgroup.procs' bash: line 0: echo: write error: Cannot allocate memory 2. Run equivalent command with oo-cgroup-reclassify. It also fails, but doesn't display an error message. $ sudo oo-cgroup-reclassify -c 52602d16e0b8cd3cb50000b9 3. Confirm that the pid has not been added to cgroup. $ sudo oo-accept-node -v FAIL: 52602d16e0b8cd3cb50000b9 has a process missing from cgroups: 20222 cgroups controller: all FAIL: 52602d16e0b8cd3cb50000b9 has a process missing from cgroups: 20227 cgroups controller: all FAIL: 52602d16e0b8cd3cb50000b9 has a process missing from cgroups: 27563 cgroups controller: all Actual results: Fails without displaying error. Expected results: Should error if unsuccessful. Additional info:
The "write error: Cannot allocate memory" comes from trying to put a zombie process into a cgroup. Zombies can't be classified, they live in the root cgroup. oo-cgroup-reclassify deliberately ignores the error for that reason. The cgroup detection logic in oo-accept-node is intended to ignore zombie processes as well but that logic had to change due to a really pernicious text processing bug. It looks like that filter is broken in oo-accept-node.
Are those PIDs still around? Could you do the following: ps -p 20222,20227,27563 -l -L I can't reproduce the problem where oo-accept-node is showing zombies - we may need to add other process states to the filter. Thanks! [root@ip-10-181-203-105 ~]# oo-accept-node PASS [root@ip-10-181-203-105 ~]# ps -p 26139 -l -L F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 1 Z 1001 26139 26137 26139 0 80 0 - 0 exit pts/1 00:00:00 ruby <defunct>
this is another gear that is exhibiting the same issues: # echo 670 >> /cgroup/all/openshift/5252485d5004469af300007c/cgroup.procs bash: echo: write error: Cannot allocate memory # ps -p 670 -l -L F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 S 4418 670 1 670 0 80 0 - 338333 futex_ ? 00:00:00 java 1 S 4418 670 1 671 0 80 0 - 338333 futex_ ? 00:00:01 java 1 S 4418 670 1 672 0 80 0 - 338333 futex_ ? 00:00:02 java 1 S 4418 670 1 673 0 80 0 - 338333 futex_ ? 00:00:02 java 1 S 4418 670 1 674 0 80 0 - 338333 futex_ ? 00:00:28 java 1 S 4418 670 1 675 0 80 0 - 338333 futex_ ? 00:00:00 java 1 S 4418 670 1 676 0 80 0 - 338333 futex_ ? 00:00:00 java 1 S 4418 670 1 680 0 80 0 - 338333 futex_ ? 00:00:00 java 1 S 4418 670 1 681 0 80 0 - 338333 futex_ ? 00:00:11 java 1 S 4418 670 1 682 0 80 0 - 338333 futex_ ? 00:00:11 java 1 S 4418 670 1 683 0 80 0 - 338333 futex_ ? 00:00:00 java 1 S 4418 670 1 685 0 80 0 - 338333 futex_ ? 00:09:24 java 1 S 4418 670 1 774 0 80 0 - 338333 futex_ ? 00:00:00 java 1 S 4418 670 1 1098 0 80 0 - 338333 futex_ ? 00:00:19 java 1 S 4418 670 1 1100 0 80 0 - 338333 futex_ ? 00:00:19 java 1 S 4418 670 1 30535 0 80 0 - 338333 futex_ ? 00:00:00 java
I'm debugging an existing problem that exhibits this behavior. From mwoodson's comment above. User has 1 process out of cgroups: UUID PPID PID 5252485d5004469af300007c 1 670 Here is the proc: 4418 670 1 1 Oct17 ? 00:12:40 /usr/bin/java -Djava.util.logging.config.file=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.prefs.userRoot=/var/lib/openshift/5252485d5004469af300007c/app-root/data/ -Djava.util.prefs.userRoot=/var/lib/openshift/5252485d5004469af300007c/app-root/data//prefs -Djava.endorsed.dirs=/var/lib/openshift/5252485d5004469af300007c/app-root/ data/tomcat/endorsed -classpath /var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/bin/bootstrap.jar:/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat -Dcatalina.home=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat -Djava.io.tmpdir=/var/lib/openshift/5252485d5004469af300007c/app-root/data/tomcat/temp org.apache.catalina.startup.Bootstrap start Notice this process is not defunct but the gear is out of memory and the process that needs to be placed back into cgroups would then put the application above its memory limit. cat /cgroup/all/openshift/5252485d5004469af300007c/memory.memsw.failcnt 2708490834 cat /cgroup/all/openshift/5252485d5004469af300007c/memory.memsw.max_usage_in_bytes 641728512 cat /cgroup/all/openshift/5252485d5004469af300007c/memory.stat cache 0 rss 15355904 mapped_file 0 pgpgin 7590333 pgpgout 7586584 swap 626372608 inactive_anon 7909376 active_anon 7446528 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 536870912 hierarchical_memsw_limit 641728512 total_cache 0 total_rss 15355904 total_mapped_file 0 total_pgpgin 7590333 total_pgpgout 7586584 total_swap 626372608 total_inactive_anon 7909376 total_active_anon 7446528 total_inactive_file 0 total_active_file 0 total_unevictable 0 The solutions to this are a few. 1. Somehow fix cgroups to not allow processes to escape their group. 2. Restart the gear (very cruel but does the job). Downside is causes downtime for gears. 3. Ignore the problem. This isn't ideal but this gear really isn't causing an issue other than it is taking more memory than it should. This is problematic as it does need to be constrained or it could at some point be catastrophic if a process consumed all the very. Likelihood, not very likely. 4. Up the memory limit with enough memory to include the missing cgroup proc which would then allow us to insert the process into its proper group. The good thing is there is no downtime and accept node would be happy. Bad thing would be the user has a somewhat increased memory limit for the running time of the app. This isn't ideal but saves restarting the application which causes downtime. If we then resize the memory back to what it should be initially we could cause an OOM kill event.
/usr/bin/oo-cgroup-reclassify -c 5252485d5004469af300007c -d Added debug here to see the same error message: File.open(File.join(path, "tasks"), File::WRONLY | File::SYNC) do |t| procs.each do |pid| begin t.syswrite "#{pid}\n" #rescue. rescue Exception => e puts e.message end end end Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks Cannot allocate memory - /cgroup/all/openshift/5252485d5004469af300007c/tasks No such process - /cgroup/all/openshift/5252485d5004469af300007c/tasks
Logs were gathered for 5252485d5004469af300007c, pid 670 above. A few things stand out: 0. Its a DIY app that runs Tomcat. 1. The process date is Oct 17'th; however, the app was idled and restored several times since then. The process should have been shut down on idle. 2. Idle and restore events have errors from CGRE (likely cgrulesengd) related to not being able to put processes into the cgroup. 3. The footprint of the java process is most of the gear, its mostly swapped out. 4. The process is in sleep, and all threads are on a futex call. Here's the hypothesis: The process is not being terminated because its stuck on a futex call. Our various normal entry points to classify processes into cgroups (pam, cgrulesengd) are emitting error messages but are allowing processes to be started. I'm going to see if I can reproduce this failure on a devenv so we can start exploring options. Try a SIGKILL on the process.
Actually, that theory only works if process 670 were in the cgroup, not outside of it. The entry point into cgroups for most gear processes should be pam_cgroups. I'll have a look at that to see what happens if the cgroup is already full. We need to know more about the gear. Are there other processes, any of them similarly stuck?
It appears as though this gear is doing something to wedge its processes such that the idler cannot stop them. Restorer is starting new processes and the gear eventually fills up. Any cartridge (eg. a downloadable cart we have no control over) can break idler in the same way by not properly handling stop. So, there are three underlying bugs: 1. Launching gear processes is a soft failure when the gear is full. 2. Idler/restorer is subject to failing to idle/restore due to idiosyncrasies in the application or cartridge. 3. No notification of why processes could not be placed in their cgroups. The fix to #1 is to make pam_cgroup no longer optional and propagate this failure back out to the node code (ex: gear restore fails). This requires testing before we can The fix for #2 sounds simple but its actually complicated. If we SIGKILL the gear on gear stop, then we risk smashing cron and ssh login processes. This needs more research to see what is doable. The fix for #3 is straightforward but I'm hesitant to go further and SIGKILL processes in the cgroups code. I'm moving this bug off the release blocker list due to #1 and #2 being fairly high risk. If you run across a gear in this state, then the following steps are recommended as a workaround: 1. Stop the gear. 2. Sigkill all remaining processes owned by the gear. - We're living with the possible side effects. 3. Start the gear.
Pull request to have idle kill off the remainder of the gear processes that are not part of an API command, cron, or ssh session. https://github.com/openshift/origin-server/pull/3977
The above pull request modified to include the "forcestopgear" and "forcestopall" options to oo-admin-ctl-gears which will completely stop a gear. Note, these calls will terminate ssh and API calls in the gear as well.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/a771d1efbd08d73a499a7f374a90dfc60bbe3a50 Bug 1020555 - Add better process termination control to kill_procs and include it as the last step in idle.
Reviewed the pam_cgroup code. It will return failure in the following situations: 1. Cgroups cannot be initialized. 2. /etc/cgrules.conf is corrupt. 3. The process cannot place itself into the cgroup. If a user is not in cgrules.conf, then there's no attempt to classify the process and success is returned. We want failure #3 in gear processes, but its more important to never see failures #1 or #2 prevent root logins on the node. I don't think we should switch pam_cgroup from optional to required in the pam configuration files. Looking for another option.
Updated the pull request to report back when non-zombie process could not be classified. # oo-cgroup-reclassify -c 52698f75905a6bf4c5000011 Errors classifying processes 5139: The cgroup is full
Experiments with pam_cgroup show that setting it to required doesn't seem to make a difference. Its unclear why; a reading the source shows that it should be returning failure. I tested scrambling /etc/cgrules.conf (overwriting it with 1MB of /dev/urandom) and it did result in not being able to classify processes but did not stop ssh login or starting a gear. The memory limits of a gear were set to zero to simulate the original failure. Gear processes could still start, they would still fail to be placed in the cgroup. I guess the good news is that setting it to required seems to have no risk of breaking root logins; and the bad news is that it doesn't do any good.
Trello card filed to continue to explore the solution. https://trello.com/c/82ZHdxls/344-r-d-enter-cgroup-or-die-trying For Q/E, double check that "oo-cgroup-reclassify" is doing its thing by the following steps: 1. Create an application 2. Force stop the gear # oo-admin-ctl-gears forcestopgear 5278228a878c9c9c30000001 3. Verify there are no more processes running for the gear: # ps -u 1000 4. Force the gear memory to empty # echo 1 > /cgroup/all/openshift/5278228a878c9c9c30000001/memory.force_empty 5. Wait till the cgroup is no longer using memory # cat /cgroup/all/openshift/5278228a878c9c9c30000001/memory.{memsw.,}usage_in_bytes 0 0 6. Set the gear limits to zero # echo 0 > /cgroup/all/openshift/5278228a878c9c9c30000001/memory.limit_in_bytes # echo 0 > /cgroup/all/openshift/5278228a878c9c9c30000001/memory.memsw.limit_in_bytes 7. Start the gear # oo-admin-ctl-gears startgear 5278228a878c9c9c30000001 8. Attempt to classify gear processes # oo-cgroup-reclassify -c 5278228a878c9c9c30000001 Errors classifying processes 12486: The cgroup is full Steps 1 through 7 should have no failures. Step 8 should return an error.
Tested it by following the steps in Comment 16 on devenv_3986, the results are as expected,it's fixed,so mark it as VERIFIED. please refer to the following results: At Step 8, it return error as below: # oo-cgroup-reclassify -c 52788df77cc10deb69000099 Errors classifying processes 9972: The cgroup is full 10100: The cgroup is full 10163: The cgroup is full 10165: The cgroup is full 10166: The cgroup is full 10167: The cgroup is full