Bug 1117004

Summary: Use of "pgrep -F" in the bash SDK is unreliable
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: ContainersAssignee: Miciah Dashiel Butler Masters <mmasters>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1.0CC: adellape, agrimm, anli, cryan, gpei, jhonce, jialiu, jokerman, libra-bugs, libra-onpremise-devel, lmeyer, mmasters, mmccomas
Target Milestone: ---Keywords: Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rubygem-openshift-origin-node-1.23.9.12-1 Doc Type: Bug Fix
Doc Text:
Often when a cartridge starts a runtime in a gear, the cartridge stores the pid of the runtime's process in a pidfile. Later, the cartridge may use the process_running function to determine whether that process is still running in the gear by checking whether any running process has a pid matching the pid saved in the pidfile. However, if the runtime's process had terminated and the operating system had subsequently assigned the same pid to a new process, the process_running function could return a false positive, interfering with cartridge control actions. This bug fix updates the process_running function to use the pgrep command with the -u option to restrict its search to processes belonging to the gear. As a result, the process_running function now has a much lower probability of returning a false positive.
Story Points: ---
Clone Of: 1116135 Environment:
Last Closed: 2014-08-04 13:27:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1116135    
Bug Blocks:    

Description Brenton Leanhardt 2014-07-07 19:30:35 UTC
+++ This bug was initially created as a clone of Bug #1116135 +++

Description of problem:

The process_running function in node/misc/usr/lib/cartridge_sdk/bash/sdk uses "pgrep -F" to determine whether a cartridge's processes are running.  The problem is that with this option, pgrep checks for these PIDs by traversing /proc.  It turns out that if another gear has a process with the pid being checked, pgrep -F will find it.  As a result, if gear A has a stale pidfile containing a pid matching a long-running process belonging to gear B, gear B will effectively prevent gear A from running, unless the owner knows to go remove the stale pid file.

Version-Release number of selected component (if applicable):

rubygem-openshift-origin-node-1.26.8-1.el6oso.noarch

How reproducible:

Easily

Steps to Reproduce:
1. rhc app create bztest nodejs-0.10 postgresql-9.2
2. rhc app stop
3. rhc ssh bztest
4. look in /proc for a process belonging to another gear (referred to below as $PID)
5. echo $PID > postgresql/pid/postgres.pid
6. gear start

Actual results:

start will fail for postgres with code 70

Expected results:

start should succeed

--- Additional comment from Jhon Honce on 2014-07-07 13:04:34 EDT ---

Fixed in https://github.com/openshift/origin-server/pull/5575

--- Additional comment from openshift-github-bot on 2014-07-07 13:54:29 EDT ---

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/16eb8a6e98def5a8c830757ad9fa9c0a3a3b4afe
Bug 1116135 - Add -u to bash sdk pgrep calls

* Since gears can "see" another gears pid files in the /proc filesystem,
  a stale pid file could block a cartridge from starting via the check
  in sdk#process_running()

Comment 1 Miciah Dashiel Butler Masters 2014-07-11 22:26:14 UTC
PR: https://github.com/openshift/enterprise-server/pull/320

Comment 5 Anping Li 2014-07-21 09:28:18 UTC
Verified and pass in puddle-2-1-2014-07-18

The bug can be recreated at puddle-2014-05-29.3
[bztest-hanli1dom.example.com 53ccdba3d42d02f3a70d3f50]\> gear start
Starting gear...
Could not start Postgres
An error occurred executing 'gear start' (exit code: 70)
Error message: CLIENT_ERROR: Failed to execute: 'control start' for /var/lib/openshift/53ccdba3d42d02f3a70d3f50/postgresql


Execute same steps in puddle-2-1-2014-07-18, No error was reported and app was started.
[bztest-hanli1dom.example.com 53ccda324cfeff7254000015]\> echo 20400 >> postgresql/pid/postgres.pid
[bztest-hanli1dom.example.com 53ccda324cfeff7254000015]\> gear start
Starting gear...
Starting Postgres cartridge
Postgres started
Starting NodeJS cartridge
Mon Jul 21 2014 05:27:17 GMT-0400 (EDT): Starting application 'bztest' ...

Comment 7 errata-xmlrpc 2014-08-04 13:27:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html