Bug 965317 - Parallel creation of applications runs into concurrency issues
Parallel creation of applications runs into concurrency issues
Status: CLOSED CURRENTRELEASE
Product: OpenShift Origin
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Rob Millner
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-20 18:12 EDT by Abhishek Gupta
Modified: 2015-05-14 19:11 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-10 23:59:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
creates 25 apps in parallel and performs some checking. (851 bytes, application/x-shellscript)
2013-05-24 14:21 EDT, Rob Millner
no flags Details
Destroys the 25 apps. (75 bytes, application/x-shellscript)
2013-05-24 14:21 EDT, Rob Millner
no flags Details

  None (edit)
Description Abhishek Gupta 2013-05-20 18:12:11 EDT
Description of problem:
Trying to create multiple single-gear applications for a user in parallel runs into various issues causing a couple of application creations to fail.

Version-Release number of selected component (if applicable):


How reproducible:
Every second attempt to create the parallel applications runs into one of these failures.


Steps to Reproduce:
1. Create 25 applications in parallel using:
  rhc app create <app-name> php-5.3 --no-git --no-dns --no-keys
2. Wait for completion
3.

Actual results:
At the end of step #1, one or two applications fail to be created.

Expected results:
All applications should be consistently created successfully.

Additional info:


Some of the failures encountered thus far are listed below. These are all individual errors in different individual cases/attempts.

[#<MCollective::RPC::Result:0x00000005cd02d0 @agent="openshift", @action="cartridge_do", @results={:sender=>"ip-10-147-172-204", :statuscode=>1, :statusmsg=>"cartridge_do_action failed 157. Output Shell command '/sbin/runuser -m -s /bin/sh 372532160485125408161792 -c \"exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c0,c501' /bin/sh -c \\\"/var/lib/openshift/372532160485125408161792/php/bin/install --version 5.3\\\"\"' returned an error. rc=1\nCLIENT_MESSAGE: ERROR: The default config file is not a valid config file or is corrupted.\n", :data=>{:time=>nil, :output=>"Shell command '/sbin/runuser -m -s /bin/sh 372532160485125408161792 -c \"exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c0,c501' /bin/sh -c \\\"/var/lib/openshift/372532160485125408161792/php/bin/install --version 5.3\\\"\"' returned an error. rc=1\nCLIENT_MESSAGE: ERROR: The default config file is not a valid config file or is corrupted.\n", :exitcode=>157}}>]


D, [2013-05-20T16:08:43.031070 #1112] DEBUG -- : Shell command '/sbin/runuser -m -s /bin/sh 771883394006838740516864 -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c0,c515' /bin/sh -c \"/var/lib/openshift/771883394006838740516864/php/bin/install --version 5.3\""' ran. rc=1 out=Could not create "/var/lib/openshift/771883394006838740516864//var/lib/openshift/771883394006838740516864//.pearrc"


D, [2013-05-17T18:17:52.727346 #4970] DEBUG -- : oo_spawn buffer(10/) [Fri May 17 18:17:52 2013] [warn] module php5_module is already loaded, skipping
(13)Permission denied: make_sock: could not bind to address 127.13.155.1:8080
no listening sockets available, shutting down
Unable to open logs


D, [2013-05-17T18:17:16.302559 #4970] DEBUG -- : oo_spawn buffer(24/) [Fri May 17 18:17:16 2013] [warn] module php5_module is already loaded, skipping
httpd: bad user name 187250427401952271794176187250427401952271794176
Comment 1 Abhishek Gupta 2013-05-20 18:12:50 EDT
Assigning this bug to Rob as he was helping me debug the issues I ran into.
Comment 2 Rob Millner 2013-05-21 16:07:09 EDT
The second and third issue in the Description are examples of something I've seen in testing:

An instance of OPENSHIFT_PHP_PORT containing "80808080"

Several instances of geardb.json containing a duplicate copy of the JSON data.

An inspection of the underlying code shows nothing which should yield this behaviour, its all basically the following...

File.open(path, File::WRONLY|File::TRUNC|File::CREAT) do |file|
   file.write(...)
end

Grasping at straws, one thought is that there's a bug in File where if it has buffered data to flush at the end it could do so twice.
Comment 3 openshift-github-bot 2013-05-21 17:03:32 EDT
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/160ddc7eab6029b7c24c72be88a99ce478edbfdd
Bug 965317 - The mutexes must be created as globals which evaluate ahead of any multithreaded operations.
Comment 4 Rob Millner 2013-05-23 19:50:01 EDT
After some tests and examining the C code, it appears as though the following happens:

File.open(...) do |file|
  file.write(...)
end

Has a chance of the string passed to write being duplicated in io.c.  There's even some code and a comment in io.c pertaining to the issue.


Setting File::SYNC on open doesn't help.  Nor does calling sync, or fsync after the write.

Surprisingly, File::SYNC does not appear to toggle sync on the IO object though it does cause the underlying system open to use O_SYNC.

File.open("foo",  File::RDWR|File::CREAT|File::TRUNC|File::SYNC) do |f| 
  puts "#{f.sync.inspect}"
end

...returns "false".

Setting f.sync=true does appear to help.
Comment 5 Rob Millner 2013-05-23 19:56:10 EDT
This has been observed to affect the following file operations:
1. Creating gear variables.
2. Writing Apache databases in FrontendHttpServer
3. NodeLogger

It is suspected that the problem has also happened during ERB file processing.

I believe that this problem potentially affects any file write orchestrated by the node code that happens with high concurrency or under a load.

This includes:
1. Anything done through mcollective
2. oo-admin-ctl-gears

Because of the pervasive nature of the problem, it is proposed that a module be created to monkey patch File to set sync=true on the resulting IO object in open and new.  This module should be included in openshift-origin-node.
Comment 6 Rob Millner 2013-05-23 19:57:44 EDT
Correction, the module will only be included in mcollective, oo-admin-ctl-gears and any other program that's likely to be affected by this issue.
Comment 7 openshift-github-bot 2013-05-24 03:04:01 EDT
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/e5a0493e65acab7e61949da95abb3c2b453452e1
Bug 965317 - Add way to patch File class so all files have sync enabled.
Comment 8 Rob Millner 2013-05-24 14:20:08 EDT
Pull request completed.

Q/E on this ticket may be a bit complicated.  Here's a few things that should help test on devenv.

0. Use a fresh devenv.

1. Set maximum gears to over 25.
oo-admin-ctl-user -l rmillner@redhat.com --setmaxgears=99

2. Change the Apache proxy timeout to 600.
vi /etc/httpd/conf.d/000000_default.conf

...
  ProxyTimeout 600


service httpd restart

3. Ignore "No nodes available" errors.  If that happens, delete your apps and try again.

4. Create 25 php apps in parallel.
  - If there are failures, examine the cause to see if its related to a variable being corrupted.

5. run oo-accept-node
  - should pass if there were no errors in step 4.

6. Make sure that geardb.json is correct.


Steps 4 through 6 are contained in a script I will attach.
Comment 9 Rob Millner 2013-05-24 14:21:06 EDT
Created attachment 752829 [details]
creates 25 apps in parallel and performs some checking.
Comment 10 Rob Millner 2013-05-24 14:21:44 EDT
Created attachment 752830 [details]
Destroys the 25 apps.
Comment 11 Meng Bo 2013-05-27 06:39:13 EDT
Test this case by following Rob's comment on a large node,

After step 4, all the 25 apps are created,
After step 5, oo-accept-node get Passed.
After step 6, all the apps are wrote into geardb.json.

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.