848500 – Automated App creation / deletion failures

Bug 848500 - Automated App creation / deletion failures

Summary: Automated App creation / deletion failures

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	847123 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-15 19:05 UTC by Kenny Woodson
Modified:	2015-05-14 22:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:	libra_ami #2050
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-09-17 21:29:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
broker log (34.51 KB, text/x-log) 2012-08-15 19:05 UTC, Kenny Woodson	no flags	Details
client log (8.34 KB, text/x-log) 2012-08-15 19:05 UTC, Kenny Woodson	no flags	Details
lockfile preventing processes to finish (1.37 MB, application/octet-stream) 2012-08-20 21:17 UTC, Rajat Chopra	no flags	Details
View All

Description Kenny Woodson 2012-08-15 19:05:17 UTC

Created attachment 604666 [details]
broker log

Description of problem:

The operations team runs a nagios check of an automated app creation.  We find that this is failing about 2-3 times a day.  We are currently unaware of any reason why it fails.  We have gone through the logs and have come to a dead end.

The broker logs appear normal as they dispatch the correct configure call.  The mcollective client receives and performs the configure call.  We don't see any output for its return result.  On the broker logs it generally says that it failed to execute the build on the nodes due to a node failure.  It then states that it could not parse the results.  

We need another set of eyes on this to figure out why this app create fails.

Version-Release number of selected component (if applicable):

Current.  This has been happening for a few months now.

How reproducible:

This happens randomly but it does occur 2-3 times a day in production.


We have an automated script that does an application creation.

Steps to Reproduce:
1. 
2.
3.
  
Actual results:

Fails to create the application.

Expected results:

Should create the application.

Additional info:

Attaching some logs.

Comment 1 Kenny Woodson 2012-08-15 19:05:55 UTC

Created attachment 604667 [details]
client log

Attached are the logs.  

broker and client log

Comment 2 Ram Ranganathan 2012-08-17 22:01:10 UTC

Looks like from the logs, the operation on the node end is hanging ... and 
then timing out. Which is why no result is sent back and the timings seem to indicate a timeout of 180 secs (as is in the broker production config). 

So rchopra and I traced thru' the code and there's really not enough info to find out where its hanging at -- the suspicion seems to point to the unix_user_observer.rb code: after_unix_user_create  and on a similar note, on destroying the unix user in before_unix_user_destroy(user).  

The two calls w/o timeouts are specifically the ones that do a: 
service libra-cgroups startuser $uuid   or  service libra-cgroups stopuser $uuid
service libra-tc startuser $uuid        or service libra-tc stopuser $uuid

So for now, am going to add a timeout of 60 seconds to the code to each of those calls. 60 feels high but if its hanging atleast will tell us that it took that long. And going to add some log messages to the code.

From the ops end, if we do happen to catch the failure, could you validate that hypothesis by checking the node or even trying out/timing one of the startuser ops on an existing uuid. Thanks.

Comment 3 Ram Ranganathan 2012-08-17 23:39:29 UTC

Bah, update:  shellCmd timeout is indeterministic -- sometimes it times out and sometimes it runs on, not consistent at all so looking at alternatives.

Comment 4 Ram Ranganathan 2012-08-18 00:14:55 UTC

*** Bug 847123 has been marked as a duplicate of this bug. ***

Comment 5 Rajat Chopra 2012-08-20 21:17:42 UTC

Created attachment 605781 [details]
lockfile preventing processes to finish

Comment 6 Rajat Chopra 2012-08-20 21:19:44 UTC

Lockfile manages to queue up whole lot of processes for stickshift-proxy setproxy.

Comment 7 Kenny Woodson 2012-08-20 21:53:57 UTC

Rajat was a big help today debugging this issue.  We arrived at a point where all of our applications were failing to be deleted.  We were looking at two specific nodes where the applications were stopped.

We narrowed the search down when we manually ran a ss-app-destroy by hand and looked at the related strace.  The strace showed it was blocked and waiting.  When we did a ps -aef | grep stickshift-proxy we found 9300 processes.  We then looked at child process of our manual destroy and found the child process was the following:

/bin/bash /etc/init.d/stickshift-proxy setproxy 58751 delete 58752 delete 58753 delete 58754 delete 58755 delete

but this process was also blocked waiting on the lock file:

lockfile /var/lib/stickshift/.stickshift-proxy.d/stickshift-proxy.cfg.lock


Rajat and I think that the mcollective call comes into the node and blocks waiting on this lock file.  After 3 minutes, the broker timesout and returns a node execution failure with no return statement from the node ("could not parse result") because the node never finished its execution of the clean up.  We determined that once removing the lock file manually that processes began to free up and the delete was successful.  We have experienced this on more than one node and would like to get this resolved.

There are a few solutions that can be debated but we definitely need a way of cleaning up this lock file or using flock or an alternative.

Comment 8 Ram Ranganathan 2012-08-20 22:23:34 UTC

lockfile should not wait indefinitely but one could use something like sleeptime and retries [combo upto a limit] -- lockfile -10 -r 6 (for 60 secs).  flock would allow you to do the same. 

And ensure lockfile is not "stale" (creation time has a max lifetime of 'n' seconds - say 300 odd -- pick a number).

Comment 9 Rob Millner 2012-08-21 17:58:19 UTC

The lock must cover both configuration file edits and the entire reload of stickshift-proxy in order for it to be stable.

We can convert the stickshift-proxy-cfg script to ruby, use a real flock and implement the same logic that was done for Apache graceful to aggregate reload requests though.

Comment 10 Rob Millner 2012-08-21 20:00:05 UTC

A quick work-around was done that allows the back-log to clear.
https://github.com/openshift/li/pull/286
https://github.com/openshift/crankcase/pull/419

Comment 11 Rob Millner 2012-08-21 21:51:01 UTC

Re-did the locks with flock and implemented the reload coalescence that we did for httpd graceful.

https://github.com/openshift/li/pull/289
https://github.com/openshift/crankcase/pull/420

Comment 12 Rob Millner 2012-08-22 05:38:10 UTC

Pull requests accepted.

Comment 13 Jianwei Hou 2012-08-23 09:45:12 UTC

Tested and verified on devenv-stage_233

Steps:
Run automation jobs to create and destroy apps. Check development.log and mcollective-client.log to see if same problem ever happen again.
No failure for app creation/deletion happened, and no error messages found in such log files.
So move this bug to verified.

Note You need to log in before you can comment on or make changes to this bug.