Created attachment 604666 [details] broker log Description of problem: The operations team runs a nagios check of an automated app creation. We find that this is failing about 2-3 times a day. We are currently unaware of any reason why it fails. We have gone through the logs and have come to a dead end. The broker logs appear normal as they dispatch the correct configure call. The mcollective client receives and performs the configure call. We don't see any output for its return result. On the broker logs it generally says that it failed to execute the build on the nodes due to a node failure. It then states that it could not parse the results. We need another set of eyes on this to figure out why this app create fails. Version-Release number of selected component (if applicable): Current. This has been happening for a few months now. How reproducible: This happens randomly but it does occur 2-3 times a day in production. We have an automated script that does an application creation. Steps to Reproduce: 1. 2. 3. Actual results: Fails to create the application. Expected results: Should create the application. Additional info: Attaching some logs.
Created attachment 604667 [details] client log Attached are the logs. broker and client log
Looks like from the logs, the operation on the node end is hanging ... and then timing out. Which is why no result is sent back and the timings seem to indicate a timeout of 180 secs (as is in the broker production config). So rchopra and I traced thru' the code and there's really not enough info to find out where its hanging at -- the suspicion seems to point to the unix_user_observer.rb code: after_unix_user_create and on a similar note, on destroying the unix user in before_unix_user_destroy(user). The two calls w/o timeouts are specifically the ones that do a: service libra-cgroups startuser $uuid or service libra-cgroups stopuser $uuid service libra-tc startuser $uuid or service libra-tc stopuser $uuid So for now, am going to add a timeout of 60 seconds to the code to each of those calls. 60 feels high but if its hanging atleast will tell us that it took that long. And going to add some log messages to the code. From the ops end, if we do happen to catch the failure, could you validate that hypothesis by checking the node or even trying out/timing one of the startuser ops on an existing uuid. Thanks.
Bah, update: shellCmd timeout is indeterministic -- sometimes it times out and sometimes it runs on, not consistent at all so looking at alternatives.
*** Bug 847123 has been marked as a duplicate of this bug. ***
Created attachment 605781 [details] lockfile preventing processes to finish
Lockfile manages to queue up whole lot of processes for stickshift-proxy setproxy.
Rajat was a big help today debugging this issue. We arrived at a point where all of our applications were failing to be deleted. We were looking at two specific nodes where the applications were stopped. We narrowed the search down when we manually ran a ss-app-destroy by hand and looked at the related strace. The strace showed it was blocked and waiting. When we did a ps -aef | grep stickshift-proxy we found 9300 processes. We then looked at child process of our manual destroy and found the child process was the following: /bin/bash /etc/init.d/stickshift-proxy setproxy 58751 delete 58752 delete 58753 delete 58754 delete 58755 delete but this process was also blocked waiting on the lock file: lockfile /var/lib/stickshift/.stickshift-proxy.d/stickshift-proxy.cfg.lock Rajat and I think that the mcollective call comes into the node and blocks waiting on this lock file. After 3 minutes, the broker timesout and returns a node execution failure with no return statement from the node ("could not parse result") because the node never finished its execution of the clean up. We determined that once removing the lock file manually that processes began to free up and the delete was successful. We have experienced this on more than one node and would like to get this resolved. There are a few solutions that can be debated but we definitely need a way of cleaning up this lock file or using flock or an alternative.
lockfile should not wait indefinitely but one could use something like sleeptime and retries [combo upto a limit] -- lockfile -10 -r 6 (for 60 secs). flock would allow you to do the same. And ensure lockfile is not "stale" (creation time has a max lifetime of 'n' seconds - say 300 odd -- pick a number).
The lock must cover both configuration file edits and the entire reload of stickshift-proxy in order for it to be stable. We can convert the stickshift-proxy-cfg script to ruby, use a real flock and implement the same logic that was done for Apache graceful to aggregate reload requests though.
A quick work-around was done that allows the back-log to clear. https://github.com/openshift/li/pull/286 https://github.com/openshift/crankcase/pull/419
Re-did the locks with flock and implemented the reload coalescence that we did for httpd graceful. https://github.com/openshift/li/pull/289 https://github.com/openshift/crankcase/pull/420
Pull requests accepted.
Tested and verified on devenv-stage_233 Steps: Run automation jobs to create and destroy apps. Check development.log and mcollective-client.log to see if same problem ever happen again. No failure for app creation/deletion happened, and no error messages found in such log files. So move this bug to verified.