Bug 965285

Summary: [oo-accept-node] httpd config references UUID without associated gear
Product: OpenShift Online Reporter: Russell Harrison <rharriso>
Component: ContainersAssignee: Rob Millner <rmillner>
Status: CLOSED UPSTREAM QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: jgoulding, kwoodson, mfisher, rharriso, twiest
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-10 17:40:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Russell Harrison 2013-05-20 20:07:20 UTC
Description of problem:

oo-accept-node is detecting the following issue and throwing the error:
FAIL: httpd config references UUID without associated gear: '519155be5004466472000232'


Gear is deleted in mongo and the user and gear directory are not present but it seems to leave behind some frontend configuration.
sudo grep -r -l 519155be5004466472000232 /var/lib/openshift/.httpd.d
/var/lib/openshift/.httpd.d/geardb.json

Comment 1 Rob Millner 2013-05-20 23:59:44 UTC
Please send me any mcollective and broker logs pertaining to that gear for analysis.  Thanks!

Comment 2 Rob Millner 2013-05-24 18:37:53 UTC
Got the logs - thanks!

Comment 3 Rob Millner 2013-05-24 19:14:47 UTC
Ok, I'm not seeing a cause for this gear to have a lingering front-end configuration.

The new rhc-fix-stale-frontend script can accept a uuid as an argument.  It will go to production at the end of this sprint but should work if its copied to the ex node early to get rid of this gear.

I believe there's nothing else that can be done for this ticket so closing it as errata (use the new rhc-fix-stale-frontend).  Please re-open if you would like it explored further.

Comment 4 Thomas Wiest 2013-05-26 15:29:10 UTC
We're seeing around 4 new of these daily in PROD.

Please take a look again at what could possibly be causing this.

It seems to be happening sporadically on gear destroy. Also, this is the only thing left of the gear.

Re-opening.

Comment 5 Rob Millner 2013-05-28 17:11:56 UTC
Lets pick the most gear that exhibits this problem and send all of the following:

1. Any broker logs pertaining to that application.
2. The mcollective logs from that ex node.
3. /var/log/messages, /var/log/audit/audit.log, /var/log/secure from the ex node.
4. the complete contents of /var/log/openshift from the ex node.


Thanks!

Comment 6 Rob Millner 2013-05-28 18:27:17 UTC
Sorry, that should read "most recent gear that exhibits the problem".

Comment 7 Thomas Wiest 2013-05-29 14:20:48 UTC
Last week we were seeing around 4 of these a day. However now, there are far less (although it is still happening).

Since the data he requested is both large and very secret, I've sent Rob an e-mail telling him how he can download it.

Comment 8 Thomas Wiest 2013-05-29 14:30:55 UTC
Removing need info as I gave the info in comment 7.

Comment 9 Rob Millner 2013-05-29 17:25:00 UTC
Taking off the build blocker list but keeping it as the high priority item on my plate.

Comment 10 Rob Millner 2013-05-29 20:40:06 UTC
Complete system logs were provided for a specific gear and it was determined that the gear account was removed by hand (ex: by calling userdel XXXXXXXXXXXX).  This would leave a stale front-end Apache configuration in place.  I have a query out to find out more about why the gear was deleted.

Comment 11 Rob Millner 2013-05-30 00:08:50 UTC
It looks like the broker purged the application in question from mongodb when the application connector hooks failed to run.

Release ticket updated to request the use of oo-app-destroy instead of userdel to purge stale gears.

Waiting on more information to see why the application hook calls failed.

Comment 13 Rob Millner 2013-05-30 19:06:49 UTC
Need another gear which has a stale front-end configuration but was not destroyed by ops with "userdel".

Comment 14 Rob Millner 2013-06-07 22:44:46 UTC
Second round of fixes to this class of issue were to fix problems in oo-accept-node resulting from the v1 -> v2 migration.

https://github.com/openshift/origin-server/pull/2780

Comment 15 Rob Millner 2013-06-10 17:40:55 UTC
This is impossible to Q/E so I'm moving it directly to closed.  Please re-open if a large number of then show up, otherwise, we'll deal with small numbers of them as they happen.