Bug 965285 - [oo-accept-node] httpd config references UUID without associated gear
[oo-accept-node] httpd config references UUID without associated gear
Status: CLOSED UPSTREAM
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Rob Millner
libra bugs
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-20 16:07 EDT by Russell Harrison
Modified: 2017-11-08 17:14 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-10 13:40:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Russell Harrison 2013-05-20 16:07:20 EDT
Description of problem:

oo-accept-node is detecting the following issue and throwing the error:
FAIL: httpd config references UUID without associated gear: '519155be5004466472000232'


Gear is deleted in mongo and the user and gear directory are not present but it seems to leave behind some frontend configuration.
sudo grep -r -l 519155be5004466472000232 /var/lib/openshift/.httpd.d
/var/lib/openshift/.httpd.d/geardb.json
Comment 1 Rob Millner 2013-05-20 19:59:44 EDT
Please send me any mcollective and broker logs pertaining to that gear for analysis.  Thanks!
Comment 2 Rob Millner 2013-05-24 14:37:53 EDT
Got the logs - thanks!
Comment 3 Rob Millner 2013-05-24 15:14:47 EDT
Ok, I'm not seeing a cause for this gear to have a lingering front-end configuration.

The new rhc-fix-stale-frontend script can accept a uuid as an argument.  It will go to production at the end of this sprint but should work if its copied to the ex node early to get rid of this gear.

I believe there's nothing else that can be done for this ticket so closing it as errata (use the new rhc-fix-stale-frontend).  Please re-open if you would like it explored further.
Comment 4 Thomas Wiest 2013-05-26 11:29:10 EDT
We're seeing around 4 new of these daily in PROD.

Please take a look again at what could possibly be causing this.

It seems to be happening sporadically on gear destroy. Also, this is the only thing left of the gear.

Re-opening.
Comment 5 Rob Millner 2013-05-28 13:11:56 EDT
Lets pick the most gear that exhibits this problem and send all of the following:

1. Any broker logs pertaining to that application.
2. The mcollective logs from that ex node.
3. /var/log/messages, /var/log/audit/audit.log, /var/log/secure from the ex node.
4. the complete contents of /var/log/openshift from the ex node.


Thanks!
Comment 6 Rob Millner 2013-05-28 14:27:17 EDT
Sorry, that should read "most recent gear that exhibits the problem".
Comment 7 Thomas Wiest 2013-05-29 10:20:48 EDT
Last week we were seeing around 4 of these a day. However now, there are far less (although it is still happening).

Since the data he requested is both large and very secret, I've sent Rob an e-mail telling him how he can download it.
Comment 8 Thomas Wiest 2013-05-29 10:30:55 EDT
Removing need info as I gave the info in comment 7.
Comment 9 Rob Millner 2013-05-29 13:25:00 EDT
Taking off the build blocker list but keeping it as the high priority item on my plate.
Comment 10 Rob Millner 2013-05-29 16:40:06 EDT
Complete system logs were provided for a specific gear and it was determined that the gear account was removed by hand (ex: by calling userdel XXXXXXXXXXXX).  This would leave a stale front-end Apache configuration in place.  I have a query out to find out more about why the gear was deleted.
Comment 11 Rob Millner 2013-05-29 20:08:50 EDT
It looks like the broker purged the application in question from mongodb when the application connector hooks failed to run.

Release ticket updated to request the use of oo-app-destroy instead of userdel to purge stale gears.

Waiting on more information to see why the application hook calls failed.
Comment 13 Rob Millner 2013-05-30 15:06:49 EDT
Need another gear which has a stale front-end configuration but was not destroyed by ops with "userdel".
Comment 14 Rob Millner 2013-06-07 18:44:46 EDT
Second round of fixes to this class of issue were to fix problems in oo-accept-node resulting from the v1 -> v2 migration.

https://github.com/openshift/origin-server/pull/2780
Comment 15 Rob Millner 2013-06-10 13:40:55 EDT
This is impossible to Q/E so I'm moving it directly to closed.  Please re-open if a large number of then show up, otherwise, we'll deal with small numbers of them as they happen.

Note You need to log in before you can comment on or make changes to this bug.