Bug 988115 - oo-admin-chk complains about null node when the application is partially deleted
oo-admin-chk complains about null node when the application is partially deleted
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Pod (Show other bugs)
2.x
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Abhishek Gupta
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-24 14:39 EDT by Sten Turpin
Modified: 2015-05-14 20:19 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-08-07 18:56:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Sten Turpin 2013-07-24 14:39:35 EDT
Description of problem: oo-admin-chk level 0 returns an error like: 

"The node  expected to contain 15 gears wasn't returned from mcollective for the gear list" 

(note the extra space)


Version-Release number of selected component (if applicable): openshift-origin-broker-util-1.11.3-1.el6oso.noarch


How reproducible: sometimes


Steps to Reproduce:
1. Run oo-admin-chk --level 0

Actual results:
failure message saying a node "  " wasn't returned

Expected results:
either a clean run, or a node with failing items

Additional info:
Comment 1 Abhishek Gupta 2013-07-26 14:31:54 EDT
This would happen when oo-admin-chk catches an application that is about to be deleted or where a gear is being destroyed. Since the mongo record for the gear is removed after the gear deletion on the node, there is small window where the gear is deleted from the node and the server but the gear still exists in mongo.

We need to handle these cases in the script by either checking again, looking for pending op groups within the application.
Comment 2 Abhishek Gupta 2013-07-29 19:07:06 EDT
Added a check to weed out false positives --> https://github.com/openshift/origin-server/pull/3213
Comment 4 Jianwei Hou 2013-07-30 23:12:07 EDT
Verified on devenv_3588

Create 3 applciations, then delete all of them via rhc, at the same time run oo-admin-chk --level 0 to see its behavior

Actual result:
When 
[root@ip-10-191-41-74 broker]# oo-admin-chk --level 0 --verbose
Started at: 2013-07-30 22:58:30 -0400
Time to fetch mongo data: 0.031s
Total gears found in mongo: 3
Time to get all gears from nodes: 21.168s
Total gears found on the nodes: 1
Total nodes that responded : 1
Checking application gears and ssh keys on corresponding nodes:
475186853543936620756992 : String...	OK
51f87c4d53d8ec309d000003 : String...	OK
09e6889ef98d11e2b82f123139401dc0 : String...	OK
Checking node gears in application database:
09e6889ef98d11e2b82f123139401dc0...	OK
Success
Total time: 21.2s
Finished at: 2013-07-30 22:58:51 -0400

It reports success, and the command executed successfully.
When delete the gear from app's group_instance, the result will be:
[root@ip-10-191-41-74 ~]# oo-admin-chk -l 0 -v
Started at: 2013-07-30 23:08:57 -0400
Time to fetch mongo data: 0.012s
Total gears found in mongo: 1
Time to get all gears from nodes: 20.531s
Total gears found on the nodes: 2
Total nodes that responded : 1
Checking application gears and ssh keys on corresponding nodes:
09e6889ef98d11e2b82f123139401dc0 : String...	OK
Checking node gears in application database:
431415193033926816825344...	FAIL
09e6889ef98d11e2b82f123139401dc0...	OK
Check failed.
Gear 431415193033926816825344 exists on node ip-10-191-41-74 (uid: 501) but does not exist in mongo database
Please refer to the oo-admin-repair tool to resolve some of these inconsistencies.
Total time: 20.545s
Finished at: 2013-07-30 23:09:17 -0400

Note You need to log in before you can comment on or make changes to this bug.