Bug 1111598 - oo-admin-chk gives bad advice to users when gears do not exist on the node.
Summary: oo-admin-chk gives bad advice to users when gears do not exist on the node.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Rory Thrasher
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-20 13:41 UTC by Eric Rich
Modified: 2018-12-09 18:01 UTC (History)
8 users (show)

Fixed In Version: openshift-origin-broker-util-1.37.4.1-1.el6op rubygem-openshift-origin-controller-1.38.4.2-1.el6op
Doc Type: Bug Fix
Doc Text:
When an error was discovered from running the `oo-admin-check` command, the error output told the user to run the `oo-admin-repair tool` to fix them. However, a number of errors were possible that could not be resolved with `oo-admin-repair` tool, which caused misdirection on how to correctly address these errors. This bug fix updates individual error messages with relevant solutions or links to a Red Hat solutions page if available. The generic error message now directs the user to the `oo-admin-repair` man page to see if their problem is something that it may be able to resolve.
Clone Of:
Environment:
Last Closed: 2015-12-17 17:09:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 712593 0 None None None Never
Red Hat Product Errata RHSA-2015:2666 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 2.2.8 security, bug fix, and enhancement update 2015-12-17 22:07:54 UTC

Description Eric Rich 2014-06-20 13:41:01 UTC
Description of problem:

# oo-admin-chk 
Started at: 2014-06-19 13:52:05 -0700
Time to fetch mongo data: 0.011s
Total gears found in mongo: 23
Time to get all gears from nodes: 20.33s
Total gears found on the nodes: 22
Total nodes that responded : 2
Check failed.
Gear 53888f3482cdf932d00000b5 does not exist on any node
Please refer to the oo-admin-repair tool to resolve some of these inconsistencies.
Total time: 20.344s
Finished at: 2014-06-19 13:52:26 -0700

From the output above you get the impression that oo-admin-repair should be run 

How reproducible: 100% 

Steps to Reproduce:
1. rm -rf /var/lib/openshift/<GEAR_UUID>
2. oo-admin-chk

Actual results:

# oo-admin-chk 
Started at: 2014-06-19 13:52:05 -0700
Time to fetch mongo data: 0.011s
Total gears found in mongo: 23
Time to get all gears from nodes: 20.33s
Total gears found on the nodes: 22
Total nodes that responded : 2
Check failed.
Gear 53888f3482cdf932d00000b5 does not exist on any node
Please refer to the oo-admin-repair tool to resolve some of these inconsistencies.
Total time: 20.344s
Finished at: 2014-06-19 13:52:26 -0700

Expected results:

# oo-admin-chk 
Started at: 2014-06-19 13:52:05 -0700
Time to fetch mongo data: 0.011s
Total gears found in mongo: 23
Time to get all gears from nodes: 20.33s
Total gears found on the nodes: 22
Total nodes that responded : 2
Check failed.
Gear 53888f3482cdf932d00000b5 does not exist on any node
  - Please follow the steps listed by https://access.redhat.com/site/solutions/712593 to resolve. 
inconsistencies.
Total time: 20.344s
Finished at: 2014-06-19 13:52:26 -0700

Additional info:

Comment 2 Luke Meyer 2014-06-20 13:47:02 UTC
I'm pretty sure "refer to the oo-admin-repair tool" is generic advice when any problem is found. It is too bad oo-admin-repair doesn't have anything for this problem. Maybe it should.

Comment 3 openshift-github-bot 2015-10-30 21:43:08 UTC
Commits pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/6faa77471f040c9516b4a70b30b0a021eee59db9
oo-admin-chk: Adds helpful information to oo-admin-chk errors

Bug 1111598
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1111598
Updates some of the error messages to contain more information (or links to support articles) that will help the user diagnose/fix the errors.

The generic "use oo-admin-repair" line has been fleshed out to recommend using the man page to see what inconsistencies it can solve.

https://github.com/openshift/origin-server/commit/bb5d7f5af62e39a0e55341ef989eed49582c8b7e
Merge pull request #6294 from thrasher-redhat/bug1111598

Merged by openshift-bot

Comment 8 Ma xiaoqiang 2015-11-12 02:16:52 UTC
Check on puddle [2.2.8/2015-11-11.1]


1. Create an app
2. scale on the app 

3. delete a gear on a node
3.1 check on node1
[root@node1 openshift]# ls
aquota.user  lost+found
3.2 check on node2
[root@node2 openshift]# ls
aquota.user  lost+found  xiaom-stest-1

4. run 'oo-admin-chk'
# oo-admin-chk 
Started at: 2015-11-12 02:08:39 UTC

User data populated in 0 seconds

Domain data populated in 0 seconds

District data populated in 0 seconds

Total gears found in mongo: 3
Application data populated in 0 seconds

Usage data populated in 0 seconds

Fetched all gears in 20 seconds
Total gears found on the nodes: 3
Total nodes that responded: 2
Checked application gears on nodes in 0 seconds

the tool can not find the gears have been delete from nodes

Comment 9 Rory Thrasher 2015-11-12 16:41:12 UTC
It appears that oo-admin-chk waits 10 minutes to check for the inconsistency we're looking for.  This 10 minute wait prevents cases where the app is taking a long time to create, but is already in the mongo database.


Try the same test again, however wait 10+ minutes after creating the apps.

Comment 10 Ma xiaoqiang 2015-11-13 01:00:17 UTC
test again on the env I installed yesterday.

]# rhc app show stest --gears
ID            State   Cartridges          Size  SSH URL
------------- ------- ------------------- ----- ---------------------------------------------------
xiaom-stest-1 started haproxy-1.4 php-5.4 small xiaom-stest-1.com.cn
xiaom-stest-3 unknown haproxy-1.4 php-5.4 small xiaom-stest-3.com.cn
xiaom-stest-4 unknown haproxy-1.4 php-5.4 small xiaom-stest-4.com.cn

the gears have been deleted yesterday.

# oo-admin-chk 
Started at: 2015-11-13 00:54:38 UTC

User data populated in 0 seconds

Domain data populated in 0 seconds

District data populated in 0 seconds

Total gears found in mongo: 6
Application data populated in 0 seconds

Usage data populated in 0 seconds

Fetched all gears in 20 seconds
Total gears found on the nodes: 6
Total nodes that responded: 2
Checked application gears on nodes in 0 seconds

Checked application gears on nodes (reverse match) in 0 seconds


No error is found.

Comment 11 Rory Thrasher 2015-11-13 16:43:30 UTC
I did some additional testing in a new devenv today and found something interesting.

I started off by creating two apps (called rmdirapp and develnodeapp).  I waited well over 10 minutes and then deleted rmdirapp by removing the gear's folder from the filesystem, and used oo-devel-node app-destroy -c GEARUUID to destroy develnodeapp (without using the broker).

Running oo-admin-chk did NOT notice that rmdirapp was gone.  However the develnodeapp gear was recognized as not existing.

Looking further at oo-addmin-chk, each node loops through its gear users and reports back - so deleting the folder will not trigger the desired error.  This is a bug in itself and I'll be opening a separate bug report for it.

For the purposes of testing this bug, using oo-devel-node app-destroy and waiting 10 minutes should reproduce the desired error state to verify the changes.

Comment 12 Rory Thrasher 2015-11-13 19:07:52 UTC
For reference, oo-admin-chk not recognized that gears have been manually deleted now has its own bug.

https://bugzilla.redhat.com/show_bug.cgi?id=1281912

Comment 13 Johnny Liu 2015-11-18 10:49:42 UTC
Retest this bug with openshift-origin-broker-util-1.37.4.1-1.el6op.noarch, FAIL.

> 1) When a gear does not exist on a node, `oo-admin-chk` now gives a link to
> a helpful public article explaining how to resolve this issue

For this scenarios, "rm -rf /var/lib/openshift/<UUID>" does not trigger error message which was already mentioned in comment 8, here to delete gear_uuid entry from /etc/passwd will trigger the error.

# oo-admin-chk
Started at: 2015-11-18 10:24:51 UTC

User data populated in 0 seconds

Domain data populated in 0 seconds

District data populated in 0 seconds

Total gears found in mongo: 3
Application data populated in 0 seconds

Usage data populated in 0 seconds

Fetched all gears in 20 seconds
Total gears found on the nodes: 2
Total nodes that responded: 2
Checked application gears on nodes in 0 seconds

Checked application gears on nodes (reverse match) in 0 seconds


Finished at: 2015-11-18 10:25:11 UTC
Total time: 20.581s
Gear jialiu-python27app-1 does not exist on any node
Please see https://access.redhat.com/site/solutions/712593 for more information.
FAILED
Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s).

Following the public article, the above error could be resolved.


> 2) When a node with gears on it is not found through mcollective (such as
> when the ruby193-mcollective service has been stopped), the following
> helpful error message is reported from `oo-admin-chk`:
>   Make sure the node <node_hostname> exists and that the ruby193-mcollective
> service is running.

# oo-admin-chk
Started at: 2015-11-18 10:33:19 UTC

User data populated in 0 seconds

Domain data populated in 0 seconds

District data populated in 0 seconds

Total gears found in mongo: 3
Application data populated in 0 seconds

Usage data populated in 0 seconds

Fetched all gears in 20 seconds
Total gears found on the nodes: 1
Total nodes that responded: 1
Checked application gears on nodes in 0 seconds

Checked application gears on nodes (reverse match) in 0 seconds


Finished at: 2015-11-18 10:33:40 UTC
Total time: 20.478s
The node node1.ose22-auto.com.cn expected to contain 1 gears wasn't returned from mcollective for the gear list
Make sure the node node1.ose22-auto.com.cn exists and that the ruby193-mcollective service is running.
FAILED
Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s).


 
> 3) When a gear exists on a node, but does not exist in mongo, `oo-admin-chk`
> reports a helpful message with a link to a public article.

Edit /etc/passwd to add one extra entry just like:
jialiu-python33app-1:x:6790:6790:OpenShift guest:/var/lib/openshift/jialiu-python33app-1:/usr/bin/oo-trap-user

# oo-admin-chk -v
Started at: 2015-11-18 09:47:28 UTC

User data populated in 0 seconds

Domain data populated in 0 seconds

District data populated in 0 seconds

Total gears found in mongo: 4
Application data populated in 0 seconds

Usage data populated in 0 seconds

Fetched all gears in 20 seconds
Total gears found on the nodes: 5
Total nodes that responded: 2
Checking application gears on corresponding nodes
Checked application gears on nodes in 0 seconds

Checking node gears in application database
jialiu-python33app-1...FAIL
Checked application gears on nodes (reverse match) in 0 seconds


Finished at: 2015-11-18 09:47:49 UTC
Total time: 20.481s
Gear jialiu-python33app-1 exists on node node1.ose22-auto.com.cn (uid: 6790) but does not exist in mongo database
Please see https://access.redhat.com/solutions/1171163 for more information.
FAILED
Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s).


The public article is pointing to https://access.redhat.com/solutions/1171163, 
The article is talking about how to resolve "A gear exists in mongo, but not on the node.", but the reality is "A gear exits on node, but not in mongo". It is wrong.

 
> 4) `oo-admin-chk` should now recommend checking the `oo-admin-repair` man
> page to see if `oo-admin-repair` can resolve any reported inconsistencies
> that do not suggest a solution.
Seen from the above advice from oo-admin-chk, the following message is always printed out once check failure is seen which is expected.

Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s).


So based on test result of scenario 3, assign this bug back.

Comment 14 openshift-github-bot 2015-11-19 15:09:52 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/180131cb88505b118581298e99ff6dca2a42dd1f
oo-admin-chk: Adds solutions to a couple of error messages

Bug 1111598
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1111598

Changes the incorrect error message for when a gear exists on the node, but not in mongo.  The error message now tells the user the correct solution for the error.

Also adds a solution to the error message for when there is a mismatch between consumed gears and actual gears.

Comment 15 Rory Thrasher 2015-11-19 15:48:06 UTC
Johnny,

Good catch on scenario 3.  I've gotten a pull request merged that updates the error message.  Instead of an article, oo-admin-chk will give the correct solution of deleting the gear using the oo-devel-node command.

Also added was an explanation for when the number of consumed gears and number of actual gears are mismatched.  Oo-admin-chk will now output the oo-admin-ctl-user command used to fix the bug.

Here are the updated/new test scenarios.

3) When a gear exists on a node, but does not exist in mongo, `oo-admin-chk` reports a resolution:

To fix this issue, remove the gear by running the oo-devel-node command from the node '<server_identity>':"
oo-devel-node app-destroy --with-container-uuid <gear_uuid>


5) When the number of consumed gears and number of actual gears do not match, then 'oo-admin-chk' should output the following resolution:

Set the correct number of consumed gears with the oo-admin-ctl-user command:
oo-admin-ctl-user --login username --setconsumedgears <app_actual_gears>

Comment 17 Ma xiaoqiang 2015-11-23 01:25:24 UTC
Check on puddle [2015-11-19.1]

When the number of consumed gears and number of actual gears do not match, it output the following resolution:

User xiaom has a mismatch in consumed gears (2) and actual gears (3)
FAILED

Can not get the expected result.

Comment 19 Ma xiaoqiang 2015-11-25 06:03:27 UTC
Check on puddle [2015-11-24.1]
# oo-admin-chk  -l 1
User xiaom has a mismatch in consumed gears (2) and actual gears (1)
Set the correct number of consumed gears with the oo-admin-ctl-user command:
oo-admin-ctl-user --login username --setconsumedgears 1
FAILED
Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s).

Get the expected result, move this issue to VERIFIED.

Comment 21 errata-xmlrpc 2015-12-17 17:09:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2666.html


Note You need to log in before you can comment on or make changes to this bug.