Bug 1111598
Summary: | oo-admin-chk gives bad advice to users when gears do not exist on the node. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Rich <erich> |
Component: | Node | Assignee: | Rory Thrasher <rthrashe> |
Status: | CLOSED ERRATA | QA Contact: | libra bugs <libra-bugs> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 2.1.0 | CC: | adellape, jialiu, jokerman, libra-onpremise-devel, mmccomas, rthrashe, tiwillia, xiama |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openshift-origin-broker-util-1.37.4.1-1.el6op rubygem-openshift-origin-controller-1.38.4.2-1.el6op | Doc Type: | Bug Fix |
Doc Text: |
When an error was discovered from running the `oo-admin-check` command, the error output told the user to run the `oo-admin-repair tool` to fix them. However, a number of errors were possible that could not be resolved with `oo-admin-repair` tool, which caused misdirection on how to correctly address these errors. This bug fix updates individual error messages with relevant solutions or links to a Red Hat solutions page if available. The generic error message now directs the user to the `oo-admin-repair` man page to see if their problem is something that it may be able to resolve.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-17 17:09:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric Rich
2014-06-20 13:41:01 UTC
I'm pretty sure "refer to the oo-admin-repair tool" is generic advice when any problem is found. It is too bad oo-admin-repair doesn't have anything for this problem. Maybe it should. Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/6faa77471f040c9516b4a70b30b0a021eee59db9 oo-admin-chk: Adds helpful information to oo-admin-chk errors Bug 1111598 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1111598 Updates some of the error messages to contain more information (or links to support articles) that will help the user diagnose/fix the errors. The generic "use oo-admin-repair" line has been fleshed out to recommend using the man page to see what inconsistencies it can solve. https://github.com/openshift/origin-server/commit/bb5d7f5af62e39a0e55341ef989eed49582c8b7e Merge pull request #6294 from thrasher-redhat/bug1111598 Merged by openshift-bot Check on puddle [2.2.8/2015-11-11.1] 1. Create an app 2. scale on the app 3. delete a gear on a node 3.1 check on node1 [root@node1 openshift]# ls aquota.user lost+found 3.2 check on node2 [root@node2 openshift]# ls aquota.user lost+found xiaom-stest-1 4. run 'oo-admin-chk' # oo-admin-chk Started at: 2015-11-12 02:08:39 UTC User data populated in 0 seconds Domain data populated in 0 seconds District data populated in 0 seconds Total gears found in mongo: 3 Application data populated in 0 seconds Usage data populated in 0 seconds Fetched all gears in 20 seconds Total gears found on the nodes: 3 Total nodes that responded: 2 Checked application gears on nodes in 0 seconds the tool can not find the gears have been delete from nodes It appears that oo-admin-chk waits 10 minutes to check for the inconsistency we're looking for. This 10 minute wait prevents cases where the app is taking a long time to create, but is already in the mongo database. Try the same test again, however wait 10+ minutes after creating the apps. test again on the env I installed yesterday. ]# rhc app show stest --gears ID State Cartridges Size SSH URL ------------- ------- ------------------- ----- --------------------------------------------------- xiaom-stest-1 started haproxy-1.4 php-5.4 small xiaom-stest-1.com.cn xiaom-stest-3 unknown haproxy-1.4 php-5.4 small xiaom-stest-3.com.cn xiaom-stest-4 unknown haproxy-1.4 php-5.4 small xiaom-stest-4.com.cn the gears have been deleted yesterday. # oo-admin-chk Started at: 2015-11-13 00:54:38 UTC User data populated in 0 seconds Domain data populated in 0 seconds District data populated in 0 seconds Total gears found in mongo: 6 Application data populated in 0 seconds Usage data populated in 0 seconds Fetched all gears in 20 seconds Total gears found on the nodes: 6 Total nodes that responded: 2 Checked application gears on nodes in 0 seconds Checked application gears on nodes (reverse match) in 0 seconds No error is found. I did some additional testing in a new devenv today and found something interesting. I started off by creating two apps (called rmdirapp and develnodeapp). I waited well over 10 minutes and then deleted rmdirapp by removing the gear's folder from the filesystem, and used oo-devel-node app-destroy -c GEARUUID to destroy develnodeapp (without using the broker). Running oo-admin-chk did NOT notice that rmdirapp was gone. However the develnodeapp gear was recognized as not existing. Looking further at oo-addmin-chk, each node loops through its gear users and reports back - so deleting the folder will not trigger the desired error. This is a bug in itself and I'll be opening a separate bug report for it. For the purposes of testing this bug, using oo-devel-node app-destroy and waiting 10 minutes should reproduce the desired error state to verify the changes. For reference, oo-admin-chk not recognized that gears have been manually deleted now has its own bug. https://bugzilla.redhat.com/show_bug.cgi?id=1281912 Retest this bug with openshift-origin-broker-util-1.37.4.1-1.el6op.noarch, FAIL. > 1) When a gear does not exist on a node, `oo-admin-chk` now gives a link to > a helpful public article explaining how to resolve this issue For this scenarios, "rm -rf /var/lib/openshift/<UUID>" does not trigger error message which was already mentioned in comment 8, here to delete gear_uuid entry from /etc/passwd will trigger the error. # oo-admin-chk Started at: 2015-11-18 10:24:51 UTC User data populated in 0 seconds Domain data populated in 0 seconds District data populated in 0 seconds Total gears found in mongo: 3 Application data populated in 0 seconds Usage data populated in 0 seconds Fetched all gears in 20 seconds Total gears found on the nodes: 2 Total nodes that responded: 2 Checked application gears on nodes in 0 seconds Checked application gears on nodes (reverse match) in 0 seconds Finished at: 2015-11-18 10:25:11 UTC Total time: 20.581s Gear jialiu-python27app-1 does not exist on any node Please see https://access.redhat.com/site/solutions/712593 for more information. FAILED Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s). Following the public article, the above error could be resolved. > 2) When a node with gears on it is not found through mcollective (such as > when the ruby193-mcollective service has been stopped), the following > helpful error message is reported from `oo-admin-chk`: > Make sure the node <node_hostname> exists and that the ruby193-mcollective > service is running. # oo-admin-chk Started at: 2015-11-18 10:33:19 UTC User data populated in 0 seconds Domain data populated in 0 seconds District data populated in 0 seconds Total gears found in mongo: 3 Application data populated in 0 seconds Usage data populated in 0 seconds Fetched all gears in 20 seconds Total gears found on the nodes: 1 Total nodes that responded: 1 Checked application gears on nodes in 0 seconds Checked application gears on nodes (reverse match) in 0 seconds Finished at: 2015-11-18 10:33:40 UTC Total time: 20.478s The node node1.ose22-auto.com.cn expected to contain 1 gears wasn't returned from mcollective for the gear list Make sure the node node1.ose22-auto.com.cn exists and that the ruby193-mcollective service is running. FAILED Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s). > 3) When a gear exists on a node, but does not exist in mongo, `oo-admin-chk` > reports a helpful message with a link to a public article. Edit /etc/passwd to add one extra entry just like: jialiu-python33app-1:x:6790:6790:OpenShift guest:/var/lib/openshift/jialiu-python33app-1:/usr/bin/oo-trap-user # oo-admin-chk -v Started at: 2015-11-18 09:47:28 UTC User data populated in 0 seconds Domain data populated in 0 seconds District data populated in 0 seconds Total gears found in mongo: 4 Application data populated in 0 seconds Usage data populated in 0 seconds Fetched all gears in 20 seconds Total gears found on the nodes: 5 Total nodes that responded: 2 Checking application gears on corresponding nodes Checked application gears on nodes in 0 seconds Checking node gears in application database jialiu-python33app-1...FAIL Checked application gears on nodes (reverse match) in 0 seconds Finished at: 2015-11-18 09:47:49 UTC Total time: 20.481s Gear jialiu-python33app-1 exists on node node1.ose22-auto.com.cn (uid: 6790) but does not exist in mongo database Please see https://access.redhat.com/solutions/1171163 for more information. FAILED Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s). The public article is pointing to https://access.redhat.com/solutions/1171163, The article is talking about how to resolve "A gear exists in mongo, but not on the node.", but the reality is "A gear exits on node, but not in mongo". It is wrong. > 4) `oo-admin-chk` should now recommend checking the `oo-admin-repair` man > page to see if `oo-admin-repair` can resolve any reported inconsistencies > that do not suggest a solution. Seen from the above advice from oo-admin-chk, the following message is always printed out once check failure is seen which is expected. Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s). So based on test result of scenario 3, assign this bug back. Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/180131cb88505b118581298e99ff6dca2a42dd1f oo-admin-chk: Adds solutions to a couple of error messages Bug 1111598 BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1111598 Changes the incorrect error message for when a gear exists on the node, but not in mongo. The error message now tells the user the correct solution for the error. Also adds a solution to the error message for when there is a mismatch between consumed gears and actual gears. Johnny, Good catch on scenario 3. I've gotten a pull request merged that updates the error message. Instead of an article, oo-admin-chk will give the correct solution of deleting the gear using the oo-devel-node command. Also added was an explanation for when the number of consumed gears and number of actual gears are mismatched. Oo-admin-chk will now output the oo-admin-ctl-user command used to fix the bug. Here are the updated/new test scenarios. 3) When a gear exists on a node, but does not exist in mongo, `oo-admin-chk` reports a resolution: To fix this issue, remove the gear by running the oo-devel-node command from the node '<server_identity>':" oo-devel-node app-destroy --with-container-uuid <gear_uuid> 5) When the number of consumed gears and number of actual gears do not match, then 'oo-admin-chk' should output the following resolution: Set the correct number of consumed gears with the oo-admin-ctl-user command: oo-admin-ctl-user --login username --setconsumedgears <app_actual_gears> Check on puddle [2015-11-19.1] When the number of consumed gears and number of actual gears do not match, it output the following resolution: User xiaom has a mismatch in consumed gears (2) and actual gears (3) FAILED Can not get the expected result. Check on puddle [2015-11-24.1] # oo-admin-chk -l 1 User xiaom has a mismatch in consumed gears (2) and actual gears (1) Set the correct number of consumed gears with the oo-admin-ctl-user command: oo-admin-ctl-user --login username --setconsumedgears 1 FAILED Please refer to the oo-admin-repair tool man page to resolve some of these inconsistencies if no suggestion was provided with any error message(s). Get the expected result, move this issue to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2666.html |