Bug 972958 - oo-admin-repair dies with "can't convert String into Integer" in STG
oo-admin-repair dies with "can't convert String into Integer" in STG
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Pod (Show other bugs)
2.x
Unspecified Unspecified
unspecified Severity urgent
: ---
: ---
Assigned To: Rajat Chopra
libra bugs
: UpcomingRelease
: 979380 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-10 18:29 EDT by Thomas Wiest
Modified: 2015-05-14 20:17 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-07-22 11:16:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
51a83930dbd93ce0990000df.json.bz2 (6.76 KB, application/octet-stream)
2013-06-12 10:19 EDT, Thomas Wiest
no flags Details
Formatted JSON (42.44 KB, text/plain)
2013-06-12 10:53 EDT, Dan McPherson
no flags Details

  None (edit)
Description Thomas Wiest 2013-06-10 18:29:44 EDT
Description of problem:
Found this issue when trying to clean up oo-admin-chk in STG.

I've attached the application's mongo document json.


oo-admin-chk is reporting this:

Gear '51a83930dbd93ce0990000df' has  key with comment 'OPENSHIFT-51a83930dbd93ce0990000df-512768132587c88299000084-default' on the node but not in mongo.
Gear '51a83930dbd93ce0990000df' has  key with comment 'OPENSHIFT-51a83930dbd93ce0990000df-512768132587c88299000084-bmeng2dhcp11171' on the node but not in mongo.
Gear '51a83930dbd93ce0990000df' has key with name 'default' in mongo but not on the node.
Please refer to the oo-admin-repair tool to resolve some of these inconsistencies.

So I ran oo-admin-repair, but it errors with this:

# oo-admin-repair --ssh-keys
[...snip...]
Gear '51a83930dbd93ce0990000df' has  key with comment 'OPENSHIFT-51a83930dbd93ce0990000df-512768132587c88299000084-default' on the node but not in mongo.
Gear '51a83930dbd93ce0990000df' has  key with comment 'OPENSHIFT-51a83930dbd93ce0990000df-512768132587c88299000084-bmeng2dhcp11171' on the node but not in mongo.
Gear '51a83930dbd93ce0990000df' has key with name 'default' in mongo but not on the node.

Total 1 applications have ssh key mismatches.

Fixing ssh key inconsistencies for all affected applications:
Failed to fix ssh key mismatches for application '51a83930dbd93ce0990000df': can't convert String into Integer

Failed to fix ssh key mismatches for 1 applications.


Version-Release number of selected component (if applicable):
rhc-broker-1.9.5-1.el6oso.noarch


How reproducible:
unknown, found in STG.

Steps to Reproduce:
1. unknown


Actual results:
oo-admin-repair fails with "can't convert String into Integer"

Expected results:
it should fix the problem
Comment 1 Andy Grimm 2013-06-12 09:14:13 EDT
I had a report yesterday of a user in a similar situation.  The relevant part of the mongo document seems to be this:

	"group_instances" : [
		{
			"_id" : ObjectId("51b552d4e0b8cd8dde000051"),
			"gears" : [
				{
					"_id" : ObjectId("51b552d4e0b8cd8dde00003d"),
					"app_dns" : true,
					"host_singletons" : true,
					"name" : "tv",
					"quarantined" : false,
					"server_identity" : "ex-std-node134.prod.rhcloud.com",
					"uid" : 1899,
					"uuid" : "51b552d4e0b8cd8dde00003d"
				}
			]
		},
		{
			"gears" : {
				"0" : {
					"server_identity" : "ex-std-node93.prod.rhcloud.com",
					"uid" : 3282
				}
			}
		}
	],

The data structure for the second "gears" is a dictionary instead of an array, perhaps because of the "0" not being cast to an int somewhere in the code.  If it helps, I _think_ this is happening only with scalable apps.
Comment 2 Thomas Wiest 2013-06-12 10:19:17 EDT
Created attachment 760186 [details]
51a83930dbd93ce0990000df.json.bz2

Apparently this didn't attach properly the first time.

So, re-attaching now.
Comment 3 Dan McPherson 2013-06-12 10:53:49 EDT
Created attachment 760201 [details]
Formatted JSON
Comment 4 Dan McPherson 2013-06-12 10:59:20 EDT
This is what he error looks like (although from a different instance):

/opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `[]': can't convert String into Integer (TypeError)
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `from_db'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:25:in `block in build'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `each'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `build'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:43:in `create_relation'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:26:in `__build__'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:120:in `block (2 levels) in get_relation'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:125:in `_loading'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:115:in `block in get_relation'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:84:in `_building'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:114:in `get_relation'
    from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:203:in `block in getter'
    from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:1138:in `run_jobs'
    from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:531:in `block in remove_features'
    from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:1280:in `run_in_application_lock'
    from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:529:in `remove_features'
    from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:559:in `destroy_app'
    from /usr/sbin/oo-admin-ctl-app:126:in `<main>'
Comment 5 Dan McPherson 2013-06-12 11:10:54 EDT
This has the same error as:  https://bugzilla.redhat.com/show_bug.cgi?id=966750
Comment 6 Abhishek Gupta 2013-06-12 12:10:10 EDT
I had got the logs from stage and was looking at it yesterday. Will continue debugging today.
Comment 7 Xiaoli Tian 2013-06-14 06:01:26 EDT
Tested this bug on devenv_3660 (which has the related stage hot-fix as well), can only find this fix maybe related to this bug 1e051bf25e47fb828fa982dfb3adb17872b53628 .


when op_type is nil,  oo-admin-clear-pending-ops   will report the following error, but the app is actually removed from mongo:
ERROR in cleaning up application's op because the type is nil. App uuid - 296627828527218494013440. Op - #<PendingAppOpGroup _id: 51bae342621d33416a000006, _type: nil, created_at: 2013-06-14 09:32:50 UTC, updated_at: 2013-06-14 09:32:50 UTC, op_type: nil, args: {"features"=>["php-5.3"], "group_overrides"=>[], "init_git_url"=>nil}, parent_op_id: nil, num_gears_added: 0, num_gears_removed: 0, num_gears_created: 0, num_gears_destroyed: 0, num_gears_rolled_back: 0, user_agent: "rhc/1.10.1 (ruby 1.8.7; x86_64-linux) (2.3.2, ruby 1.8.7 (2011-06-30) [x86_64-linux])">
1 applications were cleaned up. 0 users were cleaned up. 0 domains were cleaned up.


when created_at is nil (libra_rs:PRIMARY> db.applications.update({name: "phpapp2"}, {$unset:{"pending_op_groups.0.created_at": "" }})

oo-admin-clear-pending-ops will just ignore it and will not do anything for this app, leave this app still in pending_ops

Is above expected?
Comment 8 Thomas Wiest 2013-06-14 09:47:06 EDT
Just wanted to add some more info that we're seeing in PROD with this.

We have app create loops that run on all of the brokers (that point to themselves) and also one that hits the public broker interface (through the proxies).

Since the upgrade, one of our brokers and the external interface check are both dying saying they can't remove the app after they create it.

I can manually remove the apps using oo-admin-ctl-app -c force-destroy, but I get the same error as above. The app is removed, however.

Then, after a few hours, the issue happens again and I have to manually remove the app again.
Comment 9 Rajat Chopra 2013-06-19 17:43:14 EDT
Root cause: mcollective crashes with SIGABRT and sometimes it takes more than 10 minutes to give the control back to the broker - by which time our locks time out exposing the application to next client request in the queue.

Quick fix: Increased the lock timeout to 30 minutes. Pull request - https://github.com/openshift/origin-server/pull/2908

For QE: Will be really hard to reproduce this bug. Only four such occurences have been reported in 15 days. Since the fix is just the timeout value change for locks, I guess am fine if its a no-op for QE on this bug.
Comment 10 Xiaoli Tian 2013-06-19 22:24:25 EDT
According to comment 9, if this bug will not happen on the next STG/RPOD upgrading/deploying for OPS, we are good to close this  bug.
Comment 11 Xiaoli Tian 2013-06-19 22:24:25 EDT
According to comment 9, if this bug will not happen on the next STG/RPOD upgrading/deploying for OPS, we are good to close this  bug.
Comment 12 Xiaoli Tian 2013-06-28 06:20:25 EDT
Hi, AdamM

According to comment 9, can you  help to check if you still meet this issue in STG or PROD while deploying recently ?

If it's not reproduced any more, can you help to move it to verified or closed.

Thanks
Comment 13 Marek Mahut 2013-06-28 10:46:01 EDT
*** Bug 979380 has been marked as a duplicate of this bug. ***
Comment 14 Jianwei Hou 2013-07-09 06:54:07 EDT
Are we save to verify or close this bug now? Haven't been able to reproduce this during recent tests on devenv.
Comment 15 Thomas Wiest 2013-07-09 10:41:19 EDT
This now works in STG:

oo-admin-repair --ssh-keys


I believe the bug is now fixed.
Comment 16 Jianwei Hou 2013-07-09 21:48:17 EDT
Thanks, mark as verified according to comment 15
Comment 17 Marek Mahut 2013-07-12 12:16:15 EDT
Re-opening this bug as we found out this is still an issue in PROD.

$ sudo /usr/sbin/oo-admin-ctl-app -a mediawiki3 -c destroy  -l user_login
[sudo] password for sturpin:
    !!!! WARNING !!!! WARNING !!!! WARNING !!!!
    You are about to destroy the mediawiki3 application.

    This is NOT reversible, all remote data for this application will be removed.
Do you want to destroy this application (y/n): y
/opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `[]': can't convert String into Integer (TypeError)
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `from_db'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:25:in `block in build'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `each'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `build'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:43:in `create_relation'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:26:in `__build__'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:120:in `block (2 levels) in get_relation'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:125:in `_loading'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:115:in `block in get_relation'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:84:in `_building'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:114:in `get_relation'
        from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:203:in `block in getter'
        from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:1156:in `run_jobs'
        from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:546:in `block in remove_features'
        from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:1298:in `run_in_application_lock'
        from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:544:in `remove_features'
        from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:575:in `destroy_app'
        from /usr/sbin/oo-admin-ctl-app:126:in `<main>'
Comment 18 Rajat Chopra 2013-07-15 19:10:26 EDT
Could not trace any new issue with this app through the logs. The app was created on 7th June, much before the fix was found and implemented. 

This is apparently a leftover of the original bug, but was never hand-fixed. All apps affected by the original bug need hand-fixing.

For now, if the purpose is to delete the app, kindly use '-c force-destroy' option. To hand-clean the app, we need to run a mongo update script that will pop the offending pending_op_group. Let the broker dev team know if that is needed.
Comment 19 Xiaoli Tian 2013-07-16 05:50:15 EDT
Move this bug to closed again according to comment 18, if it could be reproducible on any new apps, feel free to re-open or file a new bug.
Comment 20 Marek Mahut 2013-07-16 06:23:43 EDT
The force-destroy option crashes with the same error. Please kindly provide a script to clean this up.

Note You need to log in before you can comment on or make changes to this bug.