Bug 972958
Summary: | oo-admin-repair dies with "can't convert String into Integer" in STG | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Online | Reporter: | Thomas Wiest <twiest> | ||||||
Component: | Pod | Assignee: | Rajat Chopra <rchopra> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 2.x | CC: | admiller, agrimm, bhatiam, dmcphers, jhou, mmahut, rchopra, twiest, xtian | ||||||
Target Milestone: | --- | Keywords: | UpcomingRelease | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-07-22 15:16:03 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Thomas Wiest
2013-06-10 22:29:44 UTC
I had a report yesterday of a user in a similar situation. The relevant part of the mongo document seems to be this: "group_instances" : [ { "_id" : ObjectId("51b552d4e0b8cd8dde000051"), "gears" : [ { "_id" : ObjectId("51b552d4e0b8cd8dde00003d"), "app_dns" : true, "host_singletons" : true, "name" : "tv", "quarantined" : false, "server_identity" : "ex-std-node134.prod.rhcloud.com", "uid" : 1899, "uuid" : "51b552d4e0b8cd8dde00003d" } ] }, { "gears" : { "0" : { "server_identity" : "ex-std-node93.prod.rhcloud.com", "uid" : 3282 } } } ], The data structure for the second "gears" is a dictionary instead of an array, perhaps because of the "0" not being cast to an int somewhere in the code. If it helps, I _think_ this is happening only with scalable apps. Created attachment 760186 [details] 51a83930dbd93ce0990000df.json.bz2 Apparently this didn't attach properly the first time. So, re-attaching now. Created attachment 760201 [details]
Formatted JSON
This is what he error looks like (although from a different instance): /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `[]': can't convert String into Integer (TypeError) from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `from_db' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:25:in `block in build' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `each' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `build' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:43:in `create_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:26:in `__build__' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:120:in `block (2 levels) in get_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:125:in `_loading' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:115:in `block in get_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:84:in `_building' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:114:in `get_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:203:in `block in getter' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:1138:in `run_jobs' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:531:in `block in remove_features' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:1280:in `run_in_application_lock' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:529:in `remove_features' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.9.14/app/models/application.rb:559:in `destroy_app' from /usr/sbin/oo-admin-ctl-app:126:in `<main>' This has the same error as: https://bugzilla.redhat.com/show_bug.cgi?id=966750 I had got the logs from stage and was looking at it yesterday. Will continue debugging today. Tested this bug on devenv_3660 (which has the related stage hot-fix as well), can only find this fix maybe related to this bug 1e051bf25e47fb828fa982dfb3adb17872b53628 . when op_type is nil, oo-admin-clear-pending-ops will report the following error, but the app is actually removed from mongo: ERROR in cleaning up application's op because the type is nil. App uuid - 296627828527218494013440. Op - #<PendingAppOpGroup _id: 51bae342621d33416a000006, _type: nil, created_at: 2013-06-14 09:32:50 UTC, updated_at: 2013-06-14 09:32:50 UTC, op_type: nil, args: {"features"=>["php-5.3"], "group_overrides"=>[], "init_git_url"=>nil}, parent_op_id: nil, num_gears_added: 0, num_gears_removed: 0, num_gears_created: 0, num_gears_destroyed: 0, num_gears_rolled_back: 0, user_agent: "rhc/1.10.1 (ruby 1.8.7; x86_64-linux) (2.3.2, ruby 1.8.7 (2011-06-30) [x86_64-linux])"> 1 applications were cleaned up. 0 users were cleaned up. 0 domains were cleaned up. when created_at is nil (libra_rs:PRIMARY> db.applications.update({name: "phpapp2"}, {$unset:{"pending_op_groups.0.created_at": "" }}) oo-admin-clear-pending-ops will just ignore it and will not do anything for this app, leave this app still in pending_ops Is above expected? Just wanted to add some more info that we're seeing in PROD with this. We have app create loops that run on all of the brokers (that point to themselves) and also one that hits the public broker interface (through the proxies). Since the upgrade, one of our brokers and the external interface check are both dying saying they can't remove the app after they create it. I can manually remove the apps using oo-admin-ctl-app -c force-destroy, but I get the same error as above. The app is removed, however. Then, after a few hours, the issue happens again and I have to manually remove the app again. Root cause: mcollective crashes with SIGABRT and sometimes it takes more than 10 minutes to give the control back to the broker - by which time our locks time out exposing the application to next client request in the queue. Quick fix: Increased the lock timeout to 30 minutes. Pull request - https://github.com/openshift/origin-server/pull/2908 For QE: Will be really hard to reproduce this bug. Only four such occurences have been reported in 15 days. Since the fix is just the timeout value change for locks, I guess am fine if its a no-op for QE on this bug. According to comment 9, if this bug will not happen on the next STG/RPOD upgrading/deploying for OPS, we are good to close this bug. According to comment 9, if this bug will not happen on the next STG/RPOD upgrading/deploying for OPS, we are good to close this bug. Hi, AdamM According to comment 9, can you help to check if you still meet this issue in STG or PROD while deploying recently ? If it's not reproduced any more, can you help to move it to verified or closed. Thanks *** Bug 979380 has been marked as a duplicate of this bug. *** Are we save to verify or close this bug now? Haven't been able to reproduce this during recent tests on devenv. This now works in STG: oo-admin-repair --ssh-keys I believe the bug is now fixed. Thanks, mark as verified according to comment 15 Re-opening this bug as we found out this is still an issue in PROD. $ sudo /usr/sbin/oo-admin-ctl-app -a mediawiki3 -c destroy -l user_login [sudo] password for sturpin: !!!! WARNING !!!! WARNING !!!! WARNING !!!! You are about to destroy the mediawiki3 application. This is NOT reversible, all remote data for this application will be removed. Do you want to destroy this application (y/n): y /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `[]': can't convert String into Integer (TypeError) from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/factory.rb:38:in `from_db' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:25:in `block in build' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `each' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/builders/embedded/many.rb:23:in `build' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:43:in `create_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:26:in `__build__' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:120:in `block (2 levels) in get_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:125:in `_loading' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:115:in `block in get_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/threaded/lifecycle.rb:84:in `_building' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:114:in `get_relation' from /opt/rh/ruby193/root/usr/share/gems/gems/mongoid-3.0.21/lib/mongoid/relations/accessors.rb:203:in `block in getter' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:1156:in `run_jobs' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:546:in `block in remove_features' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:1298:in `run_in_application_lock' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:544:in `remove_features' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.10.7/app/models/application.rb:575:in `destroy_app' from /usr/sbin/oo-admin-ctl-app:126:in `<main>' Could not trace any new issue with this app through the logs. The app was created on 7th June, much before the fix was found and implemented. This is apparently a leftover of the original bug, but was never hand-fixed. All apps affected by the original bug need hand-fixing. For now, if the purpose is to delete the app, kindly use '-c force-destroy' option. To hand-clean the app, we need to run a mongo update script that will pop the offending pending_op_group. Let the broker dev team know if that is needed. Move this bug to closed again according to comment 18, if it could be reproducible on any new apps, feel free to re-open or file a new bug. The force-destroy option crashes with the same error. Please kindly provide a script to clean this up. |