Description of problem: miq workers in aborted status except the database appliance Version-Release number of selected component (if applicable): 4.2 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This comment is a braindump of various findings... we need to see the miq_queue table and possibly remove some bad rows. From https://bugzilla.redhat.com/show_bug.cgi?id=1365968#c6 "This commit in rails [1] removed ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer. Once we upgraded to rails 5.0.0, this class no longer exists. It is possible that an existing miq_queue row created with rails 4.2 could have one of these private classes serialized in the args column. Note, it's unclear why we're able to find the row but can't call update_attributes [2] on this row with the error: ArgumentError: undefined class/module ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer [1] https://github.com/rails/rails/commit/aafee233fb3b4211ee0bfb1fca776c159bd1067e [2] https://github.com/ManageIQ/manageiq/blob/darga-5/app/models/miq_queue_worker_base/runner.rb#L60 " Note, it's unclear how we have private rails classes serialized in columns and why we're able to find the row (without error) but then fail when trying to update it. For example Gregg and I inserted a dummy miq_queue row with this class in the args column: --- - !ruby/object:ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer precision: scale: limit: range: !ruby/range begin: -2147483648 end: 2147483648 excl: true I can't even find the row with this type in the args column: irb(main):026:0> MiqQueue.find_by(:id => 1000000064887) MiqQueue Load (0.7ms) SELECT "miq_queue".* FROM "miq_queue" WHERE "miq_queue"."id" = $1 LIMIT $2 [["id", 1000000064887], ["LIMIT", 1]] MiqQueue Inst Including Associations (0.1ms - 1rows) ArgumentError: undefined class/module ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer
Created attachment 1263419 [details] Retrieve host storage devices per-host If we need it this is the patch to go back to retrieving host storage devices one host at a time not in one call.
https://github.com/ManageIQ/manageiq/pull/14365
We were provided with the miq_queue export, found that there were 46,000+ rows with the Rails 4.2 constant (ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer), which doesn't exist in Rails 5+. # select count(*), class_name, method_name from miq_queue where args like '%OID::Integer%' group by class_name, method_name; count | class_name | method_name -------+-------------+------------------------ 1 | MiqReport | _async_generate_tables 46381 | MiqAeEngine | deliver The breakdown of what types of work were backed up in the miq_queue were roughly: # select count(*), class_name, method_name from miq_queue group by class_name, method_name order by count desc limit 10; count | class_name | method_name ---------+-----------------------------------------------------+---------------------------------- 1882819 | MiqEvent | raise_evm_event 844343 | MiqServer | ntp_reload 49147 | MiqAeEngine | deliver 15933 | ManageIQ::Providers::Vmware::InfraManager::Vm | classify_with_parent_folder_path 4642 | EmsEvent | add_vc 2865 | Storage | smartstate_analysis 2599 | MiqTask | destroy 1175 | MiqSchedule | invoke_actions 777 | MiqAlert | evaluate_alerts 321 | ManageIQ::Providers::Vmware::InfraManager::Template | classify_with_parent_folder_path We worked with the customer to clear the miq_queue since they didn't need the backed up work. We then verified the work was being picked up and other end user tasks such as provider refresh and smart state analysis were being started. We need to prevent the server from providing the same bogus queue message repeatedly to different workers (see comment 13) and possibly a migration/tool to clear these out for the future.
https://github.com/ManageIQ/manageiq/pull/14418
Created a BZ to track the creation of a migration to delete all miq_queue rows with this missing constant: https://bugzilla.redhat.com/show_bug.cgi?id=1434454 PR: https://github.com/ManageIQ/manageiq/pull/14418 Note, that BZ is euwe/no since we don't want to introduce a migration in existing versions. This BZ is euwe/yes, so any workers that encounter one of these bad rows will mark it in error so other workers will not try to retry that queue item.
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/e21d1b9ca747cdbdf9c90484e47f7a25ed161d6b commit e21d1b9ca747cdbdf9c90484e47f7a25ed161d6b Author: Joe Rafaniello <jrafanie> AuthorDate: Thu Mar 16 15:43:19 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Mar 17 11:27:28 2017 -0400 If we can't update_attributes on a queue row, set state to error https://bugzilla.redhat.com/show_bug.cgi?id=1429747 In the reported bug, we had a Rails 4.2 era class [1] serialized in the args column of a miq_queue row. This class was removed in rails 5.0.0 [2], so we'd be unable to deserialize the column with: ArgumentError: undefined class/module ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer If we're unable to update_attributes because a column can't be deserialized, the message can't be handled by a worker, the worker dies, and the message remains in the miq_queue for another worker to try and also fail on. Instead, if update_attributes fails, we can try to set just the state column to 'error'. In this way, the server will not try to dispatch the same queue multiple times. We clear errored messages at server boot, so we can clean them up then. [1] ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer [2] https://github.com/rails/rails/commit/aafee233fb3b4211ee0bfb1fca776c159bd1067e app/models/miq_queue_worker_base/runner.rb | 3 ++- spec/factories/miq_worker.rb | 1 + spec/models/miq_queue_worker_base/runner_spec.rb | 24 ++++++++++++++++++++++++ 3 files changed, 27 insertions(+), 1 deletion(-) create mode 100644 spec/models/miq_queue_worker_base/runner_spec.rb
Hi Joe, whats the best way to try and reproduce this one? Thanks, Luke
Luke, you'll need to run an appliance and shut it down with some messages in the miq_queue. In navicat, pgadmin, psql, etc. you'll need to update a miq_queue row that will be picked up when you start the server again. You'll need to modify this or a few rows' 'args' column and manually put a YAML string referencing an non-existing ruby class like this: --- - !ruby/object:JoeObject {} Make sure you include the new line at the end. Save this row. If you want to do a different string, open IRB and you can produce your own class that will not be known by the evm server when you start that: $ irb irb(main):001:0> require 'yaml'; class JoeObject; end; puts YAML.dump(JoeObject.new) --- !ruby/object:JoeObject {} After you update the miq_queue with one or more rows that will blow up. You can verify this by running: bin/rails c MiqQueue.where(the row id).args Like this: irb(main):001:0> MiqQueue.find(1000000000036).reload.args PostgreSQLAdapter#log_after_checkout, connection_pool: size: 200, connections: 1, in use: 1 MiqQueue Load (1.0ms) SELECT "miq_queue".* FROM "miq_queue" WHERE "miq_queue"."id" = $1 LIMIT $2 [["id", 1000000000036], ["LIMIT", 1]] MiqQueue Inst Including Associations (13.7ms - 1rows) MiqQueue Load (0.5ms) SELECT "miq_queue".* FROM "miq_queue" WHERE "miq_queue"."id" = $1 LIMIT $2 [["id", 1000000000036], ["LIMIT", 1]] MiqQueue Inst Including Associations (0.1ms - 1rows) ArgumentError: undefined class/module JoeObject from /Users/joerafaniello/.gem/ruby/2.3.4/gems/psych-2.0.17/lib/psych/class_loader.rb:53:in `path2class' from /Users/joerafaniello/.gem/ruby/2.3.4/gems/psych-2.0.17/lib/psych/class_loader.rb:53:in `resolve' ... Once you make sure you have at least one row that will blow up, start the server process and make sure we log the error and move on, whereas previously, the worker would blow up and the message would be tried by a later worker.
Verified in 5.8.0.11