1429747 – miq worker in aborted status

Bug 1429747 - miq worker in aborted status

Summary: miq worker in aborted status

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.8.0
Assignee:	Joe Rafaniello
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	black
Depends On:
Blocks:	1436854
TreeView+	depends on / blocked

Reported:	2017-03-07 01:34 UTC by Saif Ali
Modified:	2020-04-15 15:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:	5.8.0.8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1436854 (view as bug list)
Environment:
Last Closed:	2017-06-12 16:56:17 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Retrieve host storage devices per-host (17.96 KB, application/x-gzip) 2017-03-15 18:13 UTC, Adam Grare	no flags	Details
View All

Description Saif Ali 2017-03-07 01:34:21 UTC

Description of problem:
miq workers in aborted status except the database appliance 

Version-Release number of selected component (if applicable):
4.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 9 Joe Rafaniello 2017-03-15 14:57:51 UTC

This comment is a braindump of various findings... we need to see the miq_queue table and possibly remove some bad rows.


From https://bugzilla.redhat.com/show_bug.cgi?id=1365968#c6

"This commit in rails [1] removed ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer.

Once we upgraded to rails 5.0.0, this class no longer exists.

It is possible that an existing miq_queue row created with rails 4.2 could have one of these private classes serialized in the args column.  Note, it's unclear why we're able to find the row but can't call update_attributes [2] on this row with the error:

ArgumentError: undefined class/module ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer

[1] https://github.com/rails/rails/commit/aafee233fb3b4211ee0bfb1fca776c159bd1067e 
[2] https://github.com/ManageIQ/manageiq/blob/darga-5/app/models/miq_queue_worker_base/runner.rb#L60
"

Note, it's unclear how we have private rails classes serialized in columns and why we're able to find the row (without error) but then fail when trying to update it.

For example Gregg and I inserted a dummy miq_queue row with this class in the args column:

---
- !ruby/object:ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer
  precision:
  scale:
  limit:
  range: !ruby/range
    begin: -2147483648
    end: 2147483648
    excl: true

I can't even find the row with this type in the args column:

irb(main):026:0> MiqQueue.find_by(:id => 1000000064887)
  MiqQueue Load (0.7ms)  SELECT  "miq_queue".* FROM "miq_queue" WHERE "miq_queue"."id" = $1 LIMIT $2  [["id", 1000000064887], ["LIMIT", 1]]
  MiqQueue Inst Including Associations (0.1ms - 1rows)
ArgumentError: undefined class/module ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer

Comment 12 Adam Grare 2017-03-15 18:13:12 UTC

Created attachment 1263419 [details]
Retrieve host storage devices per-host

If we need it this is the patch to go back to retrieving host storage devices one host at a time not in one call.

Comment 13 CFME Bot 2017-03-16 21:01:28 UTC

https://github.com/ManageIQ/manageiq/pull/14365

Comment 14 Joe Rafaniello 2017-03-17 15:38:24 UTC

We were provided with the miq_queue export, found that there were 46,000+ rows with the Rails 4.2 constant (ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer), which doesn't exist in Rails 5+.


# select count(*), class_name, method_name from miq_queue where args like '%OID::Integer%' group by class_name, method_name;
 count | class_name  |      method_name
-------+-------------+------------------------
     1 | MiqReport   | _async_generate_tables
 46381 | MiqAeEngine | deliver


The breakdown of what types of work were backed up in the miq_queue were roughly: 


# select count(*), class_name, method_name from miq_queue group by class_name, method_name order by count desc limit 10;
  count  |                     class_name                      |           method_name
---------+-----------------------------------------------------+----------------------------------
 1882819 | MiqEvent                                            | raise_evm_event
  844343 | MiqServer                                           | ntp_reload
   49147 | MiqAeEngine                                         | deliver
   15933 | ManageIQ::Providers::Vmware::InfraManager::Vm       | classify_with_parent_folder_path
    4642 | EmsEvent                                            | add_vc
    2865 | Storage                                             | smartstate_analysis
    2599 | MiqTask                                             | destroy
    1175 | MiqSchedule                                         | invoke_actions
     777 | MiqAlert                                            | evaluate_alerts
     321 | ManageIQ::Providers::Vmware::InfraManager::Template | classify_with_parent_folder_path


We worked with the customer to clear the miq_queue since they didn't need the backed up work.  We then verified the work was being picked up and other end user tasks such as provider refresh and smart state analysis were being started.

We need to prevent the server from providing the same bogus queue message repeatedly to different workers (see comment 13) and possibly a migration/tool to clear these out for the future.

Comment 15 CFME Bot 2017-03-20 21:21:48 UTC

https://github.com/ManageIQ/manageiq/pull/14418

Comment 16 Joe Rafaniello 2017-03-21 16:58:59 UTC

Created a BZ to track the creation of a migration to delete all miq_queue rows with this missing constant:

https://bugzilla.redhat.com/show_bug.cgi?id=1434454
PR: https://github.com/ManageIQ/manageiq/pull/14418

Note, that BZ is euwe/no since we don't want to introduce a migration in existing versions.

This BZ is euwe/yes, so any workers that encounter one of these bad rows will mark it in error so other workers will not try to retry that queue item.

Comment 17 CFME Bot 2017-03-28 02:06:01 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/e21d1b9ca747cdbdf9c90484e47f7a25ed161d6b

commit e21d1b9ca747cdbdf9c90484e47f7a25ed161d6b
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Thu Mar 16 15:43:19 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Mar 17 11:27:28 2017 -0400

    If we can't update_attributes on a queue row, set state to error
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1429747
    
    In the reported bug, we had a Rails 4.2 era class [1] serialized in the args
    column of a miq_queue row.  This class was removed in rails 5.0.0 [2],
    so we'd be unable to deserialize the column with:
    
    ArgumentError: undefined class/module ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer
    
    If we're unable to update_attributes because a column can't be
    deserialized, the message can't be handled by a worker, the worker dies,
    and the message remains in the miq_queue for another worker to try and
    also fail on.
    
    Instead, if update_attributes fails, we can try to set just the state
    column to 'error'.  In this way, the server will not try to dispatch the
    same queue multiple times.  We clear errored messages at server boot, so
    we can clean them up then.
    
    [1] ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer
    [2] https://github.com/rails/rails/commit/aafee233fb3b4211ee0bfb1fca776c159bd1067e

 app/models/miq_queue_worker_base/runner.rb       |  3 ++-
 spec/factories/miq_worker.rb                     |  1 +
 spec/models/miq_queue_worker_base/runner_spec.rb | 24 ++++++++++++++++++++++++
 3 files changed, 27 insertions(+), 1 deletion(-)
 create mode 100644 spec/models/miq_queue_worker_base/runner_spec.rb

Comment 19 luke couzens 2017-04-18 15:24:09 UTC

Hi Joe, whats the best way to try and reproduce this one?

Thanks,
Luke

Comment 20 Joe Rafaniello 2017-04-18 15:58:16 UTC

Luke, you'll need to run an appliance and shut it down with some messages in the miq_queue.  In navicat, pgadmin, psql, etc. you'll need to update a miq_queue row that will be picked up when you start the server again.  You'll need to modify this or a few rows' 'args' column and manually put a YAML string referencing an non-existing ruby class like this:

---
- !ruby/object:JoeObject {}

Make sure you include the new line at the end.  Save this row.  If you want to do a different string, open IRB and you can produce your own class that will not be known by the evm server when you start that:

$ irb
irb(main):001:0> require 'yaml'; class JoeObject; end; puts YAML.dump(JoeObject.new)

--- !ruby/object:JoeObject {}


After you update the miq_queue with one or more rows that will blow up.  You can verify this by running:

bin/rails c

MiqQueue.where(the row id).args


Like this:

irb(main):001:0> MiqQueue.find(1000000000036).reload.args
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 200, connections: 1, in use: 1
  MiqQueue Load (1.0ms)  SELECT  "miq_queue".* FROM "miq_queue" WHERE "miq_queue"."id" = $1 LIMIT $2  [["id", 1000000000036], ["LIMIT", 1]]
  MiqQueue Inst Including Associations (13.7ms - 1rows)
  MiqQueue Load (0.5ms)  SELECT  "miq_queue".* FROM "miq_queue" WHERE "miq_queue"."id" = $1 LIMIT $2  [["id", 1000000000036], ["LIMIT", 1]]
  MiqQueue Inst Including Associations (0.1ms - 1rows)
ArgumentError: undefined class/module JoeObject
	from /Users/joerafaniello/.gem/ruby/2.3.4/gems/psych-2.0.17/lib/psych/class_loader.rb:53:in `path2class'
	from /Users/joerafaniello/.gem/ruby/2.3.4/gems/psych-2.0.17/lib/psych/class_loader.rb:53:in `resolve'
...

Once you make sure you have at least one row that will blow up, start the server process and make sure we log the error and move on, whereas previously, the worker would blow up and the message would be tried by a later worker.

Comment 21 luke couzens 2017-04-21 10:35:20 UTC

Verified in 5.8.0.11

Note You need to log in before you can comment on or make changes to this bug.