Bug 1599754

Summary: Multiple user logins logged every second from Global DB UI to RegionDB
Product: Red Hat CloudForms Management Engine Reporter: Ryan Spagnola <rspagnol>
Component: ApplianceAssignee: Brandon Dunne <bdunne>
Status: CLOSED CURRENTRELEASE QA Contact: Tasos Papaioannou <tpapaioa>
Severity: medium Docs Contact:
Priority: high    
Version: 5.9.0CC: abellott, bdunne, cpelland, dmetzger, jprause, myoder, obarenbo, simaishi, tpapaioa
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.11.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 5.11.0.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1686035 (view as bug list) Environment:
Last Closed: 2019-12-13 15:16:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1595269, 1679295, 1686035    

Description Ryan Spagnola 2018-07-10 13:51:25 UTC
Description of problem:
We keep seeing user logins every few seconds for one user.  When we check the logs, we see they're API logins from the Global DB UI appliance.


Version-Release number of selected component (if applicable):
5.9.2.4

How reproducible:
Repeatedly

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I think we found the fix.  I went into the global db and did:

SELECT * FROM public.miq_queue where method_name = 'invoke_tasks_remote'

It returned 5 tasks that just seemed to keep recreating themselves.  

I then stopped EVM in the global region and ran:

DELETE * FROM public.miq_queue where method_name = 'invoke_tasks_remote'

I then started EVM back up.  I am no longer seeing the logins on the Danbury region.

Appears some MIQ code that seems to just recreate tasks when they fail.  In this case, it appears that if it fails due to a resource not being found, it should just fail, not keep trying. Or a least fail after a certain number of tries.  This has been retrying since at least 6/17 it appears.

Comment 4 CFME Bot 2018-08-06 21:36:17 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/a6343cfb71f22b050156bedff73eed55cd33190e
commit a6343cfb71f22b050156bedff73eed55cd33190e
Author:     Brandon Dunne <brandondunne>
AuthorDate: Mon Jul 23 14:38:46 2018 -0400
Commit:     Brandon Dunne <brandondunne>
CommitDate: Mon Jul 23 14:38:46 2018 -0400

    Avoid raising and re-queueing when the remote resource is not found

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1599754

 Gemfile | 2 +-
 app/models/mixins/process_tasks_mixin.rb | 11 +-
 spec/models/mixins/process_tasks_mixin_spec.rb | 13 +
 3 files changed, 22 insertions(+), 4 deletions(-)

Comment 5 Tasos Papaioannou 2018-10-05 18:09:52 UTC
There is still a window that allows the original issue to occur, between (1) and (2) in the code below:

./app/models/mixins/process_tasks_mixin.rb:

****
      if resource_ids.present?
        resource_ids.each do |id|
          begin
(1) -->     obj = collection.find(id)
          rescue ManageIQ::API::Client::ResourceNotFound => err
            _log.error(err.message)
          else
            _log.info("Invoking task #{action} on collection #{collection_name}, object #{obj.id}, with args #{post_args}")
            begin
(2) -->       obj.send(action, post_args)
            rescue NoMethodError => err
              _log.error(err.message)
            end
          end
        end
****

I can simulate the issue by adding a short sleep() call between (1) and (2) on the global appliance, then:

1.) Submit a VM start/stop action from the global appliance
2.) While the global appliance's invoke_api_tasks sleeps, remove the VM from the VMDB in the remote appliance's web UI.

An exception gets logged, and the task gets requeued indefinitely. The first exception looks like this:

****
[----] I, [2018-10-05T11:51:07.705759 #2157:e28f74]  INFO -- : MIQ(Vm.invoke_api_tasks) Invoking task start on collection vms, object 117, with args {}
[----] E, [2018-10-05T11:51:07.739843 #2157:e28f74] ERROR -- : An error occurred while invoking remote tasks...Requeueing for 1 minute from now.
[----] E, [2018-10-05T11:51:07.740151 #2157:e28f74] ERROR -- : [RuntimeError]: Couldn't find Vm with 'id'=117 [WHERE "vms"."type" IN ('Vm', 'ManageIQ::Provide
rs::CloudManager::Vm', 'ManageIQ::Providers::InfraManager::Vm', 'VmServer', 'ManageIQ::Providers::Amazon::CloudManager::Vm', 'ManageIQ::Providers::Azure::Clou
dManager::Vm', 'ManageIQ::Providers::Google::CloudManager::Vm', 'ManageIQ::Providers::Openstack::CloudManager::Vm', 'ManageIQ::Providers::Vmware::CloudManager
::Vm', 'VmXen', 'ManageIQ::Providers::Kubevirt::InfraManager::Vm', 'ManageIQ::Providers::Redhat::InfraManager::Vm', 'ManageIQ::Providers::Microsoft::InfraMana
ger::Vm', 'ManageIQ::Providers::Vmware::InfraManager::Vm') AND "vms"."template" = $1]  Method:[rescue in block in invoke_tasks_remote]
[----] E, [2018-10-05T11:51:07.740310 #2157:e28f74] ERROR -- : /opt/rh/cfme-gemset/gems/manageiq-api-client-0.3.1/lib/manageiq/api/client/connection.rb:108:in
 `check_response'
/opt/rh/cfme-gemset/gems/manageiq-api-client-0.3.1/lib/manageiq/api/client/connection.rb:102:in `send_request'
/opt/rh/cfme-gemset/gems/manageiq-api-client-0.3.1/lib/manageiq/api/client/connection.rb:34:in `post'
/opt/rh/cfme-gemset/gems/manageiq-api-client-0.3.1/lib/manageiq/api/client/client.rb:65:in `post'
/opt/rh/cfme-gemset/gems/manageiq-api-client-0.3.1/lib/manageiq/api/client/resource.rb:60:in `exec_action'
/opt/rh/cfme-gemset/gems/manageiq-api-client-0.3.1/lib/manageiq/api/client/resource.rb:46:in `method_missing'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:131:in `block in invoke_api_tasks'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:121:in `each'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:121:in `invoke_api_tasks'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:70:in `block in invoke_tasks_remote'

[...]

[----] I, [2018-10-05T11:51:07.750884 #2157:e28f74]  INFO -- : MIQ(MiqQueue.put) Message id: [99000000000350],  id: [], Zone: [default], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Vm.invoke_tasks_remote], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-10-05 15:52:07 UTC], Data: [], Args: [{:ids=>[117], :task=>"start", :userid=>"admin", :miq_task_id=>nil}]
****

All later exceptions look like:

****
[----] E, [2018-10-05T13:30:16.497205 #2165:e28f74] ERROR -- : An error occurred while invoking remote tasks...Requeueing for 1 minute from now.
[----] E, [2018-10-05T13:30:16.498234 #2165:e28f74] ERROR -- : [NoMethodError]: undefined method `id' for nil:NilClass  Method:[rescue in block in invoke_tasks_remote]
[----] E, [2018-10-05T13:30:16.498374 #2165:e28f74] ERROR -- : /var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:129:in `block in invoke_api_tasks'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:121:in `each'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:121:in `invoke_api_tasks'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:70:in `block in invoke_tasks_remote'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:65:in `each'
/var/www/miq/vmdb/app/models/mixins/process_tasks_mixin.rb:65:in `invoke_tasks_remote'
/var/www/miq/vmdb/app/models/miq_queue.rb:455:in `block in dispatch_method'

[...]

/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'
[----] I, [2018-10-05T13:30:16.510014 #2165:e28f74]  INFO -- : MIQ(MiqQueue.put) Message id: [99000000000713],  id: [], Zone: [default], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Vm.invoke_tasks_remote], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-10-05 17:31:16 UTC], Data: [], Args: [{:ids=>[117], :task=>"start", :userid=>"admin", :miq_task_id=>nil}]
****

Comment 6 Brandon Dunne 2018-10-05 19:40:50 UTC
Hi Tasos,

In real life, there is only a fraction of a second between the time that we find the resource using the api client and send the action to it.  In the edge case that the resource is found then disappears before we can send the action, yes it would raise an exception that is unhandled and get requeued once.  However, the second time around the remote resource will not be found, no exception should be raised, and it should not get requeued again.  Can you please verify this?

As for the undefined method `id' for nil:NilClass, I'd need access to the appliance to see what's going on there.

Comment 8 CFME Bot 2018-11-08 17:19:57 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/30c46272c33ddb3432a85cb99fd3898a1c5b5114
commit 30c46272c33ddb3432a85cb99fd3898a1c5b5114
Author:     Brandon Dunne <brandondunne>
AuthorDate: Thu Nov  8 10:44:36 2018 -0500
Commit:     Brandon Dunne <brandondunne>
CommitDate: Thu Nov  8 10:44:36 2018 -0500

    Upgrade to the latest manageiq-api-client

    Catch other 404 errors and raise them as ResourceNotFound
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1599754

 Gemfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 9 CFME Bot 2018-11-08 17:21:11 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/30c46272c33ddb3432a85cb99fd3898a1c5b5114
commit 30c46272c33ddb3432a85cb99fd3898a1c5b5114
Author:     Brandon Dunne <brandondunne>
AuthorDate: Thu Nov  8 10:44:36 2018 -0500
Commit:     Brandon Dunne <brandondunne>
CommitDate: Thu Nov  8 10:44:36 2018 -0500

    Upgrade to the latest manageiq-api-client

    Catch other 404 errors and raise them as ResourceNotFound
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1599754

 Gemfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 12 Brandon Dunne 2019-02-26 16:53:22 UTC
Fixed by https://github.com/ManageIQ/manageiq/pull/18491

Comment 13 CFME Bot 2019-02-26 18:50:58 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/1f074d7d020842b4239f3d2f85134e3bd32280c0
commit 1f074d7d020842b4239f3d2f85134e3bd32280c0
Author:     Brandon Dunne <brandondunne>
AuthorDate: Tue Feb 26 11:48:52 2019 -0500
Commit:     Brandon Dunne <brandondunne>
CommitDate: Tue Feb 26 11:48:52 2019 -0500

    Update to the latest manageiq-api-client

    Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1599754

 Gemfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 15 Tasos Papaioannou 2019-05-02 19:17:37 UTC
Verified on 5.11.0.2.