Bug 1575773 - Azure targeted refresh: VM remains in inventory after delete event received
Summary: Azure targeted refresh: VM remains in inventory after delete event received
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: GA
: 5.10.0
Assignee: Ladislav Smola
QA Contact: Brandon Squizzato
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-07 21:10 UTC by Brandon Squizzato
Modified: 2019-08-22 00:54 UTC (History)
9 users (show)

Fixed In Version: 5.10.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-07 23:02:11 UTC
Category: Bug
Cloudforms Team: Azure
Target Upstream Version:


Attachments (Terms of Use)
evm.log.gz (1.29 MB, application/x-gzip)
2018-05-07 21:11 UTC, Brandon Squizzato
no flags Details
new upstream log (466.31 KB, application/x-gzip)
2018-05-09 18:17 UTC, Brandon Squizzato
no flags Details
new downstream log (1.52 MB, application/x-gzip)
2018-05-09 18:17 UTC, Brandon Squizzato
no flags Details
5.10.0.4 azure log (107.44 KB, application/x-gzip)
2018-07-18 15:26 UTC, Brandon Squizzato
no flags Details
5.10.0.4 evm debug log (73.70 KB, application/x-gzip)
2018-07-18 15:27 UTC, Brandon Squizzato
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:0212 None None None 2019-02-07 23:02:18 UTC

Description Brandon Squizzato 2018-05-07 21:10:10 UTC
Description of problem:
A VM deleted on Azure is not archived in CFME after the delete event is received from the provider

Version-Release number of selected component (if applicable):
Hit on 5.9.2.4.20180501195858_35dc609

How reproducible:
100%

Steps to Reproduce:
1. Create a VM in azure
2. Wait for virtualMachines_write_EndRequest event to arrive in evm.log
3. Wait for VM to show up in instance inventory (verifies targeted refresh is working for VM creation)
4. Wait a few more minutes
5. Delete the VM (directly in Azure)
6. Wait for virtualMachines_delete_EndRequest to arrive in evm.log
7. Wait for VM to be removed/archived from instances list

Actual results:
VM remains in inventory, with archival state/power state unchanged

Expected results:
VM is archived

Additional info:
See attached evm.log -- look for events related to 'temp-bsquizz2'.
* At 16:32:31 the delete_BeginRequest comes in for the VM and its associated resources, and we see targeted refreshes queued.
* At 16:34:56 we see an 'evaluate_alerts' log for 'vm_destroy' on this VM
* There seems to be a round of targeted refreshes done at 16:35:51 
* Around 16:40 the VM is still showing in the inventory, so I initiate a manual provider refresh, at 16:42:47 the VM status is properly updated.

Comment 2 Brandon Squizzato 2018-05-07 21:11:07 UTC
Created attachment 1432832 [details]
evm.log.gz

Comment 3 Ladislav Smola 2018-05-09 12:26:10 UTC
Hello, this is what I see in my event list:

in rails c:

pp EmsEvent.pluck(:id, :timestamp, :event_type, :vm_ems_ref)

produces:
[1009,
  Wed, 09 May 2018 12:17:23 UTC +00:00,
  "virtualMachines_delete_BeginRequest",
  "2586c64b-38b4-4527-a140-012d49dfc02c\\miq-azure-test2\\microsoft.compute/virtualmachines\\ladastest8"],
 [1010,
  Wed, 09 May 2018 12:17:25 UTC +00:00,
  "virtualMachines_delete_EndRequest",
  "2586c64b-38b4-4527-a140-012d49dfc02c\\miq-azure-test2\\microsoft.compute/virtualmachines\\ladastest8"],
 [1020,
  Wed, 09 May 2018 12:19:51 UTC +00:00,
  "virtualMachines_delete_EndRequest",
  "2586c64b-38b4-4527-a140-012d49dfc02c\\miq-azure-test2\\microsoft.compute/virtualmachines\\ladastest8"]]

---------

So I am getting the virtualMachines_delete_EndRequest twice. First time it arrives, the Vm is not really deleted and it seems like the API response is still returning it's running. The second time it arrives, it is actually deleted, so that causes archival in CFME.

Do you see just 1 virtualMachines_delete_EndRequest ?

Comment 4 Brandon Squizzato 2018-05-09 15:38:47 UTC
From the evm.log attached here -- seems like I received 3 events actually, but spaced further apart:

[----] I, [2018-05-07T16:32:31.748637 #12567:581114]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourceGroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/temp-bsquizz2

[----] I, [2018-05-07T16:34:37.361368 #12567:581114]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/temp-bsquizz2

[----] I, [2018-05-07T16:44:29.316545 #12567:581114]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/temp-bsquizz2

That last one comes in 12 min after the first, and 4 min after I did a full provider refresh which had already updated the VM status to the correct state (archived), so I'm not sure if that third event would have led to the proper change or not.

Comment 5 Ladislav Smola 2018-05-09 16:04:52 UTC
Could you retest this? For me, the second event leads to VM archival. (I tried both master and g-release). For me second event came in about 2-5 minutes.

The first event is too fast, the Vm is not deleted yet, so the API returns the Vm is still running.

Comment 6 Brandon Squizzato 2018-05-09 18:17:17 UTC
Created attachment 1433993 [details]
new upstream log

Comment 7 Brandon Squizzato 2018-05-09 18:17:40 UTC
Created attachment 1433994 [details]
new downstream log

Comment 8 Brandon Squizzato 2018-05-09 18:20:51 UTC
Tested again on latest nightly (May 8) and a 5.9.2.4 appliance (evm log attached)

On upstream 2 events came in. After the second, the VM power state went to 'unknown' but the VM was never archived. Around 15min later I did a full refresh and the VM was archived.

$ zgrep "bsquizz-temp" new-evm.log-upstream | grep "Caught event virtualMachines_delete_EndRequest"
[----] I, [2018-05-09T13:26:00.130894 #16170:885130]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourceGroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-temp
[----] I, [2018-05-09T13:28:30.618341 #16170:885130]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-temp


On the downstream test 3 events came in. Similar behavior to above. After the second event, power state went to 'unknown'. Even after the 3rd event came in, the VM remains unarchived.


$ zgrep "bsquizz-anothervm" new-evm.log-downstream | grep "Caught event virtualMachines_delete_EndRequest"
[----] I, [2018-05-09T13:58:24.986737 #34093:649114]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourceGroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-anothervm
[----] I, [2018-05-09T14:00:15.273860 #34093:649114]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-anothervm
[----] I, [2018-05-09T14:10:14.849364 #34093:649114]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-anothervm

Comment 10 Ladislav Smola 2018-05-10 08:44:46 UTC
Interesting, in my env I see it being archived. It's all about timing. The VM is archived when we do API query and Azure doesn't return the VM.


I might be able a tweak the targeted refresh a bit, but the question is, if it's correct to archive it on our side, when the Vm still exists in Azure?

Could check if you'll see this behavior also for other entities?

Comment 11 Brandon Squizzato 2018-05-10 12:51:18 UTC
I could see in azure.log for these scenarios that we sent GET requests for the deleted VMs and received HTTP 404 responses, I am assuming this is what caused the power state to switch to 'unknown', but at that point shouldn't we have considered the VM removed from azure?

Comment 12 Ladislav Smola 2018-05-10 13:15:34 UTC
The archival is driven by fetching Vm (code here: https://github.com/Ladas/manageiq-providers-azure/blob/3d0b18c8380d5fdf6863f1a4105c1bf3ff2ee953/app/models/manageiq/providers/azure/inventory/collector/target_collection.rb#L135), fetching of the state is done separately after that.

Theoretically, we could take the missing status as archived Vm, I need to verify with Dan Berger.

Comment 14 CFME Bot 2018-05-15 12:06:53 UTC
New commit detected on ManageIQ/manageiq-providers-azure/master:

https://github.com/ManageIQ/manageiq-providers-azure/commit/949a23dd25ab883a87fe5d5b430442d135e9434d
commit 949a23dd25ab883a87fe5d5b430442d135e9434d
Author:     Ladislav Smola <lsmola@redhat.com>
AuthorDate: Mon May 14 03:43:59 2018 -0400
Commit:     Ladislav Smola <lsmola@redhat.com>
CommitDate: Mon May 14 03:43:59 2018 -0400

    Archive VM with no status

    Archive VM with no status, which should solve that VMs are not
    sometimes archived by targeted refresh. We receive event that VM
    was deleted, but when doing API query, the Vm is still returned.
    When fetching Vm's state, we do get uknown state (azure returns 404)

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1575773

 app/models/manageiq/providers/azure/inventory/parser/cloud_manager.rb | 5 +-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comment 15 Brandon Squizzato 2018-07-18 15:26:28 UTC
Created attachment 1459737 [details]
5.10.0.4 azure log

Comment 16 Brandon Squizzato 2018-07-18 15:27:20 UTC
Created attachment 1459738 [details]
5.10.0.4 evm debug log

Testing this on 5.10.0.4, I still do not see the VM get archived after it is deleted in Azure. See the attached logs (the '5.10.0.4' evm and azure logs) for the VM named 'bsquizz-rhel74-new'. The VM is created, and once it appears in CFME, I delete it on Azure. The delete event comes in here:

[----] I, [2018-07-18T11:18:24.045091 #30997:3d8f74]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-rhel74-new

Comment 17 Ladislav Smola 2018-07-18 15:46:32 UTC
Brandon, do you see more of these events? I usually saw 1 event, where the VM state was active or deleting, then another event, when it was really deleted.

Comment 18 Brandon Squizzato 2018-07-18 17:59:55 UTC
Yes there does appear to be 2 delete_EndRequest events in the log for the VM

[----] I, [2018-07-18T11:16:08.867400 #30997:3d8f74]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourceGroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-rhel74-new


[----] I, [2018-07-18T11:18:24.045091 #30997:3d8f74]  INFO -- : MIQ(ManageIQ::Providers::Azure::CloudManager::EventCatcher::Runner#process_event) EMS [] as [086e2864-00b5-49bb-83fe-354d7a32bfe3] Caught event virtualMachines_delete_EndRequest for /subscriptions/c9e72ccc-b20e-48bd-a0c8-879c6dbcbfbb/resourcegroups/bsquizz-resource-grp/providers/Microsoft.Compute/virtualMachines/bsquizz-rhel74-new

Comment 19 Ladislav Smola 2018-07-19 07:26:33 UTC
Brandon, can you give me access to this appliance? Both UI and ssh please. There is something wrong with it, I see that the targeted refresh is not running properly, instead it runs the old full refresh.

Comment 20 Ladislav Smola 2018-07-19 07:32:39 UTC
Looks like the graph refresh is not enabled. Did you disable it manually? It should be enabled by default, so if you haven't changed that, we might have some issues with the advanced settings.

Comment 21 Brandon Squizzato 2018-07-19 13:11:55 UTC
I didn't disable it myself, I'll double check the settings. That appliance was a temp test appliance. I'll spin up another today, reproduce, and leave it up for you to log into.

Comment 22 Ladislav Smola 2018-07-19 13:20:11 UTC
Wait, it's disabled by default. https://github.com/Ladas/manageiq-providers-azure/blob/835c86dfb83eb1a59c896f6d553494810d6f15bc/config/settings.yml#L56

I need to check with Bronagh, what should be the default.

So now you'll need to enable it, to verify this BZ

Comment 23 Bronagh Sorota 2018-07-19 13:28:31 UTC
It was disabled by default as a temporary measure, the plan was to enable it after it has tested in the field and we had a chance to get feedback. It was included in 4.6. I think its safe to enable it by default now.

Comment 24 Ladislav Smola 2018-07-19 13:30:00 UTC
https://github.com/ManageIQ/manageiq-providers-azure/pull/280 enabling it by default here

Comment 25 Brandon Squizzato 2018-07-19 18:36:04 UTC
Thanks. Testing again with that option enabled, but Azure events seem to be coming in very slowly today, so I'll wait and try verifying again later.

By the time the event comes in, the VM is fully deleted, and we see:
[----] D, [2018-07-19T14:28:14.965676 #54374:7a353ac] DEBUG -- : MIQ(ManageIQ::Providers::Azure::Inventory::Collector::TargetCollection#safe_targeted_request) Record not found Error Class=Azure::Armrest::NotFoundException, Message=The Resource 'Microsoft.Compute/virtualMachines/tmp-b-ubuntu' under resource group 'bsquizz-resource-grp' was not found.

So I'm not currently hitting the same scenario as before (where the VM is there but with an unknown state)

Comment 26 Brandon Squizzato 2018-08-03 18:21:18 UTC
Azure is behaving better now -- verified on CFME 5.10.0.8 -- VM is deleted, delete event arrived, targeted refresh occurred but we did NOT hit the "NotFoundException" (since VM was still in the process of being deleted), but VM state switched to 'Archived' in the UI.

Comment 27 errata-xmlrpc 2019-02-07 23:02:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212


Note You need to log in before you can comment on or make changes to this bug.