Bug 1672338 - VM retirement called from Service Bundle retirement does not complete successfully.
Summary: VM retirement called from Service Bundle retirement does not complete success...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Provisioning
Version: 5.9.7
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: GA
: cfme-future
Assignee: drew uhlmann
QA Contact: Sudhir Mallamprabhakara
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-04 15:58 UTC by Prasad Mukhedkar
Modified: 2022-03-13 16:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-06 18:34:29 UTC
Category: Bug
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 2 drew uhlmann 2019-02-04 16:19:29 UTC
We're currently working on this.

Comment 4 Tina Fitzgerald 2019-02-04 19:17:26 UTC
Hi Prasad,

The errors shown in the problem description above are not the cause of the reported retirement issue.

I believe the Service VM retirement issue is caused by custom automate code.

The problematic_service screen shot shows the VM, xcdcvmoas927(id = 34000000002287) as archived.
The customer logs show an error in the VM retirement initiated by service retirement.

The following log lines are from the evm_archive_xcucvidas906.es.onpremise.bancsabadell.com_20190204_091314.tar.xz automation log. 

The GBS domain is used to determine which custom vm retirement state machine will be used:
[----] I, [2019-02-04T09:03:59.424524 #28636:71b114]  INFO -- : Following Relationship [miqaedb:/infrastructure/VM/Lifecycle/Retirement?vm_id=34000000002287#create]
[----] I, [2019-02-04T09:03:59.771902 #28636:71b114]  INFO -- : Updated namespace [miqaedb:/infrastructure/VM/Lifecycle/Retirement?vm_id=34000000002287#create  GBS/infrastructure/VM]

The following line was taken from the GBS/infrastructure/VM/Lifecycle/Retirement instance:  ***** Notice that the vm.operating_system.product_name is the value used:
/Infrastructure/VM/Retirement/StateMachines/VMRetirement_${/#vm.operating_system.product_name}/Default_${/#vm.operating_system.product_name}

At 9:04, substitution was used to determine the state machine. 

[----] I, [2019-02-04T09:03:59.881031 #28636:71b114]  INFO -- : Following Relationship [miqaedb:/Infrastructure/VM/Retirement/StateMachines/VMRetirement_rhel_7x64/Default_rhel_7x64#create]
[----] I, [2019-02-04T09:04:03.208298 #28636:71b114]  INFO -- : Updated namespace [miqaedb:/Infrastructure/VM/Retirement/StateMachines/VMRetirement_rhel_7x64/Default_rhel_7x64#create  GBS/Infrastructure/VM/Retirement/StateMachines]
[----] I, [2019-02-04T09:04:03.354285 #32108:71b114]  INFO -- : User [ll014773.com] with current group ID [34000000000035] name [GrpApp_CloudForms_Admin]


At 9:05, the VM was removed from the provider:

----] I, [2019-02-04T09:05:08.014622 #3592:71b114]  INFO -- : MiqAeEvent.build_evm_event >> event=<"vm_destroy"> inputs=<{:ext_management_systems=>#<ManageIQ::Providers::Redhat::InfraManager id: 34000000000017, name: "RHEV-CDVU", created_on: "2018-06-25 20:36:09", updated_on: "2019-02-04 08:03:20", guid: "7669f8c9-afd2-4d6f-9cda-1452ba3ad138", zone_id: 34000000000004, type: "ManageIQ::Providers::Redhat::InfraManager", api_version: "4.1.10.3-0.1.el7", uid_ems: nil, host_default_vnc_port_start: nil, host_default_vnc_port_end: nil, provider_region: nil, last_refresh_error: nil, last_refresh_date: "2019-02-04 08:03:20", provider_id: nil, realm: nil, tenant_id: 34000000000001, project: nil, parent_ems_id: nil, subscription: nil, last_metrics_error: nil, last_metrics_update_date: nil, last_metrics_success_date: nil, tenant_mapping_enabled: nil, enabled: true, options: {}>, :ems_event=>#<EmsEvent id: 34000001188461, event_type: "USER_REMOVE_VM_FINISHED", message: "VM xcdcvmoas927 was successfully removed.", timestamp: "2019-02-04 08:04:47", host_name: nil, host_id: nil, vm_name: "xcdcvmoas927", vm_location: "/rhev/data-center/5ab25536-0285-0235-000b-00000000...", vm_or_template_id: 34000000002287, dest_host_name: nil, dest_host_id: nil, dest_vm_name: nil, dest_vm_location: nil, dest_vm_or_template_id: nil, source: "RHEVM", chain_id: nil, ems_i


At 9:06, a substitution error while processing the same VM retirement on a retry. (This is the same call that worked at 9:04).
Substitution will fail because the VM doesn't exist, which makes sense because it was removed a minute ago. 

 
[----] I, [2019-02-04T09:06:13.748612 #32108:71b114]  INFO -- : Following Relationship [miqaedb:/infrastructure/VM/Lifecycle/Retirement?vm_id=34000000002287#create]
[----] I, [2019-02-04T09:06:14.094893 #32108:71b114]  INFO -- : Updated namespace [miqaedb:/infrastructure/VM/Lifecycle/Retirement?vm_id=34000000002287#create  GBS/infrastructure/VM]
[----] W, [2019-02-04T09:06:14.177254 #32108:71b114]  WARN -- : Error during substitution: undefined method `product_name' for nil:NilClass
[----] W, [2019-02-04T09:06:14.179697 #32108:71b114]  WARN -- : Error during substitution: undefined method `product_name' for nil:NilClass
[----] I, [2019-02-04T09:06:14.179891 #32108:71b114]  INFO -- : Following Relationship [miqaedb:/Infrastructure/VM/Retirement/StateMachines/VMRetirement_/Default_#create]
[----] E, [2019-02-04T09:06:19.478866 #32108:71b114] ERROR -- : Class [Infrastructure/VM/Retirement/StateMachines/VMRetirement_] not found in MiqAeDatastore
[----] I, [2019-02-04T09:06:19.479030 #32108:71b114]  INFO -- : Followed  Relationship [miqaedb:/Infrastructure/VM/Retirement/StateMachines/VMRetirement_/Default_#create]
[----] I, [2019-02-04T09:06:19.479586 #32108:71b114]  INFO -- : Followed  Relationship [miqaedb:/infrastructure/VM/Lifecycle/Retirement?vm_id=34000000002287#create]
[----] I, [2019-02-04T09:06:19.479992 #32108:71b114]  INFO -- : Followed  Relationship [miqaedb:/System/Event/MiqEvent/POLICY/request_vm_retire#create]
[----] I, [2019-02-04T09:06:22.510580 #32108:71b114]  INFO -- : User [ll014773.com] with current group ID [34000000000035] name [GrpApp_CloudForms_Admin]

Can you ask the customer to disable all of the custom Automate domains, and provision/retire the service again? If the problem persists, can you ask them to send us logs showing the VM lifecycle retirement working properly, and the service vm retirement not working properly?

Let me know if you have any questions. 

Thanks,
Tina

Comment 6 Tina Fitzgerald 2019-02-05 19:20:10 UTC
Hi Prasad,

Retirement uses state machines which are explained in Chapter 13 of the Mastering CloudForms Automation Guide.
https://access.redhat.com/webassets/avalon/d/Mastering_CloudForms_Automation_Red_Hat/Mastering_CloudForms_Automation_Red_Hat.pdf
Chapter 13 also includes a section on state variables which is a good way for the customer to store the vm.operating_system.product_name value to be used after the vm has been deleted.

On a very high level, a state machine is a special type of class that contains "states" which are a special relationship type.
State machines are typically used to process workflows, and a major advantage of a state machines is that it has the ability to do a retry.
Among other things, the VM retirement state machine processing calls for removing the vm from the management system.  Since this can take some time, the state machine retries every minute(default value) to see if the vm is still is connected to the management system, and the state machine will proceed to the next state once the vm is no longer connected to the management system.  

"If we are doing the substitution correctly, why does it fail?" Your substitution doesn't take into consideration that the vm will be removed from the management system in the middle of the retirement state machine. The vm is not considered fully retired until the retirement state machine completes.

      
"I don't understand why the substitution is done again after the VM has been retired." The vm is not fully retired until the end of the retirement state machine.
 

"I thought that the state machine is called only once and execute all its methods." The state machine is called once initially, and is called again for every retry. The process knows where it left off, and where it should continue.

"Why this substitution is being done twice?" This substitution is done every time the state machines retries. 

Let me know if you have any questions.

Thanks,
Tina

Comment 7 drew uhlmann 2019-02-06 18:18:47 UTC
So this shouldn't be closed, or we should open a new bug addressing the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1672338#c0 because that is a valid issue although it doesn't relate to the main problem of the case it's attached to. 

https://bugzilla.redhat.com/show_bug.cgi?id=1672338#c0 has work here: https://github.com/ManageIQ/manageiq/pull/18437

Comment 8 drew uhlmann 2019-02-06 18:22:55 UTC
Anyone looking at this ticket, please ignore all comments except for the first (https://bugzilla.redhat.com/show_bug.cgi?id=1672338#c0) because they're about a different, unrelated issue.

Comment 9 drew uhlmann 2019-02-06 18:34:29 UTC
Closing cause Tina wants to open a new ticket.

Comment 10 Tina Fitzgerald 2019-02-06 18:39:45 UTC
Opened new ticket to track user not found issue; https://bugzilla.redhat.com/show_bug.cgi?id=1673143


Note You need to log in before you can comment on or make changes to this bug.