Bug 1483303 - Service order request for VM provision from template fail on SSL Certificate verification
Summary: Service order request for VM provision from template fail on SSL Certificate...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.0
Assignee: Juan Hernández
QA Contact: Angelina Vasileva
URL:
Whiteboard: rhev
Depends On:
Blocks: 1486351
TreeView+ depends on / blocked
 
Reported: 2017-08-20 09:29 UTC by Ilanit Stein
Modified: 2019-10-22 14:27 UTC (History)
9 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1486351 (view as bug list)
Environment:
Last Closed: 2018-03-06 14:41:45 UTC
Category: Bug
Cloudforms Team: RHEVM
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
evm.log (383.66 KB, application/x-gzip)
2017-08-20 09:41 UTC, Ilanit Stein
no flags Details
automation.log (64.23 KB, application/x-gzip)
2017-08-20 09:42 UTC, Ilanit Stein
no flags Details
engine.log (724.41 KB, application/x-gzip)
2017-08-20 09:42 UTC, Ilanit Stein
no flags Details
RHV server.log (46.77 KB, application/x-gzip)
2017-08-20 12:10 UTC, Ilanit Stein
no flags Details
Provision failed but without certificate error (95.45 KB, image/png)
2017-11-16 15:56 UTC, Jan Zmeskal
no flags Details
Logs from CFME (6.92 MB, application/x-tar)
2017-11-16 15:56 UTC, Jan Zmeskal
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1483973 0 medium CLOSED Services order request failure is not reflected in the Self Service UI 2021-02-22 00:41:40 UTC

Internal Links: 1483973

Description Ilanit Stein 2017-08-20 09:29:01 UTC
Description of problem:
for RHV provider, a service request for VM provision from template (Native) fail with last message: 

"Server [EVM] Service [rhv_item2-20170820-035050] Step [checkprovisioned] Status [Error Creating Service] Message [[RestClient::SSLCertificateNotVerified]: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed]"

The VM is created OK on the RHV provider, but it is not run.

Same service request pass, in case the provider was added with TLS_Verify: No.
(No ca_certificate is provided). 

Version-Release number of selected component (if applicable):
CFME-5.8.1.5/RHV-4.1.5

How reproducible:
Reproduced in 2 trials. 

Steps to Reproduce:
1. Add RHV provider an provide ca_certificate
2. Create a service for VM provision from Native, and order it.

Actual results:


Expected results:
Service VM provision should pass, also in case ca_certificate is provided - which is the more common case.

Additional info:

Comment 2 Ilanit Stein 2017-08-20 09:41:51 UTC
Created attachment 1315900 [details]
evm.log

Comment 3 Ilanit Stein 2017-08-20 09:42:20 UTC
Created attachment 1315901 [details]
automation.log

Comment 4 Ilanit Stein 2017-08-20 09:42:45 UTC
Created attachment 1315902 [details]
engine.log

Comment 6 Ilanit Stein 2017-08-20 12:10:56 UTC
Created attachment 1315940 [details]
RHV server.log

Comment 7 Aleksei Slaikovskii 2017-08-24 15:17:35 UTC
Hi!

I also want to add some info here.

I use rhv provider with ssl enabled and I've got this error message when I tried to provision vm from iso data storage although PXE provisioning works like a charm.

Thanks.

Comment 8 Juan Hernández 2017-08-24 18:54:02 UTC
The objects created by the 'ovirt' gem contain a reference to the connection (the '@service' instance variable) that was used to retrieve them:

  https://github.com/ManageIQ/ovirt/blob/cf6fd72f2fb3ef7ff6d7881e3c55ee1e3c5f74d5/lib/ovirt/base.rb#L242

When a related object is reetrieved, for example when the NICs of a VM are retrieved following the "nics" relationship, the same connection is copied to the new objects:

  https://github.com/ManageIQ/ovirt/blob/cf6fd72f2fb3ef7ff6d7881e3c55ee1e3c5f74d5/lib/ovirt/base.rb#L292

This means, that the VM and the NICs will all share the reference to the same connection.

The 'with_provider_connection' method of the oVirt provider closes the connection after using it:

  https://github.com/ManageIQ/manageiq-providers-ovirt/blob/51a735b1b12fb79c88040d313b86d4cc6a7b35e5/app/models/manageiq/providers/redhat/infra_manager/api_integration.rb#L132

The 'with_provider_connection' is used by the provider to retrieve the the NICs, using the connection of the VM:

  https://github.com/ManageIQ/manageiq-providers-ovirt/blob/51a735b1b12fb79c88040d313b86d4cc6a7b35e5/app/models/manageiq/providers/redhat/infra_manager/ovirt_services/strategies/v3.rb#L73-L77

So the net result is that the connection of the VM is used to retrieve the NICs, and copied to all of them, but then, is is closed, rendering the reference inside the VM and all tne NICs useless. The code then tries to update the NIC, using its connection, and it fails.

This is only detected when using SSL, and when using a custom certificate, because that is the only state that the connection of the 'ovirt' has, and the only thing that is discarded when the connection is closed.

So, conceptually, the problem is that our code is trying to use connections that have been closed, in many places. That is hard to solve. It may make sense to remove the close method of the connection, and preserve always the custom certificate.

Comment 9 Juan Hernández 2017-08-25 09:13:46 UTC
The proposed solution to this issue is the following pull request:

  Create trusted CA certificates file when needed
  https://github.com/ManageIQ/ovirt/pull/88

That only involves the 'ovirt' gem, no change is needed in the provider.

Comment 10 Juan Hernández 2017-08-26 10:22:47 UTC
Version 0.18.0 of the 'ovirt' gem has been released, containing this fix. Now the provider needs to be updated to use that version of the gem:

  Require version 0.18 or newer of ovirt gem
  https://github.com/ManageIQ/manageiq-providers-ovirt/pull/82

Comment 11 Juan Hernández 2017-08-28 13:40:51 UTC
The following pull request will be needed if this is finally approved for the `fine` branch, as the dependency on the 'ovirt' gem is in the 'manageiq-gems-pending' repository in the 'fine' branch:

  [FINE] Require version 0.18 or newer of ovirt gem
  https://github.com/ManageIQ/manageiq-gems-pending/pull/265

Comment 13 Jan Zmeskal 2017-11-16 15:55:17 UTC
I followed the reproduction steps. I made sure that provided is set up with working certificate. I did not get the error message reported by Ilanit.
Instead, the request resulted in error with this message: "[EVM] VM [test_native_clone_provision] Step [Provision] Status [Creating VM] Message [Applied PreProvision Customizations]". This definitely does not look like informative error message. See attached screenshot for other details. 

The VM was actually provisioned on RHV side and was powered off.

Logs from CFME are attached in archive.

CFME version: 5.9.0.8.20171109215303_ed87902

Comment 14 Jan Zmeskal 2017-11-16 15:56:13 UTC
Created attachment 1353601 [details]
Provision failed but without certificate error

Comment 15 Jan Zmeskal 2017-11-16 15:56:56 UTC
Created attachment 1353602 [details]
Logs from CFME

Comment 16 Juan Hernández 2017-11-16 19:27:11 UTC
I think that this is the relevant part of the `evm.log` file:

[----] I, [2017-11-16T10:02:57.673780 #2922:dbb12c]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [270], MiqWorker id: [1], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{:object_type=>"ManageIQ::Providers::Redhat::InfraManager::Provision", :object_id=>1, :attrs=>{"request"=>"vm_provision"}, :instance_name=>"AUTOMATION", :user_id=>1, :miq_group_id=>3, :tenant_id=>2}], Dequeued in: [14.496988151] seconds
[----] I, [2017-11-16T10:02:57.674769 #2922:dbb12c]  INFO -- : Q-task_id([miq_provision_1]) MIQ(MiqQueue#deliver) Message id: [270], Delivering...
[----] I, [2017-11-16T10:02:57.735219 #2922:dbb12c]  INFO -- : Q-task_id([miq_provision_1]) MIQ(MiqAeEngine.deliver) Delivering {"request"=>"vm_provision"} for object [ManageIQ::Providers::Redhat::InfraManager::Provision.1] with state [] to Automate
[----] I, [2017-11-16T10:12:53.600084 #2922:dad2e8]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Provision#execute_queue) Queuing VM Provisioning: [Provision from [golden_mixed_virtio_template] to [test_native_clone_provision]]...
[----] I, [2017-11-16T10:12:53.636094 #2922:dad2e8]  INFO -- : MIQ(MiqQueue.put) Message id: [317],  id: [], Zone: [default], Role: [ems_operations], Server: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [ManageIQ::Providers::Redhat::InfraManager::Provision.execute], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: []
[----] I, [2017-11-16T10:12:57.756328 #2922:dbb12c]  INFO -- : Q-task_id([miq_provision_1]) MIQ(ManageIQ::Providers::Redhat::InfraManager::Provision#after_ae_delivery) ae_result="error"
[----] E, [2017-11-16T10:12:57.856825 #2922:dbb12c] ERROR -- : Q-task_id([miq_provision_1]) MIQ(MiqQueue#deliver) Message id: [270], timed out after 600.181915036 seconds.  Timeout threshold [600]
[----] I, [2017-11-16T10:12:57.857486 #2922:dbb12c]  INFO -- : Q-task_id([miq_provision_1]) MIQ(MiqGenericWorker::Runner#deliver_queue_message) MIQ(MiqGenericWorker::Runner) Reconnecting to DB after timeout error during queue deliver
[----] I, [2017-11-16T10:12:57.969708 #2922:dbb12c]  INFO -- : Q-task_id([miq_provision_1]) MIQ(MiqQueue#delivered) Message id: [270], State: [timeout], Delivered in [600.294911613] seconds
[----] I, [2017-11-16T10:12:58.083048 #2922:dbb12c]  INFO -- : Q-task_id([miq_provision_1]) MIQ(MiqGenericWorker#log_status) [Generic Worker] Worker ID [1], PID [2922], GUID [118cacc5-11c1-4c74-928f-c04f30088295], Last Heartbeat [2017-11-16 15:02:57 UTC], Process Info: Memory Usage [378580992], Memory Size [717234176], Proportional Set Size: [319758000], Memory % [4.62], CPU Time [53913.0], CPU % [12.86], Priority [30]
[----] E, [2017-11-16T10:12:58.083642 #2922:dbb12c] ERROR -- : Q-task_id([miq_provision_1]) MIQ(MiqGenericWorker::Runner) ID [1] PID [2922] GUID [118cacc5-11c1-4c74-928f-c04f30088295] Exiting worker due to timeout error Worker exiting.

As you can see there is there a period of almost 10 minutes since 10:02:57 to 10:12:53 where apparently nothing happened. The request had a timeout of 600 seconds, 10 minutes, so shortly after that the process was killed and the request aborted. During that time there was activity between the CFME and RVH, but not enough to justify such a long delay.

Is this the same kind of environment were we say high use of CPU in CFME?

To move forward I would like to try to repeat this myself, using the same CFME and RHV environments. Is that possible?

Note that this is most likely a problem different to the original one; you may want to open a new bug.

Comment 17 Jan Zmeskal 2017-11-16 20:04:25 UTC
(In reply to Juan Hernández from comment #16)
> I think that this is the relevant part of the `evm.log` file:
> 
> [----] I, [2017-11-16T10:02:57.673780 #2922:dbb12c]  INFO -- :
> MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [270],
> MiqWorker id: [1], Zone: [default], Role: [automate], Server: [], Ident:
> [generic], Target id: [], Instance id: [], Task id: [], Command:
> [MiqAeEngine.deliver], Timeout: [600], Priority: [100], State: [dequeue],
> Deliver On: [], Data: [], Args:
> [{:object_type=>"ManageIQ::Providers::Redhat::InfraManager::Provision",
> :object_id=>1, :attrs=>{"request"=>"vm_provision"},
> :instance_name=>"AUTOMATION", :user_id=>1, :miq_group_id=>3,
> :tenant_id=>2}], Dequeued in: [14.496988151] seconds
> [----] I, [2017-11-16T10:02:57.674769 #2922:dbb12c]  INFO -- :
> Q-task_id([miq_provision_1]) MIQ(MiqQueue#deliver) Message id: [270],
> Delivering...
> [----] I, [2017-11-16T10:02:57.735219 #2922:dbb12c]  INFO -- :
> Q-task_id([miq_provision_1]) MIQ(MiqAeEngine.deliver) Delivering
> {"request"=>"vm_provision"} for object
> [ManageIQ::Providers::Redhat::InfraManager::Provision.1] with state [] to
> Automate
> [----] I, [2017-11-16T10:12:53.600084 #2922:dad2e8]  INFO -- :
> MIQ(ManageIQ::Providers::Redhat::InfraManager::Provision#execute_queue)
> Queuing VM Provisioning: [Provision from [golden_mixed_virtio_template] to
> [test_native_clone_provision]]...
> [----] I, [2017-11-16T10:12:53.636094 #2922:dad2e8]  INFO -- :
> MIQ(MiqQueue.put) Message id: [317],  id: [], Zone: [default], Role:
> [ems_operations], Server: [], Ident: [generic], Target id: [], Instance id:
> [1], Task id: [], Command:
> [ManageIQ::Providers::Redhat::InfraManager::Provision.execute], Timeout:
> [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: []
> [----] I, [2017-11-16T10:12:57.756328 #2922:dbb12c]  INFO -- :
> Q-task_id([miq_provision_1])
> MIQ(ManageIQ::Providers::Redhat::InfraManager::Provision#after_ae_delivery)
> ae_result="error"
> [----] E, [2017-11-16T10:12:57.856825 #2922:dbb12c] ERROR -- :
> Q-task_id([miq_provision_1]) MIQ(MiqQueue#deliver) Message id: [270], timed
> out after 600.181915036 seconds.  Timeout threshold [600]
> [----] I, [2017-11-16T10:12:57.857486 #2922:dbb12c]  INFO -- :
> Q-task_id([miq_provision_1])
> MIQ(MiqGenericWorker::Runner#deliver_queue_message)
> MIQ(MiqGenericWorker::Runner) Reconnecting to DB after timeout error during
> queue deliver
> [----] I, [2017-11-16T10:12:57.969708 #2922:dbb12c]  INFO -- :
> Q-task_id([miq_provision_1]) MIQ(MiqQueue#delivered) Message id: [270],
> State: [timeout], Delivered in [600.294911613] seconds
> [----] I, [2017-11-16T10:12:58.083048 #2922:dbb12c]  INFO -- :
> Q-task_id([miq_provision_1]) MIQ(MiqGenericWorker#log_status) [Generic
> Worker] Worker ID [1], PID [2922], GUID
> [118cacc5-11c1-4c74-928f-c04f30088295], Last Heartbeat [2017-11-16 15:02:57
> UTC], Process Info: Memory Usage [378580992], Memory Size [717234176],
> Proportional Set Size: [319758000], Memory % [4.62], CPU Time [53913.0], CPU
> % [12.86], Priority [30]
> [----] E, [2017-11-16T10:12:58.083642 #2922:dbb12c] ERROR -- :
> Q-task_id([miq_provision_1]) MIQ(MiqGenericWorker::Runner) ID [1] PID [2922]
> GUID [118cacc5-11c1-4c74-928f-c04f30088295] Exiting worker due to timeout
> error Worker exiting.
> 
> As you can see there is there a period of almost 10 minutes since 10:02:57
> to 10:12:53 where apparently nothing happened. The request had a timeout of
> 600 seconds, 10 minutes, so shortly after that the process was killed and
> the request aborted. During that time there was activity between the CFME
> and RVH, but not enough to justify such a long delay.
> 
> Is this the same kind of environment were we say high use of CPU in CFME?
> 
> To move forward I would like to try to repeat this myself, using the same
> CFME and RHV environments. Is that possible?
> 
> Note that this is most likely a problem different to the original one; you
> may want to open a new bug.

It is true that this might have been causes by the CPU problem. It is indeed the same environment. However, I had to clean it since we needed to run integration test suite there. I will try to replicate this in the fresh environment and if the problem persists, I will open a new bug.

Comment 18 Jan Zmeskal 2017-11-20 12:53:55 UTC
(In reply to Juan Hernández from comment #16)

I tried the same steps on 5.9.0.9.20171115202245_7429f75 with clean env and everything it went smoothly.


Note You need to log in before you can comment on or make changes to this bug.