Bug 1813305 - Engine updating SLA policies of VMs continuously in an environment which is not having any QOS configured
Summary: Engine updating SLA policies of VMs continuously in an environment which is ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.8
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Andrej Krejcir
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-13 13:57 UTC by nijin ashok
Modified: 2023-10-06 19:25 UTC (History)
6 users (show)

Fixed In Version: rhv-4.0.0-31
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 13:21:58 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:22:36 UTC
oVirt gerrit 108377 0 master MERGED core: Update SLA policy only on needed VMs 2020-09-22 15:22:41 UTC

Description nijin ashok 2020-03-13 13:57:17 UTC
Description of problem:

In an environment that is not having any QoS configured, the engine is continuously updating SLA for many VMs. The updation is getting triggered when the user changes any disk configuration. In less than 24 hours, there are a huge number of these messages in the engine log.

===
grep "SLA Policy was set" var/log/ovirt-engine/engine.log |wc -l
3085
===

There are no QoS configured in the environment.

====
engine=> SELECT * FROM qos_for_vm_view;
 id | qos_type | name | description | storage_pool_id | max_throughput | max_read_throughput | max_write_throughput | max_iops | max_read_iops | max_write_iops | _create_date | _update_date | cpu_limit | inbound_average | inbound_peak | i
nbound_burst | outbound_average | outbound_peak | outbound_burst | out_average_linkshare | out_average_upperlimit | out_average_realtime | vm_id 
----+----------+------+-------------+-----------------+----------------+---------------------+----------------------+----------+---------------+----------------+--------------+--------------+-----------+-----------------+--------------+--
-------------+------------------+---------------+----------------+-----------------------+------------------------+----------------------+-------
(0 rows)

engine=> SELECT * FROM qos_for_disk_profile_view;
 id | qos_type | name | description | storage_pool_id | max_throughput | max_read_throughput | max_write_throughput | max_iops | max_read_iops | max_write_iops | _create_date | _update_date | cpu_limit | inbound_average | inbound_peak | i
nbound_burst | outbound_average | outbound_peak | outbound_burst | out_average_linkshare | out_average_upperlimit | out_average_realtime | disk_profile_id 
----+----------+------+-------------+-----------------+----------------+---------------------+----------------------+----------+---------------+----------------+--------------+--------------+-----------+-----------------+--------------+--
-------------+------------------+---------------+----------------+-----------------------+------------------------+----------------------+-----------------
(0 rows)
=====

I am unable to understand why the engine is updating SLA on an environment without any QoS configured. 


Version-Release number of selected component (if applicable):

rhvm-4.3.8.2-0.4.el7.noarch


How reproducible:

Unknown. 

Steps to Reproduce:

1. I was not able to reproduce in my test environment. However, the customer is able to reproduce the issue when modifying the disk configuration.


Actual results:

Engine updating SLA policies of VMs continuously in an environment which is not having any QOS configured.

Expected results:


Additional info:

Comment 5 Andrej Krejcir 2020-04-01 09:56:48 UTC
The probable cause is that when the disk profile assigned to a disk is changed when updating a disk, the SLA policy is updated for all running VMs that have disks with the new disk profile. This is not efficient and should be optimized to only update VMs that use the updated disk.

In the bug description, it says that this happens when any disk configuration is changed, not only disk profile. So there may be another issue, where the SLA policy is updated even if the assigned disk profile is not changed. In the worst case, the SLA policy is updated for all running VMs that have disks on the storage domain that contains the updated disk.


A possible workaround could be to write a script that creates multiple disk profiles in a storage domain and assigns them to different disks. This would lower the number of calls to 'VmSlaPolicyCommand' when a disk is later updated.


Also in the logs, there are many failures to access the DB, 'Failed to obtain JDBC Connection'. The failures started after the many calls to VmSlaPolicyCommand so it seems they are not related.

Comment 6 nijin ashok 2020-04-01 13:51:59 UTC
The env(In reply to Andrej Krejcir from comment #5)
> The probable cause is that when the disk profile assigned to a disk is
> changed when updating a disk, the SLA policy is updated for all running VMs
> that have disks with the new disk profile. This is not efficient and should
> be optimized to only update VMs that use the updated disk.
> 

> In the bug description, it says that this happens when any disk
> configuration is changed, not only disk profile. So there may be another
> issue, where the SLA policy is updated even if the assigned disk profile is
> not changed. In the worst case, the SLA policy is updated for all running
> VMs that have disks on the storage domain that contains the updated disk.
> 
> 
> A possible workaround could be to write a script that creates multiple disk
> profiles in a storage domain and assigns them to different disks. This would
> lower the number of calls to 'VmSlaPolicyCommand' when a disk is later
> updated.

They only have a default disk profile which is created on the disk which is not having any QoS attached. The environment has 1000+ VMs and I am not sure creating a sperate disk profile for each disk will be a valid workaround. Also, it would be a huge task even if we automate it.

Also, I was never able to reproduce this issue. There may be something specific to the environment as well.

> 
> 
> Also in the logs, there are many failures to access the DB, 'Failed to
> obtain JDBC Connection'. The failures started after the many calls to
> VmSlaPolicyCommand so it seems they are not related.

The huge number of VmSlaPolicyCommand exhausted the PostgreSQL pool and resulted in database connection failure. So I think it's related.

Comment 8 Polina 2020-04-27 13:10:43 UTC
Verification on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31.

1. Create Storage QoS:
   In Compute -> DataCenters -> dc create QoS Storage.
   In Storage -> Storage Domains -> test_gluster_0 chande Disk_Profile with the created QoS.

As a result, see in engine.log SLA reporting only for the VMs that are created with this SD.

2020-04-27 12:56:53,111+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173718) [b3f44dd8-eef0-4571-8605-622b3fac5773] EVENT_ID: VM_SLA_POLICY_STORAGE(10,551), VM golden_env_mixed_virtio_5 SLA Policy was set. Storage policy changed for disks: [latest-rhel-guest-image-8.2-infra]
2020-04-27 12:56:53,127+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173718) [b3f44dd8-eef0-4571-8605-622b3fac5773] EVENT_ID: VM_SLA_POLICY_STORAGE(10,551), VM golden_env_mixed_virtio_4 SLA Policy was set. Storage policy changed for disks: [latest-rhel-guest-image-8.2-infra]
2020-04-27 12:56:53,206+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173718) [b3f44dd8-eef0-4571-8605-622b3fac5773] EVENT_ID: VM_SLA_POLICY_STORAGE(10,551), VM vm_disk_profile_gluster_0 SLA Policy was set. Storage policy changed for disks: [latest-rhel-guest-image-8.2-infra_vm_disk_profile_gluster_0]


Then remove this disk_profile and see that only relevant VMs are mentioned in the log.

2. Edit the Virtual Disk of one of these VMs (check Enable Incremental Backup, Extend size by 1 GiB) and see that this editing doesn't trigger any redundant SLA reporting in engine.log.

3. Add the same created QoS Storage Disk Profile to the same storage.
   Edit disk alias for one of the VMs with the Disk on the SD test_gluster_0.
   Check that this Disk update didn't trigger the SLA reporting.

4. Attach Storage Disk_Profile  to the SD iscsi_2 when we only have VMs created on iscsi_0.
   Check that no SLA reporting. 

5. Attach Storage Disk_Profile to nfs_0.
   Check that only relevant VM is mentioned in the log.

6. Remove the storage policy from DC. check that only VMs associated with the relevant SD are reported in engine.log.

Comment 16 errata-xmlrpc 2020-08-04 13:21:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.