Description of problem: In an environment that is not having any QoS configured, the engine is continuously updating SLA for many VMs. The updation is getting triggered when the user changes any disk configuration. In less than 24 hours, there are a huge number of these messages in the engine log. === grep "SLA Policy was set" var/log/ovirt-engine/engine.log |wc -l 3085 === There are no QoS configured in the environment. ==== engine=> SELECT * FROM qos_for_vm_view; id | qos_type | name | description | storage_pool_id | max_throughput | max_read_throughput | max_write_throughput | max_iops | max_read_iops | max_write_iops | _create_date | _update_date | cpu_limit | inbound_average | inbound_peak | i nbound_burst | outbound_average | outbound_peak | outbound_burst | out_average_linkshare | out_average_upperlimit | out_average_realtime | vm_id ----+----------+------+-------------+-----------------+----------------+---------------------+----------------------+----------+---------------+----------------+--------------+--------------+-----------+-----------------+--------------+-- -------------+------------------+---------------+----------------+-----------------------+------------------------+----------------------+------- (0 rows) engine=> SELECT * FROM qos_for_disk_profile_view; id | qos_type | name | description | storage_pool_id | max_throughput | max_read_throughput | max_write_throughput | max_iops | max_read_iops | max_write_iops | _create_date | _update_date | cpu_limit | inbound_average | inbound_peak | i nbound_burst | outbound_average | outbound_peak | outbound_burst | out_average_linkshare | out_average_upperlimit | out_average_realtime | disk_profile_id ----+----------+------+-------------+-----------------+----------------+---------------------+----------------------+----------+---------------+----------------+--------------+--------------+-----------+-----------------+--------------+-- -------------+------------------+---------------+----------------+-----------------------+------------------------+----------------------+----------------- (0 rows) ===== I am unable to understand why the engine is updating SLA on an environment without any QoS configured. Version-Release number of selected component (if applicable): rhvm-4.3.8.2-0.4.el7.noarch How reproducible: Unknown. Steps to Reproduce: 1. I was not able to reproduce in my test environment. However, the customer is able to reproduce the issue when modifying the disk configuration. Actual results: Engine updating SLA policies of VMs continuously in an environment which is not having any QOS configured. Expected results: Additional info:
The probable cause is that when the disk profile assigned to a disk is changed when updating a disk, the SLA policy is updated for all running VMs that have disks with the new disk profile. This is not efficient and should be optimized to only update VMs that use the updated disk. In the bug description, it says that this happens when any disk configuration is changed, not only disk profile. So there may be another issue, where the SLA policy is updated even if the assigned disk profile is not changed. In the worst case, the SLA policy is updated for all running VMs that have disks on the storage domain that contains the updated disk. A possible workaround could be to write a script that creates multiple disk profiles in a storage domain and assigns them to different disks. This would lower the number of calls to 'VmSlaPolicyCommand' when a disk is later updated. Also in the logs, there are many failures to access the DB, 'Failed to obtain JDBC Connection'. The failures started after the many calls to VmSlaPolicyCommand so it seems they are not related.
The env(In reply to Andrej Krejcir from comment #5) > The probable cause is that when the disk profile assigned to a disk is > changed when updating a disk, the SLA policy is updated for all running VMs > that have disks with the new disk profile. This is not efficient and should > be optimized to only update VMs that use the updated disk. > > In the bug description, it says that this happens when any disk > configuration is changed, not only disk profile. So there may be another > issue, where the SLA policy is updated even if the assigned disk profile is > not changed. In the worst case, the SLA policy is updated for all running > VMs that have disks on the storage domain that contains the updated disk. > > > A possible workaround could be to write a script that creates multiple disk > profiles in a storage domain and assigns them to different disks. This would > lower the number of calls to 'VmSlaPolicyCommand' when a disk is later > updated. They only have a default disk profile which is created on the disk which is not having any QoS attached. The environment has 1000+ VMs and I am not sure creating a sperate disk profile for each disk will be a valid workaround. Also, it would be a huge task even if we automate it. Also, I was never able to reproduce this issue. There may be something specific to the environment as well. > > > Also in the logs, there are many failures to access the DB, 'Failed to > obtain JDBC Connection'. The failures started after the many calls to > VmSlaPolicyCommand so it seems they are not related. The huge number of VmSlaPolicyCommand exhausted the PostgreSQL pool and resulted in database connection failure. So I think it's related.
Verification on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31. 1. Create Storage QoS: In Compute -> DataCenters -> dc create QoS Storage. In Storage -> Storage Domains -> test_gluster_0 chande Disk_Profile with the created QoS. As a result, see in engine.log SLA reporting only for the VMs that are created with this SD. 2020-04-27 12:56:53,111+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173718) [b3f44dd8-eef0-4571-8605-622b3fac5773] EVENT_ID: VM_SLA_POLICY_STORAGE(10,551), VM golden_env_mixed_virtio_5 SLA Policy was set. Storage policy changed for disks: [latest-rhel-guest-image-8.2-infra] 2020-04-27 12:56:53,127+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173718) [b3f44dd8-eef0-4571-8605-622b3fac5773] EVENT_ID: VM_SLA_POLICY_STORAGE(10,551), VM golden_env_mixed_virtio_4 SLA Policy was set. Storage policy changed for disks: [latest-rhel-guest-image-8.2-infra] 2020-04-27 12:56:53,206+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173718) [b3f44dd8-eef0-4571-8605-622b3fac5773] EVENT_ID: VM_SLA_POLICY_STORAGE(10,551), VM vm_disk_profile_gluster_0 SLA Policy was set. Storage policy changed for disks: [latest-rhel-guest-image-8.2-infra_vm_disk_profile_gluster_0] Then remove this disk_profile and see that only relevant VMs are mentioned in the log. 2. Edit the Virtual Disk of one of these VMs (check Enable Incremental Backup, Extend size by 1 GiB) and see that this editing doesn't trigger any redundant SLA reporting in engine.log. 3. Add the same created QoS Storage Disk Profile to the same storage. Edit disk alias for one of the VMs with the Disk on the SD test_gluster_0. Check that this Disk update didn't trigger the SLA reporting. 4. Attach Storage Disk_Profile to the SD iscsi_2 when we only have VMs created on iscsi_0. Check that no SLA reporting. 5. Attach Storage Disk_Profile to nfs_0. Check that only relevant VM is mentioned in the log. 6. Remove the storage policy from DC. check that only VMs associated with the relevant SD are reported in engine.log.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247