Bug 1266158 - RHEV-M marks host as non-operational and says it has a problem with a storage domain, but vdsm reports no problems with that storage domain [NEEDINFO]
RHEV-M marks host as non-operational and says it has a problem with a storage...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.4
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: nobody nobody
virt
:
Depends On: migration_improvements
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-24 12:01 EDT by Allan Voss
Modified: 2016-01-19 12:00 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-13 07:43:56 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
michal.skrivanek: needinfo? (avoss)


Attachments (Terms of Use)
vdsm.conf (238 bytes, text/plain)
2015-11-09 11:24 EST, Douglas Duckworth
no flags Details

  None (edit)
Description Allan Voss 2015-09-24 12:01:34 EDT
Description of problem:
RHEV-M marks host as non-operational and says it has a problem with a storage domain, but vdsm reports no problems with that storage domain

Version-Release number of selected component (if applicable):
rhevm-3.5.4.2-1.3.el6ev.noarch  
vdsm-4.16.26-1.el6ev.x86_64  

How reproducible:
Unknown

Steps to Reproduce:
1. configure new host
2. activate new host
3. Migrate VMs to the new host

Actual results:
Host eventually is marked as non-responsive. No issues are logged in vdsm.log on the host

Expected results:
Host should either continue to run normally, or vdsm should log a problem

Additional info:
Comment 5 Douglas Duckworth 2015-10-20 14:13:24 EDT
Do we need to add logs to this Bugzilla?
Comment 6 Omer Frenkel 2015-10-22 08:57:39 EDT
please attach engine.log for the relevant time
Comment 10 Tomas Jelinek 2015-11-06 05:38:04 EST
Looking into the logs it looks like:

- there are lots of VMs (~30) migrating to the tulhv1p06

- this load causes various issues on more hosts:
  2015-09-22 11:04:19,899 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-7) [32262ed4] Failed vds listing,  vds = c23ea72a-5476-4d60-a4a8-b436a95a079d : tulhv1p14, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Unexpected exception, code = 16

  2015-09-22 11:04:19,936 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-84) [1b136eca] Failed vds listing,  vds = 0a142853-0347-4cab-85af-6cfa1d4c0726 : tulhv1p12, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Unexpected exception, code = 16

  2015-09-22 11:08:27,875 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-39) [843e5b8] Command ListVDSCommand(HostName = tulhv1p06, HostId = 58a592d0-8810-446e-b01f-f8b448c0bcd9, vds=Host[tulhv1p06,58a592d0-8810-446e-b01f-f8b448c0bcd9]) execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused)
  
- this problems caused that 
  2015-09-22 11:06:11,836 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-45) vds tulhv1p06 reported domain f2bb7851-b2ed-495d-9460-84a0e97c016e:COM_3TB6 as in problem, attempting to move the vds to status NonOperational

  and consequently:

  2015-09-22 11:07:02,757 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-49) [39410bac] Correlation ID: 39410bac, Job ID: e58d67b6-a6ff-476b-b126-e275162b4cf8, Call Stack: null, Custom Event ID: -1, Message: Host tulhv1p06 cannot access the Storage Domain(s) COM_3TB6 attached to the Data Center tulvp1p03. Setting Host state to Non-Operational.

- but later the status of the host was set back to up
  2015-09-22 11:10:04,561 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-12) [67280370] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Status of host tulhv1p06 was set to Up.

Can you please confirm that the cluster moved back to normal after the migration storm finished? 

If yes, than please note that in the timeframe of 4.0 the migrations are going to be changed significantly especially targeting issues like this one:

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1252426
Feature Page: http://www.ovirt.org/Features/Migration_Enhancements
Gerrit topic: https://gerrit.ovirt.org/#/q/topic:migration-enhancements
Comment 11 Douglas Duckworth 2015-11-06 09:54:50 EST
The cluster usually returns to normal state with Host being set as Up though we usually manually restart vdsmd on the Host to get that back in place sooner since we actually never see the loss of storage or paused VMs.

It looks like we set "migration_max_bandwidth" too high in vdsm.conf given we thought the unit was megabits though that's not the case per https://access.redhat.com/solutions/108813# 


[vars]
ssl = true

migration_timeout = 60000

migration_max_time_per_gib_mem = 0

migration_max_bandwidth = 9500

max_outgoing_migrations = 8

We should obviously lower it to a more appropriate level given the transfer network runs over converged 10Gb 802.3ad bond.
Comment 12 Douglas Duckworth 2015-11-06 16:20:34 EST
Thanks,

We will review all of the items that were posted to this BZ. 

After lowering the VDSM migration threshold, and only migrating one VM at a time, we're still seeing messages in Engine Log like:

2015-11-06 15:07:48,361 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-11) domain 290158f5-4182-4072-abe9-7951b9d24c72:COM_3TB3 in problem. vds: tulhv1p06
2015-11-06 15:08:03,934 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 00f4c946-f133-4908-98c5-f64e352616b9:COM_3TB4 in problem. vds: tulhv1p06
2015-11-06 15:08:03,935 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain f2bb7851-b2ed-495d-9460-84a0e97c016e:COM_3TB6 in problem. vds: tulhv1p06
2015-11-06 15:08:03,936 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 8f2a3efd-5abc-42f7-a232-d482bfc74860:COM_3TB7 in problem. vds: tulhv1p06
2015-11-06 15:08:03,937 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain a35aa9f8-1576-4de2-9956-145ae934ae20:COM_3TB9 in problem. vds: tulhv1p06
2015-11-06 15:08:03,938 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain bfe870f4-eea0-4729-8ebf-3f2ee5a8c512:COM_3TB12 in problem. vds: tulhv1p06
2015-11-06 15:08:03,939 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain dbaf4d1d-571f-4873-9dde-93c7bb0296a4:COM_3TB10 in problem. vds: tulhv1p06
2015-11-06 15:08:03,941 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 2f69c2c5-0c16-454f-a429-26c8c7e4011c:COM_3TB11 in problem. vds: tulhv1p06
2015-11-06 15:08:03,942 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 975fe25e-73db-46e2-96ef-ab6d6c6550a8:COM_DR1 in problem. vds: tulhv1p06
2015-11-06 15:08:03,942 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 2eecf753-a5cf-4af6-a5c1-e2e0295cab6d:COM_3TB8 in problem. vds: tulhv1p06
2015-11-06 15:08:03,943 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 2fe689d9-a9fa-4347-ae1f-4022cc52fb5b:COM_SSD in problem. vds: tulhv1p06
2015-11-06 15:08:03,944 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 13f748ea-a005-45ac-abb9-ef64da7dfaae:COM_4TB1 in problem. vds: tulhv1p06
2015-11-06 15:08:03,945 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain a52b71eb-6707-4361-ae38-02dacf3db1f5:COM_3TB2 in problem. vds: tulhv1p06

As well as VM's being sent to Unknown status then later recovering.

Using iftop during migration, we do not see throughput approaching wire speed of the bonded connection.

During some migrations we actually saw storage errors, similar to the above, populating Engine log before VM were sent over transfer network.
Comment 13 Tomas Jelinek 2015-11-09 02:42:49 EST
could you please attach the engine logs for this case as well as the source and dest vdsm logs and the vdsm.conf (source/dest)?

Could you please also check the CPU comsumption of the source/dest host?
Maybe there is a different problem than the bandwidth one...
Comment 14 Douglas Duckworth 2015-11-09 11:24 EST
Created attachment 1091834 [details]
vdsm.conf
Comment 17 Tomas Jelinek 2015-11-10 03:53:49 EST
The migration_max_bandwidth = 1150 seems to be still quite high since the default is 32. I can imagine that this can cause some performance issues in qemu/libvirt. 

Could you please try to lower it to default (32) and try again?
Also can you please run htop on source and dest hosts while migrating?

BTW the logs don't match:
engine logs end at 2015-11-04
VDSM logs start at 2015-11-09
Comment 18 Michal Skrivanek 2015-11-10 04:03:10 EST
(In reply to Douglas Duckworth from comment #14)
> Created attachment 1091834 [details]
> vdsm.conf

with 1150 your getting way too close to maximum theoretical bandwidth.
What else is shared on that network? Storage? VM traffic?

You should also lower the max_outgoing_migrations to be absolutely sure not more than one can occur at any given time (sometimes migrations are spawned automatically as a result of "Move to maintenance" request or SLA balancing policies)
Also, in el6 qemu there was a known migration performance issue around 200-300MBps IIRC.

For testing I would recommend to try max bandwidth 250, 1 outgoing. Once it's stable we can go up
Comment 19 Michal Skrivanek 2015-11-10 04:04:12 EST
(In reply to Tomas Jelinek from comment #17)
> Could you please try to lower it to default (32) and try again?
I'd try a bit more. It's a 10G network...and for large VMs 32 is insufficient/too low
Comment 20 Douglas Duckworth 2015-11-10 12:17:13 EST
Michal

We have two separate 802.3ad bonded networks.

For Compellent SAN Bond0 carries iSCSI traffic only.  Bond0 has slaves em1 and em2. 

Bond1 provides VM, RHEV management, and transfer networks.  Bond1 has slaves p1p1 and p1p2.

Bond1 traffic over our most-active Hosts rarely exceed 50Mbps.  

The default value of "migration_max_bandwidth" actually prevented us from live-migrating extremely active DB servers.  Our most important DB was actually stuck on a Host.  I opened a ticket with Redhat [01434018] and shared these vdsm.conf settings with support.  By increasing "migration_max_bandwidth," we were able to move active Guest VM off Host that needed maintenance.

That being said thank you for this valuable feedback.  For testing I will lower 
"migration_max_bandwidth" a bit and set "max_outgoing_migrations = 1" then post results of migration to ticket that's open with Allan.

Best
Doug
Comment 21 Michal Skrivanek 2015-11-10 12:33:59 EST
(In reply to Douglas Duckworth from comment #20)

Re convergence - Yeah, I figured 32 is not enough:) Other things may help - qemu on RHEL 7 fixes the performance bottleneck on max speed per single migration, not sure if that's feasible. There is also a UI option for maximum allowed downtime(period for VM handoff from src to dst at the end of migration) - it's affected by the changes you did to other timeouts. Try to restore it back in conf and increase the value in UI. I suspect currently it's trying too hard to meet the limit(~50ms due to your config changes and the way the algorithm works)
Put it back and increase in UI to couple seconds....should greatly help


Thanks for the prompt responses, btw!:)
Comment 22 Douglas Duckworth 2015-11-10 12:43:27 EST
Thanks again Michal

We've deployed a few RHEL7 Hosts but not exported VMs to that cluster.  This sounds like another reason to start making that happen.

Ok cool I will also change "DefaultMaximumMigrationDowntime" per this article [https://access.redhat.com/solutions/445943].
Comment 23 Douglas Duckworth 2015-11-11 14:14:09 EST
Testing new settings...

RHEV-M:

Changed "DefaultMaximumMigrationDowntime" from "0" to "2000."

engine-config -g DefaultMaximumMigrationDowntime                                                                                                                                                                          
DefaultMaximumMigrationDowntime: 2000 version: general

On two Hosts that provide DEV/TEST Guest VMs changed vdsm.conf:

[vars]
ssl = true

migration_timeout = 60000

migration_max_time_per_gib_mem = 0

migration_max_bandwidth = 950

max_outgoing_migrations = 1


Will migrate several VMs between these two Hosts then post results.

Thanks
Doug
Comment 30 Michal Skrivanek 2016-01-13 07:43:56 EST
marking as resolved then
Comment 31 Douglas Duckworth 2016-01-19 12:00:42 EST
All

We are now seeing improved stability, (no more engine messages like 'domain 13f748ea-a005-45ac-abb9-ef64da7dfaae:COM_4TB1 in problem') with these vdsm.conf migration settings, as well as good enough throughput [1-3 Gbps] for our larger [40GB RAM+] VMs.

We will move to RHEL7 Hosts when we have time.

Thank you for your patience!

vdsm.conf:

ssl = true

migration_timeout = 60000

migration_max_time_per_gib_mem = 0

migration_max_bandwidth = 600

max_outgoing_migrations = 1

Note You need to log in before you can comment on or make changes to this bug.