Description of problem: RHEV-M marks host as non-operational and says it has a problem with a storage domain, but vdsm reports no problems with that storage domain Version-Release number of selected component (if applicable): rhevm-3.5.4.2-1.3.el6ev.noarch vdsm-4.16.26-1.el6ev.x86_64 How reproducible: Unknown Steps to Reproduce: 1. configure new host 2. activate new host 3. Migrate VMs to the new host Actual results: Host eventually is marked as non-responsive. No issues are logged in vdsm.log on the host Expected results: Host should either continue to run normally, or vdsm should log a problem Additional info:
Do we need to add logs to this Bugzilla?
please attach engine.log for the relevant time
Looking into the logs it looks like: - there are lots of VMs (~30) migrating to the tulhv1p06 - this load causes various issues on more hosts: 2015-09-22 11:04:19,899 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-7) [32262ed4] Failed vds listing, vds = c23ea72a-5476-4d60-a4a8-b436a95a079d : tulhv1p14, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Unexpected exception, code = 16 2015-09-22 11:04:19,936 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-84) [1b136eca] Failed vds listing, vds = 0a142853-0347-4cab-85af-6cfa1d4c0726 : tulhv1p12, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetAllVmStatsVDS, error = Unexpected exception, code = 16 2015-09-22 11:08:27,875 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-39) [843e5b8] Command ListVDSCommand(HostName = tulhv1p06, HostId = 58a592d0-8810-446e-b01f-f8b448c0bcd9, vds=Host[tulhv1p06,58a592d0-8810-446e-b01f-f8b448c0bcd9]) execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused) - this problems caused that 2015-09-22 11:06:11,836 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-45) vds tulhv1p06 reported domain f2bb7851-b2ed-495d-9460-84a0e97c016e:COM_3TB6 as in problem, attempting to move the vds to status NonOperational and consequently: 2015-09-22 11:07:02,757 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-49) [39410bac] Correlation ID: 39410bac, Job ID: e58d67b6-a6ff-476b-b126-e275162b4cf8, Call Stack: null, Custom Event ID: -1, Message: Host tulhv1p06 cannot access the Storage Domain(s) COM_3TB6 attached to the Data Center tulvp1p03. Setting Host state to Non-Operational. - but later the status of the host was set back to up 2015-09-22 11:10:04,561 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-12) [67280370] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Status of host tulhv1p06 was set to Up. Can you please confirm that the cluster moved back to normal after the migration storm finished? If yes, than please note that in the timeframe of 4.0 the migrations are going to be changed significantly especially targeting issues like this one: BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1252426 Feature Page: http://www.ovirt.org/Features/Migration_Enhancements Gerrit topic: https://gerrit.ovirt.org/#/q/topic:migration-enhancements
The cluster usually returns to normal state with Host being set as Up though we usually manually restart vdsmd on the Host to get that back in place sooner since we actually never see the loss of storage or paused VMs. It looks like we set "migration_max_bandwidth" too high in vdsm.conf given we thought the unit was megabits though that's not the case per https://access.redhat.com/solutions/108813# [vars] ssl = true migration_timeout = 60000 migration_max_time_per_gib_mem = 0 migration_max_bandwidth = 9500 max_outgoing_migrations = 8 We should obviously lower it to a more appropriate level given the transfer network runs over converged 10Gb 802.3ad bond.
Thanks, We will review all of the items that were posted to this BZ. After lowering the VDSM migration threshold, and only migrating one VM at a time, we're still seeing messages in Engine Log like: 2015-11-06 15:07:48,361 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-11) domain 290158f5-4182-4072-abe9-7951b9d24c72:COM_3TB3 in problem. vds: tulhv1p06 2015-11-06 15:08:03,934 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 00f4c946-f133-4908-98c5-f64e352616b9:COM_3TB4 in problem. vds: tulhv1p06 2015-11-06 15:08:03,935 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain f2bb7851-b2ed-495d-9460-84a0e97c016e:COM_3TB6 in problem. vds: tulhv1p06 2015-11-06 15:08:03,936 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 8f2a3efd-5abc-42f7-a232-d482bfc74860:COM_3TB7 in problem. vds: tulhv1p06 2015-11-06 15:08:03,937 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain a35aa9f8-1576-4de2-9956-145ae934ae20:COM_3TB9 in problem. vds: tulhv1p06 2015-11-06 15:08:03,938 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain bfe870f4-eea0-4729-8ebf-3f2ee5a8c512:COM_3TB12 in problem. vds: tulhv1p06 2015-11-06 15:08:03,939 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain dbaf4d1d-571f-4873-9dde-93c7bb0296a4:COM_3TB10 in problem. vds: tulhv1p06 2015-11-06 15:08:03,941 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 2f69c2c5-0c16-454f-a429-26c8c7e4011c:COM_3TB11 in problem. vds: tulhv1p06 2015-11-06 15:08:03,942 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 975fe25e-73db-46e2-96ef-ab6d6c6550a8:COM_DR1 in problem. vds: tulhv1p06 2015-11-06 15:08:03,942 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 2eecf753-a5cf-4af6-a5c1-e2e0295cab6d:COM_3TB8 in problem. vds: tulhv1p06 2015-11-06 15:08:03,943 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 2fe689d9-a9fa-4347-ae1f-4022cc52fb5b:COM_SSD in problem. vds: tulhv1p06 2015-11-06 15:08:03,944 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain 13f748ea-a005-45ac-abb9-ef64da7dfaae:COM_4TB1 in problem. vds: tulhv1p06 2015-11-06 15:08:03,945 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-23) domain a52b71eb-6707-4361-ae38-02dacf3db1f5:COM_3TB2 in problem. vds: tulhv1p06 As well as VM's being sent to Unknown status then later recovering. Using iftop during migration, we do not see throughput approaching wire speed of the bonded connection. During some migrations we actually saw storage errors, similar to the above, populating Engine log before VM were sent over transfer network.
could you please attach the engine logs for this case as well as the source and dest vdsm logs and the vdsm.conf (source/dest)? Could you please also check the CPU comsumption of the source/dest host? Maybe there is a different problem than the bandwidth one...
Created attachment 1091834 [details] vdsm.conf
The migration_max_bandwidth = 1150 seems to be still quite high since the default is 32. I can imagine that this can cause some performance issues in qemu/libvirt. Could you please try to lower it to default (32) and try again? Also can you please run htop on source and dest hosts while migrating? BTW the logs don't match: engine logs end at 2015-11-04 VDSM logs start at 2015-11-09
(In reply to Douglas Duckworth from comment #14) > Created attachment 1091834 [details] > vdsm.conf with 1150 your getting way too close to maximum theoretical bandwidth. What else is shared on that network? Storage? VM traffic? You should also lower the max_outgoing_migrations to be absolutely sure not more than one can occur at any given time (sometimes migrations are spawned automatically as a result of "Move to maintenance" request or SLA balancing policies) Also, in el6 qemu there was a known migration performance issue around 200-300MBps IIRC. For testing I would recommend to try max bandwidth 250, 1 outgoing. Once it's stable we can go up
(In reply to Tomas Jelinek from comment #17) > Could you please try to lower it to default (32) and try again? I'd try a bit more. It's a 10G network...and for large VMs 32 is insufficient/too low
Michal We have two separate 802.3ad bonded networks. For Compellent SAN Bond0 carries iSCSI traffic only. Bond0 has slaves em1 and em2. Bond1 provides VM, RHEV management, and transfer networks. Bond1 has slaves p1p1 and p1p2. Bond1 traffic over our most-active Hosts rarely exceed 50Mbps. The default value of "migration_max_bandwidth" actually prevented us from live-migrating extremely active DB servers. Our most important DB was actually stuck on a Host. I opened a ticket with Redhat [01434018] and shared these vdsm.conf settings with support. By increasing "migration_max_bandwidth," we were able to move active Guest VM off Host that needed maintenance. That being said thank you for this valuable feedback. For testing I will lower "migration_max_bandwidth" a bit and set "max_outgoing_migrations = 1" then post results of migration to ticket that's open with Allan. Best Doug
(In reply to Douglas Duckworth from comment #20) Re convergence - Yeah, I figured 32 is not enough:) Other things may help - qemu on RHEL 7 fixes the performance bottleneck on max speed per single migration, not sure if that's feasible. There is also a UI option for maximum allowed downtime(period for VM handoff from src to dst at the end of migration) - it's affected by the changes you did to other timeouts. Try to restore it back in conf and increase the value in UI. I suspect currently it's trying too hard to meet the limit(~50ms due to your config changes and the way the algorithm works) Put it back and increase in UI to couple seconds....should greatly help Thanks for the prompt responses, btw!:)
Thanks again Michal We've deployed a few RHEL7 Hosts but not exported VMs to that cluster. This sounds like another reason to start making that happen. Ok cool I will also change "DefaultMaximumMigrationDowntime" per this article [https://access.redhat.com/solutions/445943].
Testing new settings... RHEV-M: Changed "DefaultMaximumMigrationDowntime" from "0" to "2000." engine-config -g DefaultMaximumMigrationDowntime DefaultMaximumMigrationDowntime: 2000 version: general On two Hosts that provide DEV/TEST Guest VMs changed vdsm.conf: [vars] ssl = true migration_timeout = 60000 migration_max_time_per_gib_mem = 0 migration_max_bandwidth = 950 max_outgoing_migrations = 1 Will migrate several VMs between these two Hosts then post results. Thanks Doug
marking as resolved then
All We are now seeing improved stability, (no more engine messages like 'domain 13f748ea-a005-45ac-abb9-ef64da7dfaae:COM_4TB1 in problem') with these vdsm.conf migration settings, as well as good enough throughput [1-3 Gbps] for our larger [40GB RAM+] VMs. We will move to RHEL7 Hosts when we have time. Thank you for your patience! vdsm.conf: ssl = true migration_timeout = 60000 migration_max_time_per_gib_mem = 0 migration_max_bandwidth = 600 max_outgoing_migrations = 1
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days