Bug 1267136

Summary:	[Docs] [Deployment] Incorporate content from Red Hat IT using RHEV
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Julie <juwu>
Component:	Documentation	Assignee:	Julie <juwu>
Status:	CLOSED DUPLICATE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.6.0	CC:	adahms, gklein, lbopf, lsurette, rbalakri, yeylon, ykaul, ylavi
Target Milestone:	ovirt-3.6.5
Target Release:	3.6.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-02-09 06:36:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Docs	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1271437

Description Julie 2015-09-29 06:18:28 UTC

Review the content incorporate suitable content to the deployment and planning guide.

Discussions from the email thread:  

There's no official update at this point, but below is an unofficial update.


While some of the numbers have changed, I don't think many of the
lessons have really changed.

These days we're at ~130 Hypervisors and ~3000 VMs in our RHEV
infrastructure.  The hypervisors are larger than before and RHEV 3 is
massively more efficient than RHEV 2.  (Brian also improved the
Memory/CPU balance).

Most of our non-virtual capacity is for
* Memory R/W intensive applications
  - Things like big databases
  - Many *can* be virtualized, but they make cluster maintenance harder
* Services needed to cold-start the DC
  - What ever you need to be able to start and log into RHEV.


Between Zoli and I these would be our main bullet points:

* We still prefer RHEL + packages over the RHEV-H appliance
  - Easier to hook into Configuration Management
* A ratio of 1:32  CPU:RAM is working out about right for us
  - RAM is cheap and was initially our big bottle-neck
* We're currently targeting between 50 and 70% utilization, CPU & Memory
  - But expecting to be under the 50% for a while after new purchases
  - Much higher than 80% and you start seeing contention during peaks.
* Try to avoid having too many disks on the same storage domain
  - under 100 is where we currently target
  - Various disk/VM operations become slow past this point
* Dedicated NIC for management
  - Generally low traffic but if flooded it can result in fencing
* Dedicated NIC for migration
  - Avoids flooding the other NICs
  - The faster you can migrate VMs, the faster you finish maintenance
* Dedicated NIC for storage
* Tag the VLANs to VM data NIC(s)
  - Most VLANs don't generate enough traffic to warrant a dedicated NIC
* split off the heavy traffic VLANs, but still tag them
  - RHEV defines tagged/untagged at the DC level so switching is complex
* JumboFrames is a good thing
* Have 2 separate Production RHEV instances
  - Use separate storage controllers, core switches, etc
  - Occasionally you do trigger bugs in prod that you didn't find before
  - Can avoid the need to cold-boot your DC services after major
    failures
* Gather Data about how your cluster's doing
  - Nobody likes RCAs, even less so without data
* VM Migrations are essential for RHEV maintenance
  - If you can't evacuate your Hypervisors you can't perform maintenance
  - Some applications are very memory read/write intensive and make the
    VM difficult to migrate
  - VMs with 32+GB of RAM is our rule of thumb for closer inspection
  - speak with application owners/architects if the load can be split
    into a higher number of smaller VMs
  - VM migration under load should be added to test plans for any
    application being moved into RHEV


And from the "Parting Tips" I'd reiterate:
* Available memory
  - If you do mess up capacity planning CPU contention is far less
    painful than OOM killer (although harder to spot)
* Plan for Growth
  - especially if purchasing is slow
* Setting Quotas
  - makes it possible to explain your costs and demonstrate the value
    brought by those costs
* Network Speed
  - as RHEV grows RHEV-H starts acting like an access layer switch
* Keep up with RHEV
  - 3.5 was a big improvement in the UI
  - I personally keep seeing batches of my RFEs closed out with each
    new release.
* Use Red Hat Support
  - the SBRs are fantastic
  - File RFEs, they really do get attention.

Comment 4 Julie 2016-02-09 06:36:31 UTC


*** This bug has been marked as a duplicate of bug 1271437 ***