Review the content incorporate suitable content to the deployment and planning guide. Discussions from the email thread: There's no official update at this point, but below is an unofficial update. While some of the numbers have changed, I don't think many of the lessons have really changed. These days we're at ~130 Hypervisors and ~3000 VMs in our RHEV infrastructure. The hypervisors are larger than before and RHEV 3 is massively more efficient than RHEV 2. (Brian also improved the Memory/CPU balance). Most of our non-virtual capacity is for * Memory R/W intensive applications - Things like big databases - Many *can* be virtualized, but they make cluster maintenance harder * Services needed to cold-start the DC - What ever you need to be able to start and log into RHEV. Between Zoli and I these would be our main bullet points: * We still prefer RHEL + packages over the RHEV-H appliance - Easier to hook into Configuration Management * A ratio of 1:32 CPU:RAM is working out about right for us - RAM is cheap and was initially our big bottle-neck * We're currently targeting between 50 and 70% utilization, CPU & Memory - But expecting to be under the 50% for a while after new purchases - Much higher than 80% and you start seeing contention during peaks. * Try to avoid having too many disks on the same storage domain - under 100 is where we currently target - Various disk/VM operations become slow past this point * Dedicated NIC for management - Generally low traffic but if flooded it can result in fencing * Dedicated NIC for migration - Avoids flooding the other NICs - The faster you can migrate VMs, the faster you finish maintenance * Dedicated NIC for storage * Tag the VLANs to VM data NIC(s) - Most VLANs don't generate enough traffic to warrant a dedicated NIC * split off the heavy traffic VLANs, but still tag them - RHEV defines tagged/untagged at the DC level so switching is complex * JumboFrames is a good thing * Have 2 separate Production RHEV instances - Use separate storage controllers, core switches, etc - Occasionally you do trigger bugs in prod that you didn't find before - Can avoid the need to cold-boot your DC services after major failures * Gather Data about how your cluster's doing - Nobody likes RCAs, even less so without data * VM Migrations are essential for RHEV maintenance - If you can't evacuate your Hypervisors you can't perform maintenance - Some applications are very memory read/write intensive and make the VM difficult to migrate - VMs with 32+GB of RAM is our rule of thumb for closer inspection - speak with application owners/architects if the load can be split into a higher number of smaller VMs - VM migration under load should be added to test plans for any application being moved into RHEV And from the "Parting Tips" I'd reiterate: * Available memory - If you do mess up capacity planning CPU contention is far less painful than OOM killer (although harder to spot) * Plan for Growth - especially if purchasing is slow * Setting Quotas - makes it possible to explain your costs and demonstrate the value brought by those costs * Network Speed - as RHEV grows RHEV-H starts acting like an access layer switch * Keep up with RHEV - 3.5 was a big improvement in the UI - I personally keep seeing batches of my RFEs closed out with each new release. * Use Red Hat Support - the SBRs are fantastic - File RFEs, they really do get attention.
*** This bug has been marked as a duplicate of bug 1271437 ***