Created attachment 635600 [details] network addition python script using the ovirt/rhevm rest api. Description of problem: When there are a lot of networks set in the host as temporary networks, i.e., there has not been a call to setsafeconfig, a vdsmd restart can result in a few minutes long loss of connectivity. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Create 300 VLANs using the attached script: python create_networks.py 0 300 (the script should be modified to match your datacenter, cluster, host and ethernet ids). Mind you, the addtition uses the old api, so it can take up to 33min to add all the nets. 2. Restart vdsmd doing: service vdsmd restart. 3. Wait some minutes (Up to 14min) to get connectivity back, log in the machine and see that, though extremely long, the restore has been successful. Actual results: The connectivity is lost for a really long time (as long as the vlans are added to an ethernet interface with a name alphabetically precedent to 'rhevm'/'ovirmgmt'. Expected results: Connectivity loss is only as small as taking down and up the management interface takes. The rest of the interfaces are processed afterwards. Additional info: the script uses the requests library. To install it, to easy_install requests.
this is somewhat related to bug 877006: we would like to keep connectivity to storage and management networks as much as we can, even during rollback. This require the brute-force approach of `network stop` && `network start`.
A thing that might help is making the network settings dynamic (and set by the engine), as proposed by some in the mailing list and keep just the management interface persistent and as untouched as possible by restarts/restores.
With the patch http://gerrit.ovirt.org/#/c/10334/ this bug will not be reproducible on vdsmd restart. However, it can still happen when doing a regular rollback, although less likely, as now http://gerrit.ovirt.org/#/c/9506/ only stops and starts those networks that were really modified, which not often will include the management interface, as it is probably wise to modify that one by itself and set those changes as safe separately. However, if need be, we could potentially do a trick of making the management be the last to take down and the first to take up.
I feel comfortable enough to backport http://gerrit.ovirt.org/10334 split restore-net-conf away of vdsmd.init service so as to avoid 80% of the cases where the problem would manifest in.
Verified on vdsm-4.10.2-10.0.el6ev.x86_64.
3.2 has been released