Bug 871481

Summary: Host connectivity can be lost when restoring backups
Product: Red Hat Enterprise Virtualization Manager Reporter: Antoni Segura Puimedon <asegurap>
Component: vdsmAssignee: Antoni Segura Puimedon <asegurap>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: high    
Version: 3.2.0CC: abaron, bazulay, chetan, danken, hateya, iheim, knesenko, lpeer, sgrinber, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: vdsm-4.10.2-10.0.el6ev Doc Type: Release Note
Doc Text:
Previously, the vdsmd service could be restarted by the spmprotect script, which triggered an attempt to restore the host network configuration to its last known safe state. If the host lost its Storage Pool Manager role, it would lose its current network connectivity. Now, the host network configuration is only restored on boot time, not when the vdsmd service is restarted. As a result, the service vdsmd restart command does not adversely affect host networking.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 917401    
Attachments:
Description Flags
network addition python script using the ovirt/rhevm rest api. none

Description Antoni Segura Puimedon 2012-10-30 13:59:30 UTC
Created attachment 635600 [details]
network addition python script using the ovirt/rhevm rest api.

Description of problem: When there are a lot of networks set in the host as temporary networks, i.e., there has not been a call to setsafeconfig, a vdsmd restart can result in a few minutes long loss of connectivity.


Version-Release number of selected component (if applicable): 


How reproducible: 100%


Steps to Reproduce:
1. Create 300 VLANs using the attached script: python create_networks.py 0 300
(the script should be modified to match your datacenter, cluster, host and ethernet ids). Mind you, the addtition uses the old api, so it can take up to 33min to add all the nets.
2. Restart vdsmd doing: service vdsmd restart.
3. Wait some minutes (Up to 14min) to get connectivity back, log in the machine and see that, though extremely long, the restore has been successful.
  
Actual results: The connectivity is lost for a really long time (as long as the vlans are added to an ethernet interface with a name alphabetically precedent to 'rhevm'/'ovirmgmt'.


Expected results: Connectivity loss is only as small as taking down and up the management interface takes. The rest of the interfaces are processed afterwards.


Additional info: the script uses the requests library. To install it, to easy_install requests.

Comment 2 Dan Kenigsberg 2012-11-19 13:37:48 UTC
this is somewhat related to bug 877006: we would like to keep connectivity to storage and management networks as much as we can, even during rollback. This require the brute-force approach of `network stop` && `network start`.

Comment 3 Antoni Segura Puimedon 2012-11-19 14:02:18 UTC
A thing that might help is making the network settings dynamic (and set by the engine), as proposed by some in the mailing list and keep just the management interface persistent and as untouched as possible by restarts/restores.

Comment 4 Antoni Segura Puimedon 2013-01-15 01:55:34 UTC
With the patch http://gerrit.ovirt.org/#/c/10334/ this bug will not be reproducible on vdsmd restart.

However, it can still happen when doing a regular rollback, although less likely, as now http://gerrit.ovirt.org/#/c/9506/ only stops and starts those networks that were really modified, which not often will include the management interface, as it is probably wise to modify that one by itself and set those changes as safe separately. However, if need be, we could potentially do a trick of making the management be the last to take down and the first to take up.

Comment 6 Dan Kenigsberg 2013-02-20 19:22:11 UTC
I feel comfortable enough to backport

  http://gerrit.ovirt.org/10334
  split restore-net-conf away of vdsmd.init service

so as to avoid 80% of the cases where the problem would manifest in.

Comment 7 Meni Yakove 2013-03-03 10:41:00 UTC
Verified on vdsm-4.10.2-10.0.el6ev.x86_64.

Comment 8 Itamar Heim 2013-06-11 09:51:42 UTC
3.2 has been released

Comment 9 Itamar Heim 2013-06-11 09:51:53 UTC
3.2 has been released

Comment 10 Itamar Heim 2013-06-11 09:58:44 UTC
3.2 has been released