Bug 877006

Summary: storage network rollback fails for SPM
Product: Red Hat Enterprise Virtualization Manager Reporter: Martin Pavlik <mpavlik>
Component: vdsmAssignee: Antoni Segura Puimedon <asegurap>
Status: CLOSED WONTFIX QA Contact: Martin Pavlik <mpavlik>
Severity: high Docs Contact:
Priority: medium    
Version: 3.2.0CC: acathrow, bazulay, danken, gcheresh, gklein, iheim, jkt, lpeer, masayag, mavital, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 877734 (view as bug list) Environment:
Last Closed: 2014-02-25 10:05:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 877734    
Attachments:
Description Flags
vdsm.log + engine.log
none
sosreport
none
ifcfg files before wrong gw
none
ifcfg files after wrong gw
none
log_collector2 none

Description Martin Pavlik 2012-11-15 14:13:38 UTC
Created attachment 645662 [details]
vdsm.log + engine.log

Description of problem:
If user sets wrong gateway on host rhevm interface, settings do not roll back after connectivity check fails

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Manager Version: '3.1.0-28.el6ev' 

Host:
RHEL - 6Server - 6.3.0.3.el6
kernel 2.6.32 - 279.11.1.el6.x86_64
kvm 0.12.1.2 - 2.295.el6_3.5
vdsm-4.9.6-42.0.el6_3

How reproducible:
100%

Steps to Reproduce:
1. Have working host in setup
2. Host -> your host -> network interfaces -> setup host networks
3. edit rhevm interface -> set wrong default gateway (valid IP address but not address of actual GW) -> click OK
3. tick check boxes on "Verify connectivity between Host and Engine" and Save network configuration
4. wait until error message appears in GUI and "Save network configuration"

  
Actual results:
host remains unreachable with wrong GW set

Expected results:
host rollbacks, connectivity restores with old settings

Additional info:
vdsm.log
MainProcess|Thread-379::DEBUG::2012-11-15 14:46:35,588::configNetwork::1358::setupNetworks::(setupNetworks) Checking connectivity...
Thread-33::WARNING::2012-11-15 14:47:31,108::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll
    raise Timeout()
Timeout
Thread-33::ERROR::2012-11-15 14:47:31,112::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain a0d5fbad-032d-4534-841e-2bfc8d4c9af8 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 137, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 426, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 286, in callCrabRPCFunction
    raise Timeout("Operation stuck on remote handler")
Timeout: Operation stuck on remote handler

Comment 1 Martin Pavlik 2012-11-15 14:16:03 UTC
Created attachment 645663 [details]
sosreport

Comment 2 Martin Pavlik 2012-11-15 14:16:37 UTC
Created attachment 645664 [details]
ifcfg files before wrong gw

Comment 3 Martin Pavlik 2012-11-15 14:17:11 UTC
Created attachment 645665 [details]
ifcfg files after wrong gw

Comment 4 Martin Pavlik 2012-11-15 14:21:10 UTC
crorrection 

Steps to Reproduce:
4. wait until error message appears in GUI and "Save network configuration"

should be
4. wait until error message appears in GUI

Comment 5 Dan Kenigsberg 2012-11-17 22:14:41 UTC
The main culprit here is

Thread-374::INFO::2012-11-15 14:46:24,305::logUtils::39::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 2, 'spmStatus': 'SPM', 'spmLver': 13}}
MainThread::INFO::2012-11-15 14:48:29,010::vdsm::70::vds::(run) I am the actual vdsm 4.9-42.0

Vdsm was restarted before it managed to roll back the network config. Your host was SPM when that happened, and could not pet its lease due to the bad gw. It is not prudent to let people edit networking on such an important node - we should consider to block it in UI.

Maybe we should also roll back configuration upon vdsm process startup, instead of upon the sysv service restart. This should not be done hastily; I am this lowering the severity of the bug.

Comment 6 Antoni Segura Puimedon 2013-01-10 14:21:16 UTC
The UI block thing not, but the rollback only on startup instead of vdsmd restart is fixed with http://gerrit.ovirt.org/#/c/10334/

Comment 7 Dan Kenigsberg 2013-04-11 09:30:50 UTC
Marin, could you check the behavior on a recent rhev-3.2, when you have a dedicated storage network?

Editing the storage network would never ever work for SPM, but hopefully, nowardays, you could edit the management network.

Comment 8 Martin Pavlik 2013-04-17 08:09:43 UTC
(In reply to comment #7)
> Marin, could you check the behavior on a recent rhev-3.2, when you have a
> dedicated storage network?
> 
> Editing the storage network would never ever work for SPM, but hopefully,
> nowardays, you could edit the management network.

With dedicated storage network rhevm (sf13.1, vdsm-4.10.2-15.0.el6ev.x86_64) bridge settings roll back correctly. Logs are atteched as log_collector2.

Comment 9 Martin Pavlik 2013-04-17 08:10:34 UTC
Created attachment 736709 [details]
log_collector2

Comment 10 Dan Kenigsberg 2014-02-25 10:05:34 UTC
(In reply to Dan Kenigsberg from comment #5)
> 
> Maybe we should also roll back configuration upon vdsm process startup,
> instead of upon the sysv service restart. This should not be done hastily; I
> am this lowering the severity of the bug.

Since I've written this comment, we went to the opposite direction: rollback no longer occurs during sysv (and certainly not during process startup) but only during boot.

One of the motivations for this was this flow exactly: and spm failover used to cause rollback of unrelated network changes. http://gerrit.ovirt.org/10334

The bug has become much less acute since we no longer shut off all networking when rolling back a network configurations: we shut off only the relevant networks. Thus, this bug pops up only when configuring the storage network of the spm node.

The currently-remaining behavior is annoying, and would require a power cycle to fix, but I do not see any way we can avoid it --- any way besides asking users not to configure their storage network on the SPM.