877006 – storage network rollback fails for SPM

Bug 877006 - storage network rollback fails for SPM

Summary: storage network rollback fails for SPM

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Antoni Segura Puimedon
QA Contact:	Martin Pavlik
Docs Contact:
URL:
Whiteboard:	network
Depends On:
Blocks:	877734
TreeView+	depends on / blocked

Reported:	2012-11-15 14:13 UTC by Martin Pavlik
Modified:	2016-02-10 19:52 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	877734 (view as bug list)
Environment:
Last Closed:	2014-02-25 10:05:34 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm.log + engine.log (8.84 KB, application/x-gzip) 2012-11-15 14:13 UTC, Martin Pavlik	no flags	Details
sosreport (6.19 MB, application/x-xz) 2012-11-15 14:16 UTC, Martin Pavlik	no flags	Details
ifcfg files before wrong gw (624 bytes, application/x-compressed-tar) 2012-11-15 14:16 UTC, Martin Pavlik	no flags	Details
ifcfg files after wrong gw (621 bytes, application/x-compressed-tar) 2012-11-15 14:17 UTC, Martin Pavlik	no flags	Details
log_collector2 (5.69 MB, application/x-xz) 2013-04-17 08:10 UTC, Martin Pavlik	no flags	Details
View All

Description Martin Pavlik 2012-11-15 14:13:38 UTC

Created attachment 645662 [details]
vdsm.log + engine.log

Description of problem:
If user sets wrong gateway on host rhevm interface, settings do not roll back after connectivity check fails

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Manager Version: '3.1.0-28.el6ev' 

Host:
RHEL - 6Server - 6.3.0.3.el6
kernel 2.6.32 - 279.11.1.el6.x86_64
kvm 0.12.1.2 - 2.295.el6_3.5
vdsm-4.9.6-42.0.el6_3

How reproducible:
100%

Steps to Reproduce:
1. Have working host in setup
2. Host -> your host -> network interfaces -> setup host networks
3. edit rhevm interface -> set wrong default gateway (valid IP address but not address of actual GW) -> click OK
3. tick check boxes on "Verify connectivity between Host and Engine" and Save network configuration
4. wait until error message appears in GUI and "Save network configuration"

  
Actual results:
host remains unreachable with wrong GW set

Expected results:
host rollbacks, connectivity restores with old settings

Additional info:
vdsm.log
MainProcess|Thread-379::DEBUG::2012-11-15 14:46:35,588::configNetwork::1358::setupNetworks::(setupNetworks) Checking connectivity...
Thread-33::WARNING::2012-11-15 14:47:31,108::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll
    raise Timeout()
Timeout
Thread-33::ERROR::2012-11-15 14:47:31,112::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain a0d5fbad-032d-4534-841e-2bfc8d4c9af8 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 137, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 426, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 286, in callCrabRPCFunction
    raise Timeout("Operation stuck on remote handler")
Timeout: Operation stuck on remote handler

Comment 1 Martin Pavlik 2012-11-15 14:16:03 UTC

Created attachment 645663 [details]
sosreport

Comment 2 Martin Pavlik 2012-11-15 14:16:37 UTC

Created attachment 645664 [details]
ifcfg files before wrong gw

Comment 3 Martin Pavlik 2012-11-15 14:17:11 UTC

Created attachment 645665 [details]
ifcfg files after wrong gw

Comment 4 Martin Pavlik 2012-11-15 14:21:10 UTC

crorrection 

Steps to Reproduce:
4. wait until error message appears in GUI and "Save network configuration"

should be
4. wait until error message appears in GUI

Comment 5 Dan Kenigsberg 2012-11-17 22:14:41 UTC

The main culprit here is

Thread-374::INFO::2012-11-15 14:46:24,305::logUtils::39::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 2, 'spmStatus': 'SPM', 'spmLver': 13}}
MainThread::INFO::2012-11-15 14:48:29,010::vdsm::70::vds::(run) I am the actual vdsm 4.9-42.0

Vdsm was restarted before it managed to roll back the network config. Your host was SPM when that happened, and could not pet its lease due to the bad gw. It is not prudent to let people edit networking on such an important node - we should consider to block it in UI.

Maybe we should also roll back configuration upon vdsm process startup, instead of upon the sysv service restart. This should not be done hastily; I am this lowering the severity of the bug.

Comment 6 Antoni Segura Puimedon 2013-01-10 14:21:16 UTC

The UI block thing not, but the rollback only on startup instead of vdsmd restart is fixed with http://gerrit.ovirt.org/#/c/10334/

Comment 7 Dan Kenigsberg 2013-04-11 09:30:50 UTC

Marin, could you check the behavior on a recent rhev-3.2, when you have a dedicated storage network?

Editing the storage network would never ever work for SPM, but hopefully, nowardays, you could edit the management network.

Comment 8 Martin Pavlik 2013-04-17 08:09:43 UTC

(In reply to comment #7)
> Marin, could you check the behavior on a recent rhev-3.2, when you have a
> dedicated storage network?
> 
> Editing the storage network would never ever work for SPM, but hopefully,
> nowardays, you could edit the management network.

With dedicated storage network rhevm (sf13.1, vdsm-4.10.2-15.0.el6ev.x86_64) bridge settings roll back correctly. Logs are atteched as log_collector2.

Comment 9 Martin Pavlik 2013-04-17 08:10:34 UTC

Created attachment 736709 [details]
log_collector2

Comment 10 Dan Kenigsberg 2014-02-25 10:05:34 UTC

(In reply to Dan Kenigsberg from comment #5)
> 
> Maybe we should also roll back configuration upon vdsm process startup,
> instead of upon the sysv service restart. This should not be done hastily; I
> am this lowering the severity of the bug.

Since I've written this comment, we went to the opposite direction: rollback no longer occurs during sysv (and certainly not during process startup) but only during boot.

One of the motivations for this was this flow exactly: and spm failover used to cause rollback of unrelated network changes. http://gerrit.ovirt.org/10334

The bug has become much less acute since we no longer shut off all networking when rolling back a network configurations: we shut off only the relevant networks. Thus, this bug pops up only when configuring the storage network of the spm node.

The currently-remaining behavior is annoying, and would require a power cycle to fix, but I do not see any way we can avoid it --- any way besides asking users not to configure their storage network on the SPM.

Note You need to log in before you can comment on or make changes to this bug.