1059773 – Failure to replace a partially-removed network

Bug 1059773 - Failure to replace a partially-removed network

Summary: Failure to replace a partially-removed network

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Dan Kenigsberg
QA Contact:	Meni Yakove
Docs Contact:
URL:
Whiteboard:	network
Depends On:
Blocks:	rhev3.4rc
TreeView+	depends on / blocked

Reported:	2014-01-30 15:54 UTC by Allie DeVolder
Modified:	2019-04-28 10:46 UTC (History)
CC List:	11 users (show)
Fixed In Version:	av8
Doc Type:	Bug Fix
Doc Text:	Networks defined on top of NICs that become unavailable on the host no longer become unavailable forever.
Clone Of:
Environment:
Last Closed:	2014-06-09 13:28:42 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Comment (1.14 MB, text/plain) 2014-02-12 21:24 UTC, Greg Scott	no flags	Details
Comment (249.92 KB, text/plain) 2014-02-12 21:28 UTC, Greg Scott	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0504	0	normal	SHIPPED_LIVE	vdsm 3.4.0 bug fix and enhancement update	2014-06-09 17:21:35 UTC
oVirt gerrit	25009	0	None	None	None	Never

Description Allie DeVolder 2014-01-30 15:54:33 UTC

Description of problem:
When adding a new host to RHEV 3.3, the RHEV-M configures a previously removed logical network on the hypervisor

Version-Release number of selected component (if applicable):
rhevm-3.3.0-0.45.el6ev.noarch

How reproducible:
very

Steps to Reproduce:
1. Remove unused logical network from RHEV
2. Add new host to RHEV

Actual results:
logical network that was deleted from RHEV added to host

Expected results:
Only existent logical networks added to host

Additional info:

Comment 1 Greg Scott 2014-02-04 14:33:39 UTC

See support case number 01023178 for details.

The problem turns out to be nastier than we first thought. Add a new host to a RHEV-M environment. The host is reachable on its logical rhevm network. Now use the RHEV-M GUI to set up the other logical networks for, say, storage. Set up a logical network named iscsi and drag and drop it onto a NIC in the newly added host.

Drop iscsi onto the wrong NIC in the host and save changes. This works. But it's the wrong NIC. Realize your mistake and go back into the GUI and try to fix it. Drag iscsi to the correct NIC or make any other change in the GUI. This fails with "Error while executing action Setup Networks: failed to bring interface up." It is no longer possible to make any network changes. Any change to any network parameter generates the same error when you click OK.

It is possible on the host to manually edit the relevant ifcfg-** config files and persist them. But even after manual editing, removing the host from the RHEV-M GUI and re-adding it, the RHEV-M GUI still shows that host with the iscsi network connected to the wrong NIC. The only workaround we could find is give the host a new name and try again.

Looking at tail -f vdsm.log on the effected RHEV-H host shows more clues:

[root@rhev2 vdsm]# tail vdsm.log -f
Thread-64::DEBUG::2014-01-30 23:11:07,695::BindingXMLRPC::991::vds::(wrapper) return ping with {'status': {'message': 'Done', 'code': 0}}
Thread-63::ERROR::2014-01-30 23:11:08,044::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
File "<string>", line 2, in setupNetworks
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-63::DEBUG::2014-01-30 23:11:08,052::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}

Comment 2 Dan Kenigsberg 2014-02-12 12:59:19 UTC

As far as I understand the issue reported in comment 0 is not a bug: if a host is returned to a cluster, it flanks the old state of network configurations, which may well be unsynchronized.

Comment 1 describe a real bug, which I do not understand. Would you please attach more of your vdsm.log (recent setupNetwork commands and their responses) as well as supervdsm.log?

Comment 3 Greg Scott 2014-02-12 14:01:01 UTC

Unfortunately, I don't have access to the systems where this problem came up so I don't have a way to get current logs.  However the support case I referenced does have a few log collector archives and they should have copies of the log files from when the problem came up.  Do you have a way to get them from the support case?  I also have copies here - what is the best way to get them to you?

- Greg Scott
  gregscott

Comment 4 Dan Kenigsberg 2014-02-12 15:03:16 UTC

Bugzilla attachments (of only the requested logs) work best for me.

Comment 5 Greg Scott 2014-02-12 15:17:49 UTC

Ah - I just found the "Add an attachment" link.  I tried to attach a log collector report (13 MB) and it said it was successful, but I don't see anything now that says it's here.  

Tell ya what - try this.  Go here:

ftp://ftp.infrasupport.com/bugzilla

and just grab everything there.  There are a couple of log collector reports and some jpg files with screen shots and a PDF with everything in the support case.  I'll leave it all there for the day so you have time to grab it.  

- Greg

Comment 6 Greg Scott 2014-02-12 15:28:07 UTC

To make all that raw data make sense - 

There were two instances of the problem.  The first instance was on the host named rhev1.  I set up a network named iSCSI (all caps) and tried unsuccessfully to connect the wrong NIC on rhev1 to it.  I eventually removed the network named iSCSI from RHEV-M, but host rhev1 still kept its reference to it.  I was not able to get rid of the reference to the now bogus iSCSI network from that host.  I worked around that problem by setting up a new network named iscsi (lower case) and manually setting up the appropriate ifcfg-nnn files on host rhev1 and persisting them.  But to this day, host rhev1 still has the bogus reference to iSCSI (all caps) on an unused NIC.  

Later, I tried to connect host rhev2 to the environment.  And the NIC I thought was correct turned out to be the wrong one. But I was unable to save any changes to the interfaces on host rhev2.  Editing the config files by hand and persisting them on rhev2 connected everything appropriately, but I was still never able to make the RHEV GUI reflect what the ifcfg-nnn files said on host rhev2, even when removing and re-adding host rhev2.  

The only workaround this time was to change the hostname to rhev4 and be extra careful this time to connect the correct NICs to the correct networks.

And every time I tried to save changes from the RHEV_M GUI, VDSM blew up as I posted in my vdsm.log extract.  

- Greg

Comment 7 Dan Kenigsberg 2014-02-12 16:49:40 UTC

Greg, would you help me with the log digging, and provide a bigger extract from vdsm.log and the compatible part of supervdsm.log?

Comment 8 Greg Scott 2014-02-12 20:20:52 UTC

Yes.  I was out for a while at a customer site.  Let me see what I can do.  I also have to work a couple of other customer issues today.

- Greg

Comment 9 Greg Scott 2014-02-12 20:25:32 UTC

This might be helpful.  I'm pasting in an entry from the support case.  I clicked "OK" in one window while I had tail -f vdsm.log in another window.  This doesn't have any supervdsm stuff but it's at least a starting point and you can see when the programs blow up on the host when I click OK in the GUI.

I posted this on Jan. 30 at 5:19:29 PM.

****************************************************

This one just got more serious.  Trying to bring in the next host, rhev2, we made a config mistake.  We originally put the "people" rhevm network on eth5 and the iscsi network on eth6.  We fixed it - rhevm should be on eth4 and iscsi on eth5.

I set up ifcfg-iscsi and ifcfg-eth5 on the host and persisted them.  They work.  The host can ping all the stuff it needs to ping.

But I'm not able to tell the GUI to move the iscsi network from eth6 to eth5.  Trying to save any changes blows up with the same errors we saw with rhev1.  Here is a tail -f vdsm.log from host rhev2.  Every time I click OK, I see this stack dump on rhev2.

[root@rhev2 vdsm]# tail vdsm.log -f
Thread-64::DEBUG::2014-01-30 23:11:07,695::BindingXMLRPC::991::vds::(wrapper) return ping with {'status': {'message': 'Done', 'code': 0}}
Thread-63::ERROR::2014-01-30 23:11:08,044::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-63::DEBUG::2014-01-30 23:11:08,052::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}



Thread-65::DEBUG::2014-01-30 23:13:47,795::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setupNetworks with ({'iscsi': {'remove': 'true'}}, {}, {'connectivityCheck': 'true', 'connectivityTimeout': 120}) {} flowID [1e306236]
Thread-66::DEBUG::2014-01-30 23:13:47,807::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call ping with () {} flowID [1e306236]
Thread-66::DEBUG::2014-01-30 23:13:47,808::BindingXMLRPC::991::vds::(wrapper) return ping with {'status': {'message': 'Done', 'code': 0}}
Thread-65::ERROR::2014-01-30 23:13:48,118::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-65::DEBUG::2014-01-30 23:13:48,119::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}
Thread-67::DEBUG::2014-01-30 23:13:57,593::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setSafeNetworkConfig with () {} flowID [34dea9b9]
Thread-67::DEBUG::2014-01-30 23:13:57,712::BindingXMLRPC::991::vds::(wrapper) return setSafeNetworkConfig with {'status': {'message': 'Done', 'code': 0}}
Thread-68::DEBUG::2014-01-30 23:14:21,545::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setupNetworks with ({'iscsi': {'remove': 'true'}}, {}, {'connectivityCheck': 'true', 'connectivityTimeout': 120}) {} flowID [11e1ea9]
Thread-69::DEBUG::2014-01-30 23:14:21,547::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call ping with () {} flowID [11e1ea9]
Thread-69::DEBUG::2014-01-30 23:14:21,548::BindingXMLRPC::991::vds::(wrapper) return ping with {'status': {'message': 'Done', 'code': 0}}
Thread-68::ERROR::2014-01-30 23:14:21,832::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-68::DEBUG::2014-01-30 23:14:21,833::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}
Thread-70::DEBUG::2014-01-30 23:14:28,734::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setupNetworks with ({'iscsi': {'nic': 'eth5', 'netmask': '255.255.255.0', 'ipaddr': '192.168.5.22', 'bridged': 'true', 'STP': 'no'}}, {}, {'connectivityCheck': 'true', 'connectivityTimeout': 120}) {} flowID [4718cfe1]
Thread-71::DEBUG::2014-01-30 23:14:28,736::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call ping with () {} flowID [4718cfe1]
Thread-71::DEBUG::2014-01-30 23:14:28,736::BindingXMLRPC::991::vds::(wrapper) return ping with {'status': {'message': 'Done', 'code': 0}}
Thread-70::ERROR::2014-01-30 23:14:29,003::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-70::DEBUG::2014-01-30 23:14:29,003::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}
Thread-72::DEBUG::2014-01-30 23:14:36,574::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setupNetworks with ({'iscsi': {'nic': 'eth5', 'netmask': '255.255.255.0', 'ipaddr': '192.168.5.22', 'bridged': 'true', 'STP': 'no'}}, {}, {'connectivityCheck': 'false'}) {} flowID [1d703fd3]
Thread-72::ERROR::2014-01-30 23:14:36,843::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-72::DEBUG::2014-01-30 23:14:36,844::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}
Thread-73::DEBUG::2014-01-30 23:14:42,135::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setupNetworks with ({'iscsi': {'nic': 'eth5', 'netmask': '255.255.255.0', 'ipaddr': '192.168.5.22', 'bridged': 'true', 'STP': 'no'}}, {}, {'connectivityCheck': 'false'}) {} flowID [4b7d85f1]
Thread-73::ERROR::2014-01-30 23:14:42,402::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-73::DEBUG::2014-01-30 23:14:42,402::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}
Thread-74::DEBUG::2014-01-30 23:14:55,580::BindingXMLRPC::984::vds::(wrapper) client [172.16.5.210]::call setupNetworks with ({'iscsi': {'nic': 'eth5', 'netmask': '255.255.255.0', 'ipaddr': '192.168.5.22', 'bridged': 'true', 'STP': 'no'}}, {}, {'connectivityCheck': 'false'}) {} flowID [63ce8b6b]
Thread-74::ERROR::2014-01-30 23:14:55,852::API::1275::vds::(setupNetworks)
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1273, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
ConfigNetworkError: (29, '')
Thread-74::DEBUG::2014-01-30 23:14:55,853::BindingXMLRPC::991::vds::(wrapper) return setupNetworks with {'status': {'message': '', 'code': 29}}
^C
[root@rhev2 vdsm]#

Comment 10 Greg Scott 2014-02-12 21:24:41 UTC

Created attachment 915849 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 11 Greg Scott 2014-02-12 21:28:07 UTC

Created attachment 915850 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 12 Dan Kenigsberg 2014-02-19 12:48:36 UTC

Greg, such huge bugzilla comments are unmanageable. Please use attachment in the future.

I did notice:

MainProcess|Thread-48::DEBUG::2014-01-24 01:52:23,540::supervdsmServer::95::SuperVdsm.ServerCallback::(wrapper) call setupNetworks with ({'iSCSI': {'nic': 'eth2', 'netmask': '255.255.255.0', 'ipaddr': '192.168.5.21', 'gateway': '0.0.0.0', 'bridged': 'false'}}, {}, {'connectivityCheck': 'true', 'connectivityTimeout': 120}) {}

MainProcess|Thread-48::DEBUG::2014-01-24 01:52:23,704::ifcfg::273::root::(_persistentNetworkBackup) backing up network iSCSI: <network>
  <name>vdsm-iSCSI</name>
  <uuid>39bcc561-cd1d-08fb-0c0e-c45db83cb695</uuid>
  <forward dev='eth1' mode='passthrough'>
    <interface dev='eth1'/>
  </forward>
</network>

MainProcess|Thread-48::DEBUG::2014-01-24 01:52:23,852::utils::489::root::(execCmd) u'/sbin/ifup eth1' (cwd None)
MainProcess|Thread-48::DEBUG::2014-01-24 01:52:23,879::utils::509::root::(execCmd) FAILED: <err> = '/sbin/ifup: configuration for eth1 not found.\nUsage: ifup <device name>\n'; <rc> = 1

It seems that ifcfg-eth1 was manually deleted while the libvirt network vdsm-iSCSI still referred to it. I do not yet understand why this makes it impossible to remove that network. Could this be the case? Do you need help to remove the remnant?

It would be nicer if Vdsm is made more robust to handle such conditions, but it does not seem very urgent, and we have no capacity to solve this for 3.4.0

Comment 13 Greg Scott 2014-02-20 16:00:29 UTC

I tried attaching log file archives and that didn't work - that's why I pasted that whole monster in.  Believe me, it wasn't easy but at least now you have the log data you asked for.  Next time I'll try attaching again.  

All the data I collected about the problem is still sitting on the ftp site I mentioned in comment number 5.  And attached to the support case I mentioned.  And I documented suggested steps to reproduce the problem in comment number 1.  

> It would be nicer if Vdsm is made more robust to handle such conditions, but 
> it does not seem very urgent, and we have no capacity to solve this for 3.4.0

Hopefully I can change your mind on the urgency.  From where I sit, this problem is critical.  It needs to be addressed right now.  This problem will bury the Red Hat support team if not addressed.

Any time anyone makes a configuration mistake on any RHEV 3.3 host, you can no longer modify  ***any*** network parameter around that host from the RHEV-M GUI.  Every time you try to save any changes, the GUI returns the error I described earlier and VDSM blows up with the stack trace in the above monster log.  Even editing the ifcfg-ethnn files by hand and persisting them does not guarantee success.  

On hosts with multiple NICs and unpredictable ethnn NIC names, it's not easy to predict which ethnnn device belongs to what logical network.  So users will make configuration mistakes.  And every time that happens - every time a user assigns a logical network to the wrong NIC - somebody with a tight deadline will call the support line and this will tie up a senior support person and the customer with a time consuming troubleshooting process.  

I know of no trustworthy workarounds - that host is stuck **forever** in its messed up state, even if you remove and add the host back into the environment again.  

Imagine this scenario - you're a customer setting up RHEV 3.3 for the first time.  You've been working all night and you have to be up and running by 8AM so you can start your V2V or P2V migrations. Hundreds of people depend on this all working. 

The time right now is 3 AM and you can't get your host fully in the data center because you can't save any network changes.  After spending a fruitless hour retracing your steps and looking at documentation and finding nothing useful, you call the support line and eventually connect to somebody in Australia who hasn't heard of the problem.  He does his homework, eventually finds this bug, and by 6AM or so, you come up with the only viable workaround - remove the host, give it a new hostname, and add it back into the environment, this time being more careful with which NICs you assign to what logical networks.  And now it's 6AM and you have 2 hours left to get everything else done that should have happened hours ago.  

After that experience, what will happen to your confidence with this solution?  You gotta fix this ASAP.

- Greg

Comment 14 Dan Kenigsberg 2014-02-20 18:38:06 UTC

Greg, I am not sure that I completely understand the problem. Had the local admin not mess with the ovirt-managed ifcfg files, would there be any bug?

For this specific case, would

  virsh -c qemu+tls://fqdn-of-local-host/system net-disable vdsm-iSCSI
  virsh -c qemu+tls://fqdn-of-local-host/system net-undefine vdsm-iSCSI

solve the issue? Whomever deletes a ovirt-managemd ifcfg file, should remember to delete the libvirt network referring to it.

Comment 15 Greg Scott 2014-02-21 05:57:10 UTC

> Greg, I am not sure that I completely understand the problem. Had the local 
> admin not mess with the ovirt-managed ifcfg files, would there be any bug?

Yes. There would still be a bug.  It showed up in two different forms.  

And sometimes you have to mess with the ifcfg files by hand to figure out which ethnnn device is which.  This will still be true even with the new device names with RHEL 7.  

At this customer, with host rhev1 the local admin messed with the ifcfg-nnn files for exactly that reason - to find out what was connected to what.  But the overall sequence of events is more complicated. 

I knew NIC eth0 was connected to the rhevm  logical network and I thought nic eth1 was connected to the storage network.  So I set it up that way in RHEV-M. I am 60 miles away from that site so everything I did was remote.  I needed people onsite to connect NICs to switch ports.  

The next day, Jesse locally set up some ifcfg files by hand on that host.  He didn't persist them though.  And then I went in and set up the GUI with RHEV-M.  Or I tried to.  But I could never save changes.  Even after rebooting the host, which would have gotten rid of all the ifcfg files that were not persisted.  

I eventually got host rhev1 up and running by abandoning my iSCSI(all caps) logical network and setting up and persisting ifcfg files myself with a new bridge named iscsi (lower case). And then I defined an iscsi (lower case) network from the RHEV-M side to match up with what I did by hand on host rhev1.  

To this very day, RHEV-M shows an iSCSI (all caps) network associated with unused NIC eth1 on host rhev1.  But there is no longer any iSCSI (all caps) network defined in RHEV-M.  Sooner or later, that will become a problem because they will want to team an existing NIC with eth1. 

With host rhev2, I did not touch any ifcfg files on the host at first.  Jesse installed RHEV-H from CD and went through the console admin menus to assign IP Address and add to the data center.  I assigned the iscsi network to what I thought was the correct NIC, but it was the wrong one.  So then I put together ifcfg-nnn files by hand and pinged the storage array from each NIC to find the correct one.  After I found the correct NIC, I got rid of the ifcfg files and tried to set it up that way from RHEV-M.  But I could not save any changes from the GUI.

So I renamed host rhev2 to rhev4 and redid the networking.  This time it worked.  

But if I ever try to put in a new host named rhev2, those incorrect network settings are going to show up again because they're apparently still in the RHEV-M database somewhere.  

> For this specific case, would
>
>   virsh -c qemu+tls://fqdn-of-local-host/system net-disable vdsm-iSCSI
>   virsh -c qemu+tls://fqdn-of-local-host/system net-undefine vdsm-iSCSI
>
> solve the issue? 

I don't know.  The hosts are in production now and I don't have any easy way to get inside that customer environment any more.  

> Whomever deletes a ovirt-managemd ifcfg file, should remember to delete the 
> libvirt network referring to it.

Using the above command, right?  Because I tried to get rid of that bogus iSCSI (all caps) network from the RHEV-M GUI.  I could make it go away from the RHEV-M environment but not the host.  I wish I had those virsh commands in my back pocket when this problem bit us.  

OK - so having typed all this - here's what happened on host rhev1:

Boot and build rhev1 from a local CD.

Go through the admin menu from the console.  Set network parameters for NIC eth0.  It joins the rhevm bridge.  

Still on the console, associate this host with my new data center.

From the RHEV-M GUI, set up a logical network named iSCSI (all caps).  Still from the GUI, assign the ISCSI (all caps) network to NIC eth1.  This will create and persist an ifcfg-eth1 file on host rhev1.  

But nothing is physically connected to niC eth1.  

From the GUI, remove logical network iSCSI (all caps) from eth1 and save changes.  This blows up.  

Make any other networking change from the GUI to host rhev1 - they all blow up.

Comment 16 Dan Kenigsberg 2014-02-21 10:22:04 UTC

(In reply to Greg Scott from comment #15)
> > Greg, I am not sure that I completely understand the problem. Had the local 
> > admin not mess with the ovirt-managed ifcfg files, would there be any bug?
> 
> Yes. There would still be a bug.  It showed up in two different forms.

Greg, it's comment 16, and I still do not understand what has the bug been. It suppose it has something to do with unstable nic names on the host. Could you explain it from the top? 

Please note that comment 0 describes a know behavior and not a real bug: if you add a host to a cluster, and that host has network configuration from eons back, they would show up (but they have to be removable via rhevm!). Since this BZ is now too heavy to reliably load on my browser, I'd appreciate a clean start.

Let us deal with one issue at a time. If the customer truly has a burning issue about this unremovable network, please convey my command-line tips to him.

Comment 17 Greg Scott 2014-02-22 04:56:47 UTC

Forget comment 0. That's not even the main issue and sorry for the confusion. Although if we had the ability to completely get rid of a host out of the database as if it never existed, that capability would be helpful and might work around the problem described in comment 1.

That's why I put in comment 1. **That** is the real problem. Comment 1. If anything doesn't go as planned when setting up networking on a host, the RHEV-M GUI becomes useless because you can't save changes any more because VDSM on the host blows up. If we didn't have that problem, we would never care about comment 0.

So think of comment 1 as the egg and comment 0 as the chicken. Or the other way around.

And of course I will pass along your workaround to my customer - thanks - but we still have the fundamental problem that VDSM blows up on hosts, which renders the RHEV-M GUI useless for making networking changes. We worked around it at this customer but I'm afraid to do 3.3 anywhere else because of this problem.

I was thinking about your comment about the ifcfg files. VDSM is unhappy because an ifcfg file it cares about is missing - right?

Consider:

On host rhev2, I set up NIC eth4 to connect to the iscsi (lower case) logical network. That creates an ifcfg-eth4 and ifcfg-iscsi on the host, right?

But now I realize the iscsi network should connect to NIC a different ethnnn device.

So I go to the RHEV-M GUI and drag the iscsi network from eth4 to, say, eth5. This should get rid of icfg-eth4 on the host, set up a new ifcfg-eth5, and modify ifcfg-iscsi - yes?

What if something inside VDSM on the host still expects to see that now obsolete ifcfg-eth4 but doesn't find it? Is that why it blows up - because it can't find an ifcfg file it cares about?

- Greg

Comment 19 Dan Kenigsberg 2014-04-22 07:37:43 UTC

Evgheni, could you explain how 01055166 relates to this bug? (I really do not understand). Do you see there a failure to replace a partially-removed network?

In any case, the code relating to partially-removed network has been backported to the 3.4 branch within http://gerrit.ovirt.org/26224 (which would be part of av7, I believe).

Comment 20 Evgheni Dereveanchin 2014-04-22 08:31:21 UTC

Dan, that case is probably unrelated indeed, there case involves an error with an existing logical network, not one removed from RHEV-M. I unlinked it.

Comment 21 Nir Yechiel 2014-04-23 12:31:03 UTC

We were not able to reproduce the issue as reported on comment #1. That said, during the investigation of this bug, we found an issue with VDSM where there is a failure to replace a partially-removed network, i.e., a network exists in libvirt but there is an interface associated with it which does not exist. This issue is fixed now and will be validated by QA.

Greg, can you somehow help to reproduce it again? 

Nir

Comment 22 Greg Scott 2014-04-23 14:36:04 UTC

Let me see what I can do. I don't have access to the system where the problem first came up so I'll have to use some creativity to find hardware.  

- Greg

Comment 23 Meni Yakove 2014-04-29 16:09:44 UTC

Can you please provide "how to reproduce" in order to verify your fix?

Comment 24 Dan Kenigsberg 2014-04-29 17:08:06 UTC

My patch fixes a situation where a network is defined on top of eth1, and then eth1 becomes unavailable on the host (due to `ip link set dev eth1 name eth111` or hot unplug).

Comment 25 Greg Scott 2014-04-29 18:31:48 UTC

See comment 15 for the steps I took to uncover the problem. I'll generalize those below, based on what I put in comment 15 a few months ago.

1. Set up and advertise an iSCSI volume somehow. Put it in its own subnet.

2. Build RHEV-M 3.3 on a system and a new iSCSI data center.

3. Add the first host to the environment. Name the host rhev1. Use a host with 4 or more NIC slots.

4. From the rhev1 console menu, assign an appropriate IP Address to a known NIC, say, eth0. Still on the console menu, connect that host to your new data center. NIC eth0 will join the rhevm logical network and this new host should pop up in the RHEV-M GUI.

5. Now from the RHEV-M GUI, set up a logical network named iSCSI (all caps).

6. In the RHEV-M GUI, go to the network pane for host rhev1 and connect a rhev1 NIC to the logical network named iSCSI. Host rhev1 has 4 NICs; NIC eth0 is already in use for the rhevm logical network, leaving 3 available NICs to try for the iSCSI network. Pick a NIC at random, say eth1, and drag and drop it onto your new iSCSI logical network. Save your changes.

7. Physically connect some NIC other than eth1 to the iSCSI storage network. So the GUI says one NIC is connected, but the GUI does not match what is really connected.
(You will need to cheat a little bit here. From the rhev1 console or an ssh session, temporarily set up ifcfg-nnn files for each available NIC. One at a time, ifup each one, ping the storage, then ifdown. Do not persist anything. Delete your temporary ifcfg-nnn files once you find the NIC you physically connected and now know its name.)

8. Back to the RHEV-M GUI, remove logical network iSCSI (all caps) from eth1 and save changes. This blows up.

9. Make any other networking change from the GUI to host rhev1 - they all blow up.

10. From an ssh session on host rhev1, do tail -m vdsm.log as you try to save changes. You'll see stack traces in vdsm.log every time you try to save any network change in the GUI.

11. Host rhev1 is now messed up forever. You can't make any changes and you can't get rid of it. The only viable course of action is, give rhev1 a new host name and add it back into the data center with its new hostname, this time being careful what NICs you choose for which logical networks.

12. The problem is not limited to the first host - it also happens on subsequent hosts.

- Greg

Comment 26 Greg Scott 2014-04-29 18:48:56 UTC

Looking over what Nir said in comment 21 and trying to retrace my steps - the customer site is around 60 miles away and I did everything remotely from here.  I talked people onsite through booting from DVD, doing the RHEV-H installation, and going through the console menus to connect to RHEV-M.  Once I could see the RHEV-H host either via ssh or RHEV-M GUI, I did the rest.  

It is possible somebody did some things on these hosts I did not know about, especially around connecting cables to networks.  That might explain the partially removed network. 

I still have all those logs sitting in the ftp site I mentioned in comment 5.

- Greg

Comment 27 Meni Yakove 2014-04-30 13:40:48 UTC

Attached network to eth1 and rename eth1.

vdsm-4.14.7-0.2.rc.el6ev.x86_64

Comment 28 errata-xmlrpc 2014-06-09 13:28:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html

Note You need to log in before you can comment on or make changes to this bug.