Bug 1323782 - vdsm-restore-network leaves inconsistent RunningConfig after a failed restoration
Summary: vdsm-restore-network leaves inconsistent RunningConfig after a failed restora...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: unspecified
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.0.0-alpha
: 4.0.0
Assignee: Edward Haas
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks: 1332027
TreeView+ depends on / blocked
 
Reported: 2016-04-04 16:59 UTC by Prasanth
Modified: 2016-08-23 20:15 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1332027 (view as bug list)
Environment:
Last Closed: 2016-08-23 20:15:32 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
supervdsm log attached (14.93 MB, text/plain)
2016-04-13 19:16 UTC, Prasanth
no flags Details
Supervdsm.log_new (469.96 KB, text/plain)
2016-06-14 21:52 UTC, Prasanth
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1671 0 normal SHIPPED_LIVE VDSM 4.0 GA bug fix and enhancement update 2016-09-02 21:32:03 UTC
oVirt gerrit 56257 0 master MERGED net: Clean running config on unified restoration failure 2020-04-30 15:37:03 UTC
oVirt gerrit 56322 0 ovirt-3.6 MERGED net: Clean running config on unified restoration failure 2020-04-30 15:37:03 UTC

Description Prasanth 2016-04-04 16:59:10 UTC
Description of problem:
Trying to add the hypervisor to the RHEV-M and the installation fails "Command returned failure code 1 during SSH connection"
On further investigation we believe its the vdsmd service is unable to start in the hypervisor and failing with dependency error. 

Version-Release number of selected component (if applicable):
RHEV-M 3.6.3.4-0.1.el6
RHEV-H 7.2-20160302.el7ev
vdsm-4.17.23-0.el7ev.noarch

How reproducible:
It's very consistent. 

Steps to Reproduce:
1.Run with latest RHEV-M and RHEV-H
2.Add you RHEV-H to RHEV-M 
3.Installation of RHEV-H in RHEV-M will fail. 

Actual results:

[root@ictm1612s01h2 ~]# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

Apr 01 19:31:22 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Stopped Virtual Desktop Server Manager.
Apr 01 19:31:49 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Apr 01 19:31:49 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Apr 01 19:52:16 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Apr 01 19:52:16 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Apr 01 20:00:43 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Apr 01 20:00:43 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Apr 04 16:27:22 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Stopped Virtual Desktop Server Manager.
Apr 04 16:27:58 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Apr 04 16:27:58 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.

[root@ictm1612s01h2 ~]# systemctl start vdsmd
A dependency job for vdsmd.service failed. See 'journalctl -xe' for details.

[root@ictm1612s01h2 ~]# journalctl -xe
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 parse_server_challenge()
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 ask_user_info()
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 client step 2
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 ask_user_info()
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 make_client_response()
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 client step 3
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 client mech dispose
Apr 04 16:41:26 ictm1612s01h2.ict.englab.netapp.com python[16031]: DIGEST-MD5 common mech dispose
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: Traceback (most recent call last):
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/share/vdsm/vdsm-restore-net-config", line 429, in <module>
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: restore(args)
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/share/vdsm/vdsm-restore-net-config", line 392, in restore
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: unified_restoration()
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/share/vdsm/vdsm-restore-net-config", line 114, in unified_restoration
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: _remove_networks_in_running_config()
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/share/vdsm/vdsm-restore-net-config", line 199, in _remove_networks_in_running_config
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: _inRollback=True)
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/share/vdsm/network/api.py", line 932, in setupNetworks
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: "system" % network)
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: network.errors.ConfigNetworkError: (27, "Cannot delete network 'ovirtmgmt': It doesn't exist in the system")
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: Traceback (most recent call last):
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/bin/vdsm-tool", line 219, in main
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: return tool_command[cmd]["command"](*args)
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 41, in restore_command
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: exec_restore(cmd)
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 54, in exec_restore
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: raise EnvironmentError('Failed to restore the persisted networks')
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com vdsm-tool[16024]: EnvironmentError: Failed to restore the persisted networks
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: vdsm-network.service: main process exited, code=exited, status=1/FAILURE
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Failed to start Virtual Desktop Server Manager network restoration.
-- Subject: Unit vdsm-network.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit vdsm-network.service has failed.
--
-- The result is failed.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
-- Subject: Unit vdsmd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit vdsmd.service has failed.
--
-- The result is dependency.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Dependency failed for MOM instance configured for VDSM purposes.
-- Subject: Unit mom-vdsm.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit mom-vdsm.service has failed.
--
-- The result is dependency.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Job mom-vdsm.service/start failed with result 'dependency'.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: Unit vdsm-network.service entered failed state.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com systemd[1]: vdsm-network.service failed.
Apr 04 16:41:27 ictm1612s01h2.ict.englab.netapp.com polkitd[18527]: Unregistered Authentication Agent for unix-process:16010:25191529 (system bus name :1.6163, object path /org/freedesktop/PolicyKit1/Authent
lines 4053-4110/4110 (END)


Expected results:
RHEV-H when attached to RHEV-M should start and be running. 

Additional info:
We belive the below defect from redhat to ovirt looks same as above. 

http://permalink.gmane.org/gmane.comp.emulators.ovirt.user/32750
http://lists.ovirt.org/pipermail/users/2015-January.txt
https://www.mail-archive.com/users@ovirt.org/msg31519.html

Comment 1 Dan Kenigsberg 2016-04-07 12:24:18 UTC
Would you please attach /var/log/vdsm/supervdsm.log ?

Comment 2 Prasanth 2016-04-13 19:16:37 UTC
Created attachment 1146940 [details]
supervdsm log attached

Comment 3 Edward Haas 2016-04-17 19:51:59 UTC
Investigation of supervdsm log showed at least two issues.

- The first, is related to IPv6 being used and reported at boot. Persistent and kernel config is compared and wrongly seen as different, due to the IPv6 address in kernel. The difference causes these networks to be removed and re-added.
This issue has been resolved in vdsm-4.17.25, by ignoring IPv6 addresses in the comparison.
As an workaround, IPv6 can be disabled on the host if that is possible.
This issue is not severe, as it is limited to the initial host boot and causes juts a reconfiguration of existing networks.

- The second issue is caused by a failure to restore or reconfigure a network after the host boot. In our case, the attempt to reconfigure ovirtmgmt network fails after ifup times out (no link?).
This failure causes the running config to contain non existence networks and retries to restore the networks will always fail as it is unable to remove them.
A fix for this issue has been submitted for review.

Comment 4 Prasanth 2016-04-21 14:44:47 UTC
Is there a tentative date on when the fix will be released.

Comment 6 Prasanth 2016-05-04 20:01:26 UTC
Its completely blocking our testing and release for support with our products.If you can provide us a tentative date on the fix coming with ovirt4.0, we can move on with other things and come back to it at the right time.

Comment 7 Dan Kenigsberg 2016-05-05 07:38:56 UTC
Prasanth, Next beta build is going to have a fix for the secondary issue, but I wonder how does this block your further testing? Does it reproduce often on other hosts? Can you not disable ipv6 on your host?

You may also clear the content of /var/lib/vdsm/persistence and /etc/libvirt/qemu/networks/vdsm-* and reboot to start afresh.

Comment 8 Yaniv Lavi 2016-05-05 08:56:42 UTC
(In reply to Prasanth from comment #6)
> Its completely blocking our testing and release for support with our
> products.If you can provide us a tentative date on the fix coming with
> ovirt4.0, we can move on with other things and come back to it at the right
> time.

Can you try to work with the nightly builds? It should be fixed there.

Comment 9 Prasanth 2016-05-06 17:57:29 UTC
(In reply to Dan Kenigsberg from comment #7)
> Prasanth, Next beta build is going to have a fix for the secondary issue,
> but I wonder how does this block your further testing? Does it reproduce
> often on other hosts? Can you not disable ipv6 on your host?
> 
> You may also clear the content of /var/lib/vdsm/persistence and
> /etc/libvirt/qemu/networks/vdsm-* and reboot to start afresh.

Dan, 
     Its very persistent in all three hypervisors ( 7.2, 7.1 and 7.0 ) im running with. Also IPV6 is not getting disabled by using the GUI option and had problems with sysctl as well > bust once set to static it going fine so far.

Comment 10 Dan Kenigsberg 2016-05-08 09:04:52 UTC
Three independent reproductions sounds very odd (and worrying). What happens when you remove /var/lib/vdsm/persistence/* /etc/libvirt/qemu/networks/vdsm-* and reboot?

You can also add ipv6.disable=1 to your kernel command line.

Comment 11 Prasanth 2016-05-09 15:04:36 UTC
(In reply to Dan Kenigsberg from comment #10)
> Three independent reproductions sounds very odd (and worrying). What happens
> when you remove /var/lib/vdsm/persistence/*
> /etc/libvirt/qemu/networks/vdsm-* and reboot?
> 
> You can also add ipv6.disable=1 to your kernel command line.

Dan,
    The Hypervisors are up and optimal after removing the files you have mentioned and by disabling IPv6. As i said before it was very consistent before. 
Im also building a completely new config in a different lab. Let me see if i see this issue again without using the workaround provided. 
Will keep you posted.

Comment 12 Michael Burman 2016-06-09 12:41:09 UTC
Verified on - 4.0.0.2-0.1.el7ev with 
vdsm-4.18.1-11.gita92976e.el7ev.x86_64
rhevh7-ng-4.0-0.20160607.0+1

Comment 13 Prasanth 2016-06-14 21:49:19 UTC
I ran an update to 
rhev-m 3.6.6.2-0.1.el6 and installed a new rhev-h 7.2 20160516.0.el7ev

Disabled ipv6 at kernel command line on boot and the hypervisor still fails to add. 

Retried after removing /var/lib/vdsm/persistence/*
/etc/libvirt/qemu/networks/vdsm-* and rebooted with ipv6 disabled that didnt work either.

Comment 14 Prasanth 2016-06-14 21:52:12 UTC
Created attachment 1167972 [details]
Supervdsm.log_new

Comment 15 Edward Haas 2016-07-25 12:43:51 UTC
This now looks like a different problem, please open a new bug if the below is not helpful.

The setup of the ovirtmgmt network fails because there was no response from the dhcp server for 120sec, then VDSM reverts the changes and returns the original configuration/ifcfg file.

The initial em1 iface had no dhcp set on it (I would expect BOOTPROTO=dhcp), so I am not sure what was there before.
DEVICE="em1"
HWADDR="b8:ca:3a:68:3b:a0"
IPV6INIT="no"
IPV6_AUTOCONF="no"
NM_CONTROLLED="no"
ONBOOT="yes"
PEERNTP="yes"

Perhaps you should try a static IP and check if it works for you and then return back and debug dhcp (the the static one worked).

Comment 16 Prasanth 2016-08-04 20:33:04 UTC
(In reply to Edward Haas from comment #15)
> This now looks like a different problem, please open a new bug if the below
> is not helpful.
> 
> The setup of the ovirtmgmt network fails because there was no response from
> the dhcp server for 120sec, then VDSM reverts the changes and returns the
> original configuration/ifcfg file.
> 
> The initial em1 iface had no dhcp set on it (I would expect BOOTPROTO=dhcp),
> so I am not sure what was there before.
> DEVICE="em1"
> HWADDR="b8:ca:3a:68:3b:a0"
> IPV6INIT="no"
> IPV6_AUTOCONF="no"
> NM_CONTROLLED="no"
> ONBOOT="yes"
> PEERNTP="yes"
> 
> Perhaps you should try a static IP and check if it works for you and then
> return back and debug dhcp (the the static one worked).

It seems to work with static configuration. We have plenty of other set up working without any issues on DHCP. Not sure what is going on here. Will try debugging and keep you posted.

Comment 18 errata-xmlrpc 2016-08-23 20:15:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html


Note You need to log in before you can comment on or make changes to this bug.