Description of problem: After register rhevh7.0 into rhevm vt5 success, approved rhevh host up and reboot rhevh. Network for nic "em1" will down and lost all configuration files of em1 in "/etc/sysconfig/network-scripts/". Version-Release number of selected component (if applicable): rhev-hypervisor7-7.0-20141006.0.el7ev ovirt-node-3.1.0-0.20.20141006gitc421e04.el7.noarch.rpm Red Hat Enterprise Virtualization Manager Version: 3.5.0-0.14.beta.el6ev ovirt-node-plugin-vdsm-0.2.0-9.el7.noarch vdsm-reg-4.16.6-1.el7.noarch vdsm-4.16.6-1.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Install rhev-hypervisor7-7.0-20141006.0.el7ev. 2. Configure network and register rhevm vt5 3. After register rhevm vt5 success , approved rhevh host up. 4. Reboot rhevh Actual results: 1. After step4, Network for nic "em1" will down 2. lost all configuration files of em1 in "/etc/sysconfig/network-scripts/" #ifconfig can only list lo. Expected results: Additional info:
Haiyang, can you please provide all files in /var/log?
Created attachment 950952 [details] attached logs from /var/log (In reply to Fabian Deutsch from comment #1) > Haiyang, can you please provide all files in /var/log? ok, attached logs from /var/log in here
From the logs: 2014-10-27 11:05:08,260 - WARNING - ovirtfunctions - File already persisted: /etc/pki/vdsm/certs/bkp-2014-10-27_110508_cacert.pem 2014-10-27 11:05:08,260 - WARNING - ovirtfunctions - /etc/pki/vdsm/certs/bkp-2014-10-27_110508_cacert.pem Already persisted 2014-10-27 11:05:08,260 - WARNING - ovirtfunctions - File already persisted: /etc/pki/vdsm/certs/bkp-2014-10-27_110508_vdsmcert.pem 2014-10-27 11:05:08,260 - WARNING - ovirtfunctions - /etc/pki/vdsm/certs/bkp-2014-10-27_110508_vdsmcert.pem Already persisted 2014-10-27 11:05:19,743 - WARNING - ovirtfunctions - Directory already persisted: /var/lib/vdsm/persistence 2014-10-27 11:05:19,744 - WARNING - ovirtfunctions - You need to unpersist its child directories and/or files and try again. 2014-10-27 11:05:43,687 - DEBUG - ovirtfunctions - sed --copy -i "\|/etc/libvirt/qemu/networks/autostart/default.xml$|d" /config/files 2014-10-27 11:05:43,687 - DEBUG - ovirtfunctions - 2014-10-27 11:05:43,691 - DEBUG - ovirtfunctions - shred -u /config'/etc/libvirt/qemu/networks/autostart/default.xml' 2014-10-27 11:05:43,692 - DEBUG - ovirtfunctions - 2014-10-27 11:05:43,696 - DEBUG - ovirtfunctions - shred -u '/etc/libvirt/qemu/networks/autostart/default.xml' 2014-10-27 11:05:43,696 - DEBUG - ovirtfunctions - 2014-10-27 11:05:52,347 - DEBUG - ovirtfunctions - sed --copy -i "\|/etc/sysconfig/network-scripts/ifcfg-rhevm$|d" /config/files 2014-10-27 11:05:52,347 - DEBUG - ovirtfunctions - 2014-10-27 11:05:52,357 - DEBUG - ovirtfunctions - shred -u /config'/etc/sysconfig/network-scripts/ifcfg-rhevm' 2014-10-27 11:05:52,357 - DEBUG - ovirtfunctions - 2014-10-27 11:05:52,367 - DEBUG - ovirtfunctions - shred -u '/etc/sysconfig/network-scripts/ifcfg-rhevm' 2014-10-27 11:05:52,367 - DEBUG - ovirtfunctions - 2014-10-27 11:05:53,776 - DEBUG - ovirtfunctions - sed --copy -i "\|/etc/sysconfig/network-scripts/ifcfg-em1$|d" /config/files 2014-10-27 11:05:53,777 - DEBUG - ovirtfunctions - 2014-10-27 11:05:53,998 - DEBUG - ovirtfunctions - shred -u /config'/etc/sysconfig/network-scripts/ifcfg-em1' 2014-10-27 11:05:53,998 - DEBUG - ovirtfunctions - 2014-10-27 11:05:54,008 - DEBUG - ovirtfunctions - shred -u '/etc/sysconfig/network-scripts/ifcfg-em1' 2014-10-27 11:05:54,008 - DEBUG - ovirtfunctions - Seems as if the vdsm persistence related configuration files get removed again. Ideas from the vdsm side?
How was em1 networking configured for the initial registration?
(In reply to Antoni Segura Puimedon from comment #4) > How was em1 networking configured for the initial registration? network works well for the initial registration rhevm: [root@dhcp-9-12 admin]# ifconfig em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::7a2b:cbff:fe9a:72a3 prefixlen 64 scopeid 0x20<link> ether 78:2b:cb:9a:72:a3 txqueuelen 1000 (Ethernet) RX packets 2349 bytes 1610899 (1.5 MiB) RX errors 0 dropped 2 overruns 0 frame 0 TX packets 1133 bytes 470936 (459.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 20 memory 0xe1a00000-e1a20000 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 227 bytes 28249 (27.5 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 227 bytes 28249 (27.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 rhevm: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.66.9.12 netmask 255.255.252.0 broadcast 10.66.11.255 inet6 fe80::7a2b:cbff:fe9a:72a3 prefixlen 64 scopeid 0x20<link> ether 78:2b:cb:9a:72:a3 txqueuelen 0 (Ethernet) RX packets 514 bytes 52141 (50.9 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 326 bytes 372864 (364.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@dhcp-9-12 admin]# cat /etc/sysconfig/network-scripts/ifcfg-em1 # Generated by VDSM version 4.16.6-1.el7 DEVICE=em1 HWADDR=78:2b:cb:9a:72:a3 BRIDGE=rhevm ONBOOT=yes NM_CONTROLLED=no IPV6_AUTOCONF=no PEERNTP=yes IPV6INIT=no [root@dhcp-9-12 admin]# cat /etc/sysconfig/network-scripts/ifcfg-rhevm # Generated by VDSM version 4.16.6-1.el7 DEVICE=rhevm TYPE=Bridge DELAY=0 STP=off ONBOOT=yes BOOTPROTO=dhcp DEFROUTE=yes NM_CONTROLLED=no IPV6_AUTOCONF=no PEERNTP=yes IPV6INIT=no HOTPLUG=no network will down after reboot rhevh due to network configuration files lost in etc/sysconfig/network-scripts/
My question is, before running the registration, what did the ifcfg-em1 look like. I presume em1 and rhevm were created by vdsm-reg, is that the case? If that is the case, after running the registration what is the result of running: tail /var/lib/vdsm/persistence/netconf/*/*
I could identify two problems: 1. vdsmd conflicts with ksmtuned (part of qemu-kvm-rhev) This is not a problem per se. The problem is that both of them are enabled. vdsm should probably disable ksmtuned at installation time. Some data on this: http://paste.fedoraproject.org/147456/41502993 line 1654: Nov 03 15:51:21 localhost systemd[1]: Fixing conflicting jobs by deleting job ksmtuned.service/start Nov 03 15:51:21 localhost systemd[1]: Looking at job vdsmd.service/start conflicted_by=no Nov 03 15:51:21 localhost systemd[1]: Looking at job vdsmd.service/stop conflicted_by=no Nov 03 15:51:21 localhost systemd[1]: Fixing conflicting jobs by deleting job vdsmd.service/stop vdsmd is "deleted" because systemd tries to bring up ksmtuned as well. This problem is probably generic, not sure why it doesn't come up on stock rhel. My fix is currently to disable ksmtuned at build time. 2. vdsmd requires sanlock. Because of something, sanlock goes down and back up during the boot process. This is also killing vdsmd, and prevents vdsmd from coming up. Some data: http://paste.fedoraproject.org/147500/38217141 line 3733: Nov 03 18:06:04 locaasdasdasd systemd[1]: Trying to enqueue job sanlock.service/stop/replace Nov 03 18:06:04 locaasdasdasd systemd[1]: Installed new job sanlock.service/stop as 238 Nov 03 18:06:04 locaasdasdasd systemd[1]: Job vdsmd.service/start finished, result=canceled I am not sure why sanlock is going down, but at least this cancels vdsmd's startup. My hack is currently to change the Requires to a Wants in vdsmd's service file. Both patches are attached. What are your suggestions to solve this?
An update on the actual causes for this bug: The problem is that vdsmd is depending (as in Requires) on several services, if any of this services is brought down, then vdsmd will also be brought down. Initially the problem was that sanlock took vdsmd down, because sanlock.service got stopped and started during boot by a hook residing in ovirt-node-plugin-vdsm. This was solved by moving the hadnler to a differnt hook (post-upgrade). Now the problem is conceptually similar, just with a different dependency: libvirtd. This is caused by a hook handler, which calls vdsm-tool configure --force during boot (01-vdsm in ovirt-node-plugin-vdsm). This call is needed to fix some things at boot time related to vdsm. vdsm-tool is called with --force, and --force implies that services get restarted according to the documentation: 244 parser.add_argument( 245 '--force', 246 dest='force', 247 default=False, 248 action='store_true', 249 help='Force configuration, trigger services restart', 250 ) Because the reconfiguration also reconfigures libvirt, libvirt will be restarted as well. And because libvirt is a dependency of vdsmd, vdsmsd's startup is canceled. And because the start is canceled it will never come up. A try to fix this from the Node side failed: We tried to reorder all services and run all services which restart any depdency of vdsmd, before vdsmd is started. But this did not work, because the vdsmd get's queued to be started when systemd is coming up, and any messing with a dependency already affects the queued vdsmd start job. According to the systemd people this behavior is intended (that a stopped dependency can cancel vdsmd's startup while it is queued). It is also intended that vdsmd is not queued again even if the dependency comes up again.
To debug this issue it's helpful to enable systemd/journald debugging by appending the following arguments to the kernel cmdline: console=ttyS0 systemd.log_target=console debug systemd.journald.forward_to_console=yes
As discussed earlier, a solution might be to lower the requirements of vdsmd. To use Wants= instead of Requires=. The reasoning bnehing this is: If Requires is used, then vdsmd is not comming up, and vdsmd can not tell the engine about the actual problem (some service is not coming up). In this scenario the user needs to manually look at the vdsm side (independent if it's RHEV-H or RHEL Host) and cehck why vdsmd is not started. If the dependencies are relaxed (and Wants=) is used, the vdsmd can detect the state of it's dependencies on boot (or monitor them the whole time), and - and this is the nice thing - let engine know if any of the dependencies goes down. In this scenario vdsmd can report the failed depdencies back to engine, and engine can let the user know about the problem.
As it stands, vdsm must die as soon as libvirtd is stopped. Without this, there is no other means for vdsm to re-connect to the new libvirt process. Similarly, there are no means in Vdsm to support "soft" dependency, such as reporting that sanlock/nfs/multipathd is not running. Changing Requires->Wants unilaterally, with no compatible work in Engine and Vdsm is going to cause. I did not understand - did you try starting vdsm explicitly after `config --froce` has succeeded?
(In reply to Dan Kenigsberg from comment #15) > As it stands, vdsm must die as soon as libvirtd is stopped. Without this, > there is no other means for vdsm to re-connect to the new libvirt process. > Similarly, there are no means in Vdsm to support "soft" dependency, such as > reporting that sanlock/nfs/multipathd is not running. Changing > Requires->Wants unilaterally, with no compatible work in Engine and Vdsm is > going to cause. > > I did not understand - did you try starting vdsm explicitly after `config > --froce` has succeeded? No, I did not try starting vdsmd again. I'd rather like to solve the problems and not add a workaround. But this simple solution might work - I will try it.
(In reply to Fabian Deutsch from comment #16) > (In reply to Dan Kenigsberg from comment #15) > > As it stands, vdsm must die as soon as libvirtd is stopped. Without this, > > there is no other means for vdsm to re-connect to the new libvirt process. > > Similarly, there are no means in Vdsm to support "soft" dependency, such as > > reporting that sanlock/nfs/multipathd is not running. Changing > > Requires->Wants unilaterally, with no compatible work in Engine and Vdsm is > > going to cause. In my understanding vdsm is the connection endpoint for engine on the host side, and thus - IMO - it should stay up even if dependencies go down, and it is vdsm's task to report those errors back to engine. But maybe this is more an RFE, if Engine can not signal this kind of informations.
(In reply to Fabian Deutsch from comment #16) > (In reply to Dan Kenigsberg from comment #15) ... > > I did not understand - did you try starting vdsm explicitly after `config > > --froce` has succeeded? > > No, I did not try starting vdsmd again. I'd rather like to solve the > problems and not add a workaround. > But this simple solution might work - I will try it. Let me note: This solution is only a workaround for the boot time problem. The problem that vdsmd goes down if any other dependency is going down will not be fixed, and can lead to unexpected disconnects between the host and engine.
That's not true. when vdsmd goes down it directly goes back up by systemd mechanism which controls vdsmd as a daemon. Engine might see disconnection for few seconds, hopefully won't start any migrations before vdsmd will response back again. when libvirt, sanlock or supervdsm gets down, vdsm also restarts do establish new connection to those services. that's not a bug but the desire approach. in your case, afaiu, we currently call configure on each boot, and sometimes after vdsm failed to start. after calling configure we don't try to start vdsmd again if it wasn't up before, and i assume that this is the problem .. until we persist all the configurations we need to call configure on each start and afterwards start vdsmd if its not yet up before we called to configure. if vdsmd was up, the configure call should take care of starting vdsmd for you after stopping it before setting libvirt and others conf files. but probably the first vdsmd start failed due to configuration issues and we never try to start it again. can it be the case here?
(In reply to Fabian Deutsch from comment #16) > (In reply to Dan Kenigsberg from comment #15) > > As it stands, vdsm must die as soon as libvirtd is stopped. Without this, > > there is no other means for vdsm to re-connect to the new libvirt process. > > Similarly, there are no means in Vdsm to support "soft" dependency, such as > > reporting that sanlock/nfs/multipathd is not running. Changing > > Requires->Wants unilaterally, with no compatible work in Engine and Vdsm is > > going to cause. > > > > I did not understand - did you try starting vdsm explicitly after `config > > --froce` has succeeded? > > No, I did not try starting vdsmd again. I'd rather like to solve the > problems and not add a workaround. > But this simple solution might work - I will try it. Good news the hack works.
Test version: rhev-hypervisor7-7.0-20141107.0 ovirt-node-3.1.0-0.25.20141107gitf6dc7b9.el7.noarch Red Hat Enterprise Virtualization Manager Version: 3.5.0-0.19.beta.el6ev ovirt-node-plugin-vdsm-0.2.0-10.el7ev.noarch vdsm-4.16.7.3-1.el7ev.x86_64 vdsm-reg-4.16.7.3-1.el7ev.noarch Test steps: 1. Install rhev-hypervisor7-7.0-20141107.0 2. Configure network and register rhevm vt9 3. After register rhevm vt9 success , approved rhevh host up. 4. Reboot rhevh After Step 4, Network for nic "em1" will still up, also Status of rhevh host was set to Up in rhevm webui side. so this bug has been fixed, change the status into "VERIFIED"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0195.html