Description of problem: I am using rhevm 3.4 to deploy a rhel7 host. initially was was set to down, afterwards it was moved to none responsive with "no route to host" exceptions on the engine logs. I noticed that the "rhevm" bridge is not configured on the host I disabled the iptables service and re-run host-deploy through the webadmin and the host status was set to up
looks like the vdsm-tool configure --force failing qemu conf at first run. Can you take a look?
looks unrelated to vdsm-tool. bridge creation is not done there and it is also unrelated to bug 1127877 since there it is a local install issue - not host deploy install like here. At vdsm log this error repeats Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 629, in getSpmStatus pool = self.getPool(spUUID) File "/usr/share/vdsm/storage/hsm.py", line 326, in getPool raise se.StoragePoolUnknown(spUUID) StoragePoolUnknown: Unknown pool id, pool not connected: ('59425ebb-03bd-4767-9a28-2a368b2beb60',) But I do not think it is related. I'm most interested in host-deploy log located at: /tmp/ovirt-host-deploy-*.log Ohad can we get them?
I can't find them there. I think they get removed after host deploy finishes no?
Ohad, the attached vdsm.log does not show even an attempt to create the management bridge. Could you provide more info on the mode of failure? Why do you think it's related to iptables? Did you explicitly stop firewalld? Please note that the log does show traces of bug 1129232. I do not think it's related to bridge creation (only to reporting its existence), but please try with a fixed build. Thread-12::ERROR::2014-08-14 14:15:31,425::sampling::453::vds::(run) Error while sampling stats Traceback (most recent call last): File "/usr/share/vdsm/sampling.py", line 435, in run sample = self.sample() File "/usr/share/vdsm/sampling.py", line 425, in sample hs = HostSample(self._pid) File "/usr/share/vdsm/sampling.py", line 209, in __init__ self.timestamp - os.stat(P_VDSM_CLIENT_LOG).st_mtime < OSError: [Errno 2] No such file or directory: '/var/run/vdsm/client.log'
it could be a coincidence that when I disabled iptables and and redeployed the host it managed to create the bridge and activate it. firewalld was disabled on this machine before running the test
What was the name of the host with the failed installment? When was the failure? Can you point to the relevant point in engine.log? Could you reproduce this with a vdsm build that has bug 1129232 solved?
Engine does not even attempt to to create the management network - vdsm is explicitly stopped before that. storageRefresh::DEBUG::2014-09-01 15:19:57,083::hsm::393::Storage.HSM::(storageRefresh) HSM is ready MainThread::DEBUG::2014-09-01 15:31:34,560::vdsm::58::vds::(sigtermHandler) Received signal 15 Sep 1 15:31:04 cinteg19 puppet-agent[25933]: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified'); Sep 1 15:31:04 cinteg19 puppet-agent[25933]: Use 'puppet agent --enable' to re-enable. Sep 1 15:31:34 cinteg19 systemd-logind: New session 45 of user root. Sep 1 15:31:34 cinteg19 systemd: Starting Session 45 of user root. Sep 1 15:31:34 cinteg19 systemd: Started Session 45 of user root. Sep 1 15:31:34 cinteg19 systemd: Stopping Virtual Desktop Server Manager... given the root login on the very second of vdsm stop, I suspect that this is not a RHEV bug, but a problem with CI orchestration.
Created attachment 933759 [details] host deploy log
hi I see some errors in the host deploy log. care to look?
I am reopening because I see some host deploy errors. this is obviously not a CI bug.
This log is too late. I am looking for someone/something logging to the host as root at 15:31:04 and restarting vdsm. It does have a peculiar error, that begs for more details via journalctl. 2014-09-01 16:02:39 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:866 execute-output: ('/bin/systemctl', 'start', 'network.service') stderr: Job for network.service failed. See 'systemctl status network.service' and 'journalctl -xn' for details. could you provide the correct ovirt-host-deploy.log and 16:02:39's journalctl ?
yes it does look bad I will provide the logs Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/sanlock.service:8] Unknown lvalue 'ControlGroup' in section 'Service' Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/wdmd.service:7] Unknown lvalue 'ControlGroup' in section 'Service' Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Starting LSB: Bring up/down networking... Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up loopback interface: [ OK ] Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up interface eno1: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: dhclient(2938) is already running - exiting. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: This version of ISC DHCP is based on the release available Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: on ftp.isc.org. Features have been added and other changes Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: have been made to the base software release in order to make Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: it work better with this distribution. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Please report for this software via the Red Hat Bugzilla site: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: http://bugzilla.redhat.com Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: exiting. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Determining IP information for eno1...dhclient(2938) is already running - exiting. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: This version of ISC DHCP is based on the release available Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: on ftp.isc.org. Features have been added and other changes Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: have been made to the base software release in order to make Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: it work better with this distribution. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Please report for this software via the Red Hat Bugzilla site: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: http://bugzilla.redhat.com Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: exiting. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: failed. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: [FAILED] Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: network.service: control process exited, code=exited status=1 Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Failed to start LSB: Bring up/down networking. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Unit network.service entered failed state. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Reloading. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/sanlock.service:8] Unknown lvalue 'ControlGroup' in section 'Service' Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/wdmd.service:7] Unknown lvalue 'ControlGroup' in section 'Service' Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Reloading. Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/sanlock.service:8] Unknown lvalue 'ControlGroup' in section 'Service' Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/wdmd.service:7] Unknown lvalue 'ControlGroup' in section 'Service'
the command to add the host was sent on 2014-09-01 15:17 you can see in the the log 15:31:04 is much later and I think that until this time the host was just moved to non-responsive by the engine.
(In reply to Ohad Basan from comment #17) > yes > it does look bad Not really. I guess that it's about starting the network "service" while it is still running. This should not cause anything bad apart of this log noise. > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Starting LSB: > Bring up/down networking... > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up > loopback interface: [ OK ] > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up > interface eno1: > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: > dhclient(2938) is already running - exiting. > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 > 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: This version of ISC > DHCP is based on the release available > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: on > ftp.isc.org. Features have been added and other changes > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: have been > made to the base software release in order to make > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: it work > better with this distribution. > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 > 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Please report for > this software via the Red Hat Bugzilla site: > Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: > http://bugzilla.redhat.com In any case, it's not related to Engine's falure to find a running Vdsm.