1130773 – "rhevm" bridge is not configured if iptables is up on the host (rhel7)

Bug 1130773 - "rhevm" bridge is not configured if iptables is up on the host (rhel7)

Summary: "rhevm" bridge is not configured if iptables is up on the host (rhel7)

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.1-1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.6.0
Assignee:	Nobody
QA Contact:	Meni Yakove
Docs Contact:
URL:
Whiteboard:	network
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-17 15:18 UTC by Ohad Basan
Modified:	2016-02-10 19:56 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-09-09 09:41:40 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ohad Basan 2014-08-17 15:18:34 UTC

Description of problem:

I am using rhevm 3.4 to deploy a rhel7 host.
initially was was set to down, afterwards it was moved to none responsive with "no route to host" exceptions on the engine logs.
I noticed that the "rhevm" bridge is not configured on the host
I disabled the iptables service and re-run host-deploy through the webadmin and the host status was set to up

Comment 4 Sandro Bonazzola 2014-08-18 12:19:53 UTC

looks like the vdsm-tool configure --force failing qemu conf at first run.
Can you take a look?

Comment 5 Mooli Tayer 2014-08-20 12:10:50 UTC

looks unrelated to vdsm-tool. bridge creation is not done there and it is also unrelated to bug 1127877 since there it is a local install issue - not host deploy install like here.

At vdsm log this error repeats

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 629, in getSpmStatus
    pool = self.getPool(spUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 326, in getPool
    raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('59425ebb-03bd-4767-9a28-2a368b2beb60',) 

But I do not think it is related.

I'm most interested in host-deploy log located at:
/tmp/ovirt-host-deploy-*.log

Ohad can we get them?

Comment 6 Ohad Basan 2014-08-20 12:57:41 UTC

I can't find them there.
I think they get removed after host deploy finishes
no?

Comment 7 Dan Kenigsberg 2014-08-20 13:18:38 UTC

Ohad, the attached vdsm.log does not show even an attempt to create the management bridge. Could you provide more info on the mode of failure? Why do you think it's related to iptables? Did you explicitly stop firewalld?

Please note that the log does show traces of bug 1129232. I do not think it's related to bridge creation (only to reporting its existence), but please try with a fixed build.

Thread-12::ERROR::2014-08-14 14:15:31,425::sampling::453::vds::(run) Error while sampling stats
Traceback (most recent call last):
  File "/usr/share/vdsm/sampling.py", line 435, in run
    sample = self.sample()
  File "/usr/share/vdsm/sampling.py", line 425, in sample
    hs = HostSample(self._pid)
  File "/usr/share/vdsm/sampling.py", line 209, in __init__
    self.timestamp - os.stat(P_VDSM_CLIENT_LOG).st_mtime <
OSError: [Errno 2] No such file or directory: '/var/run/vdsm/client.log'

Comment 8 Ohad Basan 2014-08-20 13:29:19 UTC

it could be a coincidence that when I disabled iptables and and redeployed the host it managed to create the bridge and activate it.
firewalld was disabled on this machine before running the test

Comment 9 Dan Kenigsberg 2014-08-20 22:04:51 UTC

What was the name of the host with the failed installment? When was the failure?
Can you point to the relevant point in engine.log?

Could you reproduce this with a vdsm build that has bug 1129232 solved?

Comment 12 Dan Kenigsberg 2014-09-01 14:25:06 UTC

Engine does not even attempt to to create the management network - vdsm is explicitly stopped before that.

storageRefresh::DEBUG::2014-09-01 15:19:57,083::hsm::393::Storage.HSM::(storageRefresh) HSM is ready
MainThread::DEBUG::2014-09-01 15:31:34,560::vdsm::58::vds::(sigtermHandler) Received signal 15

Sep  1 15:31:04 cinteg19 puppet-agent[25933]: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified');
Sep  1 15:31:04 cinteg19 puppet-agent[25933]: Use 'puppet agent --enable' to re-enable.
Sep  1 15:31:34 cinteg19 systemd-logind: New session 45 of user root.
Sep  1 15:31:34 cinteg19 systemd: Starting Session 45 of user root.
Sep  1 15:31:34 cinteg19 systemd: Started Session 45 of user root.
Sep  1 15:31:34 cinteg19 systemd: Stopping Virtual Desktop Server Manager...

given the root login on the very second of vdsm stop, I suspect that this is not a RHEV bug, but a problem with CI orchestration.

Comment 13 Ohad Basan 2014-09-02 13:08:29 UTC

Created attachment 933759 [details]
host deploy log

Comment 14 Ohad Basan 2014-09-02 15:22:37 UTC

hi
I see some errors in the host deploy log.
care to look?

Comment 15 Ohad Basan 2014-09-02 15:24:00 UTC

I am reopening because I see some host deploy errors.
this is obviously not a CI bug.

Comment 16 Dan Kenigsberg 2014-09-02 15:53:24 UTC

This log is too late. I am looking for someone/something logging to the host as root at 15:31:04 and restarting vdsm.


It does have a peculiar error, that begs for more details via journalctl.

2014-09-01 16:02:39 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:866 execute-output: ('/bin/systemctl', 'start', 'network.service') stderr:
Job for network.service failed. See 'systemctl status network.service' and 'journalctl -xn' for details.

could you provide the correct ovirt-host-deploy.log and 16:02:39's journalctl ?

Comment 17 Ohad Basan 2014-09-03 08:05:21 UTC

yes
it does look bad

I will provide the logs

Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/sanlock.service:8] Unknown lvalue 'ControlGroup' in section 'Service'
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/wdmd.service:7] Unknown lvalue 'ControlGroup' in section 'Service'
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Starting LSB: Bring up/down networking...
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up loopback interface:  [  OK  ]
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up interface eno1:
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: dhclient(2938) is already running - exiting.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: This version of ISC DHCP is based on the release available
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: on ftp.isc.org.  Features have been added and other changes
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: have been made to the base software release in order to make
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: it work better with this distribution.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Please report for this software via the Red Hat Bugzilla site:
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: http://bugzilla.redhat.com
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: exiting.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Determining IP information for eno1...dhclient(2938) is already running - exiting.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: This version of ISC DHCP is based on the release available
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: on ftp.isc.org.  Features have been added and other changes
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: have been made to the base software release in order to make
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: it work better with this distribution.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Please report for this software via the Red Hat Bugzilla site:
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: http://bugzilla.redhat.com
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: exiting.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: failed.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: [FAILED]
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: RTNETLINK answers: File exists
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: network.service: control process exited, code=exited status=1
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Failed to start LSB: Bring up/down networking.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Unit network.service entered failed state.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Reloading.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/sanlock.service:8] Unknown lvalue 'ControlGroup' in section 'Service'
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/wdmd.service:7] Unknown lvalue 'ControlGroup' in section 'Service'
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Reloading.
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/sanlock.service:8] Unknown lvalue 'ControlGroup' in section 'Service'
Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: [/usr/lib/systemd/system/wdmd.service:7] Unknown lvalue 'ControlGroup' in section 'Service'

Comment 19 Ohad Basan 2014-09-03 08:18:50 UTC

the command to add the host was sent on 2014-09-01 15:17
you can see in the the log 
15:31:04 is much later and I think that until this time the host was just moved to non-responsive by the engine.

Comment 20 Dan Kenigsberg 2014-09-03 11:16:15 UTC

(In reply to Ohad Basan from comment #17)
> yes
> it does look bad

Not really. I guess that it's about starting the network "service" while it is still running. This should not cause anything bad apart of this log noise.

> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com systemd[1]: Starting LSB:
> Bring up/down networking...
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up
> loopback interface:  [  OK  ]
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com network[28382]: Bringing up
> interface eno1:
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]:
> dhclient(2938) is already running - exiting.
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01
> 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: This version of ISC
> DHCP is based on the release available
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: on
> ftp.isc.org.  Features have been added and other changes
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: have been
> made to the base software release in order to make
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: it work
> better with this distribution.
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Sep 01
> 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]: Please report for
> this software via the Red Hat Bugzilla site:
> Sep 01 16:02:39 cinteg19.ci.lab.tlv.redhat.com dhclient[28515]:
> http://bugzilla.redhat.com

In any case, it's not related to Engine's falure to find a running Vdsm.

Note You need to log in before you can comment on or make changes to this bug.