Bug 1480897

Summary: Overcloud deployment fails when using ovs 2.9.90
Product: Red Hat OpenStack Reporter: Arie Bregman <abregman>
Component: openvswitchAssignee: Aaron Conole <aconole>
Status: CLOSED NOTABUG QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: abregman, amuller, apevec, atragler, beagles, chrisw, danken, fleitner, mgrepl, nyechiel, rhallise, rhos-maint, srevivo, supadhya, twilson
Target Milestone: ---Keywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 12:36:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1482682    
Bug Blocks:    

Description Arie Bregman 2017-08-12 17:56:07 UTC
Description of problem:

Overcloud deployment fails with "no valid hosts" when using ovs 2.8.90 instead of ovs 2.7.0-8

Version-Release number of selected component (if applicable): RHOSP 12


How reproducible: 100%


Steps to Reproduce:
1. Install undercloud with ovs 2.8.90
2. Deploy overcloud with ovs 2.8.90

Actual results: 

2017-08-12 15:34:25Z [overcloud.Controller.2.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2017-08-12 15:34:25Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS  state changed
2017-08-12 15:34:26Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE  state changed
2017-08-12 15:34:27Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2017-08-12 15:34:28Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"


Expected results: Overcloud deployed successfully

Comment 2 Brent Eagles 2017-08-14 12:27:24 UTC
A couple of critical looking issues in /var/log/messages on the undercloud:

Aug 12 10:06:18 undercloud-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=5 --id=@manager -- create Manager "target=\"ptcp:6640:127.0.0.1\"" -- add Open_vSwitch . manager_options @manager
Aug 12 10:06:18 undercloud-0 ovs-vsctl: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (Permission denied)

Aug 12 11:36:22 undercloud-0 registry: 192.168.24.1 - - [12/Aug/2017:11:36:22 -0400] "OPTIONS / HTTP/1.0" 200 0 "" ""
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: Traceback (most recent call last):
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 115, in wait
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: listener.cb(fileno)
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 57, in on_write
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: current.switch(([], [original], []))
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: result = function(*args, **kwargs)
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 65, in _launch
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: raise e
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: Exception: Could not retrieve schema from tcp:127.0.0.1:6640: Connection refused
Aug 12 11:36:22 undercloud-0 neutron-openvswitch-agent: Removing descriptor: 9

Comment 3 Terry Wilson 2017-08-14 13:01:33 UTC
Since there is a permission error running the command that sets ovsdb-server to listen on ptcp:6640, neutron won't be able to communicate with OVS. Things to check would be if there are selinux failures when running that command on the undercloud, in which case it could be new selinux rules are required for the new OVS, or the opentsack-selinux package isn't installed. I can't tell from the log snippet, but it could be that rootwrap wasn't configured properly and sudo wasn't used to run the create/add manager commands.

Comment 4 Arie Bregman 2017-08-14 13:07:45 UTC
openstack-selinux is installed but probably doesn't include updated policy for the new ovs.

the failure posted here is after setting selinux to permissive.
with enforcing it fails much earlier with: "unable to start openvswitch..."

Any suggestions?

Comment 6 Terry Wilson 2017-08-14 14:45:13 UTC
It looks like we are now running openvswitch services under their own users instead of as root, so that explains all of the dac_override selinux issues--/var/run/openvswitch/conf.db etc. are now owned by 'openvswitch' instead of root. Locally, when using ovs 2.8 and selinux=permissive, I am able to run sudo ovs-vsctl set-manager ptcp:6640, which makes me wonder whether sudo is actually being called (i'm not sure why it wouldn't be).

Comment 7 Terry Wilson 2017-08-14 15:05:20 UTC
arie, according to aconole, you could try commenting out:

OVS_USER_ID="openvswitch:hugetlbfs"

in /etc/sysconfig/openvswitch to work around the issue for now as well.

Comment 8 Assaf Muller 2017-08-14 16:09:42 UTC
Do we expect to hit the user ID issue once the OVS team provides a "proper" OVS 2.8 build? If so, what would a solution look like? Would it come from the OVS or OSP (Neutron/TripleO) side?

Comment 9 Terry Wilson 2017-08-14 16:23:56 UTC
amuller: aconole said he has a patch in testing for the selinux issues. we don't know if there are additional issues. It seems like their may be if things are in permissive mode. aconole said it looked like ovs-vsctl was run as root, so maybe there is a chance that somewhere in the deployment process selinux was re-enabled? Otherwise I'm not sure unless the ovs-vsctl itself was specifically dropping permissions (and the only place I see that is in the servers).

Comment 13 Terry Wilson 2017-08-17 22:37:57 UTC
bz 1482682 describes the selinux issue, and links to a a patch upstream that should fix this (https://patchwork.ozlabs.org/patch/802232/)

Comment 14 Terry Wilson 2017-08-25 15:47:57 UTC
arie, did you try removing

OVS_USER_ID="openvswitch:openvswitch"

from /etc/sysconfig/openvswitch to temporarily move past the selinux issue? Also, what setting are you using to disable selinux. I might be able to help finding out where things are getting changed back as well.

Comment 17 Arie Bregman 2017-09-06 08:48:19 UTC
Note: OVS 2.8.90 targeted for RHOSP 13. OSP 12 is running with ovs 2.7.2

Comment 19 Lon Hohberger 2017-10-11 18:26:12 UTC
Looks like this was fixed in openvswitch by providing its own custom policy.