Bug 1574980

Summary: [OSP13] Cannot bind port due to dead network agents while deploy instance after reboot of OC nodes
Product: Red Hat OpenStack Reporter: Artem Hrechanychenko <ahrechan>
Component: openvswitchAssignee: Assaf Muller <amuller>
Status: CLOSED DUPLICATE QA Contact: Ofer Blaut <oblaut>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: ahrechan, apevec, bcafarel, bhaley, chrisw, michele, mkrcmari, rhos-maint, srevivo, tredaelli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-14 12:24:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Artem Hrechanychenko 2018-05-04 13:00:20 UTC
Description of problem:
I cannot deploy instance after reboot of OC nodes:

https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DF%20Current%20release/job/DFG-df-13-deployment-7.5-virthost-3cont_3comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv4-vxlan-RHELOSP-31820/6/consoleFull


 fault                                | {"message": "Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance fcf97214-2d4f-4656-8600-f27c5fccde9e. Last exception: Binding failed for port b9bc7060-303b-4994-817e-acf8e836eba7, please check neutron logs for more information.", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 566, in build_instances 


But at the same time I was able to start instance which was create before reboot of OC nodes

 4d266054-eaac-43b4-a5d9-9de92dfeafb7 | after_deploy | ACTIVE | -          | Running     | tenantvxlan=192.168.32.7, 10.0.0.176 |



most interesting from logs:

sudo grep "b9bc7060-303b-4994-817e-acf8e836eba7" -R /var/log/containers/neutron/  - http://pastebin.test.redhat.com/586112


Refusing to bind port b9bc7060-303b-4994-817e-acf8e836eba7 to dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'availability_zone': None, 'heartbeat_timestamp': datetime.datetime(2018, 5, 3, 22, 59, 54), 'admin_state_up': True, 'alive': False, 'topic': u'N/A', 'host': u'compute-2.localdomain', 'agent_type': u'Open vSwitch agent', 'resource_versions': {u'Subnet': u'1.0', u'Log': u'1.0', u'SubPort': u'1.0', u'SecurityGroup': u'1.0', u'SecurityGroupRule': u'1.0', u'Trunk': u'1.1', u'QosPolicy': u'1.7', u'Port': u'1.1', u'Network': u'1.0'}, 'created_at': datetime.datetime(2018, 5, 3, 18, 2, 31), 'started_at': datetime.datetime(2018, 5, 3, 18, 27, 54), 'id': 'daa5769a-3a02-4fef-8fed-a25fe35529ad', 'configurations': {u'ovs_hybrid_plug': True, u'in_distributed_mode': False, u'datapath_type': u'system', u'arp_responder_enabled': False, u'tunneling_ip': u'172.17.2.17', u'vhostuser_socket_dir': u'/var/run/openvswitch', u'devices': 1, u'ovs_capabilities': {u'datapath_types': [u'netdev', u'system'], u'iface_types': [u'geneve', u'gre', u'internal', u'lisp', u'patch', u'stt', u'system', u'tap', u'vxlan']}, u'extensions': [u'qos'], u'l2_population': False, u'tunnel_types': [u'vxlan'], u'log_agent_heartbeats': False, u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex', u'tenant': u'br-isolated'}}}
/var/log/containers/neutron/server.log:2018-05-04 09:08:47.407 27 ERROR neutron.plugins.ml2.managers [req-b911d91a-f0be-4d2e-8a89-e429f8019c0a eaba5c2057a14a0aa057859fc1eea1d1 c1d9a1aa57f149e6b4fa7eed7416daf7 - default default] Failed to bind port b9bc7060-303b-4994-817e-acf8e836eba7 on host compute-2.localdomain for vnic_type normal using segments [{'network_id': '5e19a278-c1ec-4035-aed9-e019804b65f3', 'segmentation_id': 10, 'physical_network': None, 'id': '057af317-1d4f-4807-914c-b0a22073d9e1', 'network_type': u'vxlan'}]

http://pastebin.test.redhat.com/586121

Version-Release number of selected component (if applicable):
OSP 13 puddle - 2018-05-02.5

How reproducible:
always

Steps to Reproduce:
1.Deploy OSP13 with 3ctr+3com+3ceph+ OVN(default) using puddle = 2018-05-02.5
2.Deploy instance in OC
3. reboot of OC nodes one by one to simulate rack outage
4. Nova start instance from step #2
5. Deploy new instance 

Actual results:
Failed on step#5

Expected results:
Instance was create and reachable via floating ip 

Additional info:

Comment 3 Artem Hrechanychenko 2018-05-04 13:57:12 UTC
Sosreport - http://rhos-release.virt.bos.redhat.com/log/bz1574980/

Comment 11 Assaf Muller 2018-05-14 12:24:39 UTC

*** This bug has been marked as a duplicate of bug 1575095 ***