Description of problem: When trying to deploy new overcloud node after undercloud upgrade (Newton -> Ocata) the deploy server does not get dhcp response Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. install Newton undercloud and overcloud with 2 compute nodes 2. upgrade undercloud to Ocata 3. try to scale out (run the same commans as in setp 1 with compute-scale 3) Actual results: new node fail to get ip from dhcp server. Expected results: node deployed Additional info: * This bug have some similarity to BZ #1432028 ** Running tcpdump on the undercloude show the dhcp request but there is no response: 08:27:38.956868 52:54:00:0a:25:27 > Broadcast, ethertype IPv4 (0x0800), length 438: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:0a:25:27, length 396 08:27:46.866607 52:54:00:0a:25:27 > Broadcast, ethertype IPv4 (0x0800), length 438: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:0a:25:27, length 396 08:28:02.684609 52:54:00:0a:25:27 > Broadcast, ethertype IPv4 (0x0800), length 438: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:0a:25:27, length 396 08:59:16.975626 52:54:00:0a:25:27 > Broadcast, ethertype IPv4 (0x0800), length 438: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:0a:25:27, length 396 08:59:24.884914 52:54:00:0a:25:27 > Broadcast, ethertype IPv4 (0x0800), length 438: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:0a:25:27, length 396 08:59:40.703398 52:54:00:0a:25:27 > Broadcast, ethertype IPv4 (0x0800), length 438: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:0a:25:27, length 396
Created attachment 1267167 [details] neutron/dhcp-agent.log
Created attachment 1267168 [details] neutron/openvswitch-agent.log
As with https://bugzilla.redhat.com/show_bug.cgi?id=1432028, it looks like the neutron dhcp server is not responding because of a neutron port binding failure due to a dead neutron agent. ** dhcp-agent.log ** Here the neutron port corresponding to mac 52:54:00:0a:25:27 is being added to the dhcp agent but the vid_type is "bind_failed": 2017-03-28 10:02:07.706 19566 INFO neutron.agent.dhcp.agent [req-31a21181-c1b0-42ed-9763-693c4cfe8dbe a41c516915bd49b8b58573b9d81a3a23 f8fcf891cea7471e957d2dda4a009380 - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=undercloud-0.redhat.local, binding:profile=, binding:vif_details=, binding:vif_type=binding_failed, binding:vnic_type=normal, created_at=2017-03-28T14:02:00Z, description=, device_id=0e368cf7-21b2-43b8-8a84-82eea663bd9a, device_owner=compute:None, extra_dhcp_opts=[{u'opt_value': u'192.168.24.1', u'ip_version': 4, u'opt_name': u'server-ip-address'}, {u'opt_value': u'undionly.kpxe', u'ip_version': 4, u'opt_name': u'tag:!ipxe,bootfile-name'}, {u'opt_value': u'http://192.168.24.1:8088/boot.ipxe', u'ip_version': 4, u'opt_name': u'tag:ipxe,bootfile-name'}, {u'opt_value': u'192.168.24.1', u'ip_version': 4, u'opt_name': u'tftp-server'}], fixed_ips=[{u'subnet_id': u'0c7877b2-b7d6-44f7-acc0-37c23386ccb7', u'ip_address': u'192.168.24.23'}], id=048b10ab-fbbe-403b-b61b-6c2768e8fe6c, mac_address=52:54:00:0a:25:27, name=, network_id=d69795b7-5e8d-4e5c-af76-a2b5c8ba4572, project_id=7d637d45ea47471080085a245f85ad83, revision_number=11, security_groups=[u'a5775727-aed9-4ac3-b4af-9dae5c1923de'], status=DOWN, tags=[], tenant_id=7d637d45ea47471080085a245f85ad83, updated_at=2017-03-28T14:02:07Z In the log above the port id is 048b10ab-fbbe-403b-b61b-6c2768e8fe6c. ** neutron/server.log** We can see the port bind failure occurred because of the dead neutron agent: 2017-03-28 10:02:06.840 19532 WARNING neutron.plugins.ml2.drivers.mech_agent [req-92a14e4a-8b4b-4813-88c5-e7af3226721b a41c516915bd49b8b58573b9d81a3a23 f8fcf891cea7471e957d2dda4a009380 - - -] Refusing to bind port 048b10ab-fbbe-403b-b61b-6c2768e8fe6c to dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2017, 3, 28, 9, 33, 28), 'availability_zone': None, 'alive': False, 'topic': u'N/A', 'host': u'undercloud-0.redhat.local', 'agent_type': u'Open vSwitch agent', 'resource_versions': {u'Subnet': u'1.0', u'Network': u'1.0', u'SubPort': u'1.0', u'SecurityGroup': u'1.0', u'SecurityGroupRule': u'1.0', u'Trunk': u'1.1', u'QosPolicy': u'1.4', u'Port': u'1.0'}, 'created_at': datetime.datetime(2017, 3, 27, 14, 14, 45), 'started_at': datetime.datetime(2017, 3, 28, 9, 32, 58), 'id': u'25693163-c59b-4b67-8fd1-ee306ac8ed8e', 'configurations': {u'ovs_hybrid_plug': True, u'in_distributed_mode': False, u'datapath_type': u'system', u'vhostuser_socket_dir': u'/var/run/openvswitch', u'tunneling_ip': None, u'arp_responder_enabled': False, u'devices': 1, u'ovs_capabilities': {u'datapath_types': [u'netdev', u'system'], u'iface_types': [u'geneve', u'gre', u'internal', u'ipsec_gre', u'lisp', u'patch', u'stt', u'system', u'tap', u'vxlan']}, u'log_agent_heartbeats': False, u'l2_population': False, u'tunnel_types': [], u'extensions': [], u'enable_distributed_routing': False, u'bridge_mappings': {u'ctlplane': u'br-ctlplane'}}} There are many of these "dead agent" log messages starting at 2017-03-28 05:34:49.937: 2017-03-28 05:34:49.937 19537 WARNING neutron.db.agents_db [req-1c2f1800-dae7-4511-935b-da84fbb33408 - - - - -] Agent healthcheck: found 1 dead agents out of 2: 2017-03-28 05:35:26.968 19537 WARNING neutron.db.agents_db [req-1c2f1800-dae7-4511-935b-da84fbb33408 - - - - -] Agent healthcheck: found 1 dead agents out of 2: 2017-03-28 05:36:03.996 19537 WARNING neutron.db.agents_db [req-1c2f1800-dae7-4511-935b-da84fbb33408 - - - ** neutron/openvswitch-agent.log ** It looks like the agent died here: 2017-03-28 05:15:20.123 23628 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: None _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:233 2017-03-28 05:15:20.124 23628 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: None 2017-03-28 05:15:20.124 23628 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor Interface name,ofport,external_ids --format=json] dies due to the error: None 2017-03-28 05:15:20.134 23628 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 545, in close self.uninstantiate(app_name) File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 528, in uninstantiate app = self.applications.pop(name) KeyError: 'ofctl_service' 2017-03-28 05:15:20.197 23628 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=23708 2017-03-28 05:20:55.303 6015 INFO neutron.common.config [-] Logging enabled! 2017-03-28 05:20:55.304 6015 INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 10.0.0 I'm attaching the relevant neutron logs. I want to see if someone from Neutron team can take a look to see why this agent died after upgrade.
This looks familiar, we've seen similar ryu error while investigating bug 1425507. It also seems that it's not always reproducible which could explain why I wasn't able to reproduce bug 1432028.
Thanks Marius. Yeah this comment - https://bugzilla.redhat.com/show_bug.cgi?id=1425507#c13 refers to the ryu bug that was being hit on upgrade which has the same signature as above: https://bugs.launchpad.net/neutron/+bug/1589746. It might make sense to make this a duplicate of bug 1425507.
Hi Raviv, could you confirm wether this is a duplicate of bugzilla 1425507 and if it is mark it accordingly? Many thanks.
This bug was discovered on the undercloud while BZ 1425507 was discovered on the overcloud node (compute), As the upgrade procedure is the same for undecloud and overfcloud we can safely assume it the same bug and the fix will work in both cases. *** This bug has been marked as a duplicate of bug 1425507 ***
I'm re opening this bug as BZ 1425507 is fixed and can not be reproduce any more, but this bug still happens. attached are new neutron logs. and system info
rpm info: [root@undercloud-0 neutron]# rpm -qa | grep neutron openstack-neutron-ml2-10.0.0-18.el7ost.noarch openstack-neutron-10.0.0-18.el7ost.noarch python-neutron-10.0.0-18.el7ost.noarch python-neutron-lib-1.1.0-1.el7ost.noarch openstack-neutron-openvswitch-10.0.0-18.el7ost.noarch puppet-neutron-10.3.0-2.el7ost.noarch python-neutronclient-6.1.0-1.el7ost.noarch openstack-neutron-common-10.0.0-18.el7ost.noarch
Created attachment 1272998 [details] process list
Created attachment 1273002 [details] server log
Created attachment 1273004 [details] openswitch-agent
Hi, here's what happen: - stop neutron and openstack using and upgrade some package as part of the undercloud upgrade proc: 22 sudo yum -y update instack-undercloud openstack-puppet-modules openstack-tripleo-common python-tripleoclient 23 sudo systemctl stop 'openstack-*' 24 sudo systemctl stop 'neutron-*' 25 sudo systemctl stop httpd NOTE: Raviv, this is not in the same order that there https://gitlab.cee.redhat.com/mandreou/OSP10-OSP11-Upgrade/#undercloud-upgrade , don't know if it's relevant but it would be worth putting those commands in the same order as documented there. - then we hit this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1425507 after that neutron doesn't recover. Here is all the detail from the platform. TZ is different in the log. EDT for some, UTC for other. For ref EDT=UTC-4 ===> yum.log: EDT => initial installation Apr 20 04:13:34 Installed: openstack-tripleo-heat-templates-5.2.0-9.el7ost.noarch Apr 20 04:16:29 Installed: 1:openstack-neutron-common-9.2.0-8.el7ost.noarch Apr 20 04:16:29 Installed: 1:openstack-neutron-9.2.0-8.el7ost.noarch ===> /var/log/openvswitch/ovs-vswitchd.log-20170421.gz: UTC => everything looks good 2017-04-20T08:25:47.089Z|00020|bridge|INFO|bridge br-int: added interface br-int on port 65534 2017-04-20T08:25:47.320Z|00025|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connecting... 2017-04-20T08:25:47.327Z|00026|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected 2017-04-20T08:26:57.535Z|00040|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 6 flow_mods 50 s ago (1 adds, 5 deletes) ===> yum.log: EDT => as part of the undercloud upgrade we upgrade some packages: Apr 20 06:52:55 Updated: instack-undercloud-6.0.0-6.el7ost.noarch ===> history (inferred from yum.log): so we know that those command took places at ~06:52 ETD. 22 sudo yum -y update instack-undercloud openstack-puppet-modules openstack-tripleo-common python-tripleoclient 23 sudo systemctl stop 'openstack-*' 24 sudo systemctl stop 'neutron-*' 25 sudo systemctl stop httpd ===> /var/log/neutron/openvswitch-agent.log-20170421.gz: EDT : bz 1425507 2017-04-20 06:54:39.385 23058 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor Interface name,ofport,external_ids --format=json] dies due to the error: None 2017-04-20 06:54:39.386 23058 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 545, in close self.uninstantiate(app_name) File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 528, in uninstantiate app = self.applications.pop(name) KeyError: 'ofctl_service' 2017-04-20 06:54:39.457 23058 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=23141 ===> /var/log/openvswitch/ovs-vswitchd.log-20170421.gz: UTC : it will never work again :) 2017-04-20T10:54:39.536Z|00042|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connection closed by peer 2017-04-20T10:54:40.192Z|00044|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connecting... 2017-04-20T10:54:40.192Z|00045|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-04-20T10:54:40.192Z|00046|rconn|INFO|br-int<->tcp:127.0.0.1:6633: waiting 2 seconds before reconnect ===> yum.log: EDT : the yum upgrade command takes place as part of the undercloud upgrade, it's already too late (those are the versions where the fix is in bz 1425507) Apr 20 07:56:52 Updated: 1:openstack-neutron-10.0.0-18.el7ost.noarch Apr 20 08:00:09 Updated: openstack-tripleo-heat-templates-6.0.0-5.el7ost.noarch 2
Asking Jacob to have a look as he was assigned bz 145507 and adjusting DFG.
Hi Sofer, DFG shows as DFG:HardProv in the internal whiteboard, shouldn't it be DFG:Upgrades or DFG:Networking?
yeap, I though I modified that by I had a mid-air collision ...
The issue in bug 1425507 was that rootwrap daemon wasn't killed properly at some cases keeping the socket with port bound. The upgrade scriptlet of openstack-neutron-openvswitch agent should kill the orphaned rootwrap daemon. Can you please paste here which process is still bound to port 6633? Is that the orphaned 23439 ?
I'm setting needinfo also to Sofer as the original bug was reported before bug 1425507 was fixed. ===> yum.log: EDT : the yum upgrade command takes place as part of the undercloud upgrade, it's already too late The orphaned rootwrap is killed with upgrade of openstack-neutron-openvswitch package. I don't understand for what "it's already too late".
Hi Jakub, it's already too late because it was restarted and fail to do so earlier, at 2017-04-20 06:54:39.385 (EDT), and the upgrade of the package with the fix happen at Apr 20 08:00:09 (EDT)
Hi, so on the environment after the upgrade of the undercloud, we have that: $ neutron agent-list +--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+ | 0b5342a6-f9ea-4018-a8b7-b6d7b4c18178 | Open vSwitch agent | undercloud-0.redhat.local | | xxx | True | neutron-openvswitch-agent | | 50011aa0-6230-4246-8e25-4c61f8e8f3db | DHCP agent | undercloud-0.redhat.local | nova | :-) | True | neutron-dhcp-agent | +--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+ If one restart ovs-vswitchd: sudo systemctl restart ovs-vswitchd Then the installation of the newton undercloud is working: ./overcloud_deploy_compat_non_ha.sh ... 2017-04-22 16:16:53Z [overcloud]: CREATE_COMPLETE Stack CREATE completed successfully Stack overcloud CREATE_COMPLETE and the stack is there: $ nova list +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | 985564d7-3814-4160-a431-9ed394609f8d | compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | b18fceba-8241-49bf-9096-f6e1e25ec437 | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.14 | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ unfortunately compute-0 is accessible, but controller-0 is not (ping 192.168.24.14 doesn't work). So we first have a problem during undercloud upgrade and we should get this out of the way as it's unrelated to mixed version. It would be nice to restart this test using the exact sequence highlighted in https://gitlab.cee.redhat.com/mandreou/OSP10-OSP11-Upgrade/#undercloud-upgrade for the undercloud upgrade and then check the status of the ovs-vswitchd, using neutron agent-list just after the undercloud upgrade. Then when this is sorted out, we will check further why the controller-0 is created but not responding.
The same problem happen again when running the procedure from the document line by line.
Hi, just as a precision, the problem we reproduce here is that the upgrade of the undercloud hits the non ripping of neutron rootwrap daemon and thus has a non ovs-vswitchd daemon. So, this bz is not yet related to mixversion problem but to *undercloud upgrade* problem and openvswitch.
So confirmation of the non ripping rootwrap: [root@undercloud-0 ~]# netstat -pant | grep 6633 | grep LIST tcp 51 0 127.0.0.1:6633 0.0.0.0:* LISTEN 4870/sudo [root@undercloud-0 ~]# ps fauxww | grep 4870 root 24358 0.0 0.0 112648 964 pts/1 S+ 05:43 0:00 \_ grep --color=auto 4870 root 4870 0.0 0.0 193392 2812 ? S 04:47 0:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf [root@undercloud-0 ~]# systemctl status ovs-vswitchd ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-04-24 04:47:41 EDT; 56min ago Main PID: 4750 (ovs-vswitchd) CGroup: /system.slice/ovs-vswitchd.service └─4750 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach Apr 24 04:47:41 undercloud-0.redhat.local systemd[1]: Starting Open vSwitch Forwarding Unit... Apr 24 04:47:41 undercloud-0.redhat.local ovs-ctl[4726]: Starting ovs-vswitchd [ OK ] Apr 24 04:47:41 undercloud-0.redhat.local ovs-ctl[4726]: Enabling remote OVSDB managers [ OK ] Apr 24 04:47:41 undercloud-0.redhat.local systemd[1]: Started Open vSwitch Forwarding Unit. Apr 24 05:00:33 undercloud-0.redhat.local ovs-vswitchd[4750]: ovs|00052|rconn|ERR|br-int<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting Apr 24 05:00:34 undercloud-0.redhat.local ovs-vswitchd[4750]: ovs|00054|rconn|ERR|br-ctlplane<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting The packages used: Apr 23 08:51:35 Installed: openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 Apr 23 08:58:29 Installed: python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch Apr 23 08:58:29 Installed: 1:openstack-neutron-openvswitch-9.2.0-8.el7ost.noarch Apr 24 04:43:06 Updated: python-openvswitch-2.6.1-10.git20161206.el7fdp.noarch Apr 24 04:43:26 Updated: openvswitch-2.6.1-10.git20161206.el7fdp.x86_64 Apr 24 04:43:39 Updated: 1:openstack-neutron-openvswitch-10.0.1-1.el7ost.noarch
(In reply to Sofer Athlan-Guyot from comment #24) > Hi, > > just as a precision, the problem we reproduce here is that the upgrade of > the undercloud hits the non ripping of neutron rootwrap daemon and thus has > a non ovs-vswitchd daemon. What is 'non ovs-vswitch daemon'? How does ovs-vswitchd relate to rootwrap daemon? ovs-vswitchd sets the datapath based on ovsdb content. It should be running during the upgrade all the time. The culprit must be in the scriptlet of openstack-neutron-openvswitch package and the way it kills orphaned neutron-rootwrap-daemon if you see the orphaned process after undercloud upgrade.
This bug is the duplication of BZ 1425507, New bug about dead neutron agent after undercloud upgrade will be opens regardless of the mix version and overcloud networking issue *** This bug has been marked as a duplicate of bug 1425507 ***