Bug 1807826

Summary:	[OSP15->16] Neutron is down after Controllers upgrade. Pacemaker allocating different IPs for ovn-dbserver
Product:	Red Hat OpenStack	Reporter:	Jose Luis Franco <jfrancoa>
Component:	openstack-tripleo-heat-templates	Assignee:	Jose Luis Franco <jfrancoa>
Status:	CLOSED ERRATA	QA Contact:	nlevinki <nlevinki>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	16.0 (Train)	CC:	batkisso, ccamacho, dciabrin, jjoyce, jschluet, lmiccini, mburns, shrjoshi, slinaber, tvignaud
Target Milestone:	zstream	Keywords:	Triaged
Target Release:	16.0 (Train on RHEL 8.1)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200310160324.b3d9c16.el8ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-14 12:16:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jose Luis Franco 2020-02-27 10:19:14 UTC

Description of problem:

After upgrading the controllers from OSP15 to OSP16, it isn't possible to create any network. The neutron server is completely down:

2020-02-26 13:21:39 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/networks, 503 Service Unavailable: No server is available to handle this request.
2020-02-26 13:21:39 | Creating router internal_net_cb7387c49a_router
2020-02-26 13:21:41 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/routers, No server is available to handle this request.: 503 Service Unavailable
2020-02-26 13:21:41 | Creating network internal_net_cb7387c49a
2020-02-26 13:21:43 | Error while executing command: HttpException: 503, No server is available to handle this request.: 503 Service Unavailable
2020-02-26 13:21:43 | Creating subnet internal_net_cb7387c49a_subnet
2020-02-26 13:21:45 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/networks/internal_net_cb7387c49a, 503 Service Unavailable: No server is available to handle this request.
2020-02-26 13:21:45 | Add subnet internal_net_cb7387c49a_subnet to router internal_net_cb7387c49a_router
2020-02-26 13:21:47 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/subnets/internal_net_cb7387c49a_subnet, No server is available to handle this request.: 503 Service Unavailable
2020-02-26 13:21:47 | Set external-gateway for internal_net_cb7387c49a_router
2020-02-26 13:21:49 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/routers/internal_net_cb7387c49a_router, 503 Service Unavailable: No server is available to handle this request.
2020-02-26 13:21:51 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/security-groups, 503 Service Unavailable: No server is available to handle this request.

Checking the neutron server logs, we can see neutron going down exactly when the deploy steps start the OSP16 ovn container:

/var/log/containers/neutron/server.log.2
========================================

2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.pool_timeout          = None log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589
2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.retry_interval        = 10 log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589
2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.slave_connection      = **** log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589
2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.sqlite_synchronous    = True log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589
2020-02-26 13:09:09.715 8 DEBUG oslo_service.service [-] database.use_db_reconnect      = False log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589
2020-02-26 13:09:09.715 8 DEBUG oslo_service.service [-] ******************************************************************************** log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2591
2020-02-26 13:09:11.524 27 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry
2020-02-26 13:09:11.524 27 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.545 33 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for MaintenanceWorker with retry
2020-02-26 13:09:11.546 33 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.551 28 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry
2020-02-26 13:09:11.551 28 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.574 30 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry
2020-02-26 13:09:11.575 30 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.580 29 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry
2020-02-26 13:09:11.580 29 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.586 31 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for RpcWorker with retry
2020-02-26 13:09:11.587 31 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.595 32 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for RpcReportsWorker with retry
2020-02-26 13:09:11.596 32 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:11.599 34 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for AllServicesNeutronWorker with retry
2020-02-26 13:09:11.600 34 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused
2020-02-26 13:09:15.529 27 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry
2020-02-26 13:09:15.531 27 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused


overcloud_upgrade_run_Controller.log
====================================

2020-02-26 13:03:00 |         "Completed $ podman run --name ovn_dbs_restart_bundle --label config_id=tripleo_step3 --label container_name=ovn_dbs_restart_bundle --label managed_by=tripleo-Controller --label config_data={\"command\": \"/pacemaker_restart_bundle.sh ovn-dbs-bundle ovn_dbs\", \"config_volume\": \"ovn_dbs\", \"detach\": false, \"environment\": {\"TRIPLEO_MINOR_UPDATE\": \"\"}, \"image\": \"undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200213.1\", \"ipc\": \"host\", \"net\": \"host\", \"start_order\": 0, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/var/lib/container-config-scripts/pacemaker_restart_bundle.sh:/pacemaker_restart_bundle.sh:ro\", \"/dev/shm:/dev/shm:rw\", \"/etc/puppet:/etc/puppet:ro\"]} --conmon-pidfile=/var/run/ovn_dbs_restart_bundle.pid --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/ovn_dbs_restart_bundle.log --env=TRIPLEO_MINOR_UPDATE --net=host --ipc=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/var/lib/container-config-scripts/pacemaker_restart_bundle.sh:/pacemaker_restart_bundle.sh:ro --volume=/dev/shm:/dev/shm:rw --volume=/etc/puppet:/etc/puppet:ro --cpuset-cpus=0,1,2,3,4,5,6,7 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200213.1 /pacemaker_restart_bundle.sh ovn-dbs-bundle ovn_dbs",
2020-02-26 13:03:00 |         "stdout: Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.",

....

2020-02-26 13:03:00 |         "Running container: ovn_dbs_init_bundle",
2020-02-26 13:03:00 |         "$ podman ps -a --filter label=container_name=ovn_dbs_init_bundle --filter label=config_id=tripleo_step3 --format {{.Names}}",
2020-02-26 13:03:00 |         "Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=ovn_dbs_init_bundle', '--filter', 'label=config_id=tripleo_step3', '--format', '{{.Names}}']\" - retrying without config_id",
2020-02-26 13:03:00 |         "$ podman ps -a --filter label=container_name=ovn_dbs_init_bundle --format {{.Names}}",
2020-02-26 13:03:00 |         "Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=ovn_dbs_init_bundle', '--format', '{{.Names}}']\"",
2020-02-26 13:03:00 |         "Start container ovn_dbs_init_bundle as ovn_dbs_init_bundle.",


Doing some deeper analisys, it looks like the pacemaker resource has a different VIP assigned than what the container believes is the right VIP:

[root@controller-0 ~]# pcs resource show ovn-dbs-bundle
Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.
 Bundle: ovn-dbs-bundle
  Podman: image=cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest masters=1 network=host options="--log-driver=k8s-file -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" replic
as=3 run-command="/bin/bash /usr/local/bin/kolla_start"
  Network: control-port=3125
  Storage Mapping:
   options=ro source-dir=/var/lib/kolla/config_files/ovn_dbs.json target-dir=/var/lib/kolla/config_files/config.json (ovn-dbs-cfg-files)
   options=ro source-dir=/lib/modules target-dir=/lib/modules (ovn-dbs-mod-files)
   options=rw source-dir=/var/lib/openvswitch/ovn target-dir=/run/openvswitch (ovn-dbs-run-files)
   options=rw source-dir=/var/log/containers/openvswitch target-dir=/var/log/openvswitch (ovn-dbs-log-files)
   options=rw source-dir=/var/lib/openvswitch/ovn target-dir=/etc/openvswitch (ovn-dbs-db-path)
  Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers)
   Attributes: inactive_probe_interval=180000 manage_northd=yes master_ip=172.17.1.103 nb_master_port=6641 sb_master_port=6642
   Meta Attrs: container-attribute-target=host notify=true
   Operations: demote interval=0s timeout=50s (ovndb_servers-demote-interval-0s)
               monitor interval=10s role=Master timeout=60s (ovndb_servers-monitor-interval-10s)
               monitor interval=30s role=Slave timeout=60s (ovndb_servers-monitor-interval-30s)
               notify interval=0s timeout=20s (ovndb_servers-notify-interval-0s)
               promote interval=0s timeout=50s (ovndb_servers-promote-interval-0s)
               start interval=0s timeout=200s (ovndb_servers-start-interval-0s)
               stop interval=0s timeout=200s (ovndb_servers-stop-interval-0s)

============================================
ovn-dbs-bundle has as master_ip 172.17.1.103
============================================

But the container has bringed up the service into 172.17.1.108:

[root@controller-0 ~]# netstat -ntapu |grep 6641
tcp        0      0 172.17.1.108:6641       0.0.0.0:*               LISTEN      789094/ovsdb-server 
[root@controller-0 ~]# netstat -ntapu |grep 6642
tcp        0      0 172.17.1.108:6642       0.0.0.0:*               LISTEN      789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.49:43482       ESTABLISHED 789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.49:43478       ESTABLISHED 789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.49:43476       ESTABLISHED 789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.49:43480       ESTABLISHED 789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.19:38716       ESTABLISHED 789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.19:38714       ESTABLISHED 789104/ovsdb-server 
tcp        0      0 172.17.1.108:6642       172.17.1.19:38710       ESTABLISHED 789104/ovsdb-server                                                                         $
tcp        0      0 172.17.1.108:6642       172.17.1.19:38712       ESTABLISHED 789104/ovsdb-serve

[root@controller-0 ~]# sudo podman exec -it ovn-dbs-bundle-podman-0 bash
()[root@controller-0 /]# ps -aef|grep ovsdb
root         141       1  0 Feb26 ?        00:00:00 ovsdb-server: monitoring pid 142 (healthy)
root         142     141  0 Feb26 ?        00:00:06 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvs
witch/ovnnb_db.sock --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:O
VN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --s
sl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers --remote=ptcp:6641:172.17.1.108 --sync-from=tcp:192.0.2.254:6641 /etc/openvswitch/ovnnb_db.db
root         151       1  0 Feb26 ?        00:00:00 ovsdb-server: monitoring pid 152 (healthy)
root         152     151  0 Feb26 ?        00:00:11 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvs
witch/ovnsb_db.sock --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections --private-key=db:O
VN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --s
sl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers --remote=ptcp:6642:172.17.1.108 --sync-from=tcp:192.0.2.254:6642 /etc/openvswitch/ovnsb_db.db
root      283614  283550  0 09:52 pts/0    00:00:00 grep --color=auto ovsdb

There is an environment with the issue reproduced for debugging.



Version-Release number of selected component (if applicable):

()[root@controller-0 /]# sudo rpm -qa | grep ovn
rhosp-openvswitch-ovn-common-2.11-0.5.el8ost.noarch
rhosp-openvswitch-ovn-central-2.11-0.5.el8ost.noarch
puppet-ovn-15.4.1-0.20191014133046.192ac4e.el8ost.noarch
ovn2.11-2.11.1-24.el8fdp.x86_64
ovn2.11-central-2.11.1-24.el8fdp.x86_64

()[root@controller-0 /]# sudo rpm -qa | grep pcs
pcs-0.10.2-4.el8.x86_64

()[root@controller-0 /]# sudo rpm -qa | grep pacemaker
pacemaker-libs-2.0.2-3.el8.x86_64
pacemaker-schemas-2.0.2-3.el8.noarch
pacemaker-cli-2.0.2-3.el8.x86_64
pacemaker-2.0.2-3.el8.x86_64
puppet-pacemaker-0.8.1-0.20200203145608.83d23b3.el8ost.noarch
pacemaker-cluster-libs-2.0.2-3.el8.x86_64
pacemaker-remote-2.0.2-3.el8.x86_64


How reproducible:


Steps to Reproduce:
1. Deploy OSP15 latest and upgrade the Undercloud to OSP16
2. Run overcloud upgrade prepare
3. Run overcloud upgrade run Controllers

Actual results:

Network unavailable after the upgrade of the controllers

Expected results:

Upgrade of the controllers succeeds and the neutron server is available.

Additional info:

Comment 1 Jose Luis Franco 2020-02-27 10:46:10 UTC

There seems to be some change that landed in Train which creates a dedicated VIP for OVN DBS https://github.com/openstack/tripleo-heat-templates/commit/c2d481684063af5a23fa922f028b383ecf81a3f4

This change will proably imply adding some upgrade_tasks in the ovn-dbs pacemaker template service to deal with the change: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L353

Comment 2 Damien Ciabrini 2020-02-28 10:28:37 UTC

So we had a look with Luca since yesterday and we think the problem is the following:

  . at the end of the controller upgrade, there is a deploy task that runs puppet code to reassess the state of the ovn-dbs-bundle resource (it's run in container ovn_dbs_init_bundle)

  . the puppet code correctly create the new VIP and all its associated location and ordering constraints.

  . the ovndb_servers pacemaker resource is reconfigured to listen to the new VIP (attribute "master_ip" is updated in the resource config)

  . All resource replicas that are marked as Slaves are stopped, and then restarted. However, the Master resource is only demoted, and re-promoted.
  
  . in the OVN resource agent, a demotion is not sufficient to stop the ovndb_servers process. So the new VIP is never picked up. 

It's not clear yet whether this is an expected pacemaker behaviour, but in any case, forcing a restart of the resource with "pcs resource restart" is enough to restart all ovn processes and make them pick up the new config.

Comment 10 Jose Luis Franco 2020-04-06 16:52:33 UTC

Verified on a local environment with tht package :

(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates
openstack-tripleo-heat-templates-11.3.2-0.20200324120625.c3a8eb4.el8ost.noarch


2020-04-06 12:24:46 | TASK [Restart ovn-dbs service (pacemaker)] *************************************
2020-04-06 12:24:46 | Monday 06 April 2020  12:23:35 +0000 (0:00:02.278)       0:00:10.444 **********
2020-04-06 12:24:46 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:46 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:46 | changed: [controller-0] => {"changed": true, "out": "ovn-dbs-bundle successfully restarted\n", "rc": 0}
....

2020-04-06 12:24:53 | TASK [include_tasks] ***********************************************************
2020-04-06 12:24:53 | Monday 06 April 2020  12:24:53 +0000 (0:00:00.485)       0:01:27.925 **********
2020-04-06 12:24:53 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:53 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:53 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-04-06 12:24:53 |
2020-04-06 12:24:53 | PLAY RECAP *********************************************************************
2020-04-06 12:24:53 | controller-0               : ok=13   changed=4    unreachable=0    failed=0    skipped=35   rescued=0    ignored=0
2020-04-06 12:24:53 | controller-1               : ok=12   changed=3    unreachable=0    failed=0    skipped=36   rescued=0    ignored=0
2020-04-06 12:24:53 | controller-2               : ok=12   changed=3    unreachable=0    failed=0    skipped=36   rescued=0    ignored=0
2020-04-06 12:24:53 |
2020-04-06 12:24:53 | Monday 06 April 2020  12:24:53 +0000 (0:00:00.358)       0:01:28.284 **********
2020-04-06 12:24:53 | ===============================================================================
2020-04-06 12:24:54 |
2020-04-06 12:24:54 | Updated nodes - Controller
2020-04-06 12:24:54 | Success
2020-04-06 12:24:54 | 2020-04-06 12:24:54.545 661020 INFO tripleoclient.v1.overcloud_upgrade.MajorUpgradeRun [-] Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] ^[[00m
2020-04-06 12:24:54 | 2020-04-06 12:24:54.546 661020 INFO osc_lib.shell [-] END return value: None^[[00m

Comment 15 errata-xmlrpc 2020-05-14 12:16:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2114