Description of problem: After upgrading the controllers from OSP15 to OSP16, it isn't possible to create any network. The neutron server is completely down: 2020-02-26 13:21:39 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/networks, 503 Service Unavailable: No server is available to handle this request. 2020-02-26 13:21:39 | Creating router internal_net_cb7387c49a_router 2020-02-26 13:21:41 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/routers, No server is available to handle this request.: 503 Service Unavailable 2020-02-26 13:21:41 | Creating network internal_net_cb7387c49a 2020-02-26 13:21:43 | Error while executing command: HttpException: 503, No server is available to handle this request.: 503 Service Unavailable 2020-02-26 13:21:43 | Creating subnet internal_net_cb7387c49a_subnet 2020-02-26 13:21:45 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/networks/internal_net_cb7387c49a, 503 Service Unavailable: No server is available to handle this request. 2020-02-26 13:21:45 | Add subnet internal_net_cb7387c49a_subnet to router internal_net_cb7387c49a_router 2020-02-26 13:21:47 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/subnets/internal_net_cb7387c49a_subnet, No server is available to handle this request.: 503 Service Unavailable 2020-02-26 13:21:47 | Set external-gateway for internal_net_cb7387c49a_router 2020-02-26 13:21:49 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/routers/internal_net_cb7387c49a_router, 503 Service Unavailable: No server is available to handle this request. 2020-02-26 13:21:51 | HttpException: 503: Server Error for url: https://10.0.0.101:13696/v2.0/security-groups, 503 Service Unavailable: No server is available to handle this request. Checking the neutron server logs, we can see neutron going down exactly when the deploy steps start the OSP16 ovn container: /var/log/containers/neutron/server.log.2 ======================================== 2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.pool_timeout = None log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589 2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.retry_interval = 10 log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589 2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.slave_connection = **** log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589 2020-02-26 13:09:09.714 8 DEBUG oslo_service.service [-] database.sqlite_synchronous = True log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589 2020-02-26 13:09:09.715 8 DEBUG oslo_service.service [-] database.use_db_reconnect = False log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2589 2020-02-26 13:09:09.715 8 DEBUG oslo_service.service [-] ******************************************************************************** log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2591 2020-02-26 13:09:11.524 27 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry 2020-02-26 13:09:11.524 27 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.545 33 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for MaintenanceWorker with retry 2020-02-26 13:09:11.546 33 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.551 28 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry 2020-02-26 13:09:11.551 28 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.574 30 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry 2020-02-26 13:09:11.575 30 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.580 29 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry 2020-02-26 13:09:11.580 29 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.586 31 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for RpcWorker with retry 2020-02-26 13:09:11.587 31 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.595 32 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for RpcReportsWorker with retry 2020-02-26 13:09:11.596 32 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:11.599 34 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for AllServicesNeutronWorker with retry 2020-02-26 13:09:11.600 34 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused 2020-02-26 13:09:15.529 27 INFO networking_ovn.ovsdb.impl_idl_ovn [-] Getting OvsdbNbOvnIdl for WorkerService with retry 2020-02-26 13:09:15.531 27 ERROR ovsdbapp.backend.ovs_idl.idlutils [-] Unable to open stream to tcp:172.17.1.103:6641 to retrieve schema: Connection refused overcloud_upgrade_run_Controller.log ==================================== 2020-02-26 13:03:00 | "Completed $ podman run --name ovn_dbs_restart_bundle --label config_id=tripleo_step3 --label container_name=ovn_dbs_restart_bundle --label managed_by=tripleo-Controller --label config_data={\"command\": \"/pacemaker_restart_bundle.sh ovn-dbs-bundle ovn_dbs\", \"config_volume\": \"ovn_dbs\", \"detach\": false, \"environment\": {\"TRIPLEO_MINOR_UPDATE\": \"\"}, \"image\": \"undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200213.1\", \"ipc\": \"host\", \"net\": \"host\", \"start_order\": 0, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/var/lib/container-config-scripts/pacemaker_restart_bundle.sh:/pacemaker_restart_bundle.sh:ro\", \"/dev/shm:/dev/shm:rw\", \"/etc/puppet:/etc/puppet:ro\"]} --conmon-pidfile=/var/run/ovn_dbs_restart_bundle.pid --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/ovn_dbs_restart_bundle.log --env=TRIPLEO_MINOR_UPDATE --net=host --ipc=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/var/lib/container-config-scripts/pacemaker_restart_bundle.sh:/pacemaker_restart_bundle.sh:ro --volume=/dev/shm:/dev/shm:rw --volume=/etc/puppet:/etc/puppet:ro --cpuset-cpus=0,1,2,3,4,5,6,7 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200213.1 /pacemaker_restart_bundle.sh ovn-dbs-bundle ovn_dbs", 2020-02-26 13:03:00 | "stdout: Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.", .... 2020-02-26 13:03:00 | "Running container: ovn_dbs_init_bundle", 2020-02-26 13:03:00 | "$ podman ps -a --filter label=container_name=ovn_dbs_init_bundle --filter label=config_id=tripleo_step3 --format {{.Names}}", 2020-02-26 13:03:00 | "Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=ovn_dbs_init_bundle', '--filter', 'label=config_id=tripleo_step3', '--format', '{{.Names}}']\" - retrying without config_id", 2020-02-26 13:03:00 | "$ podman ps -a --filter label=container_name=ovn_dbs_init_bundle --format {{.Names}}", 2020-02-26 13:03:00 | "Did not find container with \"['podman', 'ps', '-a', '--filter', 'label=container_name=ovn_dbs_init_bundle', '--format', '{{.Names}}']\"", 2020-02-26 13:03:00 | "Start container ovn_dbs_init_bundle as ovn_dbs_init_bundle.", Doing some deeper analisys, it looks like the pacemaker resource has a different VIP assigned than what the container believes is the right VIP: [root@controller-0 ~]# pcs resource show ovn-dbs-bundle Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead. Bundle: ovn-dbs-bundle Podman: image=cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest masters=1 network=host options="--log-driver=k8s-file -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" replic as=3 run-command="/bin/bash /usr/local/bin/kolla_start" Network: control-port=3125 Storage Mapping: options=ro source-dir=/var/lib/kolla/config_files/ovn_dbs.json target-dir=/var/lib/kolla/config_files/config.json (ovn-dbs-cfg-files) options=ro source-dir=/lib/modules target-dir=/lib/modules (ovn-dbs-mod-files) options=rw source-dir=/var/lib/openvswitch/ovn target-dir=/run/openvswitch (ovn-dbs-run-files) options=rw source-dir=/var/log/containers/openvswitch target-dir=/var/log/openvswitch (ovn-dbs-log-files) options=rw source-dir=/var/lib/openvswitch/ovn target-dir=/etc/openvswitch (ovn-dbs-db-path) Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers) Attributes: inactive_probe_interval=180000 manage_northd=yes master_ip=172.17.1.103 nb_master_port=6641 sb_master_port=6642 Meta Attrs: container-attribute-target=host notify=true Operations: demote interval=0s timeout=50s (ovndb_servers-demote-interval-0s) monitor interval=10s role=Master timeout=60s (ovndb_servers-monitor-interval-10s) monitor interval=30s role=Slave timeout=60s (ovndb_servers-monitor-interval-30s) notify interval=0s timeout=20s (ovndb_servers-notify-interval-0s) promote interval=0s timeout=50s (ovndb_servers-promote-interval-0s) start interval=0s timeout=200s (ovndb_servers-start-interval-0s) stop interval=0s timeout=200s (ovndb_servers-stop-interval-0s) ============================================ ovn-dbs-bundle has as master_ip 172.17.1.103 ============================================ But the container has bringed up the service into 172.17.1.108: [root@controller-0 ~]# netstat -ntapu |grep 6641 tcp 0 0 172.17.1.108:6641 0.0.0.0:* LISTEN 789094/ovsdb-server [root@controller-0 ~]# netstat -ntapu |grep 6642 tcp 0 0 172.17.1.108:6642 0.0.0.0:* LISTEN 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.49:43482 ESTABLISHED 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.49:43478 ESTABLISHED 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.49:43476 ESTABLISHED 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.49:43480 ESTABLISHED 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.19:38716 ESTABLISHED 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.19:38714 ESTABLISHED 789104/ovsdb-server tcp 0 0 172.17.1.108:6642 172.17.1.19:38710 ESTABLISHED 789104/ovsdb-server $ tcp 0 0 172.17.1.108:6642 172.17.1.19:38712 ESTABLISHED 789104/ovsdb-serve [root@controller-0 ~]# sudo podman exec -it ovn-dbs-bundle-podman-0 bash ()[root@controller-0 /]# ps -aef|grep ovsdb root 141 1 0 Feb26 ? 00:00:00 ovsdb-server: monitoring pid 142 (healthy) root 142 141 0 Feb26 ? 00:00:06 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvs witch/ovnnb_db.sock --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:O VN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --s sl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers --remote=ptcp:6641:172.17.1.108 --sync-from=tcp:192.0.2.254:6641 /etc/openvswitch/ovnnb_db.db root 151 1 0 Feb26 ? 00:00:00 ovsdb-server: monitoring pid 152 (healthy) root 152 151 0 Feb26 ? 00:00:11 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvs witch/ovnsb_db.sock --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections --private-key=db:O VN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --s sl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers --remote=ptcp:6642:172.17.1.108 --sync-from=tcp:192.0.2.254:6642 /etc/openvswitch/ovnsb_db.db root 283614 283550 0 09:52 pts/0 00:00:00 grep --color=auto ovsdb There is an environment with the issue reproduced for debugging. Version-Release number of selected component (if applicable): ()[root@controller-0 /]# sudo rpm -qa | grep ovn rhosp-openvswitch-ovn-common-2.11-0.5.el8ost.noarch rhosp-openvswitch-ovn-central-2.11-0.5.el8ost.noarch puppet-ovn-15.4.1-0.20191014133046.192ac4e.el8ost.noarch ovn2.11-2.11.1-24.el8fdp.x86_64 ovn2.11-central-2.11.1-24.el8fdp.x86_64 ()[root@controller-0 /]# sudo rpm -qa | grep pcs pcs-0.10.2-4.el8.x86_64 ()[root@controller-0 /]# sudo rpm -qa | grep pacemaker pacemaker-libs-2.0.2-3.el8.x86_64 pacemaker-schemas-2.0.2-3.el8.noarch pacemaker-cli-2.0.2-3.el8.x86_64 pacemaker-2.0.2-3.el8.x86_64 puppet-pacemaker-0.8.1-0.20200203145608.83d23b3.el8ost.noarch pacemaker-cluster-libs-2.0.2-3.el8.x86_64 pacemaker-remote-2.0.2-3.el8.x86_64 How reproducible: Steps to Reproduce: 1. Deploy OSP15 latest and upgrade the Undercloud to OSP16 2. Run overcloud upgrade prepare 3. Run overcloud upgrade run Controllers Actual results: Network unavailable after the upgrade of the controllers Expected results: Upgrade of the controllers succeeds and the neutron server is available. Additional info:
There seems to be some change that landed in Train which creates a dedicated VIP for OVN DBS https://github.com/openstack/tripleo-heat-templates/commit/c2d481684063af5a23fa922f028b383ecf81a3f4 This change will proably imply adding some upgrade_tasks in the ovn-dbs pacemaker template service to deal with the change: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ovn/ovn-dbs-pacemaker-puppet.yaml#L353
So we had a look with Luca since yesterday and we think the problem is the following: . at the end of the controller upgrade, there is a deploy task that runs puppet code to reassess the state of the ovn-dbs-bundle resource (it's run in container ovn_dbs_init_bundle) . the puppet code correctly create the new VIP and all its associated location and ordering constraints. . the ovndb_servers pacemaker resource is reconfigured to listen to the new VIP (attribute "master_ip" is updated in the resource config) . All resource replicas that are marked as Slaves are stopped, and then restarted. However, the Master resource is only demoted, and re-promoted. . in the OVN resource agent, a demotion is not sufficient to stop the ovndb_servers process. So the new VIP is never picked up. It's not clear yet whether this is an expected pacemaker behaviour, but in any case, forcing a restart of the resource with "pcs resource restart" is enough to restart all ovn processes and make them pick up the new config.
Verified on a local environment with tht package : (undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates openstack-tripleo-heat-templates-11.3.2-0.20200324120625.c3a8eb4.el8ost.noarch 2020-04-06 12:24:46 | TASK [Restart ovn-dbs service (pacemaker)] ************************************* 2020-04-06 12:24:46 | Monday 06 April 2020 12:23:35 +0000 (0:00:02.278) 0:00:10.444 ********** 2020-04-06 12:24:46 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:46 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:46 | changed: [controller-0] => {"changed": true, "out": "ovn-dbs-bundle successfully restarted\n", "rc": 0} .... 2020-04-06 12:24:53 | TASK [include_tasks] *********************************************************** 2020-04-06 12:24:53 | Monday 06 April 2020 12:24:53 +0000 (0:00:00.485) 0:01:27.925 ********** 2020-04-06 12:24:53 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:53 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:53 | skipping: [controller-2] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-04-06 12:24:53 | 2020-04-06 12:24:53 | PLAY RECAP ********************************************************************* 2020-04-06 12:24:53 | controller-0 : ok=13 changed=4 unreachable=0 failed=0 skipped=35 rescued=0 ignored=0 2020-04-06 12:24:53 | controller-1 : ok=12 changed=3 unreachable=0 failed=0 skipped=36 rescued=0 ignored=0 2020-04-06 12:24:53 | controller-2 : ok=12 changed=3 unreachable=0 failed=0 skipped=36 rescued=0 ignored=0 2020-04-06 12:24:53 | 2020-04-06 12:24:53 | Monday 06 April 2020 12:24:53 +0000 (0:00:00.358) 0:01:28.284 ********** 2020-04-06 12:24:53 | =============================================================================== 2020-04-06 12:24:54 | 2020-04-06 12:24:54 | Updated nodes - Controller 2020-04-06 12:24:54 | Success 2020-04-06 12:24:54 | 2020-04-06 12:24:54.545 661020 INFO tripleoclient.v1.overcloud_upgrade.MajorUpgradeRun [-] Completed Overcloud Upgrade Run for Controller with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] ^[[00m 2020-04-06 12:24:54 | 2020-04-06 12:24:54.546 661020 INFO osc_lib.shell [-] END return value: None^[[00m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2114