Bug 1801136

Summary: After a reboot after successful update from GA to latest ha services don't come back online and ovs-vswitchd has a lot of errors.
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openvswitchAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: apevec, chrisw, rhos-maint
Target Milestone: ---Keywords: TestOnly, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-10 11:55:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2020-02-10 10:42:50 UTC
Description of problem:
Update of OSP13 GA to 2020-02-06.2.  The update goes without error.  But after a final reboot the cluster doesn't come back online on one of the node with all it's service stopped:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.14       (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.10 (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.3.12 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.4.13 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rh-osbs/rhosp13-openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller-0
 Docker container: openstack-cinder-backup [192.168.24.1:8787/rh-osbs/rhosp13-openstack-cinder-backup:pcmklatest]
   openstack-cinder-backup-docker-0     (ocf::heartbeat:docker):        Started controller-1


If we take rabbitmq as an example we can see very strange problem where it cannot connect to itself:

docker logs rabbitmq-bundle-docker-2:

 notice: operation_finished:   rabbitmq_start_0:4034:stderr [ Error: unable to connect to node 'rabbit@controller-2': nodedown ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ DIAGNOSTICS ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ =========== ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ attempted to contact: ['rabbit@controller-2'] ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ rabbit@controller-2: ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [   * connected to epmd (port 4369) on controller-2 ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [   * epmd reports: node 'rabbit' not running at all ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [                   no other nodes on controller-2 ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [   * suggestion: start the node ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ current node details: ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ - node name: 'rabbitmq-cli-16@controller-2' ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ - home dir: /var/lib/rabbitmq ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ - cookie hash: 8tjj2n8/z3cXpZelPHOUTA== ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [  ]
  notice: operation_finished:   rabbitmq_start_0:4034:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]


We can see that ovs-vswitched has a lot of issues on that node:

at the time of the reboot: grep 08:0[6789]: /var/log/openvswitch/ovs-vswitchd.log  | less


2020-02-10T08:07:53.031Z|00001|netdev_tc_offloads(revalidator57)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported
2020-02-10T08:07:53.036Z|00002|netdev_tc_offloads(revalidator57)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported

2020-02-10T08:07:54.049Z|00001|netdev_tc_offloads(revalidator92)|ERR|Dropped 31 log messages in last 1 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-10T08:07:54.049Z|00002|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported
2020-02-10T08:07:55.164Z|00003|netdev_tc_offloads(revalidator92)|ERR|Dropped 6 log messages in last 1 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-10T08:07:55.164Z|00004|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported
2020-02-10T08:07:56.173Z|00005|netdev_tc_offloads(revalidator92)|ERR|Dropped 1 log messages in last 0 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-10T08:07:56.173Z|00006|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported

On the running node, constantly:

Feb 10 10:16:38 controller-2 ovs-vswitchd[2124]: ovs|15450|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for qr-0274bb9a-c5: Operation not supported
Feb 10 10:16:39 controller-2 ovs-vswitchd[2124]: ovs|15451|netdev_tc_offloads(revalidator92)|ERR|Dropped 7 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate
Feb 10 10:16:39 controller-2 ovs-vswitchd[2124]: ovs|15452|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for qr-0274bb9a-c5: Operation not supported


Version-Release number of selected component (if applicable):

[root@controller-2 ~]# yum list installed | grep openvswitch
openstack-neutron-openvswitch.noarch
openvswitch-selinux-extra-policy.noarch
openvswitch2.11.x86_64              2.11.0-35.el7fdp   @rhelosp-13.0-puddle     
python-openvswitch2.11.x86_64       2.11.0-35.el7fdp   @rhelosp-13.0-puddle     
python-rhosp-openvswitch.noarch     2.11-0.6.el7ost    @rhelosp-13.0-puddle     
rhosp-openvswitch.noarch            2.11-0.6.el7ost    @rhelosp-13.0-puddle    

cf3671052f56        192.168.24.1:8787/rh-osbs/rhosp13-openstack-haproxy:pcmklatest                     "dumb-init --singl..."   5 hours ago         Exited (255) 2 hours ago                       haproxy-bundle-docker-2
6d5c62a52074        192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:pcmklatest                       "dumb-init --singl..."   5 hours ago         Exited (255) 2 hours ago                       redis-bundle-docker-2
d8f6d47ebe52        192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:pcmklatest                     "dumb-init -- /bin..."   5 hours ago         Exited (255) 2 hours ago                       galera-bundle-docker-2
3cc2a31d01ab        192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:pcmklatest                    "dumb-init --singl..."   5 hours ago         Exited (255) 2 hours ago                       rabbitmq-bundle-docker-2


04333a3ebf4a        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-openvswitch-agent:20200205.1   "dumb-init --singl..."   7 hours ago         Up 2 hours (healthy)                           neutron_ovs_agent
f522e4dead85        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-l3-agent:20200205.1            "dumb-init --singl..."   7 hours ago         Up 2 hours (healthy)                           neutron_l3_agent
0951f3147a5e        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-metadata-agent:20200205.1      "dumb-init --singl..."   7 hours ago         Up 2 hours (healthy)                           neutron_metadata_agent
690c9f37a0e3        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-dhcp-agent:20200205.1          "dumb-init --singl..."   7 hours ago         Up 2 hours (healthy)                           neutron_dhcp

I put the component on openvswitch as the ovs-vswitched seems to be the root cause.

Comment 3 Sofer Athlan-Guyot 2020-02-10 11:55:19 UTC
Hi,

So the root cause is that the command[1] that is used by infrared to
generate fencing parameters in infrared isn't supported in OSP13 GA.

It's supported from z2.

We're going around the issue by adjusting the job to not use fencing
for GA and z1.

TL;DR

Running[1] returns an empty list of servers, generating that fencing.yaml file:

undercloud) [stack@undercloud-0 ~]$ cat fencing.yaml 
parameter_defaults:
  EnableFencing: true
  FencingConfig:
    devices: []

This cause this slight warning in the pcs status output:

[root@controller-2 ~]# pcs status
Cluster name: tripleo_cluster

WARNINGS:
No stonith devices and stonith-enabled is not false

which in turn cause pacemaker to refuse to run the cluster node with
255 error.

A simple workaround is:

pcs property set stonith-enabled=false

Then the cluster finish its jobs.

Another way to look into it that:

crm_simulate -Ls

was showing a empty list of transaction summary:

Transition Summary:

Meaning there was no unexpected errors.

The ovs-vswitched errors are a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1737982 and is not critical.

[1] openstack overcloud generate fencing --ipmi-lanplus --output fencing.yaml instackenv.json