Bug 1801136
| Summary: | After a reboot after successful update from GA to latest ha services don't come back online and ovs-vswitchd has a lot of errors. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> |
| Component: | openvswitch | Assignee: | Sofer Athlan-Guyot <sathlang> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Eran Kuris <ekuris> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 13.0 (Queens) | CC: | apevec, chrisw, rhos-maint |
| Target Milestone: | --- | Keywords: | TestOnly, Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-02-10 11:55:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hi,
So the root cause is that the command[1] that is used by infrared to
generate fencing parameters in infrared isn't supported in OSP13 GA.
It's supported from z2.
We're going around the issue by adjusting the job to not use fencing
for GA and z1.
TL;DR
Running[1] returns an empty list of servers, generating that fencing.yaml file:
undercloud) [stack@undercloud-0 ~]$ cat fencing.yaml
parameter_defaults:
EnableFencing: true
FencingConfig:
devices: []
This cause this slight warning in the pcs status output:
[root@controller-2 ~]# pcs status
Cluster name: tripleo_cluster
WARNINGS:
No stonith devices and stonith-enabled is not false
which in turn cause pacemaker to refuse to run the cluster node with
255 error.
A simple workaround is:
pcs property set stonith-enabled=false
Then the cluster finish its jobs.
Another way to look into it that:
crm_simulate -Ls
was showing a empty list of transaction summary:
Transition Summary:
Meaning there was no unexpected errors.
The ovs-vswitched errors are a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1737982 and is not critical.
[1] openstack overcloud generate fencing --ipmi-lanplus --output fencing.yaml instackenv.json
|
Description of problem: Update of OSP13 GA to 2020-02-06.2. The update goes without error. But after a final reboot the cluster doesn't come back online on one of the node with all it's service stopped: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped Docker container set: galera-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Stopped Docker container set: redis-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Stopped ip-192.168.24.14 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.12 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.13 (ocf::heartbeat:IPaddr2): Stopped Docker container set: haproxy-bundle [192.168.24.1:8787/rh-osbs/rhosp13-openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped Docker container: openstack-cinder-volume [192.168.24.1:8787/rh-osbs/rhosp13-openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-0 Docker container: openstack-cinder-backup [192.168.24.1:8787/rh-osbs/rhosp13-openstack-cinder-backup:pcmklatest] openstack-cinder-backup-docker-0 (ocf::heartbeat:docker): Started controller-1 If we take rabbitmq as an example we can see very strange problem where it cannot connect to itself: docker logs rabbitmq-bundle-docker-2: notice: operation_finished: rabbitmq_start_0:4034:stderr [ Error: unable to connect to node 'rabbit@controller-2': nodedown ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ DIAGNOSTICS ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ =========== ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ attempted to contact: ['rabbit@controller-2'] ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ rabbit@controller-2: ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ * connected to epmd (port 4369) on controller-2 ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ * epmd reports: node 'rabbit' not running at all ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ no other nodes on controller-2 ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ * suggestion: start the node ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ current node details: ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ - node name: 'rabbitmq-cli-16@controller-2' ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ - home dir: /var/lib/rabbitmq ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ - cookie hash: 8tjj2n8/z3cXpZelPHOUTA== ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ ] notice: operation_finished: rabbitmq_start_0:4034:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] We can see that ovs-vswitched has a lot of issues on that node: at the time of the reboot: grep 08:0[6789]: /var/log/openvswitch/ovs-vswitchd.log | less 2020-02-10T08:07:53.031Z|00001|netdev_tc_offloads(revalidator57)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported 2020-02-10T08:07:53.036Z|00002|netdev_tc_offloads(revalidator57)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported 2020-02-10T08:07:54.049Z|00001|netdev_tc_offloads(revalidator92)|ERR|Dropped 31 log messages in last 1 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-10T08:07:54.049Z|00002|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported 2020-02-10T08:07:55.164Z|00003|netdev_tc_offloads(revalidator92)|ERR|Dropped 6 log messages in last 1 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-10T08:07:55.164Z|00004|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported 2020-02-10T08:07:56.173Z|00005|netdev_tc_offloads(revalidator92)|ERR|Dropped 1 log messages in last 0 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-10T08:07:56.173Z|00006|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for tap33f2d77c-99: Operation not supported On the running node, constantly: Feb 10 10:16:38 controller-2 ovs-vswitchd[2124]: ovs|15450|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for qr-0274bb9a-c5: Operation not supported Feb 10 10:16:39 controller-2 ovs-vswitchd[2124]: ovs|15451|netdev_tc_offloads(revalidator92)|ERR|Dropped 7 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate Feb 10 10:16:39 controller-2 ovs-vswitchd[2124]: ovs|15452|netdev_tc_offloads(revalidator92)|ERR|dump_create: failed to get ifindex for qr-0274bb9a-c5: Operation not supported Version-Release number of selected component (if applicable): [root@controller-2 ~]# yum list installed | grep openvswitch openstack-neutron-openvswitch.noarch openvswitch-selinux-extra-policy.noarch openvswitch2.11.x86_64 2.11.0-35.el7fdp @rhelosp-13.0-puddle python-openvswitch2.11.x86_64 2.11.0-35.el7fdp @rhelosp-13.0-puddle python-rhosp-openvswitch.noarch 2.11-0.6.el7ost @rhelosp-13.0-puddle rhosp-openvswitch.noarch 2.11-0.6.el7ost @rhelosp-13.0-puddle cf3671052f56 192.168.24.1:8787/rh-osbs/rhosp13-openstack-haproxy:pcmklatest "dumb-init --singl..." 5 hours ago Exited (255) 2 hours ago haproxy-bundle-docker-2 6d5c62a52074 192.168.24.1:8787/rh-osbs/rhosp13-openstack-redis:pcmklatest "dumb-init --singl..." 5 hours ago Exited (255) 2 hours ago redis-bundle-docker-2 d8f6d47ebe52 192.168.24.1:8787/rh-osbs/rhosp13-openstack-mariadb:pcmklatest "dumb-init -- /bin..." 5 hours ago Exited (255) 2 hours ago galera-bundle-docker-2 3cc2a31d01ab 192.168.24.1:8787/rh-osbs/rhosp13-openstack-rabbitmq:pcmklatest "dumb-init --singl..." 5 hours ago Exited (255) 2 hours ago rabbitmq-bundle-docker-2 04333a3ebf4a 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-openvswitch-agent:20200205.1 "dumb-init --singl..." 7 hours ago Up 2 hours (healthy) neutron_ovs_agent f522e4dead85 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-l3-agent:20200205.1 "dumb-init --singl..." 7 hours ago Up 2 hours (healthy) neutron_l3_agent 0951f3147a5e 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-metadata-agent:20200205.1 "dumb-init --singl..." 7 hours ago Up 2 hours (healthy) neutron_metadata_agent 690c9f37a0e3 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-dhcp-agent:20200205.1 "dumb-init --singl..." 7 hours ago Up 2 hours (healthy) neutron_dhcp I put the component on openvswitch as the ovs-vswitched seems to be the root cause.