Bug 1486037
Summary: | rhosp-director: composable roles deployment gets stuck due clustercheck container not being in the database role | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | openstack-tripleo-heat-templates | Assignee: | Michele Baldessari <michele> |
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 12.0 (Pike) | CC: | aherr, aschultz, bperkins, chjones, dbecker, dprince, jjoyce, jschluet, mburns, mcornea, michele, morazi, rhel-osp-director-maint, tvignaud |
Target Milestone: | beta | Keywords: | TestBlocker, Triaged |
Target Release: | 12.0 (Pike) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-12-13 21:58:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Bug Depends On: | |||
Bug Blocks: | 1482087 |
Description
Alexander Chuzhoy
2017-08-28 19:57:57 UTC
I was able to debug this environment a bit today w/ Sasha. It appears that database syncs are failing. It looks to me that MySQL is running. I was able to attached to MySQL via localhost (mysql -u root) and verify all of the databases are getting created. But some of the db sync for services are timing out. I manually tried to connect and got this on the command line: mysql -u heat -h 172.17.1.13 -p3UazsaeTC64V9UvEcJ3GZ9rbd ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 Double checked the HA proxy config and I see this is the correct VIP for MySQL. Could be related to firewall rules given that this deployment has the controller and database servers split out. On the controller I see: [root@controller-0 containers]# iptables-save | grep 3306 -A INPUT -p tcp -m multiport --dports 3306 -m state --state NEW -m comment --comment "100 mysql_haproxy ipv4" -j ACCEPT [root@overcloud-database-0 ~]# iptables-save | grep 3306 -A INPUT -p tcp -m multiport --dports 873,3123,3306,4444,4567,4568,9200 -m state --state NEW -m comment --comment "104 mysql galera-bundle ipv4" -j ACCEPT It would seem that both services are accepting 3306 traffic. Would be good to have someone from the HA team review these configs and see if they line up correctly. I am travelling so am a bit slow to respond, but do we have sosreports around for this or a live env? From Dan's initial analysis at c#1 and the error messages, I would guess that haproxy is having some sort of issues (maybe the bundle is constantly restarting or what not). Sasha, if you can send me some env login or some sosreports I can investigate a bit more. (NB: I deploy composable HA on a daily basis with galera/rabbit split out to 6 separate nodes, so my best guess without more data would be that it is due to the fact that we do not have yet a new pacemaker build with all the needed bundle fixes, but I'd like to take a deeper look in any case) Thanks Sasha! So the issue is that the clustercheck container is erroring out when talking to mysql and hence haproxy will refuse to accept connections on 3306 because all three backends are down. A) Cluster check not working [root@controller-2 log]# docker exec -it clustercheck /bin/bash ()[mysql@controller-2 /]$ mysql -h 127.0.0.1 -u clustercheck -pdrwh87rmM8KzWxyGcJWZ2TbGC ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111) B) Haproxy refusing connections [root@controller-2 log]# mysql -u heat -h 172.17.1.13 -p3UazsaeTC64V9UvEcJ3GZ9rbd ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 C) Connections straight to mysql work correctly: [root@controller-2 log]# mysql -u heat -h 172.17.1.22 -p3UazsaeTC64V9UvEcJ3GZ9rbd MariaDB [(none)]> Bye [root@controller-2 log]# mysql -u heat -h 172.17.1.21 -p3UazsaeTC64V9UvEcJ3GZ9rbd MariaDB [(none)]> Bye [root@controller-2 log]# mysql -u heat -h 172.17.1.16 -p3UazsaeTC64V9UvEcJ3GZ9rbd MariaDB [(none)]> Bye The reason this is not working in this environment is that the clustercheck container needs to be always deployed on the database role. I will make sure that upstream this will be fixed. But for the time being you can just add OS::TripleO::Services::Clustercheck to your database role and remove it from the ControllerOpenstack role upstream Hi Michele, Since the roles_data.yaml was prepared with "openstack overcloud roles generate" - I added this comment https://bugzilla.redhat.com/show_bug.cgi?id=1485108#c9 in the respective bug. Thanks. Confirm that I was able to deploy successfully, once I moved the "OS::TripleO::Services::Clustercheck" to database role from controller. pike review merged, moving to POST and linking the right review Verified: Environment: openstack-tripleo-heat-templates-7.0.1-0.20170919183703.el7ost.noarch Clustercheck is added by default to the database role: ############################################################################### # Role: Database # ############################################################################### - name: Database description: | Standalone database role with the database being managed via Pacemaker networks: - InternalApi HostnameFormatDefault: '%stackname%-database-%index%' ServicesDefault: - OS::TripleO::Services::AuditD - OS::TripleO::Services::CACerts - OS::TripleO::Services::CertmongerUser - OS::TripleO::Services::Collectd - OS::TripleO::Services::Clustercheck - OS::TripleO::Services::Docker - OS::TripleO::Services::FluentdClient - OS::TripleO::Services::Kernel - OS::TripleO::Services::MySQL - OS::TripleO::Services::MySQLClient - OS::TripleO::Services::Ntp - OS::TripleO::Services::ContainersLogrotateCrond - OS::TripleO::Services::Pacemaker - OS::TripleO::Services::SensuClient - OS::TripleO::Services::Snmp - OS::TripleO::Services::Timezone - OS::TripleO::Services::TripleoFirewall - OS::TripleO::Services::TripleoPackages - OS::TripleO::Services::Tuned Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462 |