Bug 1860347
| Summary: | OVN DBs not started after controller reboot - pacemaker timeout | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eduardo Olivares <eolivare> |
| Component: | puppet-pacemaker | Assignee: | RHOS Maint <rhos-maint> |
| Status: | CLOSED NOTABUG | QA Contact: | nlevinki <nlevinki> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 16.1 (Train) | CC: | jjoyce, jschluet, lmiccini, michele, oblaut, slinaber, tvignaud |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-15 14:27:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Eduardo Olivares
2020-07-24 10:54:40 UTC
I can't reproduce this behavior by rebooting the ovn master controller following the provided steps on RHOS-16.1-RHEL-8-20200723.n.0. I'll check the logs of your job and see if I spot anything. + . kolla_extend_start ++ [[ ! -d /var/log/kolla/openvswitch ]] +++ stat -c %a /var/log/kolla/openvswitch Running command: '/usr/sbin/pacemaker_remoted' ++ [[ 755 != \7\5\5 ]] + echo 'Running command: '\''/usr/sbin/pacemaker_remoted'\''' + exec /usr/sbin/pacemaker_remoted (crm_add_logfile) error: Directory '/var/log/pacemaker' does not exist: logging to '/var/log/pacemaker/pacemaker.log' is disabled (crm_log_init) info: Changed active directory to /var/lib/pacemaker/cores (main) notice: Starting Pacemaker remote executor (qb_ipcs_us_publish) info: server name: lrmd (pcmk__init_tls_dh) info: Generating Diffie-Hellman parameters with 2048-bit prime for TLS (qb_ipcs_us_publish) info: server name: cib_ro (qb_ipcs_us_publish) info: server name: cib_rw (qb_ipcs_us_publish) info: server name: cib_shm (qb_ipcs_us_publish) info: server name: attrd (qb_ipcs_us_publish) info: server name: stonith-ng (qb_ipcs_us_publish) info: server name: crmd (main) notice: Pacemaker remote executor successfully started and accepting connections (crm_remote_accept) info: New remote connection from ::ffff:172.17.1.40 (lrmd_remote_listen) info: Remote client pending authentication | 0x55cf2e2f4a00 id: 83392302-2a18-474f-8446-c07479239897 (remoted__read_handshake_data) notice: Remote client connection accepted (process_lrmd_get_rsc_info) info: Agent information for 'ovndb_servers' not in cache (process_lrmd_get_rsc_info) info: Agent information for 'ovndb_servers:0' not in cache (process_lrmd_rsc_register) info: Cached agent information for 'ovndb_servers' (log_execute) info: executing - rsc:ovndb_servers action:start call_id:8 (child_timeout_callback) warning: ovndb_servers_start_0 process (PID 60) timed out (operation_finished) warning: ovndb_servers_start_0:60 - timed out after 200000ms 2020-07-23T17:06:20.715908867+00:00 stderr F (operation_finished) notice: ovndb_servers_start_0:60:stderr [ 2020-07-23T17:03:02Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl ] 2020-07-23T17:06:20.715923391+00:00 stderr F (operation_finished) notice: ovndb_servers_start_0:60:stderr [ ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (Connection refused) ] Jul 23 17:02:58 controller-0 podman[7585]: 2020-07-23 17:02:58.301150071 +0000 UTC m=+1.089836261 container exec 6106274030d44fa9fd2f14e63c70dd6505b4de4b8fc26890dde32ec8ad29d5f1 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:16.1_20200714.1, name=ovn-dbs-bundle-podman-0) Jul 23 17:02:58 controller-0 pacemaker-controld[2843]: notice: Result of monitor operation for ovn-dbs-bundle-podman-0 on controller-0: 0 (ok) Jul 23 17:03:00 controller-0 pacemaker-controld[2843]: notice: Result of monitor operation for ovn-dbs-bundle-0 on controller-0: 0 (ok) Jul 23 17:03:00 controller-0 pacemaker-controld[2843]: notice: Result of probe operation for ovndb_servers on ovn-dbs-bundle-0: 7 (not running) Jul 23 17:06:21 controller-0 pacemaker-controld[2843]: error: Result of start operation for ovndb_servers on ovn-dbs-bundle-0: Timed Out Jul 23 17:06:21 controller-0 pacemaker-controld[2843]: notice: ovn-dbs-bundle-0-ovndb_servers_start_0:8 [ 2020-07-23T17:03:02Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl\novn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (Connection refused)\n2020-07-23T17:03:03Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl\novn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (Connection refused)\n2020-07-23T17:03:03Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl\novn-appctl: cannot connect to "/var Jul 23 17:06:21 controller-0 pacemaker-attrd[2840]: notice: Setting fail-count-ovndb_servers#start_0[ovn-dbs-bundle-0]: (unset) -> INFINITY Jul 23 17:06:21 controller-0 pacemaker-attrd[2840]: notice: Setting last-failure-ovndb_servers#start_0[ovn-dbs-bundle-0]: (unset) -> 1595523981 seems like: 1. container starts -> ok 2. pacemaker_remote starts -> ok 3. ovndb_servers tries to start -> times out I only found: Jul 23 17:03:10 controller-0 pacemaker-controld [2843] (crm_procfs_pid_of) info: Found pacemaker-based active as process 2837 Jul 23 17:03:10 controller-0 pacemaker-controld [2843] (throttle_check_thresholds) notice: High CPU load detected: 15.010000 Jul 23 17:03:10 controller-0 pacemaker-controld [2843] (throttle_send_command) info: New throttle mode: high load (was undetermined) maybe a genuine timeout because of the load? I am not sure. I checked these three tobiko jobs that were run after the tobiko patches mentioned at comment 5: [1] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults-sanity/27/testReport/tobiko.tests.faults.ha.test_cloud_recovery/DisruptTripleoNodesTest/ [2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.1/view/PidOne/job/DFG-pidone-sanity-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults-sanity/26/testReport/tobiko.tests.faults.ha.test_cloud_recovery/DisruptTripleoNodesTest/ [3] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/35/testReport/tobiko.tests.faults.ha.test_cloud_recovery/DisruptTripleoNodesTest/ The issue reported in this bug was not observed in any of them. In my opinion, we can close this bug because it was not reproduced. We could open a new bug in case if the scenario is reproduced in the future. |