1464394 – Metrics not reporting after successful deployment

Bug 1464394 - Metrics not reporting after successful deployment

Summary: Metrics not reporting after successful deployment

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine-metrics
Classification:	oVirt
Component:	Generic
Sub Component:
Version:	1.0.4.3
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1419858 1458735 1459764
TreeView+	depends on / blocked

Reported:	2017-06-23 10:40 UTC by Lukas Svaty
Modified:	2017-06-26 13:31 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-06-26 13:31:26 UTC
oVirt Team:	Metrics
Embargoed:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)

Description Lukas Svaty 2017-06-23 10:40:33 UTC

Description of problem:
After successful deployment of engine metrics, nothing is being sent from fluentd. Tried with custom metrics store, as well as viaq setup.

I am not sure which logs would you like to see as these nor syslog shows any hint of error. I got the environment can provide anything you need.

Version-Release number of selected component (if applicable):
ovirt-engine-metrics-1.0.4.3-1.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. Create config.yml with your metrics store
2. run /usr/share/ovirt-engine-metrics/setup/ansible/configure_ovirt_machines_for_metrics.sh
3. Check fluentd and collectd are successfully running


Actual results:
# No incoming packets to metrics store
# tcpdump -n dst port 24284
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes


Additional info:
[root@ls-engine1 ~]# date && systemctl status collectd fluentd
Fri Jun 23 12:36:50 CEST 2017
● collectd.service - Collectd statistics daemon
   Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/collectd.service.d
           └─postgresql.conf
   Active: active (running) since Fri 2017-06-23 11:15:11 CEST; 1h 21min ago
     Docs: man:collectd(1)
           man:collectd.conf(5)
 Main PID: 709 (collectd)
   CGroup: /system.slice/collectd.service
           └─709 /usr/sbin/collectd

Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "swap" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "df" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "aggregation" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "processes" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "postgresql" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "write_http" successfully loaded.
Jun 23 11:15:11 ls-engine1.example.com collectd[709]: Systemd detected, trying to signal readyness.
Jun 23 11:15:11 ls-engine1.example.com systemd[1]: Started Collectd statistics daemon.
Jun 23 11:15:11 ls-engine1.example.com collectd[709]: Initialization complete, entering read-loop.
Jun 23 11:15:11 ls-engine1.example.com collectd[709]: Successfully connected to database engine (user engine) at server localhost:5432 (server version: 9.2.18, protocol version: 3, pid: 739)

● fluentd.service - Fluentd
   Loaded: loaded (/usr/lib/systemd/system/fluentd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2017-06-23 11:14:58 CEST; 1h 21min ago
     Docs: http://www.fluentd.org/
 Main PID: 649 (fluentd)
   CGroup: /system.slice/fluentd.service
           ├─649 /usr/bin/ruby /usr/bin/fluentd -c /etc/fluentd/fluent.conf
           └─710 /usr/bin/ruby /usr/bin/fluentd -c /etc/fluentd/fluent.conf

Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: </match>
Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: </ROOT>
Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:09 +0200 [debug]: listening http on localhost:9880
Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:09 +0200 [info]: following tail of /var/log/ovirt-engine/engine.log
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 [warn]: dead connection found: lsvaty-vm1.example.com, reconnecting...
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 fluent.warn: {"message":"dead connection found: lsvaty-vm1.example.com, reconnecting..."}
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 [info]: connection established to lsvaty-vm1.example.com
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 fluent.info: {"message":"connection established to lsvaty-vm1.example.com"}
Jun 23 11:15:19 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:19 +0200 [warn]: recovered connection to dead node: lsvaty-vm1.example.com
Jun 23 11:15:19 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:19 +0200 fluent.warn: {"message":"recovered connection to dead node: lsvaty-vm1.example.com"}0200 [info]: connection established to lsvaty-vm1.example.com.com
Jun 23 11:15:14 ls-engine1.example.com.com fluentd[649]: 2017-06-23 11:15:14 +0200 fluent.info: {"message":"connection established to lsvaty-vm1.example.com.com"}
Jun 23 11:15:19 ls-engine1.example.com.com fluentd[649]: 2017-06-23 11:15:19 +0200 [warn]: recovered connection to dead node: lsvaty-vm1.example.com.com
Jun 23 11:15:19 ls-engine1.example.com.com fluentd[649]: 2017-06-23 11:15:19 +0200 fluent.warn: {"message":"recovered connection to dead node: lsvaty-vm1.example.com.com"}

Comment 1 Red Hat Bugzilla Rules Engine 2017-06-26 09:06:34 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Shirly Radco 2017-06-26 12:36:00 UTC

These errors occur when the fluentd does not manage to connect to the remote fluentd.

I don't believe this is a blocker to the other bug.

Rich should check why the remote fluentd is having these errors:

 "2017-06-26 11:15:48 +0200 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error=\"queue size exceeds limit\" tag=\"project.ovirt-metrics-lsvaty_test-@kibana-highlighted-field@ovirt@/kibana-highlighted-field@\""

Comment 3 Lukas Svaty 2017-06-26 13:31:26 UTC

So in the end this was 2 errors, that at least I am aware of.

1. Misconfiguration of Viaq setup, not all hostnames were resolvable from all the machines
2. Misconfiguration of non-ViaQ setup, fluentd was not able to establish connection due to outdated certificate.

Due to these closing this issue.

Note You need to log in before you can comment on or make changes to this bug.