Bug 1464394

Summary: Metrics not reporting after successful deployment
Product: [oVirt] ovirt-engine-metrics Reporter: Lukas Svaty <lsvaty>
Component: GenericAssignee: Shirly Radco <sradco>
Status: CLOSED NOTABUG QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: high    
Version: 1.0.4.3CC: bugs, rmeggins
Target Milestone: ovirt-4.2.0Keywords: TestBlocker
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-26 13:31:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1419858, 1458735, 1459764    

Description Lukas Svaty 2017-06-23 10:40:33 UTC
Description of problem:
After successful deployment of engine metrics, nothing is being sent from fluentd. Tried with custom metrics store, as well as viaq setup.

I am not sure which logs would you like to see as these nor syslog shows any hint of error. I got the environment can provide anything you need.

Version-Release number of selected component (if applicable):
ovirt-engine-metrics-1.0.4.3-1.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. Create config.yml with your metrics store
2. run /usr/share/ovirt-engine-metrics/setup/ansible/configure_ovirt_machines_for_metrics.sh
3. Check fluentd and collectd are successfully running


Actual results:
# No incoming packets to metrics store
# tcpdump -n dst port 24284
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes


Additional info:
[root@ls-engine1 ~]# date && systemctl status collectd fluentd
Fri Jun 23 12:36:50 CEST 2017
● collectd.service - Collectd statistics daemon
   Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/collectd.service.d
           └─postgresql.conf
   Active: active (running) since Fri 2017-06-23 11:15:11 CEST; 1h 21min ago
     Docs: man:collectd(1)
           man:collectd.conf(5)
 Main PID: 709 (collectd)
   CGroup: /system.slice/collectd.service
           └─709 /usr/sbin/collectd

Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "swap" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "df" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "aggregation" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "processes" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "postgresql" successfully loaded.
Jun 23 11:15:06 ls-engine1.example.com collectd[709]: plugin_load: plugin "write_http" successfully loaded.
Jun 23 11:15:11 ls-engine1.example.com collectd[709]: Systemd detected, trying to signal readyness.
Jun 23 11:15:11 ls-engine1.example.com systemd[1]: Started Collectd statistics daemon.
Jun 23 11:15:11 ls-engine1.example.com collectd[709]: Initialization complete, entering read-loop.
Jun 23 11:15:11 ls-engine1.example.com collectd[709]: Successfully connected to database engine (user engine) at server localhost:5432 (server version: 9.2.18, protocol version: 3, pid: 739)

● fluentd.service - Fluentd
   Loaded: loaded (/usr/lib/systemd/system/fluentd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2017-06-23 11:14:58 CEST; 1h 21min ago
     Docs: http://www.fluentd.org/
 Main PID: 649 (fluentd)
   CGroup: /system.slice/fluentd.service
           ├─649 /usr/bin/ruby /usr/bin/fluentd -c /etc/fluentd/fluent.conf
           └─710 /usr/bin/ruby /usr/bin/fluentd -c /etc/fluentd/fluent.conf

Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: </match>
Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: </ROOT>
Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:09 +0200 [debug]: listening http on localhost:9880
Jun 23 11:15:09 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:09 +0200 [info]: following tail of /var/log/ovirt-engine/engine.log
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 [warn]: dead connection found: lsvaty-vm1.example.com, reconnecting...
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 fluent.warn: {"message":"dead connection found: lsvaty-vm1.example.com, reconnecting..."}
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 [info]: connection established to lsvaty-vm1.example.com
Jun 23 11:15:14 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:14 +0200 fluent.info: {"message":"connection established to lsvaty-vm1.example.com"}
Jun 23 11:15:19 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:19 +0200 [warn]: recovered connection to dead node: lsvaty-vm1.example.com
Jun 23 11:15:19 ls-engine1.example.com fluentd[649]: 2017-06-23 11:15:19 +0200 fluent.warn: {"message":"recovered connection to dead node: lsvaty-vm1.example.com"}0200 [info]: connection established to lsvaty-vm1.example.com.com
Jun 23 11:15:14 ls-engine1.example.com.com fluentd[649]: 2017-06-23 11:15:14 +0200 fluent.info: {"message":"connection established to lsvaty-vm1.example.com.com"}
Jun 23 11:15:19 ls-engine1.example.com.com fluentd[649]: 2017-06-23 11:15:19 +0200 [warn]: recovered connection to dead node: lsvaty-vm1.example.com.com
Jun 23 11:15:19 ls-engine1.example.com.com fluentd[649]: 2017-06-23 11:15:19 +0200 fluent.warn: {"message":"recovered connection to dead node: lsvaty-vm1.example.com.com"}

Comment 1 Red Hat Bugzilla Rules Engine 2017-06-26 09:06:34 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Shirly Radco 2017-06-26 12:36:00 UTC
These errors occur when the fluentd does not manage to connect to the remote fluentd.

I don't believe this is a blocker to the other bug.

Rich should check why the remote fluentd is having these errors:

 "2017-06-26 11:15:48 +0200 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error=\"queue size exceeds limit\" tag=\"project.ovirt-metrics-lsvaty_test-@kibana-highlighted-field@ovirt@/kibana-highlighted-field@\""

Comment 3 Lukas Svaty 2017-06-26 13:31:26 UTC
So in the end this was 2 errors, that at least I am aware of.

1. Misconfiguration of Viaq setup, not all hostnames were resolvable from all the machines
2. Misconfiguration of non-ViaQ setup, fluentd was not able to establish connection due to outdated certificate.

Due to these closing this issue.