Bug 1464737 - If fluentd fails to load, no message will appear and collectd will log many error messages
Summary: If fluentd fails to load, no message will appear and collectd will log many e...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine-metrics
Classification: oVirt
Component: Generic
Version: 1.0.4.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.4
: ---
Assignee: Shirly Radco
QA Contact: Lukas Svaty
URL:
Whiteboard:
: 1463146 (view as bug list)
Depends On: oVirt-Metrics-and-Logs
Blocks: 1463146 1468892
TreeView+ depends on / blocked
 
Reported: 2017-06-25 08:50 UTC by Shirly Radco
Modified: 2017-08-02 13:50 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-28 14:14:31 UTC
oVirt Team: Metrics
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 78605 0 master MERGED fluentd: restart collectd only if fluentd is running 2020-10-23 10:22:06 UTC

Description Shirly Radco 2017-06-25 08:50:55 UTC
Description of problem:
In case there is an issue with the fluentd packaging or configurations and fluentd fails to load, no error message appears to the user and collectd will log many errors since it can't establish the connection to it.

curl_easy_perform failed with status 7: Failed connect to localhost:9880; Connection refused

Version-Release number of selected component (if applicable):
1.0.4.3

How reproducible:
100%

Steps to Reproduce:
1. Update fluentd config so it will fail.
2. Setup metrics using the ansible script.
3. Check that collectd status.

Actual results:
collectd is running and error messages are logged.

Expected results:
collectd should be stopped and no error message for collectd.

Additional info:

Comment 1 Shirly Radco 2017-07-09 14:02:28 UTC
Note:

After this fix, If fluentd fails to load when running the metrics setup script, collectd will be stopped (or not started if it is already not running).

If fluentd is stopped manually, collectd will still be running and will still log errors to the log.

Comment 2 Shirly Radco 2017-07-17 18:08:42 UTC
*** Bug 1463146 has been marked as a duplicate of this bug. ***

Comment 3 Lukas Svaty 2017-07-18 16:19:42 UTC
tested in ovirt-engine-metrics-1.0.5-1.el7ev.noarch

when fluentd configuration fails, collectd is not stopped
Logged message:
Jul 18 18:15:53 ls-engine1 collectd[2552]: write_http plugin: curl_easy_perform failed with status 7: Failed connect to localhost:9880; Connection refused

relevant part of playbook:
TASK [ovirt_collectd/restart_collectd_if_needed : restart collectd if fluentd is up] *********************************************************************************skipping: [localhost]

TASK [ovirt_collectd/restart_collectd_if_needed : pause for collectd to start] *********************************************************************************skipping: [localhost]


Collectd should be stopped instead of not restarting it, in case fluentd was updated new config is not supported, fluentd will fail, collectd was running from the previous metrics deployments and will keep spamming.

Comment 4 Red Hat Bugzilla Rules Engine 2017-07-18 17:41:59 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 5 Shirly Radco 2017-07-19 06:35:14 UTC
(In reply to Lukas Svaty from comment #3)
> tested in ovirt-engine-metrics-1.0.5-1.el7ev.noarch
> 
> when fluentd configuration fails, collectd is not stopped
> Logged message:
> Jul 18 18:15:53 ls-engine1 collectd[2552]: write_http plugin:
> curl_easy_perform failed with status 7: Failed connect to localhost:9880;
> Connection refused
> 
> relevant part of playbook:
> TASK [ovirt_collectd/restart_collectd_if_needed : restart collectd if
> fluentd is up]
> *****************************************************************************
> ****skipping: [localhost]
> 
> TASK [ovirt_collectd/restart_collectd_if_needed : pause for collectd to
> start]
> *****************************************************************************
> ****skipping: [localhost]
> 
> 
> Collectd should be stopped instead of not restarting it, in case fluentd was
> updated new config is not supported, fluentd will fail, collectd was running
> from the previous metrics deployments and will keep spamming.

There is a different task at called "Stop collectd service" that stops the service. It is after the stage you mentioned in the test results.
It is part of collectd_setup role.

Please describe the test that you did.
Did you run status check after the playbook finished?

Comment 6 Lukas Svaty 2017-07-19 08:31:11 UTC
1. create /etc/fluentd/config.d/10-test.conf with wrong syntax (unloadable by fluentd)


2. Run /usr/share/ovirt-engine-metrics/setup/ansible/configure_ovirt_machines_for_metrics.sh
... omitted text ...
TASK [ovirt_collectd/collectd_setup : Stop collectd service] ********************************************
ok: [localhost]
... omitted text ...


3. Check fluentd status
[root@ls-engine1 ~]# service fluentd status
Redirecting to /bin/systemctl status fluentd.service
● fluentd.service - Fluentd
   Loaded: loaded (/usr/lib/systemd/system/fluentd.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2017-07-18 18:06:43 CEST; 16h ago
     Docs: http://www.fluentd.org/
 Main PID: 16443 (code=exited, status=1/FAILURE)

Jul 18 18:06:43 ls-engine1.com systemd[1]: Unit fluentd.service entered faile....
Jul 18 18:06:43 ls-engine1.com systemd[1]: fluentd.service failed.
Jul 18 18:06:43 ls-engine1.com systemd[1]: fluentd.service holdoff time over,....
Jul 18 18:06:43 ls-engine1.com systemd[1]: start request repeated too quickly...e
Jul 18 18:06:43 ls-engine1.com systemd[1]: Failed to start Fluentd.
Jul 18 18:06:43 ls-engine1.com systemd[1]: Unit fluentd.service entered faile....
Jul 18 18:06:43 ls-engine1.com systemd[1]: fluentd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.


4. Check collectd status
[root@ls-engine1 ~]# service collectd status
Redirecting to /bin/systemctl status collectd.service
● collectd.service - Collectd statistics daemon
   Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/collectd.service.d
           └─postgresql.conf
   Active: inactive (dead) since Tue 2017-07-18 18:09:03 CEST; 16h ago
     Docs: man:collectd(1)
           man:collectd.conf(5)
 Main PID: 16216 (code=exited, status=0/SUCCESS)

Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d
Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d
Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d
Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d
Jul 18 18:09:03 ls-engine1.com collectd[16216]: Exiting normally.
Jul 18 18:09:03 ls-engine1.com systemd[1]: Stopping Collectd statistics daemon...
Jul 18 18:09:03 ls-engine1.com collectd[16216]: collectd: Stopping 5 read thr....
Jul 18 18:09:03 ls-engine1.com collectd[16216]: collectd: Stopping 5 write th....
Jul 18 18:09:03 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d
Jul 18 18:09:03 ls-engine1.com systemd[1]: Stopped Collectd statistics daemon.
Hint: Some lines were ellipsized, use -l to show in full.


5. tail -f /var/log/messages
Jul 19 10:25:53 ls-engine1 collectd[2552]: write_http plugin: curl_easy_perform failed with status 7: Failed connect to localhost:9880; Connection refused

^^ this message appears 9 times every 10 seconds, thus collectd should be stopped properly

Comment 7 Shirly Radco 2017-07-19 08:38:45 UTC
once collectd can't connect to fluentd, these messages appear.
This is expected.
The final result is that the service is stopped.
This is the expected behavior.

Comment 8 Lukas Svaty 2017-07-19 08:49:39 UTC
opened this upstream

https://github.com/collectd/collectd/issues/2371

Comment 9 Shirly Radco 2017-07-19 10:26:30 UTC
Hi,

Please close/update the upstream issue.
It is not fluentd related.

The collectd write_http plugin tries to connect to the specified ip and port, that happens to be in our case fluentd but can be something else.

If it can't connect to it, then it creates these log messages.
We can ask to add to the write_http plugin the option to try to reconnect with increasing intervals so that less errors will appear.

Comment 10 Lukas Svaty 2017-07-19 10:28:09 UTC
Feel free to add comment to the upstream issue. However I believe on service collectd stop all plug-ins should be stopped properly, thus indeed collectd related. It does not matter if it connects to fluentd or other service.


Note You need to log in before you can comment on or make changes to this bug.