Description of problem: In case there is an issue with the fluentd packaging or configurations and fluentd fails to load, no error message appears to the user and collectd will log many errors since it can't establish the connection to it. curl_easy_perform failed with status 7: Failed connect to localhost:9880; Connection refused Version-Release number of selected component (if applicable): 1.0.4.3 How reproducible: 100% Steps to Reproduce: 1. Update fluentd config so it will fail. 2. Setup metrics using the ansible script. 3. Check that collectd status. Actual results: collectd is running and error messages are logged. Expected results: collectd should be stopped and no error message for collectd. Additional info:
Note: After this fix, If fluentd fails to load when running the metrics setup script, collectd will be stopped (or not started if it is already not running). If fluentd is stopped manually, collectd will still be running and will still log errors to the log.
*** Bug 1463146 has been marked as a duplicate of this bug. ***
tested in ovirt-engine-metrics-1.0.5-1.el7ev.noarch when fluentd configuration fails, collectd is not stopped Logged message: Jul 18 18:15:53 ls-engine1 collectd[2552]: write_http plugin: curl_easy_perform failed with status 7: Failed connect to localhost:9880; Connection refused relevant part of playbook: TASK [ovirt_collectd/restart_collectd_if_needed : restart collectd if fluentd is up] *********************************************************************************skipping: [localhost] TASK [ovirt_collectd/restart_collectd_if_needed : pause for collectd to start] *********************************************************************************skipping: [localhost] Collectd should be stopped instead of not restarting it, in case fluentd was updated new config is not supported, fluentd will fail, collectd was running from the previous metrics deployments and will keep spamming.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
(In reply to Lukas Svaty from comment #3) > tested in ovirt-engine-metrics-1.0.5-1.el7ev.noarch > > when fluentd configuration fails, collectd is not stopped > Logged message: > Jul 18 18:15:53 ls-engine1 collectd[2552]: write_http plugin: > curl_easy_perform failed with status 7: Failed connect to localhost:9880; > Connection refused > > relevant part of playbook: > TASK [ovirt_collectd/restart_collectd_if_needed : restart collectd if > fluentd is up] > ***************************************************************************** > ****skipping: [localhost] > > TASK [ovirt_collectd/restart_collectd_if_needed : pause for collectd to > start] > ***************************************************************************** > ****skipping: [localhost] > > > Collectd should be stopped instead of not restarting it, in case fluentd was > updated new config is not supported, fluentd will fail, collectd was running > from the previous metrics deployments and will keep spamming. There is a different task at called "Stop collectd service" that stops the service. It is after the stage you mentioned in the test results. It is part of collectd_setup role. Please describe the test that you did. Did you run status check after the playbook finished?
1. create /etc/fluentd/config.d/10-test.conf with wrong syntax (unloadable by fluentd) 2. Run /usr/share/ovirt-engine-metrics/setup/ansible/configure_ovirt_machines_for_metrics.sh ... omitted text ... TASK [ovirt_collectd/collectd_setup : Stop collectd service] ******************************************** ok: [localhost] ... omitted text ... 3. Check fluentd status [root@ls-engine1 ~]# service fluentd status Redirecting to /bin/systemctl status fluentd.service ● fluentd.service - Fluentd Loaded: loaded (/usr/lib/systemd/system/fluentd.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Tue 2017-07-18 18:06:43 CEST; 16h ago Docs: http://www.fluentd.org/ Main PID: 16443 (code=exited, status=1/FAILURE) Jul 18 18:06:43 ls-engine1.com systemd[1]: Unit fluentd.service entered faile.... Jul 18 18:06:43 ls-engine1.com systemd[1]: fluentd.service failed. Jul 18 18:06:43 ls-engine1.com systemd[1]: fluentd.service holdoff time over,.... Jul 18 18:06:43 ls-engine1.com systemd[1]: start request repeated too quickly...e Jul 18 18:06:43 ls-engine1.com systemd[1]: Failed to start Fluentd. Jul 18 18:06:43 ls-engine1.com systemd[1]: Unit fluentd.service entered faile.... Jul 18 18:06:43 ls-engine1.com systemd[1]: fluentd.service failed. Hint: Some lines were ellipsized, use -l to show in full. 4. Check collectd status [root@ls-engine1 ~]# service collectd status Redirecting to /bin/systemctl status collectd.service ● collectd.service - Collectd statistics daemon Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/collectd.service.d └─postgresql.conf Active: inactive (dead) since Tue 2017-07-18 18:09:03 CEST; 16h ago Docs: man:collectd(1) man:collectd.conf(5) Main PID: 16216 (code=exited, status=0/SUCCESS) Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d Jul 18 18:09:01 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d Jul 18 18:09:03 ls-engine1.com collectd[16216]: Exiting normally. Jul 18 18:09:03 ls-engine1.com systemd[1]: Stopping Collectd statistics daemon... Jul 18 18:09:03 ls-engine1.com collectd[16216]: collectd: Stopping 5 read thr.... Jul 18 18:09:03 ls-engine1.com collectd[16216]: collectd: Stopping 5 write th.... Jul 18 18:09:03 ls-engine1.com collectd[16216]: write_http plugin: curl_easy_...d Jul 18 18:09:03 ls-engine1.com systemd[1]: Stopped Collectd statistics daemon. Hint: Some lines were ellipsized, use -l to show in full. 5. tail -f /var/log/messages Jul 19 10:25:53 ls-engine1 collectd[2552]: write_http plugin: curl_easy_perform failed with status 7: Failed connect to localhost:9880; Connection refused ^^ this message appears 9 times every 10 seconds, thus collectd should be stopped properly
once collectd can't connect to fluentd, these messages appear. This is expected. The final result is that the service is stopped. This is the expected behavior.
opened this upstream https://github.com/collectd/collectd/issues/2371
Hi, Please close/update the upstream issue. It is not fluentd related. The collectd write_http plugin tries to connect to the specified ip and port, that happens to be in our case fluentd but can be something else. If it can't connect to it, then it creates these log messages. We can ask to add to the write_http plugin the option to try to reconnect with increasing intervals so that less errors will appear.
Feel free to add comment to the upstream issue. However I believe on service collectd stop all plug-ins should be stopped properly, thus indeed collectd related. It does not matter if it connects to fluentd or other service.