Bug 1785560

Summary:

pcp-grafana cease updating after some time

Product:

Red Hat Enterprise Linux 8

Reporter:

Lukas Zapletal <lzap>

Component:

pcp

Assignee:

pcp-maint <pcp-maint>

Status:

CLOSED ERRATA

QA Contact:

Jan Kurik <jkurik>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.0

CC:

agerstmayr, chorn, jkurik, mgoodwin, nathans, patrickm, yuokada

Target Milestone:

Keywords:

Bugfix

Target Release:

8.0

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-04-28 15:40:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
1 hour timerange	none
3 hours timerange	none

Description Lukas Zapletal 2019-12-20 08:55:46 UTC

Description of problem:

I have configured pmlogger to gather data from a remote pcp 4.x (RHEL7) and configured pmproxy, redis, grafana-pcp. Metrics show up, but after some time Grafana stops updating and nothing is showing up. I see many errors:

[Thu Dec 19 15:27:58] pmproxy(41376) Error: - DISCONNECTED - no
descriptor for series identifier
419b634e5c5be350a266792e6eb859a6c5cb8d41

[Thu Dec 19 15:27:58] pmproxy(41376) Error: - DISCONNECTED - no
descriptor for series identifier
92ffe1cd9ab522cb25d2c0e01a04b0214a8b6664

The remote pcp is running postgresql and prometheus PMDA, both are configured to store all metrics into archives.

I have restarted all the services and also the monitoring server several times. After restarts, some metric temporarly appear but some graphs do not update anymore (e.g. memory). I was using Grafana PCP plugin dashboard example.

Version-Release number of selected component (if applicable):

PCP from RHEL 8.0

Comment 2 Nathan Scott 2020-01-07 21:33:50 UTC

Could you attach the /etc/redis.conf file from this machine Lukas?

I've not observed this phenomenon myself, nor been able to reproduce it as yet.  The behaviour kinda looks like a critical, PCP metadata key has been removed from Redis, which should never happen.  We only set TTL and MAXLEN on the time series data keys, not the metadata ... but AIUI this can be made to happen with certain Redis maxmem-related settings configured differently to defaults.

Comment 3 Lukas Zapletal 2020-01-09 12:07:34 UTC

No changes there, just the distro default. Let me setup a machine for you guys in a bit.

Comment 4 Andreas Gerstmayr 2020-01-09 12:52:13 UTC

Does the pmlogger log file show anything unusual at the time when it stopped updating?
Can you paste the output of `journalctl -e -u pmlogger`?

Comment 5 Nathan Scott 2020-01-09 23:17:30 UTC

(In reply to Lukas Zapletal from comment #3)
> No changes there, just the distro default.

Hmm, OK, that shoots down my earlier theory.

> Let me setup a machine for you guys in a bit.

Thanks Lukas!

Comment 7 Jan Kurik 2020-01-20 12:02:42 UTC

I was able to reproduce it as well. The chart stopped updating after approx. 45 minutes.
Here are some logs relevant to the issue:

Time when the chart stopped updating: Jan 20 05:30

== /var/log/pcp/pmproxy/pmproxy.log ==
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set

== /var/log/pcp/pmlogger/$(hostname)/pmlogger.log ==
pmlogger: Caught signal 15, exiting
Log finished Mon Jan 20 05:30:11 2020

== $(journalctl -e -u pmlogger) ==
Jan 20 04:43:09 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Started Performance Metrics Archive Logger.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com pmlogger[23520]: /usr/share/pcp/lib/pmlogger: pmlogger not running
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Service RestartSec=100ms expired, scheduling restart.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Scheduled restart job, restart counter is at 1.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Stopped Performance Metrics Archive Logger.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Starting Performance Metrics Archive Logger...
Jan 20 05:30:13 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com pmlogger[23564]: Starting pmlogger ...
Jan 20 05:30:13 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Can't open PID file /run/pcp/pmlogger.pid (yet?) after start: No such file or directory
Jan 20 05:30:19 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: New main PID 23876 does not belong to service, and PID file is not owned by root. Refusing.
Jan 20 05:30:19 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: New main PID 23876 does not belong to service, and PID file is not owned by root. Refusing.
Jan 20 05:30:46 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Supervising process 30030 which is not our child. We'll most likely not notice when it exits.
Jan 20 05:30:46 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Started Performance Metrics Archive Logger.

== /var/log/pcp/pmlogger/pmlogger_check.log.prev ==
Duplicate archive basename ... rename 20200120.05.30.* files to 20200120.05.30-00.*
Restarting primary pmlogger for host "local:" ... [process 30030]  done
Latest folio created for 20200120.05.30

== /var/log/pcp/pmlogger/pmlogger_daily-K.log.prev ==
=== compressing PCP archives for host local: ===
Archive files being compressed ...
20200120.05.30-00.0 20200120.05.30-00.meta


It seems to me like pmlogger was restarted by systemd, because systemd was not able to recognize pmlogger as alive. The restart of pmlogger confused pmproxy and the pmproxy was not able to provide any data then.
When pmproxy is restarted, the chart starts to update again.

Comment 9 Jan Kurik 2020-01-23 20:11:44 UTC

Using the new pcp-5.0.2-3.el8 build, pmproxy survives restart of pmlogger as well as archive rotation without an issue and serves data to grafana-pcp plugin as expected.

However, I observed a similar issue, having the same symptoms but in my opinion it is a different one.
When I use "PCP Redis Host Overview" dashboard, I see only "older" data when I use 3 hours (or longer) timerange. When timerange is switched to 1 hour (or less), I can see even the latest data. I am attaching two screenshots done at the same time with 3 hours and 1 hour timerange for better understanding

Comment 10 Jan Kurik 2020-01-23 20:13:04 UTC

Created attachment 1654907 [details]
1 hour timerange

Comment 11 Jan Kurik 2020-01-23 20:13:54 UTC

Created attachment 1654908 [details]
3 hours timerange

Comment 12 Andreas Gerstmayr 2020-01-28 14:16:22 UTC

(In reply to Jan Kurik from comment #9)
> However, I observed a similar issue, having the same symptoms but in my
> opinion it is a different one.
> When I use "PCP Redis Host Overview" dashboard, I see only "older" data when
> I use 3 hours (or longer) timerange. When timerange is switched to 1 hour
> (or less), I can see even the latest data. I am attaching two screenshots
> done at the same time with 3 hours and 1 hour timerange for better
> understanding

Thanks for reporting!
I've just reproduced this error and have a fix ready.
Yep, it's another bug, this time in grafana-pcp.
I'll test it a bit more locally and then submit a new build using this BZ (as for a users point of view the symptoms are very similar).

Comment 13 Andreas Gerstmayr 2020-01-28 20:14:23 UTC

Just built grafana-pcp-1.0.5-3.el8 (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1070446)

@Jan: Can you verify if this solves the issue with not seeing the latest data with large time frames? Thanks!

Comment 14 Jan Kurik 2020-01-29 11:17:10 UTC

(In reply to Andreas Gerstmayr from comment #13)
> @Jan: Can you verify if this solves the issue with not seeing the latest
> data with large time frames? Thanks!

Thanks Andreas. The new grafana-pcp-1.0.5-3.el8 build fixed the issue I was observing. I guess I need to file a new bug for grafana-pcp component, to get the build in, right ?

Comment 15 Andreas Gerstmayr 2020-01-30 18:00:19 UTC

I've pushed a new build of grafana-pcp with this BZ already, and added the new build in the RHEA-2019:48856-01 errata.

The only problem is that I can't link this BZ to the grafana-pcp errata, as it's linked already to the PCP errata. Will keep that in mind next time and use a BZ ID of the proper component.

Comment 16 Jan Kurik 2020-01-31 17:21:38 UTC

No observation of symptoms described above when using grafana-pcp-1.0.5-3.el8 and pcp-5.0.2-3.el8 builds.

Comment 18 errata-xmlrpc 2020-04-28 15:40:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1628