1785560 – pcp-grafana cease updating after some time

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1785560 - pcp-grafana cease updating after some time

Summary: pcp-grafana cease updating after some time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcp
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	pcp-maint
QA Contact:	Jan Kurik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-20 08:55 UTC by Lukas Zapletal
Modified:	2023-02-12 21:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 15:40:28 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
1 hour timerange (131.45 KB, image/png) 2020-01-23 20:13 UTC, Jan Kurik	no flags	Details
3 hours timerange (122.35 KB, image/png) 2020-01-23 20:13 UTC, Jan Kurik	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-33038	0	None	None	None	2023-02-12 21:37:10 UTC
Red Hat Product Errata	RHBA-2020:1628	0	None	None	None	2020-04-28 15:40:41 UTC

Description Lukas Zapletal 2019-12-20 08:55:46 UTC

Description of problem:

I have configured pmlogger to gather data from a remote pcp 4.x (RHEL7) and configured pmproxy, redis, grafana-pcp. Metrics show up, but after some time Grafana stops updating and nothing is showing up. I see many errors:

[Thu Dec 19 15:27:58] pmproxy(41376) Error: - DISCONNECTED - no
descriptor for series identifier
419b634e5c5be350a266792e6eb859a6c5cb8d41

[Thu Dec 19 15:27:58] pmproxy(41376) Error: - DISCONNECTED - no
descriptor for series identifier
92ffe1cd9ab522cb25d2c0e01a04b0214a8b6664

The remote pcp is running postgresql and prometheus PMDA, both are configured to store all metrics into archives.

I have restarted all the services and also the monitoring server several times. After restarts, some metric temporarly appear but some graphs do not update anymore (e.g. memory). I was using Grafana PCP plugin dashboard example.

Version-Release number of selected component (if applicable):

PCP from RHEL 8.0

Comment 2 Nathan Scott 2020-01-07 21:33:50 UTC

Could you attach the /etc/redis.conf file from this machine Lukas?

I've not observed this phenomenon myself, nor been able to reproduce it as yet.  The behaviour kinda looks like a critical, PCP metadata key has been removed from Redis, which should never happen.  We only set TTL and MAXLEN on the time series data keys, not the metadata ... but AIUI this can be made to happen with certain Redis maxmem-related settings configured differently to defaults.

Comment 3 Lukas Zapletal 2020-01-09 12:07:34 UTC

No changes there, just the distro default. Let me setup a machine for you guys in a bit.

Comment 4 Andreas Gerstmayr 2020-01-09 12:52:13 UTC

Does the pmlogger log file show anything unusual at the time when it stopped updating?
Can you paste the output of `journalctl -e -u pmlogger`?

Comment 5 Nathan Scott 2020-01-09 23:17:30 UTC

(In reply to Lukas Zapletal from comment #3)
> No changes there, just the distro default.

Hmm, OK, that shoots down my earlier theory.

> Let me setup a machine for you guys in a bit.

Thanks Lukas!

Comment 7 Jan Kurik 2020-01-20 12:02:42 UTC

I was able to reproduce it as well. The chart stopped updating after approx. 45 minutes.
Here are some logs relevant to the issue:

Time when the chart stopped updating: Jan 20 05:30

== /var/log/pcp/pmproxy/pmproxy.log ==
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set
[Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set

== /var/log/pcp/pmlogger/$(hostname)/pmlogger.log ==
pmlogger: Caught signal 15, exiting
Log finished Mon Jan 20 05:30:11 2020

== $(journalctl -e -u pmlogger) ==
Jan 20 04:43:09 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Started Performance Metrics Archive Logger.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com pmlogger[23520]: /usr/share/pcp/lib/pmlogger: pmlogger not running
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Service RestartSec=100ms expired, scheduling restart.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Scheduled restart job, restart counter is at 1.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Stopped Performance Metrics Archive Logger.
Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Starting Performance Metrics Archive Logger...
Jan 20 05:30:13 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com pmlogger[23564]: Starting pmlogger ...
Jan 20 05:30:13 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Can't open PID file /run/pcp/pmlogger.pid (yet?) after start: No such file or directory
Jan 20 05:30:19 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: New main PID 23876 does not belong to service, and PID file is not owned by root. Refusing.
Jan 20 05:30:19 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: New main PID 23876 does not belong to service, and PID file is not owned by root. Refusing.
Jan 20 05:30:46 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Supervising process 30030 which is not our child. We'll most likely not notice when it exits.
Jan 20 05:30:46 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Started Performance Metrics Archive Logger.

== /var/log/pcp/pmlogger/pmlogger_check.log.prev ==
Duplicate archive basename ... rename 20200120.05.30.* files to 20200120.05.30-00.*
Restarting primary pmlogger for host "local:" ... [process 30030]  done
Latest folio created for 20200120.05.30

== /var/log/pcp/pmlogger/pmlogger_daily-K.log.prev ==
=== compressing PCP archives for host local: ===
Archive files being compressed ...
20200120.05.30-00.0 20200120.05.30-00.meta


It seems to me like pmlogger was restarted by systemd, because systemd was not able to recognize pmlogger as alive. The restart of pmlogger confused pmproxy and the pmproxy was not able to provide any data then.
When pmproxy is restarted, the chart starts to update again.

Comment 9 Jan Kurik 2020-01-23 20:11:44 UTC

Using the new pcp-5.0.2-3.el8 build, pmproxy survives restart of pmlogger as well as archive rotation without an issue and serves data to grafana-pcp plugin as expected.

However, I observed a similar issue, having the same symptoms but in my opinion it is a different one.
When I use "PCP Redis Host Overview" dashboard, I see only "older" data when I use 3 hours (or longer) timerange. When timerange is switched to 1 hour (or less), I can see even the latest data. I am attaching two screenshots done at the same time with 3 hours and 1 hour timerange for better understanding

Comment 10 Jan Kurik 2020-01-23 20:13:04 UTC

Created attachment 1654907 [details]
1 hour timerange

Comment 11 Jan Kurik 2020-01-23 20:13:54 UTC

Created attachment 1654908 [details]
3 hours timerange

Comment 12 Andreas Gerstmayr 2020-01-28 14:16:22 UTC

(In reply to Jan Kurik from comment #9)
> However, I observed a similar issue, having the same symptoms but in my
> opinion it is a different one.
> When I use "PCP Redis Host Overview" dashboard, I see only "older" data when
> I use 3 hours (or longer) timerange. When timerange is switched to 1 hour
> (or less), I can see even the latest data. I am attaching two screenshots
> done at the same time with 3 hours and 1 hour timerange for better
> understanding

Thanks for reporting!
I've just reproduced this error and have a fix ready.
Yep, it's another bug, this time in grafana-pcp.
I'll test it a bit more locally and then submit a new build using this BZ (as for a users point of view the symptoms are very similar).

Comment 13 Andreas Gerstmayr 2020-01-28 20:14:23 UTC

Just built grafana-pcp-1.0.5-3.el8 (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1070446)

@Jan: Can you verify if this solves the issue with not seeing the latest data with large time frames? Thanks!

Comment 14 Jan Kurik 2020-01-29 11:17:10 UTC

(In reply to Andreas Gerstmayr from comment #13)
> @Jan: Can you verify if this solves the issue with not seeing the latest
> data with large time frames? Thanks!

Thanks Andreas. The new grafana-pcp-1.0.5-3.el8 build fixed the issue I was observing. I guess I need to file a new bug for grafana-pcp component, to get the build in, right ?

Comment 15 Andreas Gerstmayr 2020-01-30 18:00:19 UTC

I've pushed a new build of grafana-pcp with this BZ already, and added the new build in the RHEA-2019:48856-01 errata.

The only problem is that I can't link this BZ to the grafana-pcp errata, as it's linked already to the PCP errata. Will keep that in mind next time and use a BZ ID of the proper component.

Comment 16 Jan Kurik 2020-01-31 17:21:38 UTC

No observation of symptoms described above when using grafana-pcp-1.0.5-3.el8 and pcp-5.0.2-3.el8 builds.

Comment 18 errata-xmlrpc 2020-04-28 15:40:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1628

Note You need to log in before you can comment on or make changes to this bug.