Bug 1785560
| Summary: | pcp-grafana cease updating after some time | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Lukas Zapletal <lzap> | ||||||
| Component: | pcp | Assignee: | pcp-maint <pcp-maint> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Jan Kurik <jkurik> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 8.0 | CC: | agerstmayr, chorn, jkurik, mgoodwin, nathans, patrickm, yuokada | ||||||
| Target Milestone: | rc | Keywords: | Bugfix | ||||||
| Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-04-28 15:40:28 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Lukas Zapletal
2019-12-20 08:55:46 UTC
Could you attach the /etc/redis.conf file from this machine Lukas? I've not observed this phenomenon myself, nor been able to reproduce it as yet. The behaviour kinda looks like a critical, PCP metadata key has been removed from Redis, which should never happen. We only set TTL and MAXLEN on the time series data keys, not the metadata ... but AIUI this can be made to happen with certain Redis maxmem-related settings configured differently to defaults. No changes there, just the distro default. Let me setup a machine for you guys in a bit. Does the pmlogger log file show anything unusual at the time when it stopped updating? Can you paste the output of `journalctl -e -u pmlogger`? (In reply to Lukas Zapletal from comment #3) > No changes there, just the distro default. Hmm, OK, that shoots down my earlier theory. > Let me setup a machine for you guys in a bit. Thanks Lukas! I was able to reproduce it as well. The chart stopped updating after approx. 45 minutes. Here are some logs relevant to the issue: Time when the chart stopped updating: Jan 20 05:30 == /var/log/pcp/pmproxy/pmproxy.log == [Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set [Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set [Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set [Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set [Mon Jan 20 05:30:20] pmproxy(21972) Error: failed to duplicate label set == /var/log/pcp/pmlogger/$(hostname)/pmlogger.log == pmlogger: Caught signal 15, exiting Log finished Mon Jan 20 05:30:11 2020 == $(journalctl -e -u pmlogger) == Jan 20 04:43:09 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Started Performance Metrics Archive Logger. Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com pmlogger[23520]: /usr/share/pcp/lib/pmlogger: pmlogger not running Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Service RestartSec=100ms expired, scheduling restart. Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Scheduled restart job, restart counter is at 1. Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Stopped Performance Metrics Archive Logger. Jan 20 05:30:12 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Starting Performance Metrics Archive Logger... Jan 20 05:30:13 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com pmlogger[23564]: Starting pmlogger ... Jan 20 05:30:13 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Can't open PID file /run/pcp/pmlogger.pid (yet?) after start: No such file or directory Jan 20 05:30:19 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: New main PID 23876 does not belong to service, and PID file is not owned by root. Refusing. Jan 20 05:30:19 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: New main PID 23876 does not belong to service, and PID file is not owned by root. Refusing. Jan 20 05:30:46 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: pmlogger.service: Supervising process 30030 which is not our child. We'll most likely not notice when it exits. Jan 20 05:30:46 ci-vm-10-0-137-223.hosted.upshift.rdu2.redhat.com systemd[1]: Started Performance Metrics Archive Logger. == /var/log/pcp/pmlogger/pmlogger_check.log.prev == Duplicate archive basename ... rename 20200120.05.30.* files to 20200120.05.30-00.* Restarting primary pmlogger for host "local:" ... [process 30030] done Latest folio created for 20200120.05.30 == /var/log/pcp/pmlogger/pmlogger_daily-K.log.prev == === compressing PCP archives for host local: === Archive files being compressed ... 20200120.05.30-00.0 20200120.05.30-00.meta It seems to me like pmlogger was restarted by systemd, because systemd was not able to recognize pmlogger as alive. The restart of pmlogger confused pmproxy and the pmproxy was not able to provide any data then. When pmproxy is restarted, the chart starts to update again. Using the new pcp-5.0.2-3.el8 build, pmproxy survives restart of pmlogger as well as archive rotation without an issue and serves data to grafana-pcp plugin as expected. However, I observed a similar issue, having the same symptoms but in my opinion it is a different one. When I use "PCP Redis Host Overview" dashboard, I see only "older" data when I use 3 hours (or longer) timerange. When timerange is switched to 1 hour (or less), I can see even the latest data. I am attaching two screenshots done at the same time with 3 hours and 1 hour timerange for better understanding Created attachment 1654907 [details]
1 hour timerange
Created attachment 1654908 [details]
3 hours timerange
(In reply to Jan Kurik from comment #9) > However, I observed a similar issue, having the same symptoms but in my > opinion it is a different one. > When I use "PCP Redis Host Overview" dashboard, I see only "older" data when > I use 3 hours (or longer) timerange. When timerange is switched to 1 hour > (or less), I can see even the latest data. I am attaching two screenshots > done at the same time with 3 hours and 1 hour timerange for better > understanding Thanks for reporting! I've just reproduced this error and have a fix ready. Yep, it's another bug, this time in grafana-pcp. I'll test it a bit more locally and then submit a new build using this BZ (as for a users point of view the symptoms are very similar). Just built grafana-pcp-1.0.5-3.el8 (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1070446) @Jan: Can you verify if this solves the issue with not seeing the latest data with large time frames? Thanks! (In reply to Andreas Gerstmayr from comment #13) > @Jan: Can you verify if this solves the issue with not seeing the latest > data with large time frames? Thanks! Thanks Andreas. The new grafana-pcp-1.0.5-3.el8 build fixed the issue I was observing. I guess I need to file a new bug for grafana-pcp component, to get the build in, right ? I've pushed a new build of grafana-pcp with this BZ already, and added the new build in the RHEA-2019:48856-01 errata. The only problem is that I can't link this BZ to the grafana-pcp errata, as it's linked already to the PCP errata. Will keep that in mind next time and use a BZ ID of the proper component. No observation of symptoms described above when using grafana-pcp-1.0.5-3.el8 and pcp-5.0.2-3.el8 builds. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1628 |