2172892 – pmlogger logs to an old hostname after the hostname updates (due to dhcp, zezere, manual hostname updates, etc.)

Bug 2172892 - pmlogger logs to an old hostname after the hostname updates (due to dhcp, zezere, manual hostname updates, etc.)

Summary: pmlogger logs to an old hostname after the hostname updates (due to dhcp, zez...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	pcp
Sub Component:
Version:	37
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Nathan Scott
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-23 12:41 UTC by Garrett LeSage
Modified:	2023-06-29 01:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pcp-6.0.5-1.fc38 pcp-6.0.5-1.fc37
Clone Of:
Environment:
Last Closed:	2023-06-29 01:25:06 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Garrett LeSage 2023-02-23 12:41:26 UTC

Description of problem:

pmlogger does not log to the proper directory, causing pcp-related data to not be visible in various tools (such as Cockpit).

Version-Release number of selected component (if applicable):

pcp-6.0.1-2.fc37.aarch64
pcp-6.0.2-1.fc38.x86_64

How reproducible:

Every time.

Steps to Reproduce:
1. Start with one hostname.
2. Make sure pmlogger is running.
3. Switch to a new hostname (even manual switching would work).
4. Check where pmlogger is logging. (It's the original hostname, not the new one.)

Actual results:

pmlogger keeps logging to the new hostname

Expected results:

pmlogger should log to the current (new) hostname

Additional info:

This is an issue as it's likely there are race conditions when booting where a machine may not have its final hostname yet, such as when zezere or dhcp are used when setting the hostname.

For example, my server running Fedora IoT has "localhost.localdoman" on boot. It then contacts Fedora via zezere and its local hostname is set to "fedora". But I have it assigned to "Snow" via DHCP. pmlogger should be logging to "Snow", but it often does not.

Depending on when pmlogger starts logging, it could be any one of the three. I had all three subdirectories in /var/log/pcp/pmlogger/ (localhost.localdoman, fedora, Snow)

Comment 1 Martin Pitt 2023-02-23 12:54:57 UTC

I confirm that. Garrett is using Fedora IoT, which apparently uses transient hostnames by default, but this is easy enough to reproduce. Starting from a Fedora 37 cloud image (with persistent /etc/hostname), I called `hostnamectl set-hostname newname`, and the running pmlogger just continued to write into /var/log/pcp/pmlogger/<old name>/.

So preferably pmlogger should listen to host name changes (systemd-hostnamed signals, or perhaps there's a netlink message for that?), as they can happen at any time. A hack would be to delay startup of pmlogger.service until after DHCP is done (for transient hostnames), but that sounds brittle and would miss recording info for the early boot.

Comment 2 Nathan Scott 2023-02-24 01:08:34 UTC

Thanks Garrett (& Martin), I'm just heading away on PTO for awhile but have kicked off a discussion in upstream PCP slack on how best to tackle this with the author of this code.  Will follow up further when I'm back.

Comment 3 Ken McDonell 2023-02-24 21:34:37 UTC

I'd like to understand expectations here before we leap into "fixing" this ...

The pmlogger of interest here is one that is launched from a control line (/etc/pcp/pmlogger/control or some file in /etc/pcp/pmlogger/control.d) where the first field is LOCALHOSTNAME and the secondary filed is "y" (primary pmlogger) and the 4th field is something like PCP_ARCHIVE_DIR/LOCALHOSTNAME, correct?

Now LOCALHOSTNAME is mapped to whatever hostname(1) reports (or "localhost" if hostname(1) fails) when pmlogger is launched and then used to rewrite the directory name PCP_ARCHIVE_DIR/LOCALHOSTNAME ... this is where the PCP archives will be stored.

Once pmlogger is running, if hostname(1) returns something different, what do you expect to happen?

The name of the "host" is also stashed away in the archive (a) in the archive label, and (b) in the pmcd.pmlogger.host metric in the epilogue and prologue records. Similarly, the path to the archive files is stashed in the pmcd.pmlogger.archive metric in the epilogue and prologue records. If the "host" name changes, then the daily process of archive management and re-launching pmlogger is likely to come unhinged.

If the control file used PCP_ARCHIVE_DIR/<preferred hostname> instead of PCP_ARCHIVE_DIR/LOCALHOSTNAME then that would address the directory naming issue, but this is ill-sited for zeroconf and default installs, as "<preferred hostname>" is not something we can guess at. But it would still leave the hostname encoded in the archive files exposed to changes.

Comment 4 Martin Pitt 2023-02-25 08:51:11 UTC

As I mentioned in comment #1, the one thing I'd expect is that it does not just start or continue to log to the wrong host name directory. I.e. either fix startup dependencies to wait for DHCP hostnames; or after the hostname changes, continue to log to the new dir. Or as a third option, why even bother with the hostname at all for logging locally? /var/log/pcp/pmlogger/localhost/ may save the hassle? Naïvely from  my user's POV the latter would seem like the best option. Hostnames are only important when logging remotely, and then of course they must be somewhat reliable. But network services are usually only available after setting the hostname, so that seems much less prone to early boot changing hostnames.

Anyway, I don't have a strong opinion how exactly to fix that. But dynamic hostnames are a thing, especially as Fedora IoT (and RHEL 4 edge) do that by default. I didn't change any configs in /etc, this is an out-of-the-box PCP package install?

Thanks!

Comment 5 Nathan Scott 2023-03-06 00:13:12 UTC

(In reply to Ken McDonell from comment #3)
> I'd like to understand expectations here before we leap into "fixing" this
> ...
> 
> The pmlogger of interest here is one that is launched from a control line
> (/etc/pcp/pmlogger/control or some file in /etc/pcp/pmlogger/control.d)
> where the first field is LOCALHOSTNAME and the secondary filed is "y"
> (primary pmlogger) and the 4th field is something like
> PCP_ARCHIVE_DIR/LOCALHOSTNAME, correct?

That's correct.

> Now LOCALHOSTNAME is mapped to whatever hostname(1) reports (or "localhost"
> if hostname(1) fails) when pmlogger is launched and then used to rewrite the
> directory name PCP_ARCHIVE_DIR/LOCALHOSTNAME ... this is where the PCP
> archives will be stored.

Yep - I think the core of this issue is the use of the LOCALHOSTNAME
and what can happen as a result of this.  While I was on leave this
other report came in from a RH customer, which I suspect is another
case of this:

" A newly installed host had a hostname change (using hostnamectl) but was not restarted. Host almost ran out of disk space, pmlogger was writing to /var/log/pcp/pmlogger/localhost.localdomain, but it appears that logrotate and the daily timer was looking for metrics in "/var/log/pcp/pmlogger/newhostname.realdomain.tld"

> Once pmlogger is running, if hostname(1) returns something different, what
> do you expect to happen?

That's the question - I think its up to us to define something that
doesn't induce this ENOSPC failure.  Since we stop tracking a logger
with previous name (no longer any /etc/pcp/pmlogger/control entries
for it so pmlogger_check/daily will be oblivious now I think), we're
going to miss compression and log rotation.

It's tough to solve too because we have no fore-warning of the change,
we just "accidentally" become aware of it later, with no knowledge of
the previous name (that we could use to repair things, maybe renaming
the directory, etc).

> The name of the "host" is also stashed away in the archive (a) in the
> archive label, and (b) in the pmcd.pmlogger.host metric in the epilogue and
> prologue records.  Similarly, the path to the archive files is stashed in
> the pmcd.pmlogger.archive metric in the epilogue and prologue records.  If
> the "host" name changes, then the daily process of archive management and
> re-launching pmlogger is likely to come unhinged.

FWIW, let's consider also the other (remote) logging cases here - in
those scenarios (i.e. not using LOCALHOSTNAME) a change to the remote
host name has no detrimental effect, pmlogger continues logging under
the previous name until next rotation.

So we have some precedent there - I think a similar outcome for a local
host name change would be fine here.

> If the control file used PCP_ARCHIVE_DIR/<preferred hostname> instead of
> PCP_ARCHIVE_DIR/LOCALHOSTNAME then that would address the directory naming
> issue, but this is ill-sited for zeroconf and default installs, as
> "<preferred hostname>" is not something we can guess at.  But it would still
> leave the hostname encoded in the archive files exposed to changes.

Yep, that will solve it and is what I'll be telling people to do until
we figure out if we can do something more automatic here.

Comment 6 Nathan Scott 2023-03-16 03:33:18 UTC

Commits tackling this are arriving in upstream PCP now;
we're anticipating arrival in pcp-6.0.4 and later.


commit e866259e85ace2ee75232e4444fc684d76e39a07
Author: Ken McDonell <kenj.au>
Date:   Sun Mar 12 11:53:38 2023 +1100

    Introducing PMCD_HOSTNAME_CHANGE
    
    We've come across some situations where a host may change hostname,
    either during post-boot activity when transitioning from a temporary
    to a permanent hostname, or by subsequent administrative action.
    
    In either case, we ended up with a disconnect for the primary pmlogger
    in particular where it appeared to be creating archives in a directory
    that matched an old hostname, and not the current hostname.  Worse,
    because this pmlogger does not match any entry in the control files
    (after LOCALHOSTNAME is translated to the current hostname), the
    periodic and nightly archive maintenance (compression, rotation,
    merging and culling) were not being done correctly.
    
    The solution is to use the existing mechanism for returning bit-fields
    of status from pmcd via from pmFetch() to indicate a change in
    pmcd's state.  Specifically we've added a new PMCD_HOSTNAME_CHANGE bit.
    
    Clients that care can test for this and take appproriate action.
    
    In the case of pmlogger, it will exit if it was launched (or re-exec'd)
    from pmlogger_check or pmlogger_daily.  This will allow the systemd
    or cron mechanisms to relaunch the pmlogger using the current hostname
    against the control files.
    
    qa/1441 (new) tests this facility.
    
    This commit addresses some of the issues raised in
    https://bugzilla.redhat.com/show_bug.cgi?id=2172892
    ... still to be done is the janitorial work for archives created
    by a pmlogger does not match the current state of the control files
    and hostname(1).

Comment 7 Martin Pitt 2023-03-16 04:17:52 UTC

> The solution is to use the existing mechanism for returning bit-fields
> of status from pmcd via from pmFetch() to indicate a change in
> pmcd's state.  Specifically we've added a new PMCD_HOSTNAME_CHANGE bit.

Hmm, how would that work? What we do right now when someone opens Cockpit's metrics pages is to read "PCP_LOG_DIR/pmlogger/" + gethostname(). The latter is guesswork. There's two cases:

 * The directory exists, i.e. pcp previously logged the archive to the current hostname (that's the current assumption). We wouldn't expect pmFetch() to tells us "PMCD_HOSTNAME_CHANGE"; but if it did, that doesn't help the client at all -- which other hostname should it use?

 * The directory does not exist, i.e. pcp started using a different host name than the current one (i.e. the scenario from this bug report). There is nothing to pmFetch(), so the new bit doesn't help. Plus, there is still no indication what other host name to use.

We also can't just iterate through *all* PCP_LOG_DIR/pmlogger/* directories, as these may be for remote machines.

> why even bother with the hostname at all for logging locally? /var/log/pcp/pmlogger/localhost/ may save the hassle? Naïvely from  my user's POV the latter would seem like the best option. Hostnames are only important when logging remotely

That would still be my preferred option -- either that, or if pcp really wants to have a dynamic directory for localhost, perhaps maintain a "localhost" symlink which points to the current output directory, or some other API replacing gethostname(2)?

I suppose that may be the "janitorial work for archives created by a pmlogger" part?

Thanks!

Comment 8 Nathan Scott 2023-03-16 04:32:22 UTC

(In reply to Martin Pitt from comment #7)
> > The solution is to use the existing mechanism for returning bit-fields
> > of status from pmcd via from pmFetch() to indicate a change in
> > pmcd's state.  Specifically we've added a new PMCD_HOSTNAME_CHANGE bit.
> 
> Hmm, how would that work?

I should've mentioned, there is no code change intended for Cockpit.

> What we do right now when someone opens Cockpit's
> metrics pages is to read "PCP_LOG_DIR/pmlogger/" + gethostname(). The latter
> is guesswork.

Its not really a guess, its correct at the time Cockpit looks there.
PCP may or may not yet have caught up with a hostname change (there's
a small race window here - which is unfixable - between when a hostname
changes and when pmlogger detects it, but in some small amount of time
it closes & the directory structure will be correct).

> There's two cases:
> 
>  * The directory exists, i.e. pcp previously logged the archive to the
> current hostname (that's the current assumption). We wouldn't expect
> pmFetch() to tells us "PMCD_HOSTNAME_CHANGE"; but if it did, that doesn't
> help the client at all -- which other hostname should it use?

You should always use the current hostname.  The pmFetch flag is for
pmlogger to respond to in this scenario.

>  * The directory does not exist, i.e. pcp started using a different host
> name than the current one (i.e. the scenario from this bug report). There is
> nothing to pmFetch(), so the new bit doesn't help. Plus, there is still no
> indication what other host name to use.
> 
> We also can't just iterate through *all* PCP_LOG_DIR/pmlogger/* directories,
> as these may be for remote machines.

Yeah, definitely don't do that.

> > why even bother with the hostname at all for logging locally? /var/log/pcp/pmlogger/localhost/ may save the hassle? Naïvely from  my user's POV the latter would seem like the best option. Hostnames are only important when logging remotely
> 

There was much discussion around this option.  In the end, the back-compat
option was maintained mainly for reasons described below...

> That would still be my preferred option -- either that, or if pcp really
> wants to have a dynamic directory for localhost, perhaps maintain a
> "localhost" symlink which points to the current output directory, or some
> other API replacing gethostname(2)?

Symlinks were also discussed a fair bit, but in the end were discarded.
There are baked in requirements that a directory containing archives is
not going to contain archives from multiple host(names) - the hostname
is stamped into each archive as well, and we really really want to end
up with separate directories-of-archives for separate hostnames.

> I suppose that may be the "janitorial work for archives created by a
> pmlogger" part?

This refers more to cleanup of directories we're no longer logging to AIUI.

cheers.

Comment 9 Fedora Update System 2023-06-26 02:24:43 UTC

FEDORA-2023-8c3b058ebd has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2023-8c3b058ebd

Comment 10 Fedora Update System 2023-06-26 02:24:43 UTC

FEDORA-2023-1f0937ef5b has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-1f0937ef5b

Comment 11 Fedora Update System 2023-06-27 10:44:47 UTC

FEDORA-2023-8c3b058ebd has been pushed to the Fedora 37 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-8c3b058ebd`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-8c3b058ebd

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 12 Fedora Update System 2023-06-27 10:55:00 UTC

FEDORA-2023-1f0937ef5b has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-1f0937ef5b`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-1f0937ef5b

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 13 Fedora Update System 2023-06-29 01:25:06 UTC

FEDORA-2023-1f0937ef5b has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 14 Fedora Update System 2023-06-29 01:46:40 UTC

FEDORA-2023-8c3b058ebd has been pushed to the Fedora 37 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.