Bug 1554546 - virt-customize always creates an /etc/machine-id
Summary: virt-customize always creates an /etc/machine-id
Keywords:
Status: NEW
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libguestfs
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1551603 1555474 1557046 2069309
TreeView+ depends on / blocked
 
Reported: 2018-03-12 23:17 UTC by Alex Schultz
Modified: 2022-04-04 08:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1557042 (view as bug list)
Environment:
Last Closed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rwmjones guestfs-tools pull 6 0 None open Ensure the machine-id operation is the last one 2022-03-25 10:45:51 UTC

Description Alex Schultz 2018-03-12 23:17:32 UTC
Description of problem:
With the addition of https://github.com/libguestfs/libguestfs/commit/d5ce659e2c136fbcf0a0b9058711765cfae6c210 /etc/machine-id is always populated when virt-customize is used. The problem is that if you use this image on multiple hosts, because the file exists and is not empty you end up with multiple hosts with the same machine-id. This is problematic for some software that require this to be unique for a given set of systems (ie storage/clustering).

Rather than persist this file, it would be beneficial if the virt-customize action only generated the /etc/machine-id for the execution of virt-customize and restored it back to a blank file once the customization is over. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create image with empty /etc/machine-id
2. use virt-customize
3. inspect image's /etc/machine-id

Actual results:
/etc/machine-id is now hard coded in the image

Expected results:
/etc/machine-id should continue to be empty so that when it is used to boot, a new machine-id is generated

Additional info:
See Bug 1551603

Comment 1 Richard W.M. Jones 2018-03-13 08:19:29 UTC
I think this is not a bug (note also the same thing happens with
the random seed).

You should use virt-sysprep on the disk image after cloning but
before deploying, which will fix this and the random seed issue.

Comment 2 Alex Schultz 2018-03-15 20:26:28 UTC
virt-sysprep by default runs customize which reintroduces the /etc/machine-id after running the machine-id operation.


virt-sysprep -a /builddir/build/BUILD/overcloud-full.x86_64.tar-extract/overcloud-full.qcow2
[   0.0] Examining the guest ...
[  61.3] Performing "abrt-data" ...
[  61.3] Performing "backup-files" ...
[  96.2] Performing "bash-history" ...
[  96.2] Performing "blkid-tab" ...
[  96.2] Performing "crash-data" ...
[  96.3] Performing "cron-spool" ...
[  96.4] Performing "dhcp-client-state" ...
[  96.4] Performing "dhcp-server-state" ...
[  96.5] Performing "dovecot-data" ...
[  96.5] Performing "logfiles" ...
[  96.6] Performing "machine-id" ...
[  96.7] Performing "mail-spool" ...
[  96.7] Performing "net-hostname" ...
[  96.7] Performing "net-hwaddr" ...
[  96.7] Performing "pacct-log" ...
[  96.7] Performing "package-manager-cache" ...
[  97.0] Performing "pam-data" ...
[  97.0] Performing "passwd-backups" ...
[  97.0] Performing "puppet-data-log" ...
[  97.0] Performing "rh-subscription-manager" ...
[  97.0] Performing "rhn-systemid" ...
[  97.0] Performing "rpm-db" ...
[  97.0] Performing "samba-db-log" ...
[  97.1] Performing "script" ...
[  97.1] Performing "smolt-uuid" ...
[  97.1] Performing "ssh-hostkeys" ...
[  97.1] Performing "ssh-userdir" ...
[  97.1] Performing "sssd-db-log" ...
[  97.1] Performing "tmp-files" ...
[  97.1] Performing "udev-persistent-net" ...
[  97.2] Performing "utmp" ...
[  97.2] Performing "yum-uuid" ...
[  97.2] Performing "customize" ...
[  97.2] Setting a random seed
[  97.2] Setting the machine ID in /etc/machine-id
[  98.6] Performing "lvm-uuids" ...

Comment 3 Richard W.M. Jones 2018-03-15 20:38:17 UTC
This (in virt-sysprep) is indeed a bug although a different one.

If you want to just remove the file, do:

  guestfish -a disk.img -i rm-f /etc/machine-id

Comment 4 Alex Schultz 2018-03-15 20:58:05 UTC
It should be noted you don't want to rm -rf /etc/machine-id as that causes other problems.  systemd only recreates the contents of /etc/machine-id if it exists and is empty.  You'd want to truncate it instead.

Comment 5 Cédric Jeanneret 2022-03-25 06:34:04 UTC
Hello there,

So this issue is still running - I just hit it with really, really weird consequences on an OSP lab, where I (unfortunately?) have to edit my overcloud image (a qcow2, used to deploy multiple VMs). In the end, getting the same machine-id broke everything OVS related, meaning, basically, broken network between the VMS (interfaces had the same MAC address).

Note that I'm using latest libguestfs as provided by both EL9 and CS9 (something around guestfs-tools-1.46.1-6.el9.x86_64 on CS9).

We really, really should get a correction for that. I tried to find the file pointed by Alex, but it apparently moved around, and the only match for "machine-id" points to a bash script[1]. By they way, here as well, we may want to clean the machine-id at the end of the run, just before the "sync" call.

If you point me to the current location of that customization, I may be able to propose a patch. In the meanwhile, I'm adding a step calling guestfish to actually remove the file - yes, we can either empty it, remove it, or set its content to "uninitialized", according to the manpage:
"""
FIRST BOOT SEMANTICS
       /etc/machine-id is used to decide whether a boot is the first one. The rules are as follows:

        1. If /etc/machine-id does not exist, this is a first boot. During early boot, systemd will write "uninitialized\n" to this file and overmount a temporary file which contains the actual machine ID. Later (after
           first-boot-complete.target has been reached), the real machine ID will be written to disk.

        2. If /etc/machine-id contains the string "uninitialized", a boot is also considered the first boot. The same mechanism as above applies.

        3. If /etc/machine-id exists and is empty, a boot is not considered the first boot.  systemd will still bind-mount a file containing the actual machine-id over it and later try to commit it to disk (if /etc/ is writable).

        4. If /etc/machine-id already contains a valid machine-id, this is not a first boot.

       If by any of the above rules, a first boot is detected, units with ConditionFirstBoot=yes will be run.
"""

Thank you in advance.

Cheers,

C.





https://github.com/libguestfs/libguestfs/blob/f47e0bb6725434778384cf79ba3b08610f8c3796/appliance/init

Comment 6 Richard W.M. Jones 2022-03-25 08:34:03 UTC
The file changed name recently to /etc/machine-info (with slightly
different content).  I don't know if creating an empty /etc/machine-info
will do what you want, but you could try it using:

virt-customize --truncate /etc/machine-info --truncate /etc/machine-id ...

The source for the virt-customize program moved to a new project:
https://github.com/rwmjones/guestfs-tools

Comment 7 Cédric Jeanneret 2022-03-25 09:59:08 UTC
Hello Richard,

hmm, the manpage doesn't seem to indicate that machine-info "replaces" nor "takes over" machine-id. It's more a metadata file, managed by hostnamectl, while machine-id is that unique UUID for the machine itself. Apparently, machine-info doesn't know about the machine-id.

I'll have a look at that other project code and see if I can propose a change that will, actually, remove the machine-id as it should. As said, the current way things are done is breaking some usages - among them, it may affect OSP customers or even Support when they have to edit an image and deploy it on actual machines...

Cheers,

C.

Comment 8 Cédric Jeanneret 2022-03-25 10:31:20 UTC
Hm, small update/correction:
there IS «a» machine-id in the /etc/machine-info, but under another name: KERNEL_INSTALL_MACHINE_ID

Though, it seems this one:
- isn't editable via hostnamectl
- isn't actually used in "hostnamectl status" (it shows the one set in /etc/machine-id)
- may be different from the actual machine-id
- specifies the installation-specific installation directory kernel-install should use - for instance, that's the machine-id generated when the image was created.

So I'm wondering if THIS one would be sufficient for the actual kernel %post ? Though, of course, there are more than probably other packages wanting to get the machine-id.

In any cases, truncating that /etc/machine-info isn't useful, since its content, even with that KERNAL_INSTALL_MACHINE_ID set and common on multiple machine, doesn't create any issue (at least, none that I could detect so far).


IMHO, the best thing to do in order to solve the issues would be to run the "machine-id" operation at the very end of all the virt-sysprep (and ensure virt-customize also cleans it, since it's creating it in order to ensure --update and --install, among things, won't spit any stacktrace due to missing machine-id).

I'll try to submit a patch, though I don't know OCamel... Shouldn't be that hard.

Cheers,

C.

Comment 9 Laszlo Ersek 2022-03-25 10:46:08 UTC
Let me reach back to comment 2:

(In reply to Alex Schultz from comment #2)
> virt-sysprep by default runs customize which reintroduces the
> /etc/machine-id after running the machine-id operation.
> 
> 
> virt-sysprep -a
> /builddir/build/BUILD/overcloud-full.x86_64.tar-extract/overcloud-full.qcow2
> [   0.0] Examining the guest ...
> ...
> [  96.6] Performing "machine-id" ...
> ...
> [  97.2] Performing "customize" ...
> [  97.2] Setting a random seed
> [  97.2] Setting the machine ID in /etc/machine-id
> ...

I agree this sequence of operations reintroduces "/etc/machine-id", but aren't the *contents* of the reintroduced "/etc/machine-id" different?

The "machine-id" operation [sysprep/sysprep_operation_machine_id.ml] truncates the file:

    let paths = [ "/etc/machine-id";
                  "/var/lib/dbus/machine-id"; ] in
    let paths = List.filter g#is_file paths in
    List.iter g#truncate paths

while "customize" [customize/customize_run.ml] overwrites the zero-sized file with hex-encoded bytes from /dev/urandom:

  let () =
    let etc_machine_id = "/etc/machine-id" in
    let statbuf =
      try Some (g#lstatns etc_machine_id) with G.Error _ -> None in
    (match statbuf with
     | Some { G.st_size = 0L; G.st_mode = mode }
          when (Int64.logand mode 0o170000_L) = 0o100000_L ->
        message (f_"Setting the machine ID in %s") etc_machine_id;
        let id = Urandom.urandom_bytes 16 in
        let id = String.map_chars (fun c -> sprintf "%02x" (Char.code c)) id in
        let id = String.concat "" id in
        let id = id ^ "\n" in
        g#write etc_machine_id id
     | _ -> ()
    ) in

So what is the actual problem:
- duplicate ID?
- *some* ID that is not a duplicate (or an empty file), which tells systemd that this is not a "first boot"?

If the ask for virt-sysprep is to set "/etc/machine-id" such that systemd think this is a "first boot", I don't think that's possible to accommodate. The explanation is given in <https://github.com/libguestfs/libguestfs/commit/d5ce659e2c136fbcf0a0b9058711765cfae6c210> -- in that case, we cannot install the kernel.

So we have two design-level problems here:

- the "customize" and "machine-id" operations are both default operations in virt-sysprep, their desired outputs are in conflict, and the order of their execution is unspecified ("register_operation" [sysprep/sysprep_operation.ml] is called by both modules in unspecified order)

- AIUI, users want virt-sysprep to finish with the machine-id truncated; on the other hand, that way they won't be able to run "virt-customize --update", due to the bug described in the above-noted libguestfs commit.

Please describe the *canonical order* in which the following three commands are supposed to run:

- virt-sysprep
- virt-customize
- guest kernel update

When we know the desired order, we might be able to resolve the conflict between the "machine-id" and "customize" operations. Thanks.

Comment 10 Cédric Jeanneret 2022-03-25 10:54:28 UTC
Hello,

Sooo. the issue is: the image we're modifying is then used to deploy multiple hosts - being virtual, or hardware (usually, that's hardware, really). The fact the "golden image" has a generated /etc/machine-id file leads to a situation where all the nodes deployed with this golden image end with the very same machine-id - leading to really nasty to debug issues.

For instance, it took me a week to find out the root cause of a *networking* issue with openvswitch.... Due to the fact all my nodes were having the same machine-id, leading to openvswitch generating the very same MAC address for its different ports.

I just pushed a patch that should make things a bit better - basically, ensuring the "machine-id" operation is launched last in virt-sysprep. I'll have to figure out how to do the same within virt-customize, but at least, it should correct half of the problem (hopefully).

Thing is: the /etc/machine-id should be cleared as a very last step. That's just it. Do whatever is needed to actually install packages, configure things and so on - but at the very end, it should be cleared; and, as said in my commit message, if the operator wants to keep the generated machine-id, it's possible, at least in virt-sysprep, since they can deactivate the "machine-id" operation.

IMHO, doing so is the best thing - the machine-id is actually cleared by default as expected, and there's a way to not clear it.

Does my explanation help understanding the issue?

Cheers,

C.

Comment 11 Laszlo Ersek 2022-03-25 10:56:09 UTC
I need to correct myself, the order between the customize and machine-id operations of virt-sysprep is not unspecified. In "sysprep/sysprep_operation_customize.ml", we have:

let op = {
  defaults with
    order = 99;                         (* Run it after everything. *)
    name = "customize";

The suggested update at <https://github.com/rwmjones/guestfs-tools/pull/6/commits/c1f7dfe17e42142569813259cb8f6f9325bcee85>, namely to set the order of "machine-id" to 999, conflicts with the above comment (i.e., "run customize after everything"). Plus I don't know why that doesn't make the machine-id generation part of customize entirely superfluous in the first place -- why have a non-empty "machine-id" temporarily? Do we run the kernel update as a part of "customize" (so before "machine-id" moved to the end), but after "customize" itself generates "machine-id" temporarily?

Comment 13 Laszlo Ersek 2022-03-25 11:12:42 UTC
I think the actual issue with <https://github.com/libguestfs/libguestfs/commit/d5ce659e2c136fbcf0a0b9058711765cfae6c210> is that it sets machine-id unconditionally an permanently. It was intended as a work-around for a kernel install bug, but its effect & lifetime extends far beyond that.

In my opinion, what we should do is restrict the machine-id generation logic in "customize" to just the --install and --update options, and at the end of those options, "machine-id" should be rolled back at once to its original state. In other words, assuming machine-id does not exist in the guest being modified right off the bat, it should be created *temporarily*, only as long as --install and/or --update runs.

This way it will not matter
- whether the guest has a non-empty machine-id or not at the beginning,
- whether machine-id runs first or customize runs first (if machine-id runs first, then customize will restore that state, so we'll end up with a truncated machine id; if customize runs first, then machine-id runs second, and will truncate any machine ID, so will'end up with a truncated one again).

In brief, the logic in <https://github.com/libguestfs/libguestfs/commit/d5ce659e2c136fbcf0a0b9058711765cfae6c210> seems justified, but it should be temporary, not persistent. IMO, moving the "machine-id" operation to the tail of the operation list just papers over the issue.

Comment 14 Cédric Jeanneret 2022-03-25 11:58:32 UTC
Hello,

Yep - machine-id should be temporary and not stay in the image at the end, that's the very core of the issue. Sorry if the previous attempts at explaining it weren't successful.

As for the method: scoping the generation to the --install and --update, then remove it, sounds really good. Moving the machin-id like I proposed is probably the only thing I can actually contribute, since I don't know the language at all :/.

IMHO, the best way to do this kind of things is to actually rely on the same logic implemented for the machine-id(1):
- if /etc/machine-id doesn't exist, is empty or contains "uninitialized", then the generated machine-id should be temporary and removed after the action(s)
- if the file exists and contains a valid ID, then it should be kept as-is, and stay after the actions

That's the TL;DR of the "FIRST BOOT SEMANTICS" of the machine-id(1) manpage[1].

Of course, I can try to provide a patch with that. Though it will take some time, for obvious reasons (language, project - I know nothing yet ;)).

Regarding the currently open PR - it may be a "functional workaround", and would allow to actually mimic the above logic, since the operator can disable the "machine-id" thingy. WDYT?


Cheers,

C.

[1] https://www.freedesktop.org/software/systemd/man/machine-id.html

Comment 15 Richard W.M. Jones 2022-03-25 12:40:04 UTC
I'm still unclear what the actual issue is, could you describe the steps needed to
reproduce the issue locally.

Comment 16 Cédric Jeanneret 2022-03-25 13:00:57 UTC
Hello Richard,

so, easy:

- take cs9 qcow2 image from here, for instance: https://cloud.centos.org/centos/9-stream/x86_64/images/ (latest is good enough)
- edit the image, like setting a root password
- create some VMs based on that disk, for instance using  for i in 1 2 3; do qemu-img create -o backing_file=CentOS-Stream-GenericCloud-9-20220325.0.x86_64.qcow2,backing_fmt=qcow2 -f qcow2 vm${i}.qcow2 20G; done
- boot those VMs in your preferred virtualization service (libvirt for instance)
- check the actual machine-id on those booted VMs

This is, more or less, what OSP does when we provision hardware in order to get computes, controllers and any other roles in OpenStack[1]: we have a so-called "gold image" in a service, that is then replicated on the hardware in order to get our nodes running with already installed packages and so on. But that's not all: since instances in OSP are also based on disk image, if we customize a CS9 image before uploading it to glance (an OSP service storing the images as-is for later use in order to start actual computes), we'll face the same issue: machine-id will be here already, and we'll end with a bunch of VM having the exact same ID - which is utterly wrong.

To go further, you may want to install openvswitch, and create some bridge (ovs-vsctl add-br foo), and create some vlan associated to this bridge (ovs-vsctl add-port foo vlan103 tag=103 -- set interface vlan103 type=internal). Doing this on all the VMs having the same machine-id will lead to the vlan103 interface having the same MAC address (ip link show vlan103). This was described first as an openvswitch issue here: https://bugzilla.redhat.com/show_bug.cgi?id=2067216 - that was before I found out the machine-id was the same on all the nodes...

So, basically: if a machine-id is needed for some of the customization operation, there's no problem, generate one, use it, abuse it. But, please, clean that generated ID once the operation is over, else it breaks the very idea of a "golden image". Hence it "just" breaks anything cloud related ;).

Is it clearer?

Cheers,

C.



[1] mostly here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/director_installation_and_usage/assembly_provisioning-bare-metal-nodes-before-deploying-the-overcloud - but you probably don't want to read all the doc.

Comment 17 Cédric Jeanneret 2022-03-25 13:08:54 UTC
* for the "modify the image":

╰─$ virt-sysprep -a CentOS-Stream-GenericCloud-9-20220325.0.x86_64.qcow2 --root-password "password:fooBar" --selinux-relabel                       
[   0.0] Examining the guest ...
[...]
[   3.6] Performing "machine-id" ... <- this cleans up the existing one, GOOD!
[...]
[   3.9] Performing "customize" ...
[   4.0] Setting a random seed
[   4.0] Setting the machine ID in /etc/machine-id <- well.. ok, get a new one
[   4.0] Setting passwords
[   5.2] SELinux relabelling
[  15.8] Performing "lvm-uuids" ...
╭─cjeanner@marvin ~/libvirt
╰─$

This shows perfectly the dangling "machine-id". Also, we can see there's already an operation happening after the "customize": lvm-uuids. So moving the "machine-id" at the very end shouldn't be that terrible, though we might want to update the comment pointed by Laszlo earlier :)

Comment 21 Cédric Jeanneret 2022-03-31 05:15:54 UTC
Hello there,

Just a quick (public) hint "how to truncate the file" without pain:
when you call virt-sysprep or virt-customize, just add "--truncate /etc/machine-id" as the very last option. While it will still create the machine-id once it hits the "customize" operations, it will be cleared with the --truncate. It works pretty fine and, since the machine-id is generated, the file will exist anyway.

That's what I'm doing now, and what we've done in the TripleO and OSP CI tools, in order to work around the main issue.

Apparently, in TripleO/OSP world, we may even get rid of anything virt-* related - though it makes our customization a bit more complicated.

I'll continue to watch this issue, in the hope to get an actual resolution/correction of the issue within libguestfs tools, but, at least on our (OSP) side, the issue isn't a blocker anymore.

Cheers,

C.


Note You need to log in before you can comment on or make changes to this bug.