1760873 – [UPI][BAREMETAL] RHCOS 4.2 installation does not work when trying to use a secondary disk for /var/lib/containers

Bug 1760873 - [UPI][BAREMETAL] RHCOS 4.2 installation does not work when trying to use a secondary disk for /var/lib/containers

Summary: [UPI][BAREMETAL] RHCOS 4.2 installation does not work when trying to use a se...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-11 14:55 UTC by Benjamin Chardi
Modified:	2023-09-07 20:47 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:42:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:42:11 UTC

Description Benjamin Chardi 2019-10-11 14:55:17 UTC

Description of problem:

When trying to install OCP42 UPI Baremetal cluster using RHCOS 4.2 as OS on baremetal nodes with vdb1 partition for /var/lib/containers cri-o storage, cri-o service is unable to start on installed nodes.
The following is the ignition config used in order to use vdb1 partition mounted at /var/lib/containers for cri-o storage on bootstrap host:

# cat bootstrap.ign
...
  "storage": {
...
    "disks": [
      {
        "device": "/dev/vdb",
        "wipeTable": true,
        "partitions": [
          {
            "label": "data01",
            "number": 1,
            "size": 0
          }
        ]
      }
    ],
    "filesystems": [
      {
        "mount": {
          "device": "/dev/vdb1",
          "format": "xfs",
          "label": "data01"
        }
      }
    ]
  },
  "systemd": {
    "units": [
...
      {
        "name": "var-lib-containers.mount",
        "enabled": true,
        "contents": "[Mount]\nWhat=/dev/vdb1\nWhere=/var/lib/containers\nType=xfs\nOptions=defaults\n\n[Install]\nWantedBy=local-fs.target"
      },
      {
        "name": "var-lib-containers-relabel.service",
        "enabled": true,
        "contents": "[Unit]\nAfter=var-lib-containers.mount\n[Service]\nType=oneshot\nExecStart=/sbin/restorecon /var/lib/containers\n\n[Install]\nWantedBy=local-fs.target"
      }
...


RHCOS installation finish successfully on bootstrap node and vdb1 is mounted on /var/lib/containers as expected:

[core@ocp4-bootstrap ~]$ mount | grep vdb1
/dev/vdb1 on /var/lib/containers type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

[core@ocp4-bootstrap ~]$ sudo df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.8G     0  7.8G   0% /dev
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           7.9G  6.6M  7.9G   1% /run
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/vda3        49G  2.3G   47G   5% /sysroot
/dev/vda2       976M   69M  841M   8% /boot
/dev/vda1        94M  6.6M   88M   8% /boot/efi
/dev/vdb1        20G  176M   20G   1% /var/lib/containers
tmpfs           1.6G     0  1.6G   0% /run/user/1000



BUT cri-o service is unable to start on bootstrap installed node because crio-wipe service (dependence) is unable to start:

[core@ocp4-bootstrap ~]$ sudo systemctl status cri-o
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o

Oct 11 14:20:00 ocp4-bootstrap.info.net systemd[1]: Dependency failed for Open Container Initiative Daemon.
Oct 11 14:20:00 ocp4-bootstrap.info.net systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.


[core@ocp4-bootstrap ~]$ sudo systemctl status crio-wipe
● crio-wipe.service - CRI-O Auto Update Script
   Loaded: loaded (/usr/lib/systemd/system/crio-wipe.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2019-10-11 14:26:25 UTC; 16s ago
  Process: 2929 ExecStart=/bin/bash /usr/libexec/crio/crio-wipe/crio-wipe.bash (code=exited, status=1/FAILURE)
 Main PID: 2929 (code=exited, status=1/FAILURE)

Oct 11 14:26:25 ocp4-bootstrap.info.net systemd[1]: Starting CRI-O Auto Update Script...
Oct 11 14:26:25 ocp4-bootstrap.info.net bash[2929]: Old version not found
Oct 11 14:26:25 ocp4-bootstrap.info.net bash[2929]: Wiping storage
Oct 11 14:26:25 ocp4-bootstrap.info.net bash[2929]: rm: cannot remove '/var/lib/containers': Device or resource busy
Oct 11 14:26:25 ocp4-bootstrap.info.net systemd[1]: crio-wipe.service: Main process exited, code=exited, status=1/FAILURE
Oct 11 14:26:25 ocp4-bootstrap.info.net systemd[1]: crio-wipe.service: Failed with result 'exit-code'.
Oct 11 14:26:25 ocp4-bootstrap.info.net systemd[1]: Failed to start CRI-O Auto Update Script.


As can be seen crio-wipe.service can not start because the script /usr/libexec/crio/crio-wipe/crio-wipe.bash tries to remove /var/lib/containers directory, which is the mount point where vdb1 is mounted.
I believe that crio-wipe expects that /var/lib/containers be a regular directory and not a mount point (that can not be removed while mounted). This inconsistence makes that OCP42 cluster fails when trying to configure /var/lib/containers storage using a secondary disk (vdb) on master and workers. I believe that the use case described in this BZ will be common when installing OCP42 on baremetal servers.

Possible solution:

Use 'rm -rf /var/lib/containers/*' instead of 'rm -rf /var/lib/containers' on crio-wipe.service


Version-Release number of selected component (if applicable):

rhcos-4.2.0-0.nightly-2019-08-28-152644-x86_64

[core@ocp4-bootstrap ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.2

[core@ocp4-bootstrap ~]$ rpm -q cri-o
cri-o-1.14.10-0.8.dev.rhaos4.2.gitaf00350.el8.x86_64



How reproducible:


Steps to Reproduce:
1. Follow the standard procedure to install OCP42 UPI Baremetal
2. Modify bootstrap, worker and master ignition files in order to use custom storage for cri-o (/var/lib/containers) as described in this case.
3. OCP42 installation fails on the bootstrap node because cri-o and crio-wipe services can not start (crio-wipe can not remove /var/lib/containers, which is a mount point)

Actual results:

OCP42 installation fails on the bootstrap


Expected results:

OCP42 installation success


Additional info:

Related bugzillas

https://bugzilla.redhat.com/show_bug.cgi?id=1699107
https://bugzilla.redhat.com/show_bug.cgi?id=1692513

Comment 1 Colin Walters 2019-10-11 14:59:40 UTC

> Use 'rm -rf /var/lib/containers/*' instead of 'rm -rf /var/lib/containers' on crio-wipe.service

Agreed!

Comment 2 Peter Hunt 2019-10-11 15:38:52 UTC

This should be fixed as of about a month ago, it seems your build is a bit older than that. Can you try a newer OCP version? I can't find the exact date it got merged, but every release listed here https://releases-rhcos-art.cloud.privileged.psi.redhat.com/ has it (it merged somewhere before 1.14.10-0.18.dev.rhaos4.2.git3725006.el8 but after cri-o-1.14.10-0.8.dev.rhaos4.2.gitaf00350.el8.x86_64)

Comment 4 Benjamin Chardi 2019-10-13 17:20:18 UTC

(In reply to Peter Hunt from comment #2)
> This should be fixed as of about a month ago, it seems your build is a bit
> older than that. Can you try a newer OCP version? I can't find the exact
> date it got merged, but every release listed here
> https://releases-rhcos-art.cloud.privileged.psi.redhat.com/ has it (it
> merged somewhere before 1.14.10-0.18.dev.rhaos4.2.git3725006.el8 but after
> cri-o-1.14.10-0.8.dev.rhaos4.2.gitaf00350.el8.x86_64)

Tested on the newest version of OCP42 and RHCOS42 releases and working as expected:

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/4.2.0-rc.5
https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.2.0-rc.5

Also the workaround used to relabel mount point /var/lib/containers is not needed, service ignition-relabel is doing the trick. 
So the ignition config need is:

# cat bootstrap.ign
...
  "storage": {
...
    "disks": [
      {
        "device": "/dev/vdb",
        "wipeTable": true,
        "partitions": [
          {
            "label": "data01",
            "number": 1,
            "size": 0
          }
        ]
      }
    ],
    "filesystems": [
      {
        "mount": {
          "device": "/dev/vdb1",
          "format": "xfs",
          "label": "data01"
        }
      }
    ]
  },
  "systemd": {
    "units": [
...
      {
        "name": "var-lib-containers.mount",
        "enabled": true,
        "contents": "[Mount]\nWhat=/dev/vdb1\nWhere=/var/lib/containers\nType=xfs\nOptions=defaults\n\n[Install]\nWantedBy=local-fs.target"
      },

... 

From my point of view this case can be closed. Many thanks !.

Comment 6 errata-xmlrpc 2019-10-16 06:42:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.