Bug 1748638

Summary: openshift-install 4.2 GCloud does not install /lib/udev/rules.d/65-gce-disk-naming.rules
Product: OpenShift Container Platform Reporter: Craig Rodrigues <rodrigc>
Component: RHCOSAssignee: Micah Abbott <miabbott>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: bbreard, dustymabe, imcleod, jligon, nstielau
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:40:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Craig Rodrigues 2019-09-03 23:27:39 UTC
Description of problem:

Version-Release number of the following components:

openshift-install version
openshift-install unreleased-master-1655-g4f3e73a0143ba36229f42e8b65b6e65342bb826b
built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b
release image registry.svc.ci.openshift.org/origin/release:4.2

How reproducible:

Steps to Reproduce:
1.  Get the source code for openshift-install from https://github.com/openshift/installer and compile it

2.  Create a install-config.yaml file that looks like:

apiVersion: v1
baseDomain: gcp.openshift.portworx.com
compute:
- hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: craig-gcp-1
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  gcp:
    projectID: portworx-eng
    region: us-west2
pullSecret: [redacted]
sshKey: [redacted]


3.  Install Openshift in Azure with:

openshift-install create cluster




4.  Openshift will provision a cluster.  On each node in the cluster, this OS is running:

NAME="Red Hat Enterprise Linux CoreOS"
VERSION="42.80.20190829.1"
VERSION_ID="4.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20190829.1 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.2"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.2"
OSTREE_VERSION=42.80.20190829.1

5.  If I log into one of the nodes, if I look at the files in:

/etc/udev/
/lib/udev

I see that these files are missing:

/lib/udev/rules.d/64-gce-disk-removal.rules
/lib/udev/rules.d/99-gce.rules
/lib/udev/rules.d/65-gce-disk-naming.rules





I work for Portworx ( https://www.portworx.com ), and found this problem
by trying to install Openshift 4.2 in GCloud, and then dynamically provisioning Portworx storage devices, using the StorageCluster interface in Openshift.

The lack of the necessary udev files on the Openshift nodes breaks Portworx storage.

Comment 1 Craig Rodrigues 2019-09-03 23:29:38 UTC
This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1747575

Comment 2 Craig Rodrigues 2019-09-03 23:32:28 UTC
If I create a normal GKE cluster in GCloud, using the GCloud web UI,
I can log into one of the nodes, and I can see that:


/lib/udev/rules.d/64-gce-disk-removal.rules
/lib/udev/rules.d/99-gce.rules
/lib/udev/rules.d/65-gce-disk-naming.rules

are there.


Our storage product (Portworx) works fine in that case when installed on plain GKE.

However, when we install our product on Openshift4 in GCloud, we hit problems when dynamically provisioning disks.

This problem will hit other storage providers using Openshift4 in GCloud.

Comment 3 Craig Rodrigues 2019-09-03 23:36:32 UTC
So in GKE, these files are coming from the gce-compute-image-packages package:

for f in $(find /lib/udev/rules.d -name "*gc*" )      
> do
> dpkg -S $f
> done
gce-compute-image-packages: /lib/udev/rules.d/64-gce-disk-removal.rules
gce-compute-image-packages: /lib/udev/rules.d/99-gce.rules
gce-compute-image-packages: /lib/udev/rules.d/65-gce-disk-naming.rules


Looking around, I found these rules on GitHub:

https://github.com/GoogleCloudPlatform/compute-image-packages/tree/master/packages/google-compute-engine/src/lib/udev/rules.d/

Comment 5 Micah Abbott 2019-09-11 15:49:27 UTC
Craig, does Portworx require the `64-gce-disk-removal.rules`?  It looks like it is just forcing a lazy unmount of the device and logging a message to the journal.  I would imagine the umounting of the device would be handled at a higher layer and wouldn't be handled by the udev rules.

Comment 6 Craig Rodrigues 2019-09-11 23:21:15 UTC
Micah,

At Portworx, we have extensive tests for cloud storage.

As you fix this bug on GCloud, could you run the tests we have for mounting storage?
Our tests are Open Source.

You can do the following.

1.  Provision an Openshift cluster on GCloud

2.  Get direct access to one of the nodes and log into it.

3.  Read: https://github.com/libopenstorage/cloudops/blob/master/gce/README.md
 

4.  Use the following container to checkout and run the tests on GCloud, replace the environment variables
    with your GCloud setup:

docker run \
       --rm \
       -t \
        -i \
       -e GOOGLE_APPLICATION_CREDENTIALS=<path-to-service-account-json-file> \
       -e GCE_INSTANCE_NAME=<gce-instance-name> \
       -e GCE_INSTANCE_ZONE=<gce-instance-zone> \
       -e GCE_INSTANCE_PROJECT=<gce-project-name> \
       -v $PWD:/go/src/github.com/libopenstorage \
       -w /go/src/github.com/libopenstorage \
       hatsunemiku/golang-dev-docker \
       bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops  && make && make test'

Comment 7 Craig Rodrigues 2019-09-11 23:22:15 UTC
And for Azure:

1.  Provision an Openshift cluster on Azure

2.  Get direct access to one of the nodes and log into it.

3.  Read: https://github.com/libopenstorage/cloudops/blob/master/azure/README.md
    for

4.  Use the following container to checkout and run the tests on Azure, replace the environment variables
    with your Azure setup:

docker run \
       --rm \
       -t \
        -i \
       -e AZURE_INSTANCE_ID=<instance-id> \
       -e AZURE_INSTANCE_REGION=<instance-region> \
       -e AZURE_SCALE_SET_NAME=<scale-set-name> \
       -e AZURE_SUBSCRIPTION_ID=<subscription-id> \
       -e AZURE_RESOURCE_GROUP_NAME=<resource-group-name-of-instance> \
       -e AZURE_ENVIRONMENT=<azure-cloud-environment> \
       -e AZURE_TENANT_ID=<tenant-id> \
       -e AZURE_CLIENT_ID=<client-id> \
       -e AZURE_CLIENT_SECRET=<client-secret> \
       -v $PWD:/go/src/github.com/libopenstorage \
       -w /go/src/github.com/libopenstorage \
       hatsunemiku/golang-dev-docker \
       bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops  && make && make test'

Comment 8 Craig Rodrigues 2019-09-12 00:39:18 UTC
You can run additional tests on GCE by doing:

docker run \
       --rm \
       -t \
        -i \
       -e GOOGLE_APPLICATION_CREDENTIALS=<path-to-service-account-json-file> \
       -e GCE_INSTANCE_NAME=<gce-instance-name> \
       -e GCE_INSTANCE_ZONE=<gce-instance-zone> \
       -e GCE_INSTANCE_PROJECT=<gce-project-name> \
       -v $PWD:/go/src/github.com/libopenstorage \
       -w /go/src/github.com/libopenstorage \
       hatsunemiku/golang-dev-docker \
       bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops/gce && go test -v'

Comment 10 Micah Abbott 2019-09-12 18:37:12 UTC
Rules have been added to the RHCOS config and will be present in RHCOS 42.80.20190911.0 and later.

Comment 12 Craig Rodrigues 2019-09-13 01:41:06 UTC
Regarding your question, I looked at:
https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/packages/google-compute-engine/src/lib/udev/rules.d/64-gce-disk-removal.rules

and don't understand why Google decided to do a umount -f
on device removal.  That's weird.

Portworx doesn't depend on this.

Comment 13 Micah Abbott 2019-09-13 02:33:05 UTC
Thanks for the reply, Craig.  

We included the removal rule for completeness.  We don't think it should have any adverse affects.

I monkeyed around with the test you provided in comment #7 and got it working with `podman` (no `docker` on RHCOS) and got it to pass after also mounting the host's `/dev` into the container.

I confirmed that the `/dev/disk/by-id` for attached disks showed up while the test was running.

```
$ ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root  9 Sep 13 02:29 google-openstorage-test-6e5443e9-1ef2-42da-9ae1-5ec6928897d8 -> ../../sdb
lrwxrwxrwx. 1 root root  9 Sep 12 19:34 google-persistent-disk-0 -> ../../sda
lrwxrwxrwx. 1 root root 10 Sep 12 19:34 google-persistent-disk-0-part1 -> ../../sda1
lrwxrwxrwx. 1 root root 10 Sep 12 19:35 google-persistent-disk-0-part2 -> ../../sda2
lrwxrwxrwx. 1 root root 10 Sep 12 19:34 google-persistent-disk-0-part3 -> ../../sda3
lrwxrwxrwx. 1 root root  9 Sep 13 02:29 scsi-0Google_PersistentDisk_openstorage-test-6e5443e9-1ef2-42da-9ae1-5ec6928897d8 -> ../../sdb
lrwxrwxrwx. 1 root root  9 Sep 12 19:34 scsi-0Google_PersistentDisk_persistent-disk-0 -> ../../sda
lrwxrwxrwx. 1 root root 10 Sep 12 19:34 scsi-0Google_PersistentDisk_persistent-disk-0-part1 -> ../../sda1
lrwxrwxrwx. 1 root root 10 Sep 12 19:35 scsi-0Google_PersistentDisk_persistent-disk-0-part2 -> ../../sda2
lrwxrwxrwx. 1 root root 10 Sep 12 19:34 scsi-0Google_PersistentDisk_persistent-disk-0-part3 -> ../../sda3
```

Comment 14 Craig Rodrigues 2019-09-13 20:28:07 UTC
Micah,

Thanks for working on this, and for running the libopenstorage/cloudops tests to verify.

Just out of curiousity, for solving this problem did you bake the udev files into the RHCOS image,
or did you add them to the code in afterburn, which does cloud-specific provisioning?

Comment 15 Micah Abbott 2019-09-16 13:21:59 UTC
In this case, we've baked the rules into the RHCOS image.  In the future, we want to break the rules into a separate package (see BZ#1751310) so that we can include them that way.

Comment 17 Craig Rodrigues 2019-09-23 20:08:39 UTC
Micah,

I verified this fix on the latest RHCOS image with openshift-install on GCP.

Specifically, I used the latest openshift-install to provision an Openshift 4 cluster in GCP,
then I provisioned Portworx, created a StorageCluster, and I observed that the disks
were created properly and mounted.

Thanks a lot for working on this fix, and running the libopenstorage/cloudops tests.

Comment 18 Craig Rodrigues 2019-10-07 22:32:44 UTC
Micah,

Thanks for fixing this problem on GCP and Azure.

With this fix, I was able to set up two Openshift 4.2 clusters (1 on GCP, 1 on Azure).

Ryan Wallner, on my team, was able to test Portworx Disaster Recovery (DR), where he managed to migrate
applications + data running on one Openshift cluster, and migrate it to the other cluster running in a different cloud.
Ryan made a video:

"Cross-Cloud Application Migration with Google Cloud and Microsoft Azure on OpenShift 4.2"
https://youtu.be/ZhdpE6sl_jM


Thanks again, Micah, for making this possible.

Comment 19 errata-xmlrpc 2019-10-16 06:40:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922