Description of problem: Version-Release number of the following components: openshift-install version openshift-install unreleased-master-1655-g4f3e73a0143ba36229f42e8b65b6e65342bb826b built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b release image registry.svc.ci.openshift.org/origin/release:4.2 How reproducible: Steps to Reproduce: 1. Get the source code for openshift-install from https://github.com/openshift/installer and compile it 2. Create a install-config.yaml file that looks like: apiVersion: v1 baseDomain: gcp.openshift.portworx.com compute: - hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: hyperthreading: Enabled name: master platform: {} replicas: 3 metadata: creationTimestamp: null name: craig-gcp-1 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineCIDR: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: gcp: projectID: portworx-eng region: us-west2 pullSecret: [redacted] sshKey: [redacted] 3. Install Openshift in Azure with: openshift-install create cluster 4. Openshift will provision a cluster. On each node in the cluster, this OS is running: NAME="Red Hat Enterprise Linux CoreOS" VERSION="42.80.20190829.1" VERSION_ID="4.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20190829.1 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.2" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.2" OSTREE_VERSION=42.80.20190829.1 5. If I log into one of the nodes, if I look at the files in: /etc/udev/ /lib/udev I see that these files are missing: /lib/udev/rules.d/64-gce-disk-removal.rules /lib/udev/rules.d/99-gce.rules /lib/udev/rules.d/65-gce-disk-naming.rules I work for Portworx ( https://www.portworx.com ), and found this problem by trying to install Openshift 4.2 in GCloud, and then dynamically provisioning Portworx storage devices, using the StorageCluster interface in Openshift. The lack of the necessary udev files on the Openshift nodes breaks Portworx storage.
This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1747575
If I create a normal GKE cluster in GCloud, using the GCloud web UI, I can log into one of the nodes, and I can see that: /lib/udev/rules.d/64-gce-disk-removal.rules /lib/udev/rules.d/99-gce.rules /lib/udev/rules.d/65-gce-disk-naming.rules are there. Our storage product (Portworx) works fine in that case when installed on plain GKE. However, when we install our product on Openshift4 in GCloud, we hit problems when dynamically provisioning disks. This problem will hit other storage providers using Openshift4 in GCloud.
So in GKE, these files are coming from the gce-compute-image-packages package: for f in $(find /lib/udev/rules.d -name "*gc*" ) > do > dpkg -S $f > done gce-compute-image-packages: /lib/udev/rules.d/64-gce-disk-removal.rules gce-compute-image-packages: /lib/udev/rules.d/99-gce.rules gce-compute-image-packages: /lib/udev/rules.d/65-gce-disk-naming.rules Looking around, I found these rules on GitHub: https://github.com/GoogleCloudPlatform/compute-image-packages/tree/master/packages/google-compute-engine/src/lib/udev/rules.d/
Craig, does Portworx require the `64-gce-disk-removal.rules`? It looks like it is just forcing a lazy unmount of the device and logging a message to the journal. I would imagine the umounting of the device would be handled at a higher layer and wouldn't be handled by the udev rules.
Micah, At Portworx, we have extensive tests for cloud storage. As you fix this bug on GCloud, could you run the tests we have for mounting storage? Our tests are Open Source. You can do the following. 1. Provision an Openshift cluster on GCloud 2. Get direct access to one of the nodes and log into it. 3. Read: https://github.com/libopenstorage/cloudops/blob/master/gce/README.md 4. Use the following container to checkout and run the tests on GCloud, replace the environment variables with your GCloud setup: docker run \ --rm \ -t \ -i \ -e GOOGLE_APPLICATION_CREDENTIALS=<path-to-service-account-json-file> \ -e GCE_INSTANCE_NAME=<gce-instance-name> \ -e GCE_INSTANCE_ZONE=<gce-instance-zone> \ -e GCE_INSTANCE_PROJECT=<gce-project-name> \ -v $PWD:/go/src/github.com/libopenstorage \ -w /go/src/github.com/libopenstorage \ hatsunemiku/golang-dev-docker \ bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops && make && make test'
And for Azure: 1. Provision an Openshift cluster on Azure 2. Get direct access to one of the nodes and log into it. 3. Read: https://github.com/libopenstorage/cloudops/blob/master/azure/README.md for 4. Use the following container to checkout and run the tests on Azure, replace the environment variables with your Azure setup: docker run \ --rm \ -t \ -i \ -e AZURE_INSTANCE_ID=<instance-id> \ -e AZURE_INSTANCE_REGION=<instance-region> \ -e AZURE_SCALE_SET_NAME=<scale-set-name> \ -e AZURE_SUBSCRIPTION_ID=<subscription-id> \ -e AZURE_RESOURCE_GROUP_NAME=<resource-group-name-of-instance> \ -e AZURE_ENVIRONMENT=<azure-cloud-environment> \ -e AZURE_TENANT_ID=<tenant-id> \ -e AZURE_CLIENT_ID=<client-id> \ -e AZURE_CLIENT_SECRET=<client-secret> \ -v $PWD:/go/src/github.com/libopenstorage \ -w /go/src/github.com/libopenstorage \ hatsunemiku/golang-dev-docker \ bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops && make && make test'
You can run additional tests on GCE by doing: docker run \ --rm \ -t \ -i \ -e GOOGLE_APPLICATION_CREDENTIALS=<path-to-service-account-json-file> \ -e GCE_INSTANCE_NAME=<gce-instance-name> \ -e GCE_INSTANCE_ZONE=<gce-instance-zone> \ -e GCE_INSTANCE_PROJECT=<gce-project-name> \ -v $PWD:/go/src/github.com/libopenstorage \ -w /go/src/github.com/libopenstorage \ hatsunemiku/golang-dev-docker \ bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops/gce && go test -v'
Rules have been added to the RHCOS config and will be present in RHCOS 42.80.20190911.0 and later.
Regarding your question, I looked at: https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/packages/google-compute-engine/src/lib/udev/rules.d/64-gce-disk-removal.rules and don't understand why Google decided to do a umount -f on device removal. That's weird. Portworx doesn't depend on this.
Thanks for the reply, Craig. We included the removal rule for completeness. We don't think it should have any adverse affects. I monkeyed around with the test you provided in comment #7 and got it working with `podman` (no `docker` on RHCOS) and got it to pass after also mounting the host's `/dev` into the container. I confirmed that the `/dev/disk/by-id` for attached disks showed up while the test was running. ``` $ ls -l /dev/disk/by-id/ total 0 lrwxrwxrwx. 1 root root 9 Sep 13 02:29 google-openstorage-test-6e5443e9-1ef2-42da-9ae1-5ec6928897d8 -> ../../sdb lrwxrwxrwx. 1 root root 9 Sep 12 19:34 google-persistent-disk-0 -> ../../sda lrwxrwxrwx. 1 root root 10 Sep 12 19:34 google-persistent-disk-0-part1 -> ../../sda1 lrwxrwxrwx. 1 root root 10 Sep 12 19:35 google-persistent-disk-0-part2 -> ../../sda2 lrwxrwxrwx. 1 root root 10 Sep 12 19:34 google-persistent-disk-0-part3 -> ../../sda3 lrwxrwxrwx. 1 root root 9 Sep 13 02:29 scsi-0Google_PersistentDisk_openstorage-test-6e5443e9-1ef2-42da-9ae1-5ec6928897d8 -> ../../sdb lrwxrwxrwx. 1 root root 9 Sep 12 19:34 scsi-0Google_PersistentDisk_persistent-disk-0 -> ../../sda lrwxrwxrwx. 1 root root 10 Sep 12 19:34 scsi-0Google_PersistentDisk_persistent-disk-0-part1 -> ../../sda1 lrwxrwxrwx. 1 root root 10 Sep 12 19:35 scsi-0Google_PersistentDisk_persistent-disk-0-part2 -> ../../sda2 lrwxrwxrwx. 1 root root 10 Sep 12 19:34 scsi-0Google_PersistentDisk_persistent-disk-0-part3 -> ../../sda3 ```
Micah, Thanks for working on this, and for running the libopenstorage/cloudops tests to verify. Just out of curiousity, for solving this problem did you bake the udev files into the RHCOS image, or did you add them to the code in afterburn, which does cloud-specific provisioning?
In this case, we've baked the rules into the RHCOS image. In the future, we want to break the rules into a separate package (see BZ#1751310) so that we can include them that way.
Micah, I verified this fix on the latest RHCOS image with openshift-install on GCP. Specifically, I used the latest openshift-install to provision an Openshift 4 cluster in GCP, then I provisioned Portworx, created a StorageCluster, and I observed that the disks were created properly and mounted. Thanks a lot for working on this fix, and running the libopenstorage/cloudops tests.
Micah, Thanks for fixing this problem on GCP and Azure. With this fix, I was able to set up two Openshift 4.2 clusters (1 on GCP, 1 on Azure). Ryan Wallner, on my team, was able to test Portworx Disaster Recovery (DR), where he managed to migrate applications + data running on one Openshift cluster, and migrate it to the other cluster running in a different cloud. Ryan made a video: "Cross-Cloud Application Migration with Google Cloud and Microsoft Azure on OpenShift 4.2" https://youtu.be/ZhdpE6sl_jM Thanks again, Micah, for making this possible.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922