Description of problem: Version-Release number of the following components: openshift-install version openshift-install unreleased-master-1655-g4f3e73a0143ba36229f42e8b65b6e65342bb826b built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b release image registry.svc.ci.openshift.org/origin/release:4.2 How reproducible: Steps to Reproduce: 1. Get the source code for openshift-install from https://github.com/openshift/installer and compile it 2. Create a install-config.yaml file that looks like: apiVersion: v1 baseDomain: azure.openshift.portworx.com compute: - hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: hyperthreading: Enabled name: master platform: {} replicas: 3 metadata: creationTimestamp: null name: craig-azure-cool5 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineCIDR: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: azure: baseDomainResourceGroupName: openshift region: westus pullSecret: [redacted] sshKey: [redacted] 3. Install Openshift in Azure with: openshift-install create cluster apiVersion: v1 baseDomain: azure.openshift.portworx.com compute: - hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: hyperthreading: Enabled name: master platform: {} replicas: 3 metadata: creationTimestamp: null name: craig-azure-cool5 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineCIDR: 10.0.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: azure: baseDomainResourceGroupName: openshift region: westus pullSecret: [redacted] sshKey: [redacted] 4. Openshift will provision a cluster. On each node in the cluster, this OS is running: NAME="Red Hat Enterprise Linux CoreOS" VERSION="42.80.20190829.1" VERSION_ID="4.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20190829.1 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.2" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.2" OSTREE_VERSION=42.80.20190829.1 5. If I log into one of the nodes, I see that in /etc/udev , we have: /etc/udev/ /etc/udev/udev.conf /etc/udev/hwdb.d /etc/udev/rules.d /etc/udev/rules.d/70-persistent-ipoib.rules /etc/udev/hwdb.bin However, according to the documentation at: https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshoot-device-names-problems Any VM in Azure, should have some azure specific UDEV files for dealing with dynamic provisioning of storage devices: Specifically, each node should have: /etc/udev/rules.d/66-azure-storage.rules /etc/udev/rules.d/99-azure-product-uuid.rules which are provided by the walinuxagent package from Microsoft. I work for Portworx ( https://www.portworx.com ), and found this problem by trying to install Openshift 4.2 in Azure, and then dynamically provisioning Portworx storage devices, using the StorageCluster interface in Openshift. The lack of the necessary udev files on the Openshift nodes breaks Portworx storage.
Portworx storage works fine in Azure nodes configured via AKS. Just to compare, I ran this command to create an AKS cluster, using Microsoft's code: az aks create \ --resource-group craig-awesome1-group \ --name craig-aks-awesome2 \ --node-count 1 \ --enable-addons monitoring \ --ssh-key-value ~/.ssh/id_rsa.pub \ --debug When I logged into the node created by this command, I found: /etc/udev/rules.d /etc/udev/rules.d/66-azure-storage.rules /etc/udev/rules.d/99-azure-product-uuid.rules /etc/udev/rules.d/70-persistent-net.rules /etc/udev/rules.d/10-net-device-added.rules /etc/udev/hwdb.d /etc/udev/udev.conf So they install that file, and thus 3rd party storage providers such as Portworx work fine in Azure.
I also put a reference to this bug here: https://github.com/openshift/installer/issues/2298
66-azure-storage.rules comes from here: https://github.com/Azure/WALinuxAgent/ so it looks like openshift-install needs to make sure that walinuxagent is installed. Maybe the Azure rhcosimage used by openshift-install should have this installed in the image by default?
(In reply to Craig Rodrigues from comment #3) > 66-azure-storage.rules comes from here: > > https://github.com/Azure/WALinuxAgent/ > > so it looks like openshift-install needs to make sure that walinuxagent is > installed. > > Maybe the Azure rhcosimage used by openshift-install should have this > installed in the image by default? The RHCOS does not and probably wont ship the walinuxagent.
Can you change the openshift-install logic to rpm install walinuxagent as part of the provisioning?
I ran an additional experiment where I provisioned a bare Centos 7.5 VM in Azure (no AKS, nothing fancy): The version is: NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" The files in udev are: /etc/udev/ /etc/udev/rules.d /etc/udev/rules.d/66-azure-storage.rules /etc/udev/rules.d/99-azure-product-uuid.rules /etc/udev/rules.d/75-persistent-net-generator.rules /etc/udev/rules.d/68-azure-sriov-nm-unmanaged.rules /etc/udev/udev.conf /etc/udev/hwdb.bin rpm -qf /etc/udev/rules.d/66-azure-storage.rules WALinuxAgent-2.2.18-1.el7.centos.noarch
Storage is not the right component here - we maintain PVs and PVCs and storage plugins in kube-apiserver, kube-controller-manager or kubelet. Especially kubelet must already have all the uDev rules installed when it starts, we should not install them from inside kubelet (and reboot the machine). The rules must either be created by installer or RHCOS. Trying RHCOS.
I agree with Jan Safranek. The WALinuxAgent package package which contains the necessary UDEV rules for hosts running on Azure should not be dealt with at the Kubernetes (kube-apiserver, kube-controller-manager, or kubelet layer). My recommendations are to either: 1. Make the WALinuxAgent package part of the base RHCOS image which is installed on Azure, OR 2. Change the terraform logic inside openshift-install to install the WALinuxAgent package when provisioning hosts on Azure.
We are actually going to go with a different option altogether. WALinuxAgent invites too many anti-patterns in the OpenShift 4 model, where everything is declarative, and we've decided not to ship the agent at all. We instead have our own, minimal agent: https://github.com/coreos/afterburn. As the GitHub org implies, this utility and model originated in Container Linux (by CoreOS). As for the udev rules, we will just include those in RHCOS.
Ah that's interesting, I did not know about afterburn. The udev rules looks like they come from this GitHub repository maintained by Microsoft: https://github.com/Azure/WALinuxAgent/tree/master/config/ 66-azure-storage.rules looks like it hasn't changed much in the past 3 years, so hopefully just including it without installing WALinuxAgent should do the trick.
At Portworx, we have extensive tests for cloud storage. As you fix this bug on Azure, could you run the tests we have for mounting storage? Our tests are Open Source. You can do the following. 1. Provision an Openshift cluster on Azure 2. Get direct access to one of the nodes and log into it. 3. Read: https://github.com/libopenstorage/cloudops/blob/master/azure/README.md for 4. Use the following container to checkout and run the tests on Azure, replace the environment variables with your Azure setup: docker run \ --rm \ -t \ -i \ -e AZURE_INSTANCE_ID=<instance-id> \ -e AZURE_INSTANCE_REGION=<instance-region> \ -e AZURE_SCALE_SET_NAME=<scale-set-name> \ -e AZURE_SUBSCRIPTION_ID=<subscription-id> \ -e AZURE_RESOURCE_GROUP_NAME=<resource-group-name-of-instance> \ -e AZURE_ENVIRONMENT=<azure-cloud-environment> \ -e AZURE_TENANT_ID=<tenant-id> \ -e AZURE_CLIENT_ID=<client-id> \ -e AZURE_CLIENT_SECRET=<client-secret> \ -v $PWD:/go/src/github.com/libopenstorage \ -w /go/src/github.com/libopenstorage \ hatsunemiku/golang-dev-docker \ bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops && make && make test'
You can run additional tests on Azure by doing: docker run \ --rm \ -t \ -i \ -e AZURE_INSTANCE_ID=<instance-id> \ -e AZURE_INSTANCE_REGION=<instance-region> \ -e AZURE_SCALE_SET_NAME=<scale-set-name> \ -e AZURE_SUBSCRIPTION_ID=<subscription-id> \ -e AZURE_RESOURCE_GROUP_NAME=<resource-group-name-of-instance> \ -e AZURE_ENVIRONMENT=<azure-cloud-environment> \ -e AZURE_TENANT_ID=<tenant-id> \ -e AZURE_CLIENT_ID=<client-id> \ -e AZURE_CLIENT_SECRET=<client-secret> \ -v $PWD:/go/src/github.com/libopenstorage \ -w /go/src/github.com/libopenstorage \ hatsunemiku/golang-dev-docker \ bash -c 'git clone https://github.com/libopenstorage/cloudops && cd cloudops/azure && go test -v'
Rules have been added to the RHCOS config and will be present in RHCOS 42.80.20190911.0 and later.
Micah, I verified this fix on the latest RHCOS image with openshift-install on Azure. Specifically, I used the latest openshift-install to provision an Openshift 4 cluster in Azure, then I provisioned Portworx, created a StorageCluster, and I observed that the disks were created properly and mounted. Thanks a lot for working on this fix, and running the libopenstorage/cloudops tests.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922