Description of problem: After installing the following CR, storageos-daemonset keeps with CrashLoopBackOff state apiVersion: storageos.com/v1 kind: StorageOSCluster metadata: name: storageos namespace: openshift-operators spec: secretRefName: "storageos-api" secretRefNamespace: "openshift-operators" namespace: openshift-operators csi: enable: true deploymentStrategy: deployment resources: requests: memory: "512Mi" k8sDistro: "openshift" oc get pods NAME READY STATUS RESTARTS AGE storageos-csi-helper-688f676c89-knght 3/3 Running 0 9m45s storageos-daemonset-56htx 3/3 Running 0 9m45s storageos-daemonset-9b6sh 2/3 CrashLoopBackOff 5 9m45s storageos-daemonset-bpc7p 3/3 Running 0 9m45s storageos-daemonset-d4mkq 2/3 CrashLoopBackOff 5 9m45s storageos-daemonset-rf2lk 2/3 CrashLoopBackOff 5 9m45s storageos-daemonset-xhnjl 2/3 CrashLoopBackOff 5 9m45s storageos-operator-7f787d48c4-rs8qp 1/1 Running 0 42m storageos-scheduler-5d47b6c86d-vbjcp 1/1 Running 0 9m45s oc logs -f storageos-daemonset-xhnjl -c storageos level=info msg="not first cluster node, joining first node" action=create address=10.0.140.90 category=etcd host=ip-10-0-140-90.us-east-2.compute.internal module=cp target=10.0.140.90 time="2019-11-07T04:56:17.061159088Z" level=error msg="could not retrieve cluster config from api" endpoint="http://10.0.140.90:5705/v1/members" status_code=404 time="2019-11-07T04:56:17.061230456Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="10.0.136.208,10.0.156.24,10.0.169.3,10.0.158.200,10.0.174.204,10.0.140.90" error="404 Not Found" module=cp time="2019-11-07T04:56:17.061273574Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp time="2019-11-07T04:56:22.061521189Z" level=info msg="not first cluster node, joining first node" action=create address=10.0.140.90 category=etcd host=ip-10-0-140-90.us-east-2.compute.internal module=cp target=10.0.136.208 Version-Release number of selected component (if applicable): StorageOS Operator: 1.4.0 How reproducible: Always Steps to Reproduce: 1.Create StorageOS Operator 2.Create the secret as oriented on https://docs.storageos.com/docs/platforms/openshift/install/4.1 3. Create the StorageOS CR as mentioned on bug description Actual results: StoragesOS daemonsets are failing Expected results: StoragesOS daemonsets should be running and StorageOS cluster should be healthy
We believe this was due to port 5705 being blocked by a firewall on the worker nodes. We've changed the docs in https://github.com/storageos/cluster-operator/pull/197 to make it more clear this port is required. We'll release 1.5.1 later this week with the changes.
Tested on StorageOS 1.5.1 version and looks good if tcp/5705 is opened. oc get pods NAME READY STATUS RESTARTS AGE storageos-csi-helper-5564768bc5-snnmt 3/3 Running 0 7m58s storageos-daemonset-2hmf6 3/3 Running 0 7m57s storageos-daemonset-6rr7c 3/3 Running 0 7m57s storageos-daemonset-jhdcg 3/3 Running 0 7m57s storageos-operator-658fb7f587-7kmlb 1/1 Running 0 9m58s storageos-scheduler-57878b58db-9fnm6 1/1 Running 3 7m58s Marking as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062