Description of problem: On a deployment of OCP 3.11.43, while deploying OCS 3.10 in converged mode, the deployment fails to pass the task Wait for deploy-heketi pods. ansible playbook run, # ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -vvv On describing the deploy-heketi pod, The error seen is- kubelet, dhcp46-26.lab.eng.blr.redhat.com Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "64b45294eea6bc10e657bcd9dfe4e21b84f62ea4ac2c7c0ab495f05daf936620" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to set up pod "deploy-heketi-storage-1-deploy_glusterfs" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container "64b45294eea6bc10e657bcd9dfe4e21b84f62ea4ac2c7c0ab495f05daf936620" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to teardown pod "deploy-heketi-storage-1-deploy_glusterfs" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused] Version-Release number of selected component (if applicable): OCP 3.11.43 and OCS 3.10 How reproducible: twice Steps to Reproduce: 1. Deploy OCP 3.11.43 and run OCS deployment. 2. The deployment fails at task, Wait for deploy-heketi pod. Actual results: Fails at Wait for deploy-heketi pod task. Expected results: The OCS deployment should pass Additional info:
############Snippet from ansible deployment of OCS 3.10 in converged mode: <dhcp46-113.lab.eng.blr.redhat.com> ESTABLISH SSH CONNECTION FOR USER: root <dhcp46-113.lab.eng.blr.redhat.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/f7e374e19b dhcp46-113.lab.eng.blr.redhat.com '/bin/sh -c '"'"'echo ~root && sleep 0'"'"'' <dhcp46-113.lab.eng.blr.redhat.com> (0, '/root\n', '') <dhcp46-113.lab.eng.blr.redhat.com> ESTABLISH SSH CONNECTION FOR USER: root <dhcp46-113.lab.eng.blr.redhat.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/f7e374e19b dhcp46-113.lab.eng.blr.redhat.com '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182 `" && echo ansible-tmp-1544602634.29-231169423874182="` echo /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182 `" ) && sleep 0'"'"'' <dhcp46-113.lab.eng.blr.redhat.com> (0, 'ansible-tmp-1544602634.29-231169423874182=/root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182\n', '') Using module file /usr/share/ansible/openshift-ansible/roles/lib_openshift/library/oc_obj.py <dhcp46-113.lab.eng.blr.redhat.com> PUT /root/.ansible/tmp/ansible-local-50261RC_izL/tmpZbVhKn TO /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182/oc_obj.py <dhcp46-113.lab.eng.blr.redhat.com> SSH: EXEC sftp -b - -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/f7e374e19b '[dhcp46-113.lab.eng.blr.redhat.com]' <dhcp46-113.lab.eng.blr.redhat.com> (0, 'sftp> put /root/.ansible/tmp/ansible-local-50261RC_izL/tmpZbVhKn /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182/oc_obj.py\n', '') <dhcp46-113.lab.eng.blr.redhat.com> ESTABLISH SSH CONNECTION FOR USER: root <dhcp46-113.lab.eng.blr.redhat.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/f7e374e19b dhcp46-113.lab.eng.blr.redhat.com '/bin/sh -c '"'"'chmod u+x /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182/ /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182/oc_obj.py && sleep 0'"'"'' <dhcp46-113.lab.eng.blr.redhat.com> (0, '', '') <dhcp46-113.lab.eng.blr.redhat.com> ESTABLISH SSH CONNECTION FOR USER: root <dhcp46-113.lab.eng.blr.redhat.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/f7e374e19b -tt dhcp46-113.lab.eng.blr.redhat.com '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182/oc_obj.py && sleep 0'"'"'' <dhcp46-113.lab.eng.blr.redhat.com> (0, '\r\n{"invocation": {"module_args": {"files": null, "kind": "pod", "force": false, "name": null, "field_selector": null, "all_namespaces": null, "namespace": "glusterfs", "delete_after": false, "kubeconfig": "/etc/origin/master/admin.kubeconfig", "content": null, "state": "list", "debug": false, "selector": "glusterfs=deploy-heketi-storage-pod"}}, "state": "list", "changed": false, "results": {"returncode": 0, "cmd": "/usr/bin/oc get pod --selector=glusterfs=deploy-heketi-storage-pod -o json -n glusterfs", "results": [{"items": [], "kind": "List", "apiVersion": "v1", "metadata": {"selfLink": "", "resourceVersion": ""}}]}}\r\n', 'Shared connection to dhcp46-113.lab.eng.blr.redhat.com closed.\r\n') <dhcp46-113.lab.eng.blr.redhat.com> ESTABLISH SSH CONNECTION FOR USER: root <dhcp46-113.lab.eng.blr.redhat.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/f7e374e19b dhcp46-113.lab.eng.blr.redhat.com '/bin/sh -c '"'"'rm -f -r /root/.ansible/tmp/ansible-tmp-1544602634.29-231169423874182/ > /dev/null 2>&1 && sleep 0'"'"'' <dhcp46-113.lab.eng.blr.redhat.com> (0, '', '') FAILED - RETRYING: Wait for deploy-heketi pod (42 retries left).Result was: { "attempts": 139, "changed": false, "invocation": { "module_args": { "all_namespaces": null, "content": null, "debug": false, "delete_after": false, "field_selector": null, "files": null, "force": false, "kind": "pod", "kubeconfig": "/etc/origin/master/admin.kubeconfig", "name": null, "namespace": "glusterfs", "selector": "glusterfs=deploy-heketi-storage-pod", "state": "list" } }, "results": { "cmd": "/usr/bin/oc get pod --selector=glusterfs=deploy-heketi-storage-pod -o json -n glusterfs", "results": [ { "apiVersion": "v1", "items": [], "kind": "List", "metadata": { "resourceVersion": "", "selfLink": "" } } ], "returncode": 0 }, "retries": 181, "state": "list" } ######### End of ansible run snippet ######### oc describe of the deploy-heketi pod # oc describe pod deploy-heketi-storage-1-deploy Name: deploy-heketi-storage-1-deploy Namespace: glusterfs Priority: 0 PriorityClassName: <none> Node: dhcp46-26.lab.eng.blr.redhat.com/10.70.46.26 Start Time: Wed, 12 Dec 2018 13:22:53 +0530 Labels: openshift.io/deployer-pod-for.name=deploy-heketi-storage-1 Annotations: openshift.io/deployment-config.name=deploy-heketi-storage openshift.io/deployment.name=deploy-heketi-storage-1 openshift.io/scc=restricted Status: Pending IP: Containers: deployment: Container ID: Image: registry.access.redhat.com/openshift3/ose-deployer:v3.11 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: OPENSHIFT_DEPLOYMENT_NAME: deploy-heketi-storage-1 OPENSHIFT_DEPLOYMENT_NAMESPACE: glusterfs Mounts: /var/run/secrets/kubernetes.io/serviceaccount from deployer-token-xq8v6 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: deployer-token-xq8v6: Type: Secret (a volume populated by a Secret) SecretName: deployer-token-xq8v6 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 20m default-scheduler Successfully assigned glusterfs/deploy-heketi-storage-1-deploy to dhcp46-26.lab.eng.blr.redhat.com Warning FailedCreatePodSandBox 20m kubelet, dhcp46-26.lab.eng.blr.redhat.com Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "64b45294eea6bc10e657bcd9dfe4e21b84f62ea4ac2c7c0ab495f05daf936620" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to set up pod "deploy-heketi-storage-1-deploy_glusterfs" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container "64b45294eea6bc10e657bcd9dfe4e21b84f62ea4ac2c7c0ab495f05daf936620" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to teardown pod "deploy-heketi-storage-1-deploy_glusterfs" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused] Normal SandboxChanged 0s (x58 over 20m) kubelet, dhcp46-26.lab.eng.blr.redhat.com Pod sandbox changed, it will be killed and re-created. ############# End ################
I'm seeing the exact same issue with OCP 3.11.59 and OCS 3.11.0. In my case is reproducible 100% of the time. (Tested using ovs-networkpolicy plugin, ovs-multitenant plugin and calico sdn) #### From the "oc describe pod deploy-heketi-storage-1-deploy": ...<snip>... PodScheduled True Volumes: deployer-token-cxrkf: Type: Secret (a volume populated by a Secret) SecretName: deployer-token-cxrkf Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8m default-scheduler Successfully assigned openshift-storage/deploy-heketi-storage-1-deploy to ocp.shift.zone Warning FailedCreatePodSandBox 8m kubelet, ocp.shift.zone Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d78170e16282e24dce3167e5bef44355ed5abdb559c72276179485f6c54f494f" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to set up pod "deploy-heketi-storage-1-deploy_openshift-storage" network: context deadline exceeded, failed to clean up sandbox container "d78170e16282e24dce3167e5bef44355ed5abdb559c72276179485f6c54f494f" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to teardown pod "deploy-heketi-storage-1-deploy_openshift-storage" network: context deadline exceeded] Normal SandboxChanged 11s (x23 over 8m) kubelet, ocp.shift.zone Pod sandbox changed, it will be killed and re-created. #### Sample RUN 1 #### From "oc get events" at project level (installer deploying Heketi to infrastructure node): 0m 10m 1 deploy-heketi-storage.157a61202a3aebde DeploymentConfig Normal DeploymentCreated deploymentconfig-controller Created new replication controller "deploy-heketi-storage-1" for version 1 10m 10m 1 deploy-heketi-storage-1-deploy.157a61202d393214 Pod Normal Scheduled default-scheduler Successfully assigned openshift-storage/deploy-heketi-storage-1-deploy to ocp-inf2.shift.zone 10m 10m 1 deploy-heketi-storage-1-deploy.157a61251b464d7f Pod Warning FailedCreatePodSandBox kubelet, ocp-inf2.shift.zone Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "c381b14226dfc3408b898154a840a95eb1aed4eda5cd28523b03d3038955ab40" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to set up pod "deploy-heketi-storage-1-deploy_openshift-storage" network: context deadline exceeded, failed to clean up sandbox container "c381b14226dfc3408b898154a840a95eb1aed4eda5cd28523b03d3038955ab40" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to teardown pod "deploy-heketi-storage-1-deploy_openshift-storage" network: context deadline exceeded] 1s 10m 28 deploy-heketi-storage-1-deploy.157a612550469fcd Pod Normal SandboxChanged kubelet, ocp-inf2.shift.zone Pod sandbox changed, it will be killed and re-created. #### SAMPLE RUN 2 #### From "oc get events" at project level (installer deploying Heketi to master node): 14m 14m 1 deploy-heketi-storage.157a695943f0da1f DeploymentConfig Normal DeploymentCreated deploymentconfig-controller Created new replication controller "deploy-heketi-storage-1" for version 1 14m 14m 1 deploy-heketi-storage-1-deploy.157a69594a295e88 Pod Normal Scheduled default-scheduler Successfully assigned openshift-storage/deploy-heketi-storage-1-deploy to ocp.shift.zone 14m 14m 1 deploy-heketi-storage-1-deploy.157a695e39b5c646 Pod Warning FailedCreatePodSandBox kubelet, ocp.shift.zone Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d78170e16282e24dce3167e5bef44355ed5abdb559c72276179485f6c54f494f" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to set up pod "deploy-heketi-storage-1-deploy_openshift-storage" network: context deadline exceeded, failed to clean up sandbox container "d78170e16282e24dce3167e5bef44355ed5abdb559c72276179485f6c54f494f" network for pod "deploy-heketi-storage-1-deploy": NetworkPlugin cni failed to teardown pod "deploy-heketi-storage-1-deploy_openshift-storage" network: context deadline exceeded] 3m 14m 28 deploy-heketi-storage-1-deploy.157a695e611867e2 Pod Normal SandboxChanged kubelet, ocp.shift.zone Pod sandbox changed, it will be killed and re-created. #### Additional Information - The installation successfully deploy the Gluster DaemonSet - I also tried this but no difference: https://access.redhat.com/solutions/3785221
Note: Update to fix this issue when using OCP 3.11.59 and OCS 3.11.0 (Answering to my own report.) I took two actions and that fixed the problem, so I'm documenting them here in case it helps someone else: 1) I found uninstalling OCP left behind some iptables rules and as re-installations occur the more of these iptables that are left behind (up to several pages). So, now I manually reset the iptables configuration to the default basic rules (allow 22, related, established, etc). I have not had this OCS installation problem since doing this after each uninstall. 2) When using calico, I also found some configurations are left behind under /etc/cni. I'm also removing those after the uninstall.
Hi Ashmitha, did he hints from William in comment #4 help you?
Hi Niels, I still hit this issue during 3.11 testing and what William has mentioned has helped. This is definitely not what is expected to happen. Any reason why this would happen during fresh installs?
Was OCP installed on freshly provisioned VMs, or were the machines being reused? In any case, is this still an issue? If not, please close this BZ.
(In reply to Jose A. Rivera from comment #7) > Was OCP installed on freshly provisioned VMs, or were the machines being > reused? > > In any case, is this still an issue? If not, please close this BZ. I've seen this issue on both freshly provisioned VMs and VMs which were being reused after cleanup. And this is still an issue.
Hmm... odd. Is this happening all the time, or only some times? What are the exact workaround steps you apply when this occurs in the case of a freshly provisioned VM? Is the following sequence of events correct? 1. Provision VM 2. Attempt OCP install, install fails 3. Apply workarounds 4. OCP install succeeds What happens if you apply the workarounds prior to installing OCP?
If using the latest OCP 3.11.98 and OCS 3.11.2, I've found that with certain setups (resources constrains or slow disks) setting the following variable also helps as it give enough time for the GLusterfs cluster to sync among themselves: openshift_storage_glusterfs_timeout=900 Another thing to keep in mind, if doing a re-install, make sure there is nothing left behind under "/etc/glusterfs" as failed configurations may interfere with new ones.
I can confirm that this issue is 100% reproducible in my case even with fresh new clean vms
I'm trying to overcome this issue in my deployment for around a week, I ran the deployment scripts prerequisites.yml, deploy_cluster.yml, uninstall.yml, again and again still same issue. "name": "heketi", "ready": false, "restartCount": 0, "state": {"waiting": {"reason": "ContainerCreating"}}}], "hostIP": "192.168.1.212", "phase": "Pending", "qosClass": "BestEffort", "startTime": "2020-01-20T17:37:40Z"}}], "kind": "List", "metadata": {"resourceVersion": "", "selfLink": ""}}], "returncode": 0}, "state": "list"} ` PLAY RECAP **************************************************************************************************************************************************************************************** localhost : ok=12 changed=0 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0 os-infra.mydomain.com : ok=144 changed=36 unreachable=0 failed=0 skipped=163 rescued=0 ignored=0 os-master.mydomain.com : ok=471 changed=193 unreachable=0 failed=1 skipped=589 rescued=0 ignored=0 os-node.mydomain.com : ok=129 changed=36 unreachable=0 failed=0 skipped=159 rescued=0 ignored=0 os-storage.mydomain.com : ok=129 changed=36 unreachable=0 failed=0 skipped=159 rescued=0 ignored=0 INSTALLER STATUS ********************************************************************************************************************************************************************************** Initialization : Complete (0:00:26) Health Check : Complete (0:00:07) Node Bootstrap Preparation : Complete (0:03:14) etcd Install : Complete (0:00:41) Master Install : Complete (0:04:26) Master Additional Install : Complete (0:00:40) Node Join : Complete (0:00:43) GlusterFS Install : In Progress (0:13:46) This phase can be restarted by running: playbooks/openshift-glusterfs/new_install.yml Failure summary: Hosts: os-master.mydomain.com Play: Configure GlusterFS Task: Wait for heketi pod Message: Failed without returning a message.` Can someone please advice what should I do to be able to successfully deploy? I'm using as standalone ESXi VMware as the hypervisor, and an RPM install of Origin. ansible 2.9.2 Origin 3.11 Centos 7 as the OS for the nodes `[root@os-master ~]# docker version Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-103.git7f2769b.el7.centos.x86_64 Go version: go1.10.3 Git commit: 7f2769b/1.13.1 Built: Sun Sep 15 14:06:47 2019 OS/Arch: linux/amd64 Server: Version: 1.13.1 API version: 1.26 (minimum version 1.12) Package version: docker-1.13.1-103.git7f2769b.el7.centos.x86_64 Go version: go1.10.3 Git commit: 7f2769b/1.13.1 Built: Sun Sep 15 14:06:47 2019 OS/Arch: linux/amd64 Experimental: false` Here is my inventory: [OSEv3:children] masters etcd nodes glusterfs glusterfs_registry [OSEv3:vars] ansible_ssh_user=root openshift_deployment_type=origin openshift_release="3.11" openshift_image_tag="v3.11" openshift_master_default_subdomain=apps.mydomain.com openshift_docker_selinux_enabled=True openshift_check_min_host_memory_gb=16 openshift_check_min_host_disk_gb=50 openshift_disable_check=docker_image_availability openshift_master_dynamic_provisioning_enabled=true openshift_registry_selector="role=infra" openshift_hosted_registry_storage_kind=glusterfs openshift_metrics_install_metrics=true openshift_metrics_cassandra_storage_type=pv openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/infra": "true"} openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/infra": "true"} openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/infra": "true"} openshift_metrics_storage_volume_size=20Gi openshift_metrics_cassandra_pvc_storage_class_name="glusterfs-registry-block" openshift_logging_install_logging=true openshift_logging_es_pvc_dynamic=true openshift_logging_storage_kind=dynamic openshift_logging_kibana_nodeselector={"node-role.kubernetes.io/infra": "true"} openshift_logging_curator_nodeselector={"node-role.kubernetes.io/infra": "true"} openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"} openshift_logging_es_pvc_size=20Gi openshift_logging_es_pvc_storage_class_name="glusterfs-registry-block" openshift_storage_glusterfs_registry_namespace=infra-storage openshift_storage_glusterfs_registry_storageclass=false openshift_storage_glusterfs_registry_storageclass_default=false openshift_storage_glusterfs_registry_block_deploy=true openshift_storage_glusterfs_registry_block_host_vol_create=true openshift_storage_glusterfs_registry_block_host_vol_size=100 openshift_storage_glusterfs_registry_block_storageclass=true openshift_storage_glusterfs_registry_block_storageclass_default=false [masters] os-master.mydomain.com [etcd] os-master.mydomain.com [nodes] os-master.mydomain.com openshift_node_group_name="node-config-master" os-infra.mydomain.com openshift_node_group_name="node-config-infra" os-storage.mydomain.com openshift_node_group_name="node-config-compute" os-node.mydomain.com openshift_node_group_name="node-config-compute" [glusterfs_registry] os-infra.mydomain.com glusterfs_ip='192.168.1.213' glusterfs_devices='["/dev/sdb"]' os-node.mydomain.com glusterfs_ip='192.168.1.214' glusterfs_devices='["/dev/sdb"]' os-storage.mydomain.com glusterfs_ip='192.168.1.215' glusterfs_devices='["/dev/sdb"]' [glusterfs] os-infra.mydomain.com glusterfs_ip='192.168.1.213' glusterfs_devices='["/dev/sdb"]' os-node.mydomain.com glusterfs_ip='192.168.1.214' glusterfs_devices='["/dev/sdb"]' os-storage.mydomain.com glusterfs_ip='192.168.1.215' glusterfs_devices='["/dev/sdb"]'` Many Thanks on advance.
Unfortunately the workaround (openshift_storage_glusterfs_timeout=900) doesn’t work in my case