Description of problem: During pod density testing pods are finishing in Error state (12 of 2000 pods) Version-Release number of selected component (if applicable): % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-25-065959 True False 29h Cluster version is 4.6.0-0.nightly-2020-07-25-065959 How reproducible: 100% Steps to Reproduce: 1. Scale up cluster to 20 worker nodes. 2. Create 2000 projects (200 per node): - git clone https://github.com/openshift/svt.git - cd svt openshift_scalability - touch test.yaml - vim test.yaml ``` projects: - num: 2000 basename: svt- templates: - num: 1 file: ./content/deployment-config-1rep-pause-template.json ``` - cp $KUBECONFIG ~/.kube/config - python cluster-loader.py -f test.yaml -p 5 Actual results: % oc get pods --all-namespaces | egrep -v "Running|Complete" NAMESPACE NAME READY STATUS RESTARTS AGE svt-1645 deploymentconfig0-1-deploy 0/1 Error 0 124m svt-1708 deploymentconfig0-1-deploy 0/1 Error 0 123m svt-1714 deploymentconfig0-1-deploy 0/1 Error 0 123m svt-1750 deploymentconfig0-1-deploy 0/1 Error 0 123m svt-1767 deploymentconfig0-1-deploy 0/1 Error 0 122m svt-1770 deploymentconfig0-1-deploy 0/1 Error 0 122m svt-1797 deploymentconfig0-1-deploy 0/1 Error 0 122m svt-1806 deploymentconfig0-1-deploy 0/1 Error 0 122m svt-1840 deploymentconfig0-1-deploy 0/1 Error 0 121m svt-1916 deploymentconfig0-1-deploy 0/1 Error 0 120m svt-1920 deploymentconfig0-1-deploy 0/1 Error 0 120m % oc logs deploymentconfig0-1-deploy -n svt-1645 error: couldn't get deployment deploymentconfig0-1: Get "https://172.30.0.1:443/api/v1/namespaces/svt-1645/replicationcontrollers/deploymentconfig0-1": dial tcp 172.30.0.1:443: connect: no route to host % oc get replicationcontrollers -n svt-1645 NAME DESIRED CURRENT READY AGE deploymentconfig0-1 0 0 0 127m % oc describe replicationcontrollers deploymentconfig0-1 -n svt-1645 Name: deploymentconfig0-1 Namespace: svt-1645 Selector: deployment=deploymentconfig0-1,deploymentconfig=deploymentconfig0,name=replicationcontroller0 Labels: openshift.io/deployment-config.name=deploymentconfig0 template=deploymentConfigTemplate Annotations: kubectl.kubernetes.io/desired-replicas: 1 openshift.io/deployer-pod.completed-at: 2020-07-30 18:31:44 +0000 UTC openshift.io/deployer-pod.created-at: 2020-07-30 18:31:35 +0000 UTC openshift.io/deployer-pod.name: deploymentconfig0-1-deploy openshift.io/deployment-config.latest-version: 1 openshift.io/deployment-config.name: deploymentconfig0 openshift.io/deployment.phase: Failed openshift.io/deployment.replicas: 0 openshift.io/deployment.status-reason: config change openshift.io/encoded-deployment-config: {"kind":"DeploymentConfig","apiVersion":"apps.openshift.io/v1","metadata":{"name":"deploymentconfig0","namespace":"svt-1645","selfLink":"/... Replicas: 0 current / 0 desired Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: deployment=deploymentconfig0-1 deploymentconfig=deploymentconfig0 name=replicationcontroller0 Annotations: openshift.io/deployment-config.latest-version: 1 openshift.io/deployment-config.name: deploymentconfig0 openshift.io/deployment.name: deploymentconfig0-1 Containers: pause0: Image: gcr.io/google-containers/pause-amd64:3.0 Port: 8080/TCP Host Port: 0/TCP Environment: ENVVAR1_0: lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2 ENVVAR2_0: lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2 ENVVAR3_0: lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2 ENVVAR4_0: lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2 Mounts: <none> Volumes: <none> Events: <none> Expected results: All pods created with no errors Additional info:
Can you attach the output of `oc get events`? It's not clear why this should be attributed to routing - if you can't create a deployment that speaks to other resource issues.
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
(In reply to Andrew McDermott from comment #2) > Can you attach the output of `oc get events`? > > It's not clear why this should be attributed to routing - if you can't > create a deployment that speaks to other resource issues. % oc logs deploymentconfig0-1-deploy -n svt-1645 error: couldn't get deployment deploymentconfig0-1: Get "https://172.30.0.1:443/api/v1/namespaces/svt-1645/replicationcontrollers/deploymentconfig0-1": dial tcp 172.30.0.1:443: connect: no route to host ^ I missed this on first reading. If your cluster is still up can you attach: $ oc get events and what does: $ oc get pods -n openshift-ingress $ oc log -n openshift-ingress <router-pod-XXX> show?
I have not been able to get back to this exact state, I have hit other issues before I was able to get more detailed information
Was able to get to this same error state. Attaching logs of ingress and events as private comments Here is the information from my new cluster % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-05-082458 True False 108m Cluster version is 4.6.0-0.nightly-2020-08-05-082458 % oc get pods -A | grep svt | grep Error svt-1366 deploymentconfig0-1-deploy 0/1 Error 0 8m36s svt-1383 deploymentconfig0-1-deploy 0/1 Error 0 8m23s svt-1397 deploymentconfig0-1-deploy 0/1 Error 0 8m9s svt-1568 deploymentconfig0-1-deploy 0/1 Error 0 5m34s svt-1585 deploymentconfig0-1-deploy 0/1 Error 0 5m15s svt-1586 deploymentconfig0-1-deploy 0/1 Error 0 5m15s svt-1591 deploymentconfig0-1-deploy 0/1 Error 0 5m11s svt-1592 deploymentconfig0-1-deploy 0/1 Error 0 5m11s svt-1595 deploymentconfig0-1-deploy 0/1 Error 0 5m4s svt-1596 deploymentconfig0-1-deploy 0/1 Error 0 5m6s svt-1597 deploymentconfig0-1-deploy 0/1 Error 0 5m5s svt-1598 deploymentconfig0-1-deploy 0/1 Error 0 5m6s svt-1600 deploymentconfig0-1-deploy 0/1 Error 0 4m58s svt-1602 deploymentconfig0-1-deploy 0/1 Error 0 4m58s svt-1603 deploymentconfig0-1-deploy 0/1 Error 0 4m58s svt-1605 deploymentconfig0-1-deploy 0/1 Error 0 4m54s svt-1607 deploymentconfig0-1-deploy 0/1 Error 0 4m54s svt-1608 deploymentconfig0-1-deploy 0/1 Error 0 4m53s svt-1611 deploymentconfig0-1-deploy 0/1 Error 0 4m48s svt-1615 deploymentconfig0-1-deploy 0/1 Error 0 4m41s svt-887 deploymentconfig0-1-deploy 0/1 Error 0 16m svt-897 deploymentconfig0-1-deploy 0/1 Error 0 16m
Looking through the router logs from comment #8 and the events from comment #9 I don't see anything that hints at an ingress problem. It's not clear that ingress is failing so moving this to SDN as we see (from comment #4): % oc logs deploymentconfig0-1-deploy -n svt-1645 error: couldn't get deployment deploymentconfig0-1: Get "https://172.30.0.1:443/api/v1/namespaces/svt-1645/replicationcontrollers/deploymentconfig0-1": dial tcp 172.30.0.1:443: connect: no route to host which is an internal endpoint.
Changing sub component to ovn because these clusters have been using the ovn specific configuration
Please reproduce in newer build and capture must-gather output.
No route to host is an SDN problem, isn't it?