Bug 1862271

Summary: Can not get deployment due to no route to host
Product: OpenShift Container Platform Reporter: Paige Rubendall <prubenda>
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: aconstan, anbhat, aos-bugs, mfojtik, mifiedle, prubenda, skordas, sttts, xxia
Version: 4.6Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-08 16:38:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paige Rubendall 2020-07-30 20:45:40 UTC
Description of problem:
During pod density testing pods are finishing in Error state (12 of 2000 pods) 

Version-Release number of selected component (if applicable):
% oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-25-065959   True        False         29h     Cluster version is 4.6.0-0.nightly-2020-07-25-065959


How reproducible:
100%

Steps to Reproduce:
1. Scale up cluster to 20 worker nodes.
2. Create 2000 projects (200 per node):
  - git clone https://github.com/openshift/svt.git
  - cd svt openshift_scalability
  - touch test.yaml
  - vim test.yaml

```
projects:
  - num: 2000
    basename: svt-
    templates:
      -
        num: 1
        file: ./content/deployment-config-1rep-pause-template.json
```

  - cp $KUBECONFIG ~/.kube/config
  - python cluster-loader.py -f test.yaml -p 5

Actual results:

% oc get pods --all-namespaces | egrep -v "Running|Complete" 
NAMESPACE                                          NAME                                                                 READY   STATUS      RESTARTS   AGE
svt-1645                                           deploymentconfig0-1-deploy                                           0/1     Error       0          124m
svt-1708                                           deploymentconfig0-1-deploy                                           0/1     Error       0          123m
svt-1714                                           deploymentconfig0-1-deploy                                           0/1     Error       0          123m
svt-1750                                           deploymentconfig0-1-deploy                                           0/1     Error       0          123m
svt-1767                                           deploymentconfig0-1-deploy                                           0/1     Error       0          122m
svt-1770                                           deploymentconfig0-1-deploy                                           0/1     Error       0          122m
svt-1797                                           deploymentconfig0-1-deploy                                           0/1     Error       0          122m
svt-1806                                           deploymentconfig0-1-deploy                                           0/1     Error       0          122m
svt-1840                                           deploymentconfig0-1-deploy                                           0/1     Error       0          121m
svt-1916                                           deploymentconfig0-1-deploy                                           0/1     Error       0          120m
svt-1920                                           deploymentconfig0-1-deploy                                           0/1     Error       0          120m

% oc logs deploymentconfig0-1-deploy -n svt-1645   
error: couldn't get deployment deploymentconfig0-1: Get "https://172.30.0.1:443/api/v1/namespaces/svt-1645/replicationcontrollers/deploymentconfig0-1": dial tcp 172.30.0.1:443: connect: no route to host

% oc get replicationcontrollers -n svt-1645
NAME                  DESIRED   CURRENT   READY   AGE
deploymentconfig0-1   0         0         0       127m


% oc describe replicationcontrollers deploymentconfig0-1 -n svt-1645
Name:         deploymentconfig0-1
Namespace:    svt-1645
Selector:     deployment=deploymentconfig0-1,deploymentconfig=deploymentconfig0,name=replicationcontroller0
Labels:       openshift.io/deployment-config.name=deploymentconfig0
              template=deploymentConfigTemplate
Annotations:  kubectl.kubernetes.io/desired-replicas: 1
              openshift.io/deployer-pod.completed-at: 2020-07-30 18:31:44 +0000 UTC
              openshift.io/deployer-pod.created-at: 2020-07-30 18:31:35 +0000 UTC
              openshift.io/deployer-pod.name: deploymentconfig0-1-deploy
              openshift.io/deployment-config.latest-version: 1
              openshift.io/deployment-config.name: deploymentconfig0
              openshift.io/deployment.phase: Failed
              openshift.io/deployment.replicas: 0
              openshift.io/deployment.status-reason: config change
              openshift.io/encoded-deployment-config:
                {"kind":"DeploymentConfig","apiVersion":"apps.openshift.io/v1","metadata":{"name":"deploymentconfig0","namespace":"svt-1645","selfLink":"/...
Replicas:     0 current / 0 desired
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       deployment=deploymentconfig0-1
                deploymentconfig=deploymentconfig0
                name=replicationcontroller0
  Annotations:  openshift.io/deployment-config.latest-version: 1
                openshift.io/deployment-config.name: deploymentconfig0
                openshift.io/deployment.name: deploymentconfig0-1
  Containers:
   pause0:
    Image:      gcr.io/google-containers/pause-amd64:3.0
    Port:       8080/TCP
    Host Port:  0/TCP
    Environment:
      ENVVAR1_0:  lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2
      ENVVAR2_0:  lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2
      ENVVAR3_0:  lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2
      ENVVAR4_0:  lF6ipPHq34NJ3TyLTQvuk2QH1qViWmMwjfjwvEeBBMw4sR5y1tTdGtPLTCUow2oB4P1yNcLtTJwXNXhffHlj3Ecni7MRInh3AF50APMRMrUmVohibI5C6OYY0dsHa8PdxUAd6vM7Iq0EA5PyTQHkguTvmMVNsvXtL42htL5soN8xe2aFPYd0tHwV6aG2oMTQI7CkgllhCD0nPhESKxvS7uqj2TNSEYp8aqLBDlvHjjWOT14a7uKb5c2LH1EAii2
    Mounts:       <none>
  Volumes:        <none>
Events:           <none>


Expected results:
All pods created with no errors 

Additional info:

Comment 2 Andrew McDermott 2020-07-31 11:42:32 UTC
Can you attach the output of `oc get events`? 

It's not clear why this should be attributed to routing - if you can't create a deployment that speaks to other resource issues.

Comment 3 Andrew McDermott 2020-07-31 11:48:56 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 4 Andrew McDermott 2020-07-31 11:56:00 UTC
(In reply to Andrew McDermott from comment #2)
> Can you attach the output of `oc get events`? 
> 
> It's not clear why this should be attributed to routing - if you can't
> create a deployment that speaks to other resource issues.

% oc logs deploymentconfig0-1-deploy -n svt-1645   
error: couldn't get deployment deploymentconfig0-1: Get "https://172.30.0.1:443/api/v1/namespaces/svt-1645/replicationcontrollers/deploymentconfig0-1": dial tcp 172.30.0.1:443: connect: no route to host

^ I missed this on first reading.

If your cluster is still up can you attach:

$ oc get events

and what does:

$ oc get pods -n openshift-ingress
$ oc log -n openshift-ingress <router-pod-XXX>

show?

Comment 5 Paige Rubendall 2020-08-04 02:34:30 UTC
I have not been able to get back to this exact state, I have hit other issues before I was able to get more detailed information

Comment 8 Paige Rubendall 2020-08-06 15:07:16 UTC
Was able to get to this same error state. Attaching logs of ingress and events as private comments 

Here is the information from my new cluster 

% oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-05-082458   True        False         108m    Cluster version is 4.6.0-0.nightly-2020-08-05-082458

% oc get pods -A | grep svt | grep Error                                          
svt-1366                                           deploymentconfig0-1-deploy                                            0/1     Error               0          8m36s
svt-1383                                           deploymentconfig0-1-deploy                                            0/1     Error               0          8m23s
svt-1397                                           deploymentconfig0-1-deploy                                            0/1     Error               0          8m9s
svt-1568                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m34s
svt-1585                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m15s
svt-1586                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m15s
svt-1591                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m11s
svt-1592                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m11s
svt-1595                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m4s
svt-1596                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m6s
svt-1597                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m5s
svt-1598                                           deploymentconfig0-1-deploy                                            0/1     Error               0          5m6s
svt-1600                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m58s
svt-1602                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m58s
svt-1603                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m58s
svt-1605                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m54s
svt-1607                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m54s
svt-1608                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m53s
svt-1611                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m48s
svt-1615                                           deploymentconfig0-1-deploy                                            0/1     Error               0          4m41s
svt-887                                            deploymentconfig0-1-deploy                                            0/1     Error               0          16m
svt-897                                            deploymentconfig0-1-deploy                                            0/1     Error               0          16m

Comment 10 Andrew McDermott 2020-08-06 16:33:29 UTC
Looking through the router logs from comment #8 and the events from comment #9 I don't see anything that hints at an ingress problem.

It's not clear that ingress is failing so moving this to SDN as we see (from comment #4):

% oc logs deploymentconfig0-1-deploy -n svt-1645   
error: couldn't get deployment deploymentconfig0-1: Get "https://172.30.0.1:443/api/v1/namespaces/svt-1645/replicationcontrollers/deploymentconfig0-1": dial tcp 172.30.0.1:443: connect: no route to host

which is an internal endpoint.

Comment 11 Paige Rubendall 2020-08-06 16:47:07 UTC
Changing sub component to ovn because these clusters have been using the ovn specific configuration

Comment 14 Luis Sanchez 2020-08-21 13:30:23 UTC
Please reproduce in newer build and capture must-gather output.

Comment 16 Stefan Schimanski 2020-08-25 10:26:26 UTC
No route to host is an SDN problem, isn't it?