Bug 1878657

Summary: Multus-admission-controller pods in crashloopbackoff state due to multus-admission-controller-secret not found error.
Product: OpenShift Container Platform Reporter: Archana Prabhakar <aprabhak>
Component: Multi-ArchAssignee: Dennis Gilmore <dgilmore>
Status: CLOSED DUPLICATE QA Contact: Barry Donahue <bdonahue>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: danili, lmcfadde, mjtarsel, mkumatag, mtarsel, pdsilva, pradeep, psundara, sanjum, wilder
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-18 18:41:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
bootstrap gather logs
none
Rendered assets logs
none
Resources json
none
master-1 logs
none
master-2 logs
none
master 3 logs
none
ovs-vswitchd logs on master-2 none

Description Archana Prabhakar 2020-09-14 09:59:30 UTC
Description of problem:

OCP 4.6 latest build install is failing during the bootstrap process due to an issue with the openshift-multus/network-metrics-daemon startup

Version-Release number of selected component (if applicable):
4.6


How reproducible:

Noticed same error with below 2 builds.

4.6.0-0.nightly-ppc64le-2020-09-14-031547
4.6.0-0.nightly-ppc64le-2020-09-13-000045

Steps to Reproduce:
1.
2.
3.

Actual results:
 "stderr": "level=info msg=\"Waiting up to 20m0s for the Kubernetes API at https://api.test-arc46.redhat.com:6443...\"\nlevel=info msg=\"API v1.19.0+35ab7c5 up\"\nlevel=info msg=\"Waiting up to 30m0s for bootstrapping to complete...\"\nW0914 02:15:58.095544   14636 reflector.go:326] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: very short watch: k8s.io/client-go/tools/watch/informerwatcher.go:146: Unexpected watch close - watch lasted less than a second and no items received\nE0914 02:15:59.097186   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:00.098748   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:01.105320   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:02.106922   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:03.108358   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:04.109853   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:05.111416   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:06.113125   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:07.114638   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nE0914 02:16:08.116268   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF\nlevel=info msg=\"Cluster operator network Progressing is True with Deploying: DaemonSet \\\"openshift-multus/network-metrics-daemon\\\" is not available (awaiting 3 nodes)\\nDaemonSet \\\"openshift-multus/multus-admission-controller\\\" is not available (awaiting 3 nodes)\"\nlevel=info msg=\"Use the following commands to gather logs from the cluster\"\nlevel=info msg=\"openshift-install gather bootstrap --help\"\nlevel=fatal msg=\"failed to wait for bootstrapping to complete: timed out waiting for the condition\"",
module.install.null_resource.install (remote-exec):     "stderr_lines": [
module.install.null_resource.install (remote-exec):         "level=info msg=\"Waiting up to 20m0s for the Kubernetes API at https://api.test-arc46.redhat.com:6443...\"",
module.install.null_resource.install (remote-exec):         "level=info msg=\"API v1.19.0+35ab7c5 up\"",
module.install.null_resource.install (remote-exec):         "level=info msg=\"Waiting up to 30m0s for bootstrapping to complete...\"",
module.install.null_resource.install (remote-exec):         "W0914 02:15:58.095544   14636 reflector.go:326] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ConfigMap ended with: very short watch: k8s.io/client-go/tools/watch/informerwatcher.go:146: Unexpected watch close - watch lasted less than a second and no items received",
module.install.null_resource.install (remote-exec):         "E0914 02:15:59.097186   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:00.098748   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:01.105320   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:02.106922   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:03.108358   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:04.109853   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:05.111416   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:06.113125   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:07.114638   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "E0914 02:16:08.116268   14636 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get \"https://api.test-arc46.redhat.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0\": EOF",
module.install.null_resource.install (remote-exec):         "level=info msg=\"Cluster operator network Progressing is True with Deploying: DaemonSet \\\"openshift-multus/network-metrics-daemon\\\" is not available (awaiting 3 nodes)\\nDaemonSet \\\"openshift-multus/multus-admission-controller\\\" is not available (awaiting 3 nodes)\"",



Expected results:
The openshift-multus/network-metrics-daemon should start up successfully.


Additional info:

oc get co
NAME                                       VERSION                                     AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                                                              
cloud-credential                                                                       True        False         False      4h1m
cluster-autoscaler                                                                                                          
config-operator                                                                                                             
console                                                                                                                     
csi-snapshot-controller                                                                                                     
dns                                                                                                                         
etcd                                                                                                                        
image-registry                                                                                                              
ingress                                                                                                                     
insights                                                                                                                    
kube-apiserver                                                                                                              
kube-controller-manager                                                                                                     
kube-scheduler                                                                                                              
kube-storage-version-migrator                                                                                               
machine-api                                                                                                                 
machine-approver                                                                                                            
machine-config                                                                                                              
marketplace                                                                                                                 
monitoring                                                                                                                  
network                                    4.6.0-0.nightly-ppc64le-2020-09-14-031547   True        True          False      3h46m
node-tuning                                                                                                                 
openshift-apiserver                                                                                                         
openshift-controller-manager                                                                                                
openshift-samples                                                                                                           
operator-lifecycle-manager                                                                                                  
operator-lifecycle-manager-catalog                                                                                          
operator-lifecycle-manager-packageserver                                                                                    
service-ca                                                                                                                  
storage                                                                                                                     

oc get nodes
NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   3h48m   v1.19.0+35ab7c5
master-1   Ready    master   3h47m   v1.19.0+35ab7c5
master-2   Ready    master   3h48m   v1.19.0+35ab7c5

oc get pods --all-namespaces | grep -vi completed | grep -vi running
NAMESPACE                                          NAME                                                      READY   STATUS              RESTARTS   AGE
openshift-apiserver-operator                       openshift-apiserver-operator-844999d7f6-jm4qb             0/1     CrashLoopBackOff    40         3h42m
openshift-authentication-operator                  authentication-operator-777dcb644d-tdkk7                  0/1     CrashLoopBackOff    40         3h42m
openshift-cluster-machine-approver                 machine-approver-548d59dc7-jvsj8                          0/2     ContainerCreating   0          3h42m
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-6f9659cc44-6smkh             0/1     CrashLoopBackOff    44         3h42m
openshift-cluster-storage-operator                 cluster-storage-operator-98795794c-46xxd                  0/1     Error               40         3h42m
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-6c94d485cd-sx79b         0/1     CrashLoopBackOff    40         3h42m
openshift-cluster-version                          cluster-version-operator-56588d8f8d-zs5x4                 0/1     ContainerCreating   0          3h42m
openshift-controller-manager-operator              openshift-controller-manager-operator-74fd7c5476-scxz8    0/1     CrashLoopBackOff    40         3h42m
openshift-dns-operator                             dns-operator-76f898d794-22q8h                             0/2     ContainerCreating   0          3h42m
openshift-etcd-operator                            etcd-operator-655c8f58c7-4d69m                            0/1     CrashLoopBackOff    40         3h42m
openshift-kube-apiserver-operator                  kube-apiserver-operator-5c776b8cf6-kflvc                  0/1     CrashLoopBackOff    40         3h42m
openshift-kube-controller-manager-operator         kube-controller-manager-operator-84b8676ff7-2hwdz         0/1     CrashLoopBackOff    40         3h42m
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-859ccd48db-8z9mq        0/1     CrashLoopBackOff    40         3h42m
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-5995764d6b-xrvdj   0/1     CrashLoopBackOff    40         3h42m
openshift-multus                                   multus-admission-controller-dj7mk                         0/2     ContainerCreating   0          3h25m
openshift-multus                                   multus-admission-controller-pw9rj                         0/2     ContainerCreating   0          3h25m
openshift-multus                                   multus-admission-controller-ztwrp                         0/2     ContainerCreating   0          3h26m
openshift-multus                                   network-metrics-daemon-4b5g4                              0/2     ContainerCreating   0          3h27m
openshift-multus                                   network-metrics-daemon-cgrhz                              0/2     ContainerCreating   0          3h26m
openshift-multus                                   network-metrics-daemon-xgbjd                              0/2     ContainerCreating   0          3h27m
openshift-operator-lifecycle-manager               catalog-operator-7876fb8df9-2dv6x                         0/1     ContainerCreating   0          3h42m
openshift-operator-lifecycle-manager               olm-operator-66cbb8c4c9-p7zxm                             0/1     ContainerCreating   0          3h42m
openshift-service-ca-operator                      service-ca-operator-745cb9db56-ndf69                      0/1     Error               40         3h42m


oc describe pod multus-admission-controller-dj7mk  -n openshift-multus
Name:                 multus-admission-controller-dj7mk
Namespace:            openshift-multus
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 master-1/9.114.98.125
Start Time:           Mon, 14 Sep 2020 02:12:12 -0400
Labels:               app=multus-admission-controller
                      component=network
                      controller-revision-hash=764b48c4f5
                      namespace=openshift-multus
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        DaemonSet/multus-admission-controller
Containers:
  multus-admission-controller:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c1e97d5fccb61d21128deb7fea015485530e778fbf32f5547078f1b7f640f51
    Image ID:      
    Port:          9091/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/webhook
    Args:
      -bind-address=0.0.0.0
      -port=6443
      -tls-private-key-file=/etc/webhook/tls.key
      -tls-cert-file=/etc/webhook/tls.crt
      -alsologtostderr=true
      -metrics-listen-address=127.0.0.1:9091
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        10m
    Environment:  <none>
    Mounts:
      /etc/webhook from webhook-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from multus-token-sl9w7 (ro)
  kube-rbac-proxy:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f947b498107b67350f1725e1044b4997ba47c1bf7788aca64d46077e99df6dc
    Image ID:      
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --logtostderr
      --secure-listen-address=:8443
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
      --upstream=http://127.0.0.1:9091/
      --tls-private-key-file=/etc/webhook/tls.key
      --tls-cert-file=/etc/webhook/tls.crt
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/webhook from webhook-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from multus-token-sl9w7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  webhook-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  multus-admission-controller-secret
    Optional:    false
  multus-token-sl9w7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  multus-token-sl9w7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason       Age                      From     Message
  ----     ------       ----                     ----     -------
  Warning  FailedMount  39m (x61 over 3h27m)     kubelet  Unable to attach or mount volumes: unmounted volumes=[webhook-certs], unattached volumes=[webhook-certs multus-token-sl9w7]: timed out waiting for the condition
  Warning  FailedMount  18m (x16 over 3h22m)     kubelet  Unable to attach or mount volumes: unmounted volumes=[webhook-certs], unattached volumes=[multus-token-sl9w7 webhook-certs]: timed out waiting for the condition
  Warning  FailedMount  3m43s (x109 over 3h29m)  kubelet  MountVolume.SetUp failed for volume "webhook-certs" : secret "multus-admission-controller-secret" not found

oc get secrets -n openshift-multus
NAME                            TYPE                                  DATA   AGE
default-token-7bzbf             kubernetes.io/service-account-token   3      3h31m
metrics-daemon-sa-token-sfnmd   kubernetes.io/service-account-token   3      3h31m
multus-token-sl9w7              kubernetes.io/service-account-token   3      3h31m

Comment 2 Archana Prabhakar 2020-09-14 16:08:40 UTC
This issue is noticed on the powervm environment.

Comment 3 Archana Prabhakar 2020-09-14 16:54:24 UTC
Created attachment 1714830 [details]
bootstrap gather logs

Comment 4 Archana Prabhakar 2020-09-14 16:57:02 UTC
Created attachment 1714831 [details]
Rendered assets logs

Comment 5 Archana Prabhakar 2020-09-14 16:57:46 UTC
Created attachment 1714832 [details]
Resources json

Comment 6 Archana Prabhakar 2020-09-14 16:59:47 UTC
Created attachment 1714834 [details]
master-1 logs

Comment 7 Archana Prabhakar 2020-09-14 17:01:44 UTC
Created attachment 1714835 [details]
master-2 logs

Comment 8 Archana Prabhakar 2020-09-14 17:02:26 UTC
Created attachment 1714836 [details]
master 3 logs

Comment 9 Prashanth Sundararaman 2020-09-15 13:32:29 UTC
Looks like the cluster-policy-controller on the bootstrap is crashing with this exception:

goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00083bc01, 0xc0001da1c0, 0x7b, 0xd1)
	k8s.io/klog/v2.0/klog.go:996 +0xac
k8s.io/klog/v2.(*loggingT).output(0x12fe49e0, 0xc000000003, 0x0, 0x0, 0xc000a137a0, 0x12f03a66, 0x6, 0x37, 0x0)
	k8s.io/klog/v2.0/klog.go:945 +0x1b4
k8s.io/klog/v2.(*loggingT).printDepth(0x12fe49e0, 0xc000000003, 0x0, 0x0, 0x1, 0xc0008dbb30, 0x1, 0x1)
	k8s.io/klog/v2.0/klog.go:718 +0x128
k8s.io/klog/v2.(*loggingT).print(...)
	k8s.io/klog/v2.0/klog.go:703
k8s.io/klog/v2.Fatal(...)
	k8s.io/klog/v2.0/klog.go:1443
github.com/openshift/cluster-policy-controller/pkg/cmd/cluster-policy-controller.NewClusterPolicyControllerCommand.func1(0xc0001c5080, 0xc00087b4a0, 0x0, 0x5)
	github.com/openshift/cluster-policy-controller/pkg/cmd/cluster-policy-controller/cmd.go:55 +0x31c
github.com/spf13/cobra.(*Command).execute(0xc0001c5080, 0xc00087b450, 0x5, 0x5, 0xc0001c5080, 0xc00087b450)
	github.com/spf13/cobra.0/command.go:846 +0x208
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001c4840, 0xc0001c4840, 0x0, 0x0)
	github.com/spf13/cobra.0/command.go:950 +0x294
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra.0/command.go:887
main.main()
	github.com/openshift/cluster-policy-controller/cmd/cluster-policy-controller/main.go:67 +0x298

goroutine 19 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x12fe49e0)
	k8s.io/klog/v2.0/klog.go:1131 +0x78
created by k8s.io/klog/v2.init.0
	k8s.io/klog/v2.0/klog.go:416 +0xe0

goroutine 92 [syscall]:
os/signal.signal_recv(0x0)
	runtime/sigqueue.go:147 +0xf8
os/signal.loop()
	os/signal/signal_unix.go:23 +0x24
created by os/signal.Notify.func1
	os/signal/signal.go:127 +0x4c

goroutine 94 [chan receive]:
k8s.io/apiserver/pkg/server.SetupSignalContext.func1(0xc00096bd40)
	k8s.io/apiserver.0-rc.2/pkg/server/signal.go:48 +0x38
created by k8s.io/apiserver/pkg/server.SetupSignalContext
	k8s.io/apiserver.0-rc.2/pkg/server/signal.go:47 +0xf0

goroutine 95 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x11c185a0, 0x11e8e920, 0xc000802030, 0x1, 0xc00053e060)
	k8s.io/apimachinery.0-rc.2/pkg/util/wait/wait.go:167 +0x120
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x11c185a0, 0x12a05f200, 0x0, 0x1, 0xc00053e060)
	k8s.io/apimachinery.0-rc.2/pkg/util/wait/wait.go:133 +0x90
k8s.io/apimachinery/pkg/util/wait.Until(...)
	k8s.io/apimachinery.0-rc.2/pkg/util/wait/wait.go:90
k8s.io/apimachinery/pkg/util/wait.Forever(0x11c185a0, 0x12a05f200)
	k8s.io/apimachinery.0-rc.2/pkg/util/wait/wait.go:81 +0x50
created by k8s.io/component-base/logs.InitLogs
	k8s.io/component-base.0-rc.2/logs/logs.go:58 +0x88

Comment 10 Prashanth Sundararaman 2020-09-15 13:48:56 UTC
ah..seems like that was the side effect - the fatal exception was because of :

16:14:22.617615       1 cmd.go:55] open /etc/kubernetes/config/cluster-policy-config.yaml: no such file or directory

Comment 11 Archana Prabhakar 2020-09-15 14:05:32 UTC
On the bootstrap node,

[root@bootstrap core]# ls -l /etc/kubernetes/bootstrap-configs
total 16
-rw-r--r--. 1 root root 6635 Sep 14 05:54 kube-apiserver-config.yaml
-rw-r--r--. 1 root root 1341 Sep 14 05:54 kube-controller-manager-config.yaml
-rw-r--r--. 1 root root   81 Sep 14 05:54 kube-scheduler-config.yaml
[root@bootstrap core]# pwd
/var/home/core

Comment 12 Archana Prabhakar 2020-09-15 14:14:58 UTC
[root@bootstrap core]# ls -l /etc/kubernetes/
total 16
drwxr-xr-x. 2 root root  117 Sep 14 05:54 bootstrap-configs
drwxr-xr-x. 2 root root    6 Sep 14 05:53 bootstrap-manifests
drwx------. 2 root root 4096 Sep 15 13:59 bootstrap-secrets
drwxr-xr-x. 2 root root    6 Sep 14 05:54 cloud
drwxr-xr-x. 3 root root   19 Sep  1 03:51 cni
-rw-------. 1 root root 5795 Sep 14 05:54 kubeconfig
-rw-r--r--. 1 root root  146 Sep 14 05:53 kubelet-pause-image-override
drwxr-xr-x. 3 root root   20 Sep 14 05:53 kubelet-plugins
drwxr-xr-x. 2 root root  252 Sep 15 13:59 manifests
drwxr-xr-x. 3 root root   25 Sep 14 05:54 static-pod-resources

Comment 13 Prashanth Sundararaman 2020-09-15 14:39:14 UTC
ignore previous analysis..looks like that error occurs but the installation still proceeds. what i do see on the master node's sdn pods though is concerning: 

I0914 06:12:20.799579    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:20.925805    1996 proxier.go:370] userspace proxy: processing 0 service events
I0914 06:12:20.926133    1996 proxier.go:349] userspace syncProxyRules took 49.44885ms
I0914 06:12:21.081209    1996 proxier.go:370] userspace proxy: processing 0 service events
I0914 06:12:21.082842    1996 proxier.go:349] userspace syncProxyRules took 50.442527ms
I0914 06:12:21.306427    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:21.937474    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:22.724906    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:23.707811    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:24.935827    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:26.469006    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:28.382423    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:30.772801    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
I0914 06:12:33.759059    1996 ovs.go:180] Error executing ovs-vsctl: ovs-vsctl: no row "vxlan0" in table Interface
F0914 06:12:33.759135    1996 healthcheck.go:99] SDN healthcheck detected unhealthy OVS server, restarting: plugin is not setup

Comment 14 Prashanth Sundararaman 2020-09-15 18:04:11 UTC
ovsdb server fails to start

Failed to connect to bus: No data available
openvswitch is running in container
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db.
ovsdb-server: /var/run/openvswitch/ovsdb-server.pid: pidfile check failed (No such process), aborting
Starting ovsdb-server ... failed!

Comment 15 Archana Prabhakar 2020-09-16 09:02:33 UTC
Added network dev from POWER team Mick and squad lead Manju.

Comment 16 Archana Prabhakar 2020-09-16 09:19:02 UTC
Noticing no space errors on the interfaces. Below is the data from one of the master nodes. The RHCOS46 images had 20GB disk size.

[root@master-2 core]# ovs-vsctl list interface
_uuid               : eea736c0-7b57-4e1e-aa18-dfc6f20900db
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth657700ea to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.9", sandbox="0e899417fbcbd8b49c511fcbe273ccd2ec07520f94298a71024b3ee4b6b79661"}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth657700ea
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : 01c25001-64ca-44c4-bdaa-fcb63a7d9de1
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device vethf20e8e29 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.15", sandbox="130cab3a6214363bdb4994202336fd267ddd9a4a55d9a4a7fac706806f46f026"}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : vethf20e8e29
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : 51dae8f1-ccee-4752-af78-1861dd366e30
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth6cec1255 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.6", sandbox=b6bc118838dce0ca5252d8efe90172a354ebeedb4c59158b5da4b03c38f5827a}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth6cec1255
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : 15a1fb82-79a2-4b10-9192-b4f09cfad6be
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : []
external_ids        : {}
ifindex             : 25
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : up
lldp                : {}
mac                 : []
mac_in_use          : "1a:81:c6:da:2a:09"
mtu                 : []
mtu_request         : []
name                : vxlan0
ofport              : 1
ofport_request      : 1
options             : {dst_port="4789", key=flow, remote_ip=flow}
other_config        : {}
statistics          : {rx_bytes=0, rx_packets=0, tx_bytes=0, tx_packets=0}
status              : {}
type                : vxlan

_uuid               : 3752b713-ecab-4798-8167-810bdbf8bdfb
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : []
external_ids        : {}
ifindex             : 27
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 1
link_speed          : []
link_state          : up
lldp                : {}
mac                 : []
mac_in_use          : "02:eb:92:6e:4e:87"
mtu                 : 1450
mtu_request         : 1450
name                : tun0
ofport              : 2
ofport_request      : 2
options             : {}
other_config        : {}
statistics          : {collisions=0, rx_bytes=0, rx_crc_err=0, rx_dropped=0, rx_errors=0, rx_frame_err=0, rx_over_err=0, rx_packets=0, tx_bytes=125108, tx_dropped=0, tx_errors=0, tx_packets=2968}
status              : {driver_name=openvswitch}
type                : internal

_uuid               : cf1d8504-e77a-41c4-8bdd-b265c3a2ba67
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth05b2efc4 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.18", sandbox=dce0bbe07172b35471abf89641c28d0ec9dcd4e26fd9f671eaaf03cd6ca62ee0}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth05b2efc4
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : 1d1ecee5-7e76-4b50-aa67-a637ecdf3438
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth5f6cb8b8 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.11", sandbox="7bfaf063fb454194c9d02fd74d3bcbb90c6a68967ae28e6a2457b5929bd5e990"}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth5f6cb8b8
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : f6bb3176-06fb-492b-9229-ffa0b901c029
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth9dd59021 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.10", sandbox="04af944f504e0dbfa1079b438537c655e4c63b7389266b1ea3083a55b31684cf"}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth9dd59021
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : e6c5d907-3957-4d00-961c-baf4e63e2257
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth73694404 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.8", sandbox="197847dbee9666f86dde221c719358228109d39b8444aa3172195eb8ce10dfb1"}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth73694404
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : 3f5e0158-b9f6-4f2f-8cac-c30c84e96906
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth64602b83 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.12", sandbox="47548db0199e0a59ac7fc86397e9db8877b4318a64ef2b142e268e8153c0f2bc"}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth64602b83
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : cf23fbd8-64ec-4027-8670-4a8bf5dc42bf
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device veth94fb7f81 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.14", sandbox=ab7f309f6c0e65f32a1515685da90189a3d09521036600e1be0b811f5518e999}
ifindex             : []
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 0
link_speed          : []
link_state          : []
lldp                : {}
mac                 : []
mac_in_use          : []
mtu                 : []
mtu_request         : []
name                : veth94fb7f81
ofport              : -1
ofport_request      : []
options             : {}
other_config        : {}
statistics          : {}
status              : {}
type                : ""

_uuid               : 528ed08d-970e-4620-9654-b9dc01752df6
admin_state         : []
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : []
error               : "could not add network device vethc71f3834 to ofproto (No space left on device)"
external_ids        : {ip="10.128.0.13", sandbox="4b5145dcd172516b36a35b377df2f34765841f378b6ce6d58c82fdecb14b9a0a"}
ifindex             : []



Since it complained about space, I retried using a bigger image of RHCOS46.82 with 120GB and reran the install with latest build. Again, I see same failures as discussed in this bug. Even the network interface attach fails with no space on device as shown above.

Comment 17 lmcfadde 2020-09-16 18:00:14 UTC
@psundra is there any reason you re-targeted this to 4.7.  We need this resolved for 4.6 and it is impacting our ability to complete OCP 4.6 testing in sprint 1.

Comment 18 Prashanth Sundararaman 2020-09-16 18:30:18 UTC
@lmcfadde that was a mistake . it should be 4.6. After some debugging and talking to Mick Tarsel looks like a ovs issue. He will update with more details.

Comment 19 Mick Tarsel 2020-09-16 19:24:37 UTC
Created attachment 1715123 [details]
ovs-vswitchd logs on master-2

So with Prashanth's help we narrowed down the error to the openshift-sdn pod on master-2. Something appears to be wrong with openvswitch and openflow.

We then looked at the sdn container logs via crictl which showed a lot of...

I0916 17:17:46.160062  193200 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)

(I can attach these logs if needed). Looks like there is a problem executing OpenFlow commands from here:
 https://github.com/openshift/sdn/blob/release-4.6/pkg/network/node/ovs/ovs.go#L155

So I looked at the ovs-vswitchd logs in /var/log/openvswitch on master-2 which is attached to this comment.

Things start getting funky right after br0 is created (at top of log)

2020-09-16T14:09:44.182Z|00029|bridge|INFO|bridge br0: added interface br0 on port 65534
2020-09-16T14:09:44.183Z|00030|bridge|INFO|bridge br0: using datapath ID 0000e64086f8f744
2020-09-16T14:09:44.183Z|00031|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
2020-09-16T14:09:44.291Z|00032|bridge|INFO|bridge br0: added interface vxlan0 on port 1
2020-09-16T14:09:44.325Z|00033|netdev|WARN|failed to set MTU for network device tun0: No such device
2020-09-16T14:09:44.331Z|00034|bridge|INFO|bridge br0: added interface tun0 on port 2

And later on…

2020-09-16T14:10:44.543Z|00042|bridge|INFO|bridge br0: deleted interface tun0 on port 2
2020-09-16T14:10:44.548Z|00043|netdev|WARN|failed to set MTU for network device tun0: No such device
2020-09-16T14:10:44.650Z|00044|bridge|INFO|bridge br0: added interface tun0 on port 2
2020-09-16T14:10:44.655Z|00045|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device
2020-09-16T14:10:44.661Z|00046|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device
2020-09-16T14:10:44.764Z|00047|bridge|INFO|bridge br0: deleted interface br0 on port 65534

Right after br0 is deleted, it’s created again and then it goes into this loop of adding and deleting the tun0 interface from br0 all while vxlan_sys_4789 is still not found.

2020-09-16T14:10:44.854Z|00048|bridge|INFO|bridge br0: added interface br0 on port 65534
2020-09-16T14:10:44.855Z|00001|dpif(revalidator6)|WARN|system@ovs-system: failed to flow_get (No such file or directory) recirc_id(0),dp_hash(0),skb_priority(0),in_port(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00), packets:0, bytes:0, used:never
2020-09-16T14:10:44.855Z|00002|ofproto_dpif_upcall(revalidator6)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (No such file or directory): ufid:4cd769d5-92b6-4acd-80b3-5aa16991b6d8
2020-09-16T14:10:44.858Z|00049|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device
2020-09-16T14:10:44.865Z|00050|bridge|INFO|bridge br0: deleted interface tun0 on port 2
2020-09-16T14:10:44.868Z|00051|netdev|WARN|failed to set MTU for network device tun0: No such device
2020-09-16T14:10:44.959Z|00052|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on tun0 device failed: No such device
2020-09-16T14:10:44.960Z|00053|bridge|WARN|could not add network device tun0 to ofproto (No such device)
2020-09-16T14:10:44.962Z|00054|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device
2020-09-16T14:10:45.078Z|00055|netdev|WARN|failed to set MTU for network device tun0: No such device
2020-09-16T14:10:45.079Z|00056|bridge|INFO|bridge br0: added interface tun0 on port 2
2020-09-16T14:10:45.086Z|00057|bridge|INFO|bridge br0: deleted interface tun0 on port 2
2020-09-16T14:10:45.089Z|00058|netdev|WARN|failed to set MTU for network device tun0: No such device
2020-09-16T14:10:45.205Z|00059|bridge|INFO|bridge br0: added interface tun0 on port 2
2020-09-16T14:10:45.208Z|00060|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device
2020-09-16T14:10:45.294Z|00061|bridge|INFO|bridge br0: deleted interface tun0 on port 2
2020-09-16T14:10:45.300Z|00062|netdev|WARN|failed to set MTU for network device tun0: No such device
2020-09-16T14:10:45.300Z|00063|bridge|INFO|bridge br0: added interface tun0 on port 2
2020-09-16T14:10:45.397Z|00064|bridge|INFO|bridge br0: deleted interface tun0 on port 2
2020-09-16T14:10:45.399Z|00065|bridge|INFO|bridge br0: added interface tun0 on port 2
2020-09-16T14:10:47.905Z|00066|timeval|WARN|Unreasonably long 1048ms poll interval (5ms user, 7ms system)
2020-09-16T14:10:47.905Z|00067|timeval|WARN|context switches: 14 voluntary, 7 involuntary


So looking at the master-2 interfaces, things get more weird. There are no interfaces on br0 at all. The veth interfaces should be there which are present on the machine:

[root@master-2 core]# ovs-vsctl show
def875a0-54d7-4075-9468-84ab195df9ad
    Bridge br0
        fail_mode: secure
        Port br0
            Interface br0
                type: internal
    ovs_version: "2.11.4"

[root@master-2 core]# ip link show | grep veth
29: veth6c5907ed@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
30: vethfd738fd5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
32: vethd2ab6e37@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
33: veth0cac84d7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
34: veth9474b203@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
35: vethad383ec8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
36: vethf2b7c335@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
37: vetheb944dc6@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
38: veth95cbfe1f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
39: veth026eaffb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
40: veth6affbeee@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
41: vetha6b49861@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
42: vethc168c2be@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
43: veth68364631@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default


And here are the problem interfaces which ovs said “No such device”:

26: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether 1e:d3:73:f0:0a:47 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1cd3:73ff:fef0:a47/64 scope link
       valid_lft forever preferred_lft forever
28: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether b2:d6:15:c7:f1:bd brd ff:ff:ff:ff:ff:ff
    inet 10.128.0.1/23 brd 10.128.1.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::b0d6:15ff:fec7:f1bd/64 scope link
       valid_lft forever preferred_lft forever

Also note that tun0 is in UNKNOWN state. 


Dave Wilder on the LTC Networking team at IBM logged in and was not able to add a veth interface to br0. After he restarted the openvswitch service via systemctl, the interfaces were present on br0. 

[root@master-2 openvswitch]# ovs-vsctl show
f64925b1-a493-4dee-976a-80ba5cf705a4
    Bridge br0
        fail_mode: secure
        Port vxlan0
            Interface vxlan0
                type: vxlan
                options: {dst_port="4789", key=flow, remote_ip=flow}
        Port br0
            Interface br0
                type: internal
        Port tun0
            Interface tun0
                type: internal
    ovs_version: "2.13.2"


Looks like a problem with openvswitch.

Comment 20 David J. Wilder 2020-09-16 19:40:27 UTC
> Dave Wilder on the LTC Networking team at IBM logged in and was not able to
> add a veth interface to br0. After he restarted the openvswitch service via
> systemctl, the interfaces were present on br0. 
> 
> [root@master-2 openvswitch]# ovs-vsctl show
> f64925b1-a493-4dee-976a-80ba5cf705a4
>     Bridge br0
>         fail_mode: secure
>         Port vxlan0
>             Interface vxlan0
>                 type: vxlan
>                 options: {dst_port="4789", key=flow, remote_ip=flow}
>         Port br0
>             Interface br0
>                 type: internal
>         Port tun0
>             Interface tun0
>                 type: internal
>     ovs_version: "2.13.2"
> 
> 
> Looks like a problem with openvswitch.

Some clarification on the test I did:

1)Created a veth pair:
# ip link add vethA type veth peer name vethB
2)Attempted to add vethA to br0
# ovs-vsctl add-port br0 vethA
This hung.....  (ctrl-c out)

#ovs-vsctl show  did not list vethA.

Restart ovs.
# systemctl restart openvswitch.

[root@master-2 openvswitch]# ovs-vsctl show
f64925b1-a493-4dee-976a-80ba5cf705a4
     Bridge br0
         fail_mode: secure
         Port vxlan0
             Interface vxlan0
                 type: vxlan
                 options: {dst_port="4789", key=flow, remote_ip=flow}
         Port br0
             Interface br0
                 type: internal
         Port tun0
             Interface tun0
                 type: internal
     ovs_version: "2.13.2"

It appears ovs was not responding until after reboot but still added vethA to db.

I rebooted the node and observed new interfaces were added to br0.
[root@master-2 core]# ovs-vsctl show
def875a0-54d7-4075-9468-84ab195df9ad
    Bridge br0
        fail_mode: secure
        Port veth5f017582
            Interface veth5f017582
        Port vxlan0
            Interface vxlan0
                type: vxlan
                options: {dst_port="4789", key=flow, remote_ip=flow}
        Port br0
            Interface br0
                type: internal
        Port veth08774018
            Interface veth08774018
        Port vetheb89cc8b
            Interface vetheb89cc8b
        Port vethfded5e2e
            Interface vethfded5e2e
        Port veth888a39e8
            Interface veth888a39e8
        Port tun0
            Interface tun0
                type: internal
        Port veth8878b0bc
            Interface veth8878b0bc
        Port veth93cb26e2
            Interface veth93cb26e2
        Port veth31496ad7
            Interface veth31496ad7
        Port vethfc335352
            Interface vethfc335352
        Port veth8d94567f
            Interface veth8d94567f
        Port vethde565abf
            Interface vethde565abf
        Port vethf3069cbd
            Interface vethf3069cbd
        Port vethf36bb791
            Interface vethf36bb791
        Port veth1d8a33aa
            Interface veth1d8a33aa
        Port vethb1a5133f
            Interface vethb1a5133f
    ovs_version: "2.11.4"

Comment 21 Mick Tarsel 2020-09-16 20:24:55 UTC
This looks very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1874696 


Adding ovs-devel to CC as well.

Comment 22 pdsilva 2020-09-17 12:00:20 UTC
I see the following results installing OCP with the docker.io/prashanths684/openshift-release:4.6-ppc64le-ovs image
 
#  oc get nodes
NAME       STATUS     ROLES    AGE    VERSION
master-0   NotReady   master   4h3m   v1.19.0+35ab7c5
master-1   NotReady   master   4h4m   v1.19.0+35ab7c5
master-2   NotReady   master   4h3m   v1.19.0+35ab7c5


# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication
cloud-credential                                     True        False         False      4h17m
cluster-autoscaler
config-operator
console
csi-snapshot-controller
dns
etcd
image-registry
ingress
insights
kube-apiserver
kube-controller-manager
kube-scheduler
kube-storage-version-migrator
machine-api
machine-approver
machine-config
marketplace
monitoring
network                                              False       True          True       4h4m
node-tuning
openshift-apiserver
openshift-controller-manager
openshift-samples
operator-lifecycle-manager
operator-lifecycle-manager-catalog
operator-lifecycle-manager-packageserver
service-ca
storage


The only pods in Running/ContainerCreating state. Rest of the pods are in Pending state.

# oc get pods --all-namespaces | grep  "Running\|ContainerCrea"
openshift-multus                                   multus-c9f7c                                              1/1     Running             14         3h56m
openshift-multus                                   multus-mwv68                                              1/1     Running             12         3h57m
openshift-multus                                   multus-zksjc                                              1/1     Running             4          3h56m
openshift-multus                                   network-metrics-daemon-bb4m2                              0/2     ContainerCreating   0          3h56m
openshift-multus                                   network-metrics-daemon-cdt5b                              0/2     ContainerCreating   0          3h57m
openshift-multus                                   network-metrics-daemon-rc77h                              0/2     ContainerCreating   0          3h56m
openshift-network-operator                         network-operator-5bfcfc7cb6-fhbm9                         1/1     Running             0          154m
openshift-sdn                                      sdn-controller-c5hmd                                      1/1     Running             0          3h56m
openshift-sdn                                      sdn-controller-d6n9w                                      1/1     Running             0          3h56m
openshift-sdn                                      sdn-controller-gbqxf                                      1/1     Running             0          3h56m


# oc describe pod network-metrics-daemon-bb4m2 -n openshift-multus
Name:           network-metrics-daemon-bb4m2
Namespace:      openshift-multus
Priority:       0
Node:           master-0/9.114.98.137
Start Time:     Thu, 17 Sep 2020 03:38:58 -0400
Labels:         app=network-metrics-daemon
                component=network
                controller-revision-hash=77dd98c48b
                openshift.io/component=network
                pod-template-generation=1
                type=infra
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  DaemonSet/network-metrics-daemon
Containers:
  network-metrics-daemon:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83e81b9a1307b2c4137acb9b0fc940e42c245d5c823cbf4860f1ce41deed050d
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/network-metrics
    Args:
      --node-name
      $(NODE_NAME)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  100Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from metrics-daemon-sa-token-t95pz (ro)
  kube-rbac-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fdb7e386b43c9f56189e90912b33b1cba9c24a09bde6a3067ff92ef529f71dd
    Image ID:
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --logtostderr
      --secure-listen-address=:8443
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
      --upstream=http://127.0.0.1:9091/
      --tls-private-key-file=/etc/metrics/tls.key
      --tls-cert-file=/etc/metrics/tls.crt
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/metrics from metrics-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from metrics-daemon-sa-token-t95pz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  metrics-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-daemon-secret
    Optional:    false
  metrics-daemon-sa-token-t95pz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-daemon-sa-token-t95pz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     op=Exists
Events:
  Type     Reason           Age                     From     Message
  ----     ------           ----                    ----     -------
  Warning  NetworkNotReady  159m (x2552 over 4h4m)  kubelet  network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
  Warning  FailedMount      96m (x35 over 151m)     kubelet  MountVolume.SetUp failed for volume "metrics-certs" : secret "metrics-daemon-secret" not found
  Warning  NetworkNotReady  106s (x4502 over 151m)  kubelet  network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?


# ssh core@master-1 'ip a'
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: env32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:f3:6e:61 brd ff:ff:ff:ff:ff:ff
    inet 9.114.98.247/22 brd 9.114.99.255 scope global dynamic noprefixroute env32
       valid_lft 12803sec preferred_lft 12803sec
    inet6 fe80::f816:3eff:fef3:6e61/64 scope link noprefixroute
       valid_lft forever preferred_lft forever


Restarted the nodes to see any change in the results but that did not help.

Comment 23 Prashanth Sundararaman 2020-09-18 18:41:39 UTC
After doing some investigation with Mick and talking to the sdn team , they suggested it might be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1874696. Today i built a patched ppc64le image with these two PRs:

https://github.com/openshift/machine-config-operator/pull/2102
https://github.com/openshift/cluster-network-operator/pull/785

Archana was able to test the image and the install succeeded on powerVM. Marking it as a dup of 1874696

*** This bug has been marked as a duplicate of bug 1874696 ***