Bug 1882991 - [OVN][OSP] Scale up rhel worker failed
Summary: [OVN][OSP] Scale up rhel worker failed
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Alexander Constantinescu
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-27 09:13 UTC by huirwang
Modified: 2020-10-27 11:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 11:33:56 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description huirwang 2020-09-27 09:13:12 UTC
Description of problem:
Scale up rhel 7.8 worker in OSP cluster failed. rhel node cannot connect the ovndb

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-09-26-194704

How reproducible:


Steps to Reproduce:
1. Setup UPI OSP cluster with OVN network type.
2. Then scale up  rhel workers.
3. ovnkube-node pod crash with error

Actual results:

oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE
ovnkube-master-2pjqx   4/4     Running            0          171m
ovnkube-master-sgfk6   4/4     Running            0          171m
ovnkube-master-v7twv   4/4     Running            1          171m
ovnkube-node-7gj22     2/2     Running            0          171m
ovnkube-node-85n9j     2/2     Running            0          171m
ovnkube-node-88g2b     2/2     Running            0          171m
ovnkube-node-blkdg     2/2     Running            0          149m
ovnkube-node-jprqx     1/2     CrashLoopBackOff   13         60m
ovnkube-node-psllh     2/2     Running            0          150m
ovnkube-node-rnxbh     2/2     Running            0          150m
ovnkube-node-rrktq     1/2     CrashLoopBackOff   13         59m
ovs-node-5s96j         1/1     Running            0          59m
ovs-node-66p5h         1/1     Running            0          60m
ovs-node-6nnj5         1/1     Running            0          171m
ovs-node-gnqdg         1/1     Running            0          171m
ovs-node-mdxcf         1/1     Running            0          150m
ovs-node-p2nhz         1/1     Running            0          149m
ovs-node-zllzn         1/1     Running            0          171m
ovs-node-zrlcv         1/1     Running            0          150m

oc logs ovnkube-node-rrktq  -n openshift-ovn-kubernetes --all-containers
2020-09-27T07:49:11Z|00001|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-09-27T07:49:11Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2020-09-27T07:49:12Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-09-27T07:49:12Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-09-27T07:49:12Z|00005|main|INFO|OVS IDL reconnected, force recompute.
2020-09-27T07:49:12Z|00006|main|INFO|OVNSB IDL reconnected, force recompute.
2020-09-27T07:49:13Z|00007|reconnect|INFO|ssl:192.168.0.74:9642: connecting...
2020-09-27T07:49:14Z|00008|reconnect|INFO|ssl:192.168.0.74:9642: connection attempt timed out
2020-09-27T07:49:14Z|00009|reconnect|INFO|ssl:192.168.0.123:9642: connecting...
2020-09-27T07:49:15Z|00010|reconnect|INFO|ssl:192.168.0.123:9642: connection attempt timed out
2020-09-27T07:49:15Z|00011|reconnect|INFO|ssl:192.168.1.183:9642: connecting...
2020-09-27T07:49:16Z|00012|reconnect|INFO|ssl:192.168.1.183:9642: connection attempt timed out
2020-09-27T07:49:17Z|00013|reconnect|INFO|ssl:192.168.0.74:9642: connecting...
2020-09-27T07:49:18Z|00014|reconnect|INFO|ssl:192.168.0.74:9642: connection attempt timed out
2020-09-27T07:49:18Z|00015|reconnect|INFO|ssl:192.168.0.74:9642: waiting 2 seconds before reconnect
2020-09-27T07:49:20Z|00016|reconnect|INFO|ssl:192.168.0.123:9642: connecting...
2020-09-27T07:49:22Z|00017|reconnect|INFO|ssl:192.168.0.123:9642: connection attempt timed out
2020-09-27T07:49:22Z|00018|reconnect|INFO|ssl:192.168.0.123:9642: waiting 4 seconds before reconnect
2020-09-27T07:49:26Z|00019|reconnect|INFO|ssl:192.168.1.183:9642: connecting...
2020-09-27T07:49:30Z|00020|reconnect|INFO|ssl:192.168.1.183:9642: connection attempt timed out
2020-09-27T07:49:30Z|00021|reconnect|INFO|ssl:192.168.1.183:9642: continuing to reconnect in the background but suppressing further logging

.........
I0927 07:58:04.699421   28160 ovs.go:246] exec(124): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.12131.ctl connection-status
I0927 07:58:04.721485   28160 ovs.go:249] exec(124): stdout: "not connected\n"
I0927 07:58:04.721505   28160 ovs.go:250] exec(124): stderr: ""
I0927 07:58:04.721512   28160 node.go:118] node huir-upg3-xsm9v-rhel-0 connection status = not connected
F0927 07:58:04.721562   28160 ovnkube.go:129] timed out waiting sbdb for node huir-upg3-xsm9v-rhel-0: timed out waiting for the condition


On rhel nodes
sh-4.2# ls -l /etc/kubernetes/cni/net.d/
ls: cannot access /etc/kubernetes/cni/net.d/: No such file or directory
sh-4.2#  ls -l /var/run/multus/cni/net.d/
total 0


Logs on master nodes
W0927 08:14:36.778606       1 node_annotations.go:227] macAddress annotation not found for node "huir-upg3-xsm9v-rhel-1" 
W0927 08:14:36.778639       1 node_annotations.go:227] macAddress annotation not found for node "huir-upg3-xsm9v-rhel-1" 
E0927 08:14:36.784997       1 ovn.go:625] k8s.ovn.org/l3-gateway-config annotation not found for node "huir-upg3-xsm9v-rhel-1"
I0927 08:15:14.524368       1 reflector.go:418] k8s.io/client-go/informers/factory.go:135: Watch close - *v1.Endpoints total 23 items received
W0927 08:15:18.272357       1 node_annotations.go:227] macAddress annotation not found for node "huir-upg3-xsm9v-rhel-0" 
W0927 08:15:18.272380       1 node_annotations.go:227] macAddress annotation not found for node "huir-upg3-xsm9v-rhel-0" 
E0927 08:15:18.275426       1 ovn.go:625] k8s.ovn.org/l3-gateway-config annotation not found for node "huir-upg3-xsm9v-rhel-0"


Expected results:
Scale up successfully.

Additional info:

Comment 2 Ben Bennett 2020-09-28 14:34:51 UTC
Possible dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1882667

Comment 3 Ben Bennett 2020-09-28 14:37:38 UTC
This is an unsupported platform for ovn-kube on 4.5, moving to 4.7 so that we can verify that it doesn't affect 4.6.

Comment 4 Alexander Constantinescu 2020-10-08 13:58:02 UTC
This does not seem to be related to the bug mentioned in #comment 2

The reason it can't connect to the OVN database seems to be because OVS is not running on that node, from 

ovs-node-5s96j/ovs-daemons/ovs-daemons/logs/current.log

2020-09-27T03:49:13.750481011-04:00 2020-09-27T07:49:13.128Z|00003|stream_ssl|ERR|SSL_use_certificate_file: error:02001002:system library:fopen:No such file or directory
2020-09-27T03:49:13.750481011-04:00 2020-09-27T07:49:13.128Z|00004|stream_ssl|ERR|SSL_use_PrivateKey_file: error:20074002:BIO routines:FILE_CTRL:system lib
2020-09-27T03:49:13.750877875-04:00 2020-09-27T07:49:13.129Z|00008|stream_ssl|ERR|SSL_use_certificate_file: error:02001002:system library:fopen:No such file or directory
2020-09-27T03:49:13.750877875-04:00 2020-09-27T07:49:13.130Z|00009|stream_ssl|ERR|SSL_use_PrivateKey_file: error:20074002:BIO routines:FILE_CTRL:system lib
2020-09-27T03:49:13.750877875-04:00 2020-09-27T07:49:13.130Z|00010|stream_ssl|ERR|failed to load client certificates from /ovn-ca/ca-bundle.crt: error:140AD002:SSL routines:SSL_CTX_use_certificate_file:system lib

so it seems the /ovn-ca/ca-bundle.crt file is not in the OVS containers? Hm...

@Huiran: Could you retry with latest 4.5 nightly and give me a kubeconfig?

/Alex

Comment 5 Alexander Constantinescu 2020-10-08 13:59:08 UTC
Sorry, it is running...but it seems to be an issue with the SSL files, so it can't connect.


Note You need to log in before you can comment on or make changes to this bug.