Description of problem: The problem manifested itself by not allowing our automation to deploy our tooling: Warning FailedCreatePodSandBox 4m39s (x572 over 4h35m) kubelet, ip-10-0-144-180.us-west-2.compute.internal (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_scale-ci-mastervertical-mpzjn_scale-ci-tooling_7fd0a212-a03a-49c6-99df-cda39288db59_0(8e03feff97ea1b99584a86f47b8bcda2def64de8917961c85e0b043bed861427): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition Which led us to look at the ovn-master pod, which had screens of `not able to log to /dev/stdout` In the AWS console, the node had a warning that AWS Checks couldn't find the state of the node. I rebooted the node, and it came back up -- previously, it was unreachable even via SSH or debug node. Version-Release number of selected component (if applicable): OCP4.4 How reproducible: N/A
Hi Joe Wondering if you've encountered the issue again during your scale tests? I have not been able to reproduce the /dev/stdout issue, but I assume I am simply not generating enough logs to trigger the problem. /Alex
Hey Alex - I have not seen this recently, we can close for now.
Hi Alexander, I notice the same error with OCP 4.3.5.
As Aakarsh mentioned, we have seen this again : root@ip-172-31-12-199: ~/ovs-policy # oc rsh -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-sb8rp sh-4.2# df -h Filesystem Size Used Avail Use% Mounted on overlay 220G 18G 203G 8% / tmpfs 64M 64M 0 100% /dev shm 64M 0 64M 0% /dev/shm tmpfs 63G 12M 63G 1% /run/ovn /dev/mapper/coreos-luks-root-nocrypt 220G 18G 203G 8% /env tmpfs 63G 8.0K 63G 1% /ovn-cert tmpfs 63G 24K 63G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 63G 0 63G 0% /proc/acpi tmpfs 63G 0 63G 0% /proc/scsi tmpfs 63G 0 63G 0% /sys/firmware Where /dev has sh-4.2# cd /dev sh-4.2# ls core fd full mqueue null ptmx pts random shm stderr stdin stdout stdout-2020-03-19T18-03-52.057 stdout-2020-03-19T18-03-52.057.gz termination-log tty urandom zero sh-4.2# ls -talrh total 64M drwxrwxrwt. 2 root root 40 Mar 18 16:00 mqueue drwxrwxrwt. 2 root root 40 Mar 18 16:00 shm -rw-rw-rw-. 1 root root 0 Mar 18 16:00 termination-log drwxr-xr-x. 2 root root 0 Mar 18 16:00 pts drwxr-xr-x. 1 root root 92 Mar 18 16:00 .. crw-rw-rw-. 1 root root 1, 5 Mar 18 16:00 zero crw-rw-rw-. 1 root root 1, 9 Mar 18 16:00 urandom crw-rw-rw-. 1 root root 5, 0 Mar 18 16:00 tty lrwxrwxrwx. 1 root root 15 Mar 18 16:00 stdout-2020-03-19T18-03-52.057 -> /proc/self/fd/1 lrwxrwxrwx. 1 root root 15 Mar 18 16:00 stdin -> /proc/self/fd/0 lrwxrwxrwx. 1 root root 15 Mar 18 16:00 stderr -> /proc/self/fd/2 crw-rw-rw-. 1 root root 1, 8 Mar 18 16:00 random lrwxrwxrwx. 1 root root 8 Mar 18 16:00 ptmx -> pts/ptmx crw-rw-rw-. 1 root root 1, 3 Mar 18 16:00 null crw-rw-rw-. 1 root root 1, 7 Mar 18 16:00 full lrwxrwxrwx. 1 root root 13 Mar 18 16:00 fd -> /proc/self/fd lrwxrwxrwx. 1 root root 11 Mar 18 16:00 core -> /proc/kcore -rw-------. 1 root root 0 Mar 19 18:03 stdout-2020-03-19T18-03-52.057.gz drwxr-xr-x. 5 root root 400 Mar 19 18:03 . -rw-------. 1 root root 64M Mar 26 18:44 stdout Which the stdout is full of : time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port nodevertical0_nodevert-pod-2670 dynamic_addresses addresses" time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): stdout: \"[]\\n[dynamic]\\n\"" time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): stderr: \"\"" time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port nodevertical0_nodevert-pod-2670 dynamic_addresses addresses" time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): stdout: \"\\\"76:1e:ab:82:06:89 10.130.6.136\\\"\\n[dynamic]\\n\"" time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): stderr: \"\"" time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): /usr/bin/ovn-nbctl --timeout=15 lsp-set-port-security nodevertical0_nodevert-pod-2670 76:1e:ab:82:06:89 10.130.6.136/23" time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): stdout: \"\"" time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): stderr: \"\"" time="2020-03-19T18:08:11Z" level=debug msg="setAddressSet for a9236571437960368105 with [10.129.12.84 10.130.10.104 10.128.6.125 10.131.0.51 10.131.12.38 10.128.8.62 10.131.12.70 10.130.10.71 10.130.6.111 10.131.8.126 10.129.6.35 10.130.4.92 10.131.8.93 10.131.6.116 10.130.4.128 10.129.8.55 10.128.12.62 10.128.12.114 10.130.8.128 10.130.10.57 10.129.6.29 10.131.14.35 10.129.8.99 10.128.8.128 10.131.4.135 10.130.8.9 10.128.12.24 10.130.10.78 10.131.10.96 10.128.8.99 10.130.10.24 10.128.6.51 10.129.2.58 10.129.12.74 10.130.4.86 10.128.6.93 10.128.6.138 10.129.2.49 10.131.6.93 10.130.8.66 10.130.10.45 10.128.6.56 10.131.10.67 10.128.6.92 10.128.10.121 10.129.12.124 10.128.6.16 10.131.14.36 10.131.6.60 10.131.12.74 10.128.6.78 10.130.6.80 10.128.2.109 10. 130.10.129 10.130.10.20 10.129.8.133 10.129.12.140 10.130.6.19 10.129.8.63 10.128.12.69 10.131.8.91 10.128.8.94 10.130.10.105 10.130.4.102 10.128.2.13 10.130.8.22 10.131.10.29 10.131.14.34 10.131.14.43 10.128.10.75 10.130.8.120 10.131.4.131 10.128.12.19 10.128.8.59 10.131.6.71 10.128.10.91 10.131.4.90 10.129.8.123 10.128.6.54 10.131.4.91 10.131.0.101 10.129.6.124 10.128.12.31 10.131.12.45 10.131.0.120 10.131.4.121 10.131.0.18 10.131.14.90 10.128.12.111 10.131.14.129 10.128.6.140 10.129.6.33 10.131.10.59 10.130.4.75 10.130.10.97 10.131.6.118 10.131.14.133 10.129.8.21 10.131.4.19 10.128.8.46 10.130.8.84 10.128.8.17 10.130.8.114 10.129.12.128 10.131.14.97 10.131.6.45 10.130.10.74 10.128.12.80 10.129.8.95 10.131.6.120 10.131.10.120 10.130.10.12 10.131.4.96 10.128.10.107 10.128.6.143 10.129.12.66 10.130.4.97 10.131.14.134 10.128.2.128 10.131.14.54 10.129.2.44 10.129.2.55 10.130.10.73 10.131.8.83 10.131.4.104 10.128.10.37 10.129.12.42 10.128.10.53 10.128.2.50 10.128.10.59 10.128.2.63 10.131.14.111 10.129.12 .37 10.131.12.41 10.129.8.58 10.131.10.88 10.131.0.124 10.130.4.134 10.131.10.16 10.131.12.80 10.128.10.77 10.128.6.86 10.130.8.111 10.131.0.125 10.131.14.128 10.130.4.149 10.131.12.67 10.128.6.142 10.130.4.47 10.131.12.68 10.129.2.101 10.128.2.121 10.12 8.6.128 10.130.4.27 10.130.6.55 10.130.4.98 10.130.6.30 10.128.10.84 10.129.2.91 10.128.10.132 10.131.12.134 10.130.8.139 10.131.0.23 10.130.10.50 10.129.2.24 10.128.8.38 10.131.8.70 10.131.12.77 10.131.10.122 10.129.8.127 10.131.6.138 10.130.8.26 10.128 .2.22 10.129.6.58 10.130.4.95 10.129.6.101 10.129.8.12 10.129.8.40 10.129.2.73 10.130.10.95 10.128.8.103 10.131.14.137 10.130.6.25 10.130.8.50 10.128.6.64 10.130.6.105 10.130.6.110 10.130.10.124 10.131.4.124 10.128.8.32 10.128.2.52 10.130.6.70 10.131.0.2 1 10.128.6.48 10.128.6.73 10.130.6.76 10.130.6.87 10.128.2.82 10.128.8.106 10.128.8.10 10.130.8.59 10.130.8.95 10.129.12.102 10.130.8.121 10.129.8.136 10.128.2.20 10.130.10.47 10.129.6.80 10.129.2.94 10.128.2.117 10.130.10.7 10.130.8.25 10.130.4.37 10.12 9.8.88 10.129.2.88 10.128.6.127 10.130.10.18 10.129.8.62 10.131.6.73 10.128.2.91 10.128.6.116 10.131.12.117 10.129.12.112 10.130.4.115 10.129.8.47 10.128.10.20 10.130.4.30 10.129.8.52 10.129.12.79 10.130.10.86 10.131.12.99 10.129.12.105 10.131.14.13 10.1 28.10.131 10.131.14.60 10.131.4.61 10.130.6.85 10.128.6.112 10.129.8.23 10.129.6.104 10.130.8.130 10.131.14.95 10.131.4.43 10.130.8.48 10.131.10.49 10.129.6.47 10.129.8.75 10.128.12.32 10.129.12.26 10.130.10.62 10.130.4.68 10.129.12.103 10.128.12.118 10. 130.10.121 10.128.6.23 10.128.10.69 10.129.8.92 10.131.10.139 10.130.4.38 10.130.6.42 10.131.12.106 10.130.8.116 10.129.8.15 10.129.12.47 10.130.4.53 10.128.10.67 10.131.6.77 10.130.4.120 10.130.6.135 10.131.8.45 10.129.12.70 10.129.2.68 10.129.12.73 10. 131.6.28 10.128.6.106 10.130.6.33 10.131.0.29 10.131.10.17 10.129.8.19 10.128.2.17 10.129.6.55 10.131.6.109 10.128.2.118 10.128.8.12 10.130.8.32 10.131.4.33 10.128.12.42 10.128.10.51 10.130.10.25 10.129.8.28 10.129.6.40 10.131.0.45 10.131.8.100 10.131.4. 119 10.131.4.120 10.130.4.9 10.131.4.34 10.128.12.51 10.128.10.63 10.131.10.68 10.130.10.110 10.130.10.115 10.129.2.15 10.130.10.34 10.131.0.77 10.128.10.27 10.131.14.115 10.130.10.142 10.129.12.9 10.129.12.15 10.129.8.61 10.128.12.11 10.130.10.49 10.131 .14.47 10.131.6.72 10.129.6.75 10.131.6.78 10.131.0.128 10.128.8.132 10.131.10.12 10.128.2.10 10.131.0.52 10.128.6.87 10.131.8.111 10.131.6.124 10.131.14.8 10.128.12.34 10.129.12.56 10.129.2.89 10.129.12.97 10.128.2.98 10.130.10.8 10.130.10.43 10.131.10. .... continues with many, many more IPs.
Joe, can you also comment on number of workers and masters you used? and which build you were trying on wrt comment 5?
@anurag this was with 3 masters, 4 workers and 2 infra nodes. The build was `4.3.5` version
@agopi @Joe, i was not able to reproduce it with 3M-4W on Azure and AWS on 4.3.5. Is this repro always in your case? Also did it reproduce immediately post cluster install or post some testing which might have resulted into this? Any info related to that would be useful.
Hey @anurag, its not immediately after the cluster install, we usually try to create a few pods and then this error seems to occur after a few days at least in my case. Also @Joe did a little digging and found out the following: `/usr/bin/ovnkube --init-master ip-10-0-139-244.us-west-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --metrics-bind-address 0.0.0.0:9102 --sb-address ssl://10.0.139.244:9642,ssl://10.0.149.21:9642,ssl://10.0.171.170:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt` is how the ovnkube master is invoked, and if we take a look at the /run/ovnkube-config/ovnkube.conf file, its looking as follows: ``` [default] mtu="8901" cluster-subnets="10.128.0.0/14/23" encap-port="6081" [kubernetes] service-cidr="172.30.0.0/16" ovn-config-namespace="openshift-ovn-kubernetes" apiserver="https://api-int.awsovn435aakarsh.perf-testing.devcluster.openshift.com:6443" [logging] logfile="/dev/stdout" [gateway] mode=local nodeport=true ``` The main line is the logging.logfile value which we see is written to /dev/stdout, however /dev partition max size is 64MB... and Joe looked at the log rotating and mentioned its supposed to rotate at 100M ( Joe can add more details about it).
Hi guys Thanks for the investigation! Really helpful and I know what's causing this now. I will need to double check how to apply the fix, but at least it's clear what's happening now. FYI: the issue is related to the log-rotation mechanism that ovn-kubernetes is using in 4.3. The fact that it's writing to /dev/stdout which is limited to 64MB triggers log-rotation to happen incorrectly, which are these files we see here in your output: sh-4.2# ls -talrh total 64M lrwxrwxrwx. 1 root root 15 Mar 18 16:00 stdout-2020-03-19T18-03-52.057 -> /proc/self/fd/1 -rw-------. 1 root root 0 Mar 19 18:03 stdout-2020-03-19T18-03-52.057.gz -rw-------. 1 root root 64M Mar 26 18:44 stdout The stdout-DATE files indicate that log rotation happened using the internal mechanism of ovn-kubernetes, but as /dev is full it seems it's been done incorrectly. Part of the solution is indeed to apply the same log file as done upstreams, but that's only half the solution needed. The other half is actually already existing in 4.4 and 4.5, but I need to check if we can back-port that to 4.3 -Alex
Did setup a cluster to observe the fix on 4.4 for couple of days
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581