Description of problem: On some occasions we're seeing OVN failing to start because a secret volume that has optional: true is not mounted. Both secret AND keys existed long before the container was created. For instance in this test https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/832/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1314209767828230144 Checking the logs of the previous execution of the "sbdb" container of the "ovnkube-master-tn7pn" container in the "openshift-ovn-kubernetes" namespace status I see: ``` { "containerID": "cri-o://3be754f81faa64fac7c59af6b60fc66803b0d0ee956ddec6420114c06f77d4a1", "image": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8", "imageID": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8", "lastState": { "terminated": { "containerID": "cri-o://91d7441a52750ff525fe077d41618c3b15e2baa27ea6e418dfe676113698c0fc", "exitCode": 137, "finishedAt": "2020-10-08T15:00:59Z", "message": <snip>", "reason": "Error", "startedAt": "2020-10-08T14:55:43Z" } }, "name": "sbdb", "ready": true, "restartCount": 1, "started": true, "state": { "running": { "startedAt": "2020-10-08T15:01:30Z" } } } ``` This shows that the start timestamp was at 2020-10-08T14:55:43Z Checking other pod logs I see other container consuming the same secret over a minute earlier than earlier than the pod start time: ``` + [[ ! -f /ovn-cert/tls.key ]] + [[ ! -f /ovn-cert/tls.crt ]] ++ date -Iseconds + echo '2020-10-08T14:54:14+00:00 - starting ovn-northd' 2020-10-08T14:54:14+00:00 - starting ovn-northd + exec ovn-northd --no-chdir -vconsole:info -vfile:off --ovnnb-db ssl:10.0.0.5:9641,ssl:10.0.0.6:9641,ssl:10.0.0.8:9641 --ovnsb-db ssl:10.0.0.5:9642,ssl:10.0.0.6:9642,ssl:10.0.0.8:9642 --pidfile /var/run/ovn/ovn-northd.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt ``` https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/832/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1314209767828230144/artifacts/e2e-azure-ovn/gather-extra/pods/openshift-ovn-kubernetes_ovnkube-master-fmxt5_northd.log Checking the journal, I see some MountVolume.Setup started and succeeded for that pod, but these happen long before the pod container started so I beleive there is a chance these lines belong to an older execution of the container or another container in that pod: $ zcat journal | grep ovn-cert | grep tn7pn | grep 14:5.: |grep MountVol Oct 08 14:53:35.560532 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.560461 1813 reconciler.go:254] Starting operationExecutor.MountVolume for volume "ovn-cert" (UniqueName: "kubernetes.io/secret/9b7082ef-1d75-4614-9002-26ca8deb4ae4-ovn-cert") pod "ovnkube-master-tn7pn" (UID: "9b7082ef-1d75-4614-9002-26ca8deb4ae4") Oct 08 14:53:35.560532 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.560500 1813 reconciler.go:269] operationExecutor.MountVolume started for volume "ovn-cert" (UniqueName: "kubernetes.io/secret/9b7082ef-1d75-4614-9002-26ca8deb4ae4-ovn-cert") pod "ovnkube-master-tn7pn" (UID: "9b7082ef-1d75-4614-9002-26ca8deb4ae4") Oct 08 14:53:35.584931 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.584918 1813 operation_generator.go:663] MountVolume.SetUp succeeded for volume "ovn-cert" (UniqueName: "kubernetes.io/secret/9b7082ef-1d75-4614-9002-26ca8deb4ae4-ovn-cert") pod "ovnkube-master-tn7pn" (UID: "9b7082ef-1d75-4614-9002-26ca8deb4ae4") Version-Release number of selected component (if applicable): master branch, may affect 4.6 as well, I'm not sure. How reproducible: Seems to happen randomly and in few containers. Many times if the container is restarted it is likely to work in the restarted container. I've spotted this a lot on OVN jobs, I have not seen it in SDN jobs but that may be entirely unrelated. It doesn't happen on most containers but it happens enough to break the test fairly often. Steps to Reproduce: Unknown Actual results: The files of a secret that has been created a long time ago may not be present Expected results: If a pod is started it must have all the volumes mounted that don't have optional: true Additional info: I have only seen this problem in secrets It may affect other kind of volumes as well. I'm setting priority to high because I don't know if this affects 4.6 as well or just 4.7.
This is impacting 4.6 jobs: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-ovn-4.6 . Sippy reports 52% failure over the past three days.
Lowering priority and severity because we have a workaround in OVN which is making the secret mandatory.
I had a quick look at the journal again and the secretVolumeManager received the data at 14:53:59, but the container was started at: 14:55:43 and died at 15:00:59 During these 5 minutes the files were never present in the filesystem. $ zcat journal | grep secret.go:207 | grep ovn-cert |grep master-1 Oct 08 14:53:35.560817 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.560703 1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (0) pieces of data, 0 total bytes Oct 08 14:53:35.662890 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.662458 1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (0) pieces of data, 0 total bytes Oct 08 14:53:59.039225 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:59.039111 1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (2) pieces of data, 4099 total bytes Oct 08 14:54:00.042725 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:54:00.042529 1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (2) pieces of data, 4099 total bytes { "containerID": "cri-o://3be754f81faa64fac7c59af6b60fc66803b0d0ee956ddec6420114c06f77d4a1", "image": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8", "imageID": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8", "lastState": { "terminated": { "containerID": "cri-o://91d7441a52750ff525fe077d41618c3b15e2baa27ea6e418dfe676113698c0fc", "exitCode": 137, "finishedAt": "2020-10-08T15:00:59Z", "message": <snip>", "reason": "Error", "startedAt": "2020-10-08T14:55:43Z" } }, "name": "sbdb", "ready": true, "restartCount": 1, "started": true, "state": { "running": { "startedAt": "2020-10-08T15:01:30Z" } } }
Since OVN did not exist prior to the 4.6 release, this is not a regression. Moving to 4.7. We will continue to look at it.
Ryan, I don't have anything against moving this to 4.7, but I don't know if this happens only on OVN. We only detected it in OVN because a lot of critical components were accidentally relying on this feature.
*** Bug 1872874 has been marked as a duplicate of this bug. ***