Bug 1886726 - On some rare occasions pods will start without optional secrets created before pod creation
Summary: On some rare occasions pods will start without optional secrets created befor...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Harshal Patil
QA Contact: MinLi
URL:
Whiteboard:
: 1872874 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-09 08:48 UTC by Juan Luis de Sousa-Valadas
Modified: 2020-12-17 20:22 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-11 04:27:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Juan Luis de Sousa-Valadas 2020-10-09 08:48:13 UTC
Description of problem:
On some occasions we're seeing OVN failing to start because a secret volume that has optional: true is not mounted.
Both secret AND keys existed long before the container was created.

For instance in this test https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/832/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1314209767828230144

Checking the logs of the previous execution of the "sbdb" container of the "ovnkube-master-tn7pn" container in the "openshift-ovn-kubernetes" namespace status I see:
```
                    {
                        "containerID": "cri-o://3be754f81faa64fac7c59af6b60fc66803b0d0ee956ddec6420114c06f77d4a1",
                        "image": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8",
                        "imageID": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8",
                        "lastState": {
                            "terminated": {
                                "containerID": "cri-o://91d7441a52750ff525fe077d41618c3b15e2baa27ea6e418dfe676113698c0fc",
                                "exitCode": 137,
                                "finishedAt": "2020-10-08T15:00:59Z",
                                "message": <snip>",
                                "reason": "Error",
                                "startedAt": "2020-10-08T14:55:43Z"
                            }
                        },
                        "name": "sbdb",
                        "ready": true,
                        "restartCount": 1,
                        "started": true,
                        "state": {
                            "running": {
                                "startedAt": "2020-10-08T15:01:30Z"
                            }
                        }
                    }

```

This shows that the start timestamp was at 2020-10-08T14:55:43Z

Checking other pod logs I see other container consuming the same secret over a minute earlier than earlier than the pod start time:
```
+ [[ ! -f /ovn-cert/tls.key ]]
+ [[ ! -f /ovn-cert/tls.crt ]]
++ date -Iseconds
+ echo '2020-10-08T14:54:14+00:00 - starting ovn-northd'
2020-10-08T14:54:14+00:00 - starting ovn-northd
+ exec ovn-northd --no-chdir -vconsole:info -vfile:off --ovnnb-db ssl:10.0.0.5:9641,ssl:10.0.0.6:9641,ssl:10.0.0.8:9641 --ovnsb-db ssl:10.0.0.5:9642,ssl:10.0.0.6:9642,ssl:10.0.0.8:9642 --pidfile /var/run/ovn/ovn-northd.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt
```

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/832/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1314209767828230144/artifacts/e2e-azure-ovn/gather-extra/pods/openshift-ovn-kubernetes_ovnkube-master-fmxt5_northd.log

Checking the journal, I see some MountVolume.Setup started and succeeded for that pod, but these happen long before the pod container started so I beleive there is a chance these lines belong to an older execution of the container or another container in that pod:
$ zcat journal |  grep ovn-cert | grep tn7pn | grep 14:5.: |grep MountVol
Oct 08 14:53:35.560532 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.560461    1813 reconciler.go:254] Starting operationExecutor.MountVolume for volume "ovn-cert" (UniqueName: "kubernetes.io/secret/9b7082ef-1d75-4614-9002-26ca8deb4ae4-ovn-cert") pod "ovnkube-master-tn7pn" (UID: "9b7082ef-1d75-4614-9002-26ca8deb4ae4")
Oct 08 14:53:35.560532 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.560500    1813 reconciler.go:269] operationExecutor.MountVolume started for volume "ovn-cert" (UniqueName: "kubernetes.io/secret/9b7082ef-1d75-4614-9002-26ca8deb4ae4-ovn-cert") pod "ovnkube-master-tn7pn" (UID: "9b7082ef-1d75-4614-9002-26ca8deb4ae4")
Oct 08 14:53:35.584931 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.584918    1813 operation_generator.go:663] MountVolume.SetUp succeeded for volume "ovn-cert" (UniqueName: "kubernetes.io/secret/9b7082ef-1d75-4614-9002-26ca8deb4ae4-ovn-cert") pod "ovnkube-master-tn7pn" (UID: "9b7082ef-1d75-4614-9002-26ca8deb4ae4")


Version-Release number of selected component (if applicable):
master branch, may affect 4.6 as well, I'm not sure.

How reproducible:
Seems to happen randomly and in few containers. Many times if the container is restarted it is likely to work in the restarted container.

I've spotted this a lot on OVN jobs, I have not seen it in SDN jobs but that may be entirely unrelated. It doesn't happen on most containers but it happens enough to break the test fairly often.

Steps to Reproduce:
Unknown

Actual results:
The files of a secret that has been created a long time ago may not be present

Expected results:
If a pod is started it must have all the volumes mounted that don't have optional: true

Additional info:
I have only seen this problem in secrets It may affect other kind of volumes as well.

I'm setting priority to high because I don't know if this affects 4.6 as well or just 4.7.

Comment 1 David Eads 2020-10-09 11:15:14 UTC
This is impacting 4.6 jobs: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-ovn-4.6 .  Sippy reports 52% failure over the past three days.

Comment 2 Juan Luis de Sousa-Valadas 2020-10-09 13:49:20 UTC
Lowering priority and severity because we have a workaround in OVN which is making the secret mandatory.

Comment 4 Juan Luis de Sousa-Valadas 2020-10-09 15:29:31 UTC
I had a quick look at the journal again and the secretVolumeManager received the data at 14:53:59, but the container was started at: 14:55:43 and died at 15:00:59
During these 5 minutes the files were never present in the filesystem.

$ zcat journal |  grep secret.go:207 | grep ovn-cert |grep master-1
Oct 08 14:53:35.560817 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.560703    1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (0) pieces of data, 0 total bytes
Oct 08 14:53:35.662890 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:35.662458    1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (0) pieces of data, 0 total bytes
Oct 08 14:53:59.039225 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:53:59.039111    1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (2) pieces of data, 4099 total bytes
Oct 08 14:54:00.042725 ci-op-07tdp2hv-23229-hzznd-master-1 hyperkube[1813]: I1008 14:54:00.042529    1813 secret.go:207] Received secret openshift-ovn-kubernetes/ovn-cert containing (2) pieces of data, 4099 total bytes


                    {
                        "containerID": "cri-o://3be754f81faa64fac7c59af6b60fc66803b0d0ee956ddec6420114c06f77d4a1",
                        "image": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8",
                        "imageID": "registry.build01.ci.openshift.org/ci-op-07tdp2hv/stable@sha256:a7a3af407006529edd36c1dc94171533ed9b985a017e6da85130b0ed6275a7c8",
                        "lastState": {
                            "terminated": {
                                "containerID": "cri-o://91d7441a52750ff525fe077d41618c3b15e2baa27ea6e418dfe676113698c0fc",
                                "exitCode": 137,
                                "finishedAt": "2020-10-08T15:00:59Z",
                                "message": <snip>",
                                "reason": "Error",
                                "startedAt": "2020-10-08T14:55:43Z"
                            }
                        },
                        "name": "sbdb",
                        "ready": true,
                        "restartCount": 1,
                        "started": true,
                        "state": {
                            "running": {
                                "startedAt": "2020-10-08T15:01:30Z"
                            }
                        }
                    }

Comment 5 Ryan Phillips 2020-10-09 16:01:53 UTC
Since OVN did not exist prior to the 4.6 release, this is not a regression. Moving to 4.7.

We will continue to look at it.

Comment 6 Juan Luis de Sousa-Valadas 2020-10-09 16:18:37 UTC
Ryan, I don't have anything against moving this to 4.7, but I don't know if this happens only on OVN. We only detected it in OVN because a lot of critical components were accidentally relying on this feature.

Comment 7 Surya Seetharaman 2020-10-12 14:51:31 UTC
*** Bug 1872874 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.