Bug 1538216 - 1/3 of all e2e test runs in Origin are failing with flake on projected annotation e2e
Summary: 1/3 of all e2e test runs in Origin are failing with flake on projected annota...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.0
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-24 16:06 UTC by Clayton Coleman
Modified: 2018-03-28 14:22 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-03-28 14:22:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer patch (3.49 KB, patch)
2018-02-13 17:02 UTC, Jan Safranek
no flags Details | Diff
filtered-kubelet.log (107.64 KB, text/plain)
2018-02-13 21:45 UTC, Seth Jennings
no flags Details
label-update-flake.log (109.11 KB, text/plain)
2018-02-15 15:16 UTC, Seth Jennings
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 59813 0 None closed High flake rates on Downward API volume update e2e tests 2020-10-09 17:44:07 UTC
Github openshift origin issues 17556 0 None closed [k8s.io] Projected should update annotations/labels tests flaking 2020-10-09 17:44:08 UTC
Github openshift origin issues 17605 0 None closed Flake: [k8s.io] Downward API volume should update annotations on modification [Conformance] [sig-storage] [Suite:openshi... 2020-10-09 17:44:07 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:22:52 UTC

Description Clayton Coleman 2018-01-24 16:06:10 UTC
Moving from issue https://github.com/openshift/origin/issues/17556

Projected labels/annotations are not being updated within the 2m timeout

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17357/test_pull_request_origin_extended_conformance_install/3306/

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17543/test_pull_request_origin_extended_conformance_install/3315/

This has been occurring for most of the last 3 months, 1/3 jobs fail due to it, and it may be a real customer issue.  Needs immediate triage.

Comment 1 Seth Jennings 2018-01-24 16:36:44 UTC
Looking into it

Comment 2 Seth Jennings 2018-01-24 17:29:10 UTC
It does appear as if the volume reconciler is not seeing the need to update the volume in the flake case:

atomic_writer.go:142] pod e2e-tests-projected-cm8pz/annotationupdatec7cecf13-d666-11e7-b45c-0e6d720a2590 volume podinfo: no update required for target directory ...

In the passing case, I see something like this after the kubelet reports the pod is Running:

atomic_writer.go:145] pod e2e-tests-projected-9gwqx/annotationupdatebf9ad5de-d6bc-11e7-8b0a-0e6b5e8582b8 volume podinfo: write required for target directory ...

Comment 3 Seth Jennings 2018-01-24 19:13:47 UTC
Ok too many github issue to add one by one.  For the record, these are all dups of the same underlying issue.

https://github.com/openshift/origin/issues/18136
https://github.com/openshift/origin/issues/17882
https://github.com/openshift/origin/issues/17605 <-- primary downwardapi
https://github.com/openshift/origin/issues/17556 <-- primary projected
https://github.com/openshift/origin/issues/17519

Comment 4 Seth Jennings 2018-01-24 21:45:09 UTC
In case you were wondering "what is the update mechanism for downwardAPI volumes??" see below.

SyncLoop (SYNC)

(kl *Kubelet) syncLoopIteration
  (kl *Kubelet) HandlePodSyncs
    (kl *Kubelet) dispatchWork
      (p *podWorkers) UpdatePod

writes update to pod worker channel

(p *podWorkers) managePodLoop
  (kl *Kubelet) syncPod(o syncPodOptions)
    (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod)
      (dswp *desiredStateOfWorldPopulator) ReprocessPod (removes pod from the set of processed pods)
      (asw *actualStateOfWorld) MarkRemountRequired (sets remountRequired=true on the mounted pod)
      poll on verifyVolumesMountedFunc() until number of unmounted volumes for the pod goes to 0

volumes are mounted by the reconcilation loop

(rc *reconciler) Run
  (rc *reconciler) reconciliationLoopFunc
    (rc *reconciler) reconcile
      for _, volumeToMount := range rc.desiredStateOfWorld.GetVolumesToMount()
        (dsw *desiredStateOfWorld) GetVolumesToMount() []VolumeToMount
          copies pod pointer from dsw.volumesToMount[].podsToMount[].pod <--
        (f FilesystemVolumeHandler) MountVolumeHandler
          (oe *operationExecutor) MountVolume
            (og *operationGenerator) GenerateMountVolumeFunc  <-- downwardAPIVolumeMounter is created at this time
              (grm *nestedPendingOperations) Run
                go operationFunc() (which is func returned by GenerateMountVolumeFunc)

mount volume function goroutine

GenerateMountVolumeFunc
====
volumeMounter, newMounterErr := volumePlugin.NewMounter(
		volumeToMount.VolumeSpec,
		volumeToMount.Pod,
		volume.VolumeOptions{})
===
  (b *downwardAPIVolumeMounter) SetUp
    (b *downwardAPIVolumeMounter) SetUpAt
      (w *AtomicWriter) Write

how does Pod get into the dsw.volumesToMount[].podsToMount[] ?

(dswp *desiredStateOfWorldPopulator) Run
  (dswp *desiredStateOfWorldPopulator) populatorLoopFunc()
    (dswp *desiredStateOfWorldPopulator) findAndAddNewPods()
      for _, pod := range dswp.podManager.GetPods() <-- pods come from the podManager
        (dswp *desiredStateOfWorldPopulator) processPodVolumes(pod *v1.Pod)
          (dsw *desiredStateOfWorld) AddPodToVolume
	          // Create new podToMount object. If it already exists, it is refreshed with
	          // updated values (this is required for volumes that require remounting on
	          // pod update, like Downward API volumes).
	          dsw.volumesToMount[volumeName].podsToMount[podName] = podToMount{
	          podName:             podName,
	          pod:                 pod,
	          spec:                volumeSpec,
	          outerVolumeSpecName: outerVolumeSpecName,
	          }

populatorLoopFunc() run every 100ms so any update in the podManager would be quickly reflected in the dsw.

Theory is that the podManager is not being updated for some reason.

Comment 5 Seth Jennings 2018-01-24 22:26:23 UTC
how are pods added the pod manager?

(kl *Kubelet) syncLoopIteration
  HandlePodAdditions for ADD and RESTORE
  HandlePodUpdates for UPDATE for DELETE
  HandlePodReconcile for RECONCILE
  converge on UpdatePod
    (pm *basicManager) UpdatePod
      (pm *basicManager) updatePodsInternal (adds pods to internal maps)

We do receive the UPDATE from the annotation change and that proceeds to resync the pod. That happens very quickly (<100ms). It is possible that the dswp will not have updated from the podManager in that time.

However, it should be updated by the next periodic pod SYNC. But atomic_writer is still detecting no change to the volume even by that point which is 89s later.  Test times out after 120s.

Comment 6 Seth Jennings 2018-01-24 22:33:32 UTC
By comparison with a log from a e2e run where that passed, the volume manager is updated from the podManager _before_ the syncPod() following the UPDATE and the volume is updated within 2s of the UPDATE being received.

Comment 7 Seth Jennings 2018-01-24 22:44:16 UTC
In my previous comment s/volume manager/dswp/.

In summary, the flake is caused by a combination of two things:

1) The syncPod() following an UPDATE completes "too quickly" before the dswp populator loop can update state from the podManager
2) Subsequent periodic SYNCs on the pod also to not pick up on the change (cause unknown)

Comment 8 Seth Jennings 2018-01-25 03:43:21 UTC
I just look at an upstream result for this e2e test (upstream does not have this flaking issue currently) and they too can hit case 1 (from the comment above) where the syncPod() as a result of the UPDATE happens before the dswp loop can get an update from the pod manager.  However case 2 does not happen and the volume is updated on the next periodic SYNC.

Comment 9 Seth Jennings 2018-01-25 04:16:11 UTC
I also confirmed that every time we are passing this in Origin CI, we are getting lucky and avoiding case 1. The syncPod() happens after the dswp has already updated and we immediately update the volume.

Comment 10 Seth Jennings 2018-01-26 04:57:36 UTC
I've done a tree wide diff between kube 1.9.1 and origin master vendored kube and there is no difference that would cause this.

I am unable to reproduce locally.  When I change an annotation/label, the UPDATE sync typically happens too fast to update the volume immediately, but it _is_ updated on the next SYNC.

I'm at a loss here.  The only thing I can think is that one of the volume manager loops is bogged down with something or bailing out early because of interference with a parallel running test.  I'm grasping at straws at this point.

I've pretty much decided that is due to the dsw not being updated by the dsw populator, but I'm not sure how that occurs.

Comment 11 Seth Jennings 2018-01-31 17:08:22 UTC
Tested in parallel in a loop for 3hrs (until my master ran out of memory) and it never flaked.  There is some environment factor here.  Going back to code inspection to find the next question.

Comment 12 Jan Safranek 2018-02-07 10:04:17 UTC
cc: Hemant, he know something about volume manager. It's the #1 flake in Origin now.

Comment 13 Seth Jennings 2018-02-13 16:41:31 UTC
Opened an upstream issues as it is also impacted:
https://github.com/kubernetes/kubernetes/issues/59813

Comment 14 Jan Safranek 2018-02-13 16:55:06 UTC
Trying to wrap my brain around this...

volumeManager's dswp is updated every 100ms. I noticed that dswp.podPreviouslyProcessed() prevents a pod/volume to be added to dsw (and thus refreshed).

podPreviouslyProcessed is waiting for syncPod() to call dswp.ReprocessPod() to remove a pod from processed pods.

Who is supposed to call syncPod and how often? Upstream --sync-frequency is 1 minute, do we override it? I can see syncPod() called at least twice in a minute, but still it takes up to 90 seconds to propagate new annotation to DownwardAPI volume. Where this comes from? I'll continue tomorrow.

Comment 15 Jan Safranek 2018-02-13 17:02:41 UTC
Created attachment 1395554 [details]
reproducer patch

This patch (for current Kubernetes master) helps me to reproduce the issue a bit. It modifies an annotation and waits until it's propagated into DownwardAPI volume. It does so 10x in a loop on the same pod, so we don't need to wait for new pod / namespace to be created. Env. variable can be used to run the loop longer.

Usage (against origin master): 
$ oc cluster up

$ export KUBECONFIG=master/admin.kubeconfig KUBERNETES_PROVIDER=local
$ go run hack/e2e.go  -- -v  --test  --test_args="--host=https://127.0.0.1:8443 --ginkgo.failFast=true --ginkgo.trace=true --ginkgo.focus=Downward.API.volume.should.update.annotations.on.modification"

It shows that it takes 2 - 92 seconds to propagate annotation change to the volume. The test has never failed though, it was always propagated in the end. I did not see any 120s timeout there. On average, 10 annotation changes take ~500 seconds (!!!), it should be faster.

Comment 16 Seth Jennings 2018-02-13 17:27:03 UTC
> The test has never failed though, it was always propagated in the end. I did not see any 120s timeout there.

Same here.

> On average, 10 annotation changes take ~500 seconds (!!!), it should be faster.

There are two cases mentioned in comment 7.

Case 1 is the fast case when the dswp loop updates the volume manager before the remount is attempted resulting in a nearly synchronous update of the pod volume.  

Case 2 is where the dswp loop does not update the volume manager in time and the pod is marked as processed again with no update to the volume.  However the subsequent periodic sync (every 1m) will update the volume.

The only way to make it faster would be to 1) force synchronicity in some way between the pod workers calling syncPod and the volume manager updating the pod volumes or 2) increase the sync frequency (this has downsides in terms of API server load)

Comment 17 Seth Jennings 2018-02-13 21:25:35 UTC
Thanks for the recreate patch.

I did this with upstream kube:

export LOG_LEVEL=5
export KUBELET_FLAGS='--sync-frequency=5s'
hack/local-up-cluster.sh

Then modified the test to timeout after 15s (3x sync frequency)

I have not recreated yet.  But I did find something potentially interesting in the completion time.  I ran the update in a tight loop for 100 iterations.  Here is the breakdown of completions times:

38 finished in 1.5s
42 finished in 2.5s
1 finished in 3.5s
3 finished in 7.5s
2 in 8.5s
12 in 9.5s
2 in 10.5s

The test polls at 1s (I modified this) intervals and mounttest outputs file contents at 2s intervals, so worst case we are looking at ~3s of latency between when the volume is updated and the test observes success.

The 1.5s and 2.5s completions are likely the "fast update" situation.

So one would expect all tests to pass in < 8s (5s sync interval + 3s worst observation latency).  But we see tests that have > 8s completion times.

That means that at least 16 took more than one sync interval to reflect the change.

Comment 18 Seth Jennings 2018-02-13 21:45:50 UTC
Created attachment 1395673 [details]
filtered-kubelet.log

Looking through a filtered kubelet log, hwoever, I don't see any instances where the fast path after the UPDATE is missed, but the very next SYNC doesn't update the volume.

That would take the form of UPDATE -> no update required -> SYNC -> no update required -> SYNC -> write required, which is the smoking gun.

Comment 19 Jan Safranek 2018-02-14 14:06:42 UTC
> But we see tests that have > 8s completion times.

I found the root cause for this, periodic pod sync has variable frequency. podWorkers.wrapUp() schedules next sync randomly in in resyncInterval - resyncInterval * 1.5 seconds. That would explain up to ~95 seconds in "normal case" and 11.5 second in your hastened environment (7.5s as the worst resync + 2s for the pod to pick new values + 2s for the e2e test to poll new logs) . 

I got stable 60 - 62 second resync with workerResyncIntervalJitterFactor = 0.00001, which resulted in either 2s or ~64s delay in the test. 

Still, even with random jitter, 120 seconds is guaranteed to have at least one resync in between...

Comment 20 Jan Safranek 2018-02-14 14:42:59 UTC
> We do receive the UPDATE from the annotation change and that proceeds to 
> resync the pod. That happens very quickly (<100ms). It is possible that the 
> dswp will not have updated from the podManager in that time.

IMO, that's not possible. Kubelet.HandlePodUpdates() calls kl.podManager.UpdatePod(pod) and then it calls kl.dispatchWork() that asynchronously leads to syncPod(). So when syncPod runs and removes the pod from dswp, PodManager already must have new pod and dswp finds it there.

Comment 21 Jan Safranek 2018-02-14 15:17:46 UTC
Finally I found it, volumeManager.WaitForAttachAndMount (called by syncPod()) marks the pod both in dswp for refresh dsw and in actualStateOfWorld to remount it:

        // Some pods expect to have Setup called over and over again to update.
        // Remount plugins for which this is true. (Atomically updating volumes,
        // like Downward API, depend on this to update the contents of the volume).
        vm.desiredStateOfWorldPopulator.ReprocessPod(uniquePodName)
        vm.actualStateOfWorld.MarkRemountRequired(uniquePodName)


dswp and reconciler (that re-mounts volumes) run in separate goroutines and they basically race if dswp gets the pod first and updates pod and its annotation in dsw (-> 2s delay) or reconciler gets the pod first and remounts the volume with old pod from not-yet-updated dsw.

Having the root cause does not mean I have a fix. dswp does not have reference to actualStateOfWorld to mark the volume for remount and putting it there looks ugly... I'll comment in the upstream issue.

Comment 22 Seth Jennings 2018-02-14 15:35:34 UTC
Very interesting about the amount of wiggle in the resyncInterval

> IMO, that's not possible. Kubelet.HandlePodUpdates() calls
> kl.podManager.UpdatePod(pod) and then it calls kl.dispatchWork() that
> asynchronously leads to syncPod(). So when syncPod runs and removes the pod
> from dswp, PodManager already must have new pod and dswp finds it there.

The question is does the dswp loop execute findAndAddNewPods() after the podManager update but before the volume manager reconciliation loop remounts the volumes in response to syncPod running ReprocessPod() and MarkRemountRequired().

Looking at the filtered-kubelet.log I attached, there are situations like this:

I0213 14:40:22.859439   28798 kubelet.go:1822] SyncLoop (UPDATE, "api"): "annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa_e2e-tests-downward-api-z9f7t(e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa)"

I0213 14:40:22.870313   28798 atomic_writer.go:156] pod e2e-tests-downward-api-z9f7t/annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa volume podinfo: no update required for target directory /var/lib/kubelet/pods/e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa/volumes/kubernetes.io~downward-api/podinfo

I0213 14:40:29.832020   28798 kubelet.go:1879] SyncLoop (SYNC): 1 pods; annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa_e2e-tests-downward-api-z9f7t(e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa)

I0213 14:40:29.875804   28798 atomic_writer.go:159] pod e2e-tests-downward-api-z9f7t/annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa volume podinfo: write required for target directory /var/lib/kubelet/pods/e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa/volumes/kubernetes.io~downward-api/podinfo

In that case, the delta-t between the UPDATE and the volume manager remount is about 11ms (870-859), which less than the dswp loop interval.  In this case, the dswp loop didn't update the dsw from the podManager before the volume remount took place.  Thus we had to wait until the next SYNC for it to be picked up.

Comment 23 Jan Safranek 2018-02-14 16:31:46 UTC
sent a PR upstream, https://github.com/kubernetes/kubernetes/pull/59873

Works well, test finishes in 2-4 seconds. However, that fixes only the 1st problem from comment #7, we still don't know why subsequent resync does not pick the updated pod. There is some chance it will still flake...

Comment 24 Jan Safranek 2018-02-14 16:37:01 UTC
> The question is does the dswp loop execute findAndAddNewPods() after the 
> podManager update

Yes, that's IMO 100% guaranteed.

> but before the volume manager reconciliation loop remounts the volumes in 
> response to syncPod running ReprocessPod() and MarkRemountRequired().

No, that's not guaranteed. Reconciliation loop and dswp loop run in separate goroutines and they basically race who finds the pod first.

Comment 25 Seth Jennings 2018-02-15 15:16:31 UTC
Created attachment 1396550 [details]
label-update-flake.log

> There is some chance it will still flake...

I agree.  In the cases I see in our CI, the subsequent SYNC, which should pick up the change regardless of the race, does not. That leads me to believe that regardless of the race or the test timeout, the test would flake because the volume manager dsw is never being updated for some reason.

Attached is a node log filtered to the pod involved in a label update tests.  You can see that the SYNC following the UPDATE also result in the atomic write determining that no update is needed, which means the pod provided to it by the volume manager is not the updated version of the pod.

Comment 26 Jan Safranek 2018-02-15 16:35:25 UTC
Added PR with additional logs: https://github.com/kubernetes/kubernetes/pull/59923

Maybe we should port it back together with the fix in comment #23.

Comment 27 Seth Jennings 2018-02-16 05:10:01 UTC
Origin PR fixing the race and increasing volume manager log levels
https://github.com/openshift/origin/pull/18636

Comment 29 DeShuai Ma 2018-02-26 08:40:23 UTC
Verify on v3.9.0-0.51.0.

Steps:
1. Create a pod with downwardAPI projected volume.
apiVersion: v1
kind: Pod
metadata:
  name: allinone-normal
  labels:
    region: one
  annotations:
    builder: bar
spec:
  containers:
  - name: allinone-normal
    image: docker.io/ocpqe/hello-pod:latest
    resources:
      limits:
        cpu: "500m"
        memory: "256Mi"
    volumeMounts:
    - name: all-in-one
      mountPath: "/all-in-one"
      readOnly: true
  volumes:
  - name: all-in-one
    projected:
      sources:
      - downwardAPI:
          items:
            - path: annotations
              fieldRef:
                fieldPath: metadata.annotations
            - path: mydapi/labels
              fieldRef:
                fieldPath: metadata.labels
            - path: mydapi/name
              fieldRef:
                fieldPath: metadata.name
            - path: mydapi/cpu_limit
              resourceFieldRef:
                containerName: allinone-normal
                resource: limits.cpu
                divisor: "1m"


# oc get po -n dma
NAME              READY     STATUS    RESTARTS   AGE
allinone-normal   1/1       Running   0          20m

2. Check the value in pod
[root@ip-172-18-6-178 dma]# oc exec -it allinone-normal -n dma -- cat /all-in-one/annotations
builder="bar"
kubernetes.io/config.seen="2018-02-26T03:13:42.306863034-05:00"
kubernetes.io/config.source="api"
openshift.io/scc="anyuid"

3. Update the annotations of pod then check the volume value again in pod; The value in pod update immediately after update pod annotations
# oc patch pod allinone-normal -n dma -p '{"metadata":{"annotations":{"builder":"foo"}}}'
pod "allinone-normal" patched
# oc exec -it allinone-normal -n dma -- cat /all-in-one/annotations
builder="foo"
kubernetes.io/config.seen="2018-02-26T03:13:42.306863034-05:00"
kubernetes.io/config.source="api"

Comment 32 errata-xmlrpc 2018-03-28 14:22:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.