Moving from issue https://github.com/openshift/origin/issues/17556 Projected labels/annotations are not being updated within the 2m timeout https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17357/test_pull_request_origin_extended_conformance_install/3306/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17543/test_pull_request_origin_extended_conformance_install/3315/ This has been occurring for most of the last 3 months, 1/3 jobs fail due to it, and it may be a real customer issue. Needs immediate triage.
Looking into it
It does appear as if the volume reconciler is not seeing the need to update the volume in the flake case: atomic_writer.go:142] pod e2e-tests-projected-cm8pz/annotationupdatec7cecf13-d666-11e7-b45c-0e6d720a2590 volume podinfo: no update required for target directory ... In the passing case, I see something like this after the kubelet reports the pod is Running: atomic_writer.go:145] pod e2e-tests-projected-9gwqx/annotationupdatebf9ad5de-d6bc-11e7-8b0a-0e6b5e8582b8 volume podinfo: write required for target directory ...
Ok too many github issue to add one by one. For the record, these are all dups of the same underlying issue. https://github.com/openshift/origin/issues/18136 https://github.com/openshift/origin/issues/17882 https://github.com/openshift/origin/issues/17605 <-- primary downwardapi https://github.com/openshift/origin/issues/17556 <-- primary projected https://github.com/openshift/origin/issues/17519
In case you were wondering "what is the update mechanism for downwardAPI volumes??" see below. SyncLoop (SYNC) (kl *Kubelet) syncLoopIteration (kl *Kubelet) HandlePodSyncs (kl *Kubelet) dispatchWork (p *podWorkers) UpdatePod writes update to pod worker channel (p *podWorkers) managePodLoop (kl *Kubelet) syncPod(o syncPodOptions) (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) (dswp *desiredStateOfWorldPopulator) ReprocessPod (removes pod from the set of processed pods) (asw *actualStateOfWorld) MarkRemountRequired (sets remountRequired=true on the mounted pod) poll on verifyVolumesMountedFunc() until number of unmounted volumes for the pod goes to 0 volumes are mounted by the reconcilation loop (rc *reconciler) Run (rc *reconciler) reconciliationLoopFunc (rc *reconciler) reconcile for _, volumeToMount := range rc.desiredStateOfWorld.GetVolumesToMount() (dsw *desiredStateOfWorld) GetVolumesToMount() []VolumeToMount copies pod pointer from dsw.volumesToMount[].podsToMount[].pod <-- (f FilesystemVolumeHandler) MountVolumeHandler (oe *operationExecutor) MountVolume (og *operationGenerator) GenerateMountVolumeFunc <-- downwardAPIVolumeMounter is created at this time (grm *nestedPendingOperations) Run go operationFunc() (which is func returned by GenerateMountVolumeFunc) mount volume function goroutine GenerateMountVolumeFunc ==== volumeMounter, newMounterErr := volumePlugin.NewMounter( volumeToMount.VolumeSpec, volumeToMount.Pod, volume.VolumeOptions{}) === (b *downwardAPIVolumeMounter) SetUp (b *downwardAPIVolumeMounter) SetUpAt (w *AtomicWriter) Write how does Pod get into the dsw.volumesToMount[].podsToMount[] ? (dswp *desiredStateOfWorldPopulator) Run (dswp *desiredStateOfWorldPopulator) populatorLoopFunc() (dswp *desiredStateOfWorldPopulator) findAndAddNewPods() for _, pod := range dswp.podManager.GetPods() <-- pods come from the podManager (dswp *desiredStateOfWorldPopulator) processPodVolumes(pod *v1.Pod) (dsw *desiredStateOfWorld) AddPodToVolume // Create new podToMount object. If it already exists, it is refreshed with // updated values (this is required for volumes that require remounting on // pod update, like Downward API volumes). dsw.volumesToMount[volumeName].podsToMount[podName] = podToMount{ podName: podName, pod: pod, spec: volumeSpec, outerVolumeSpecName: outerVolumeSpecName, } populatorLoopFunc() run every 100ms so any update in the podManager would be quickly reflected in the dsw. Theory is that the podManager is not being updated for some reason.
how are pods added the pod manager? (kl *Kubelet) syncLoopIteration HandlePodAdditions for ADD and RESTORE HandlePodUpdates for UPDATE for DELETE HandlePodReconcile for RECONCILE converge on UpdatePod (pm *basicManager) UpdatePod (pm *basicManager) updatePodsInternal (adds pods to internal maps) We do receive the UPDATE from the annotation change and that proceeds to resync the pod. That happens very quickly (<100ms). It is possible that the dswp will not have updated from the podManager in that time. However, it should be updated by the next periodic pod SYNC. But atomic_writer is still detecting no change to the volume even by that point which is 89s later. Test times out after 120s.
By comparison with a log from a e2e run where that passed, the volume manager is updated from the podManager _before_ the syncPod() following the UPDATE and the volume is updated within 2s of the UPDATE being received.
In my previous comment s/volume manager/dswp/. In summary, the flake is caused by a combination of two things: 1) The syncPod() following an UPDATE completes "too quickly" before the dswp populator loop can update state from the podManager 2) Subsequent periodic SYNCs on the pod also to not pick up on the change (cause unknown)
I just look at an upstream result for this e2e test (upstream does not have this flaking issue currently) and they too can hit case 1 (from the comment above) where the syncPod() as a result of the UPDATE happens before the dswp loop can get an update from the pod manager. However case 2 does not happen and the volume is updated on the next periodic SYNC.
I also confirmed that every time we are passing this in Origin CI, we are getting lucky and avoiding case 1. The syncPod() happens after the dswp has already updated and we immediately update the volume.
I've done a tree wide diff between kube 1.9.1 and origin master vendored kube and there is no difference that would cause this. I am unable to reproduce locally. When I change an annotation/label, the UPDATE sync typically happens too fast to update the volume immediately, but it _is_ updated on the next SYNC. I'm at a loss here. The only thing I can think is that one of the volume manager loops is bogged down with something or bailing out early because of interference with a parallel running test. I'm grasping at straws at this point. I've pretty much decided that is due to the dsw not being updated by the dsw populator, but I'm not sure how that occurs.
Tested in parallel in a loop for 3hrs (until my master ran out of memory) and it never flaked. There is some environment factor here. Going back to code inspection to find the next question.
cc: Hemant, he know something about volume manager. It's the #1 flake in Origin now.
Opened an upstream issues as it is also impacted: https://github.com/kubernetes/kubernetes/issues/59813
Trying to wrap my brain around this... volumeManager's dswp is updated every 100ms. I noticed that dswp.podPreviouslyProcessed() prevents a pod/volume to be added to dsw (and thus refreshed). podPreviouslyProcessed is waiting for syncPod() to call dswp.ReprocessPod() to remove a pod from processed pods. Who is supposed to call syncPod and how often? Upstream --sync-frequency is 1 minute, do we override it? I can see syncPod() called at least twice in a minute, but still it takes up to 90 seconds to propagate new annotation to DownwardAPI volume. Where this comes from? I'll continue tomorrow.
Created attachment 1395554 [details] reproducer patch This patch (for current Kubernetes master) helps me to reproduce the issue a bit. It modifies an annotation and waits until it's propagated into DownwardAPI volume. It does so 10x in a loop on the same pod, so we don't need to wait for new pod / namespace to be created. Env. variable can be used to run the loop longer. Usage (against origin master): $ oc cluster up $ export KUBECONFIG=master/admin.kubeconfig KUBERNETES_PROVIDER=local $ go run hack/e2e.go -- -v --test --test_args="--host=https://127.0.0.1:8443 --ginkgo.failFast=true --ginkgo.trace=true --ginkgo.focus=Downward.API.volume.should.update.annotations.on.modification" It shows that it takes 2 - 92 seconds to propagate annotation change to the volume. The test has never failed though, it was always propagated in the end. I did not see any 120s timeout there. On average, 10 annotation changes take ~500 seconds (!!!), it should be faster.
> The test has never failed though, it was always propagated in the end. I did not see any 120s timeout there. Same here. > On average, 10 annotation changes take ~500 seconds (!!!), it should be faster. There are two cases mentioned in comment 7. Case 1 is the fast case when the dswp loop updates the volume manager before the remount is attempted resulting in a nearly synchronous update of the pod volume. Case 2 is where the dswp loop does not update the volume manager in time and the pod is marked as processed again with no update to the volume. However the subsequent periodic sync (every 1m) will update the volume. The only way to make it faster would be to 1) force synchronicity in some way between the pod workers calling syncPod and the volume manager updating the pod volumes or 2) increase the sync frequency (this has downsides in terms of API server load)
Thanks for the recreate patch. I did this with upstream kube: export LOG_LEVEL=5 export KUBELET_FLAGS='--sync-frequency=5s' hack/local-up-cluster.sh Then modified the test to timeout after 15s (3x sync frequency) I have not recreated yet. But I did find something potentially interesting in the completion time. I ran the update in a tight loop for 100 iterations. Here is the breakdown of completions times: 38 finished in 1.5s 42 finished in 2.5s 1 finished in 3.5s 3 finished in 7.5s 2 in 8.5s 12 in 9.5s 2 in 10.5s The test polls at 1s (I modified this) intervals and mounttest outputs file contents at 2s intervals, so worst case we are looking at ~3s of latency between when the volume is updated and the test observes success. The 1.5s and 2.5s completions are likely the "fast update" situation. So one would expect all tests to pass in < 8s (5s sync interval + 3s worst observation latency). But we see tests that have > 8s completion times. That means that at least 16 took more than one sync interval to reflect the change.
Created attachment 1395673 [details] filtered-kubelet.log Looking through a filtered kubelet log, hwoever, I don't see any instances where the fast path after the UPDATE is missed, but the very next SYNC doesn't update the volume. That would take the form of UPDATE -> no update required -> SYNC -> no update required -> SYNC -> write required, which is the smoking gun.
> But we see tests that have > 8s completion times. I found the root cause for this, periodic pod sync has variable frequency. podWorkers.wrapUp() schedules next sync randomly in in resyncInterval - resyncInterval * 1.5 seconds. That would explain up to ~95 seconds in "normal case" and 11.5 second in your hastened environment (7.5s as the worst resync + 2s for the pod to pick new values + 2s for the e2e test to poll new logs) . I got stable 60 - 62 second resync with workerResyncIntervalJitterFactor = 0.00001, which resulted in either 2s or ~64s delay in the test. Still, even with random jitter, 120 seconds is guaranteed to have at least one resync in between...
> We do receive the UPDATE from the annotation change and that proceeds to > resync the pod. That happens very quickly (<100ms). It is possible that the > dswp will not have updated from the podManager in that time. IMO, that's not possible. Kubelet.HandlePodUpdates() calls kl.podManager.UpdatePod(pod) and then it calls kl.dispatchWork() that asynchronously leads to syncPod(). So when syncPod runs and removes the pod from dswp, PodManager already must have new pod and dswp finds it there.
Finally I found it, volumeManager.WaitForAttachAndMount (called by syncPod()) marks the pod both in dswp for refresh dsw and in actualStateOfWorld to remount it: // Some pods expect to have Setup called over and over again to update. // Remount plugins for which this is true. (Atomically updating volumes, // like Downward API, depend on this to update the contents of the volume). vm.desiredStateOfWorldPopulator.ReprocessPod(uniquePodName) vm.actualStateOfWorld.MarkRemountRequired(uniquePodName) dswp and reconciler (that re-mounts volumes) run in separate goroutines and they basically race if dswp gets the pod first and updates pod and its annotation in dsw (-> 2s delay) or reconciler gets the pod first and remounts the volume with old pod from not-yet-updated dsw. Having the root cause does not mean I have a fix. dswp does not have reference to actualStateOfWorld to mark the volume for remount and putting it there looks ugly... I'll comment in the upstream issue.
Very interesting about the amount of wiggle in the resyncInterval > IMO, that's not possible. Kubelet.HandlePodUpdates() calls > kl.podManager.UpdatePod(pod) and then it calls kl.dispatchWork() that > asynchronously leads to syncPod(). So when syncPod runs and removes the pod > from dswp, PodManager already must have new pod and dswp finds it there. The question is does the dswp loop execute findAndAddNewPods() after the podManager update but before the volume manager reconciliation loop remounts the volumes in response to syncPod running ReprocessPod() and MarkRemountRequired(). Looking at the filtered-kubelet.log I attached, there are situations like this: I0213 14:40:22.859439 28798 kubelet.go:1822] SyncLoop (UPDATE, "api"): "annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa_e2e-tests-downward-api-z9f7t(e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa)" I0213 14:40:22.870313 28798 atomic_writer.go:156] pod e2e-tests-downward-api-z9f7t/annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa volume podinfo: no update required for target directory /var/lib/kubelet/pods/e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa/volumes/kubernetes.io~downward-api/podinfo I0213 14:40:29.832020 28798 kubelet.go:1879] SyncLoop (SYNC): 1 pods; annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa_e2e-tests-downward-api-z9f7t(e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa) I0213 14:40:29.875804 28798 atomic_writer.go:159] pod e2e-tests-downward-api-z9f7t/annotationupdatee60f5720-10fd-11e8-bff0-1c1b0deeddfa volume podinfo: write required for target directory /var/lib/kubelet/pods/e60f86d7-10fd-11e8-bc3d-1c1b0deeddfa/volumes/kubernetes.io~downward-api/podinfo In that case, the delta-t between the UPDATE and the volume manager remount is about 11ms (870-859), which less than the dswp loop interval. In this case, the dswp loop didn't update the dsw from the podManager before the volume remount took place. Thus we had to wait until the next SYNC for it to be picked up.
sent a PR upstream, https://github.com/kubernetes/kubernetes/pull/59873 Works well, test finishes in 2-4 seconds. However, that fixes only the 1st problem from comment #7, we still don't know why subsequent resync does not pick the updated pod. There is some chance it will still flake...
> The question is does the dswp loop execute findAndAddNewPods() after the > podManager update Yes, that's IMO 100% guaranteed. > but before the volume manager reconciliation loop remounts the volumes in > response to syncPod running ReprocessPod() and MarkRemountRequired(). No, that's not guaranteed. Reconciliation loop and dswp loop run in separate goroutines and they basically race who finds the pod first.
Created attachment 1396550 [details] label-update-flake.log > There is some chance it will still flake... I agree. In the cases I see in our CI, the subsequent SYNC, which should pick up the change regardless of the race, does not. That leads me to believe that regardless of the race or the test timeout, the test would flake because the volume manager dsw is never being updated for some reason. Attached is a node log filtered to the pod involved in a label update tests. You can see that the SYNC following the UPDATE also result in the atomic write determining that no update is needed, which means the pod provided to it by the volume manager is not the updated version of the pod.
Added PR with additional logs: https://github.com/kubernetes/kubernetes/pull/59923 Maybe we should port it back together with the fix in comment #23.
Origin PR fixing the race and increasing volume manager log levels https://github.com/openshift/origin/pull/18636
Verify on v3.9.0-0.51.0. Steps: 1. Create a pod with downwardAPI projected volume. apiVersion: v1 kind: Pod metadata: name: allinone-normal labels: region: one annotations: builder: bar spec: containers: - name: allinone-normal image: docker.io/ocpqe/hello-pod:latest resources: limits: cpu: "500m" memory: "256Mi" volumeMounts: - name: all-in-one mountPath: "/all-in-one" readOnly: true volumes: - name: all-in-one projected: sources: - downwardAPI: items: - path: annotations fieldRef: fieldPath: metadata.annotations - path: mydapi/labels fieldRef: fieldPath: metadata.labels - path: mydapi/name fieldRef: fieldPath: metadata.name - path: mydapi/cpu_limit resourceFieldRef: containerName: allinone-normal resource: limits.cpu divisor: "1m" # oc get po -n dma NAME READY STATUS RESTARTS AGE allinone-normal 1/1 Running 0 20m 2. Check the value in pod [root@ip-172-18-6-178 dma]# oc exec -it allinone-normal -n dma -- cat /all-in-one/annotations builder="bar" kubernetes.io/config.seen="2018-02-26T03:13:42.306863034-05:00" kubernetes.io/config.source="api" openshift.io/scc="anyuid" 3. Update the annotations of pod then check the volume value again in pod; The value in pod update immediately after update pod annotations # oc patch pod allinone-normal -n dma -p '{"metadata":{"annotations":{"builder":"foo"}}}' pod "allinone-normal" patched # oc exec -it allinone-normal -n dma -- cat /all-in-one/annotations builder="foo" kubernetes.io/config.seen="2018-02-26T03:13:42.306863034-05:00" kubernetes.io/config.source="api"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489