Description of problem: The diskmaker-manager Pod continues to be deployed and PV is not created when configuring local volumes multiple nodes with local-storage-operator 4.8. Version-Release number of selected component (if applicable): OCP4.8.13 UPI local-storage-operator 4.8 In the customer's environment, the diskmaker-manager Pod got normally when local-storage-operator was downgrade to 4.7. How reproducible: This is the procedure reproduced in the internal environment. Steps to Reproduce: 1. Build an OCP4.8 cluster with 3 masters and 2 workers. 2. Add a disk to worker (/dev/vdb) and partition it into multiple partitions. (This time, create 10 partitions for each.) 3. Create localvolumes each with the attached script(23-worker-0.sh,23-worker-1.sh). An issue occurs on the fourth localvolume. Actual results: The diskmaker-manager will be in a state of repeating ContainerCreating and Terminating. ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-t7mdq 1/1 Running 0 33s 10.128.2.54 worker-0 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m30s 10.131.0.57 worker-1 <none> <none> => 20210928-153637 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-q9ww9 0/1 ContainerCreating 0 1s <none> worker-1 <none> <none> diskmaker-manager-t7mdq 1/1 Running 0 34s 10.128.2.54 worker-0 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m31s 10.131.0.57 worker-1 <none> <none> => 20210928-153638 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-q9ww9 0/1 ContainerCreating 0 2s <none> worker-1 <none> <none> diskmaker-manager-t7mdq 1/1 Running 0 35s 10.128.2.54 worker-0 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m32s 10.131.0.57 worker-1 <none> <none> => 20210928-153639 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-q9ww9 0/1 ContainerCreating 0 3s <none> worker-1 <none> <none> diskmaker-manager-t7mdq 1/1 Running 0 36s 10.128.2.54 worker-0 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m33s 10.131.0.57 worker-1 <none> <none> => 20210928-153640 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-q9ww9 0/1 ContainerCreating 0 4s <none> worker-1 <none> <none> diskmaker-manager-t7mdq 1/1 Running 0 37s 10.128.2.54 worker-0 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m34s 10.131.0.57 worker-1 <none> <none> => 20210928-153641 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-q9ww9 1/1 Running 0 5s 10.131.0.59 worker-1 <none> <none> diskmaker-manager-t7mdq 0/1 Terminating 0 38s 10.128.2.54 worker-0 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m35s 10.131.0.57 worker-1 <none> <none> => 20210928-153642 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-8ls9h 0/1 ContainerCreating 0 0s <none> worker-0 <none> <none> diskmaker-manager-q9ww9 1/1 Running 0 6s 10.131.0.59 worker-1 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m36s 10.131.0.57 worker-1 <none> <none> => 20210928-153643 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES diskmaker-manager-8ls9h 0/1 Terminating 0 1s <none> worker-0 <none> <none> diskmaker-manager-q9ww9 1/1 Terminating 0 7s 10.131.0.59 worker-1 <none> <none> local-storage-operator-65878b66fc-k59sl 1/1 Running 0 6m37s 10.131.0.57 worker-1 <none> <none> ``` After about an hour, both will be running, or one will be running and the other will be terminating. Expected results: The diskmaker-manager should be created and continues to run. The same goes for ds/diskmaker-manager. Additional info: The ConfigMap resourceVersion continues to increasing. ds/diskmaker-manager The same goes for ds/diskmaker-manager. The contents have not changed. It seems that multiple controllers are competing with each other and keep updating the same object. ``` $ oc -n openshift-local-storage get cm/local-provisioner -o yaml | grep resourceVersion resourceVersion: "5888047" $ oc -n openshift-local-storage get cm/local-provisioner -o yaml | grep resourceVersion resourceVersion: "5888135" $ oc -n openshift-local-storage get cm/local-provisioner -o yaml | grep resourceVersion resourceVersion: "5888224" $ oc -n openshift-local-storage get cm/local-provisioner -o yaml | grep resourceVersion resourceVersion: "5888292" ```
Thank you Manabu and Ryoji, this info was very helpful. I'm able to reproduce the issue, still working on RCA. 0) Initial setup Installed LSO and created a series of loopback devices for testing on each worker node as follows: [jon@dagobah 2011171]$ oc debug node/ip-10-0-138-245.us-east-2.compute.internal ... sh-4.4# for i in `seq 1 4`; do dd if=/dev/zero of=/tmp/loopback${i}.img bs=1 count=0 seek=1G; losetup -fP /tmp/loopback${i}.img; done ... sh-4.4# losetup -a /dev/loop1: [0042]:4466232 (/tmp/loopback2.img) /dev/loop2: [0042]:4466840 (/tmp/loopback3.img) /dev/loop0: [0042]:4466837 (/tmp/loopback1.img) /dev/loop3: [0042]:4466842 (/tmp/loopback4.img) 1) Create single LV on single worker node [jon@dagobah 2011171]$ cat lv-loop0.yaml apiVersion: "local.storage.openshift.io/v1" kind: "LocalVolume" metadata: name: "lv-loop0" namespace: "openshift-local-storage" spec: nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - ip-10-0-138-245 storageClassDevices: - storageClassName: "lv-loop0" volumeMode: Filesystem fsType: xfs devicePaths: - /dev/loop0 [jon@dagobah 2011171]$ oc apply -f lv-loop0.yaml localvolume.local.storage.openshift.io/lv-loop0 created This triggered multiple updates to the daemonset (see "daemonset changed" entries): [jon@dagobah 2011171]$ oc logs -n openshift-local-storage pod/local-storage-operator-77d6b45fc9-t8879 ... I1014 22:28:05.087390 1 reconcile.go:34] Reconciling LocalVolume I1014 22:28:05.087454 1 api_updater.go:74] Updating localvolume openshift-local-storage/lv-loop0 I1014 22:28:05.097273 1 reconcile.go:34] Reconciling LocalVolume {"level":"info","ts":1634250485.0975707,"logger":"localvolumesetdaemon-controller","msg":"provisioner configmap changed","Request.Namespace":"openshift-local-storage","Request.Name":""} E1014 22:28:05.108356 1 reconcile.go:96] failed to fetch diskmaker daemonset DaemonSet.apps "diskmaker-manager" not found E1014 22:28:05.114731 1 reconcile.go:134] error syncing condition: LocalVolume.local.storage.openshift.io "lv-loop0" is invalid: status.generations: Required value {"level":"info","ts":1634250485.1165073,"logger":"localvolumesetdaemon-controller","msg":"daemonset changed","Request.Namespace":"openshift-local-storage","Request.Name":"","daemonset.Name":"diskmaker-manager","op.Result":"created"} I1014 22:28:05.117012 1 reconcile.go:34] Reconciling LocalVolume {"level":"info","ts":1634250485.126506,"logger":"localvolumesetdaemon-controller","msg":"daemonset changed","Request.Namespace":"openshift-local-storage","Request.Name":"","daemonset.Name":"diskmaker-manager","op.Result":"updated"} {"level":"info","ts":1634250485.1347363,"logger":"localvolumesetdaemon-controller","msg":"daemonset changed","Request.Namespace":"openshift-local-storage","Request.Name":"","daemonset.Name":"diskmaker-manager","op.Result":"updated"} I1014 22:28:05.135488 1 reconcile.go:34] Reconciling LocalVolume {"level":"info","ts":1634250485.1427386,"logger":"localvolumesetdaemon-controller","msg":"daemonset changed","Request.Namespace":"openshift-local-storage","Request.Name":"","daemonset.Name":"diskmaker-manager","op.Result":"updated"} I1014 22:28:05.165650 1 reconcile.go:34] Reconciling LocalVolume I1014 22:28:05.176266 1 reconcile.go:34] Reconciling LocalVolume {"level":"error","ts":1634250485.17673,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"localvolumesetdaemon-controller","request":"openshift-local-storage/","error":"Operation cannot be fulfilled on daemonsets.apps \"diskmaker-manager\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/local-storage-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/openshift/local-storage-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/local-storage-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/openshift/local-storage-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/local-storage-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/local-storage-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/local-storage-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} {"level":"info","ts":1634250486.185989,"logger":"localvolumesetdaemon-controller","msg":"daemonset changed","Request.Namespace":"openshift-local-storage","Request.Name":"","daemonset.Name":"diskmaker-manager","op.Result":"updated"} I1014 22:28:16.900963 1 reconcile.go:34] Reconciling LocalVolume {"level":"info","ts":1634250496.9091973,"logger":"localvolumesetdaemon-controller","msg":"daemonset changed","Request.Namespace":"openshift-local-storage","Request.Name":"","daemonset.Name":"diskmaker-manager","op.Result":"updated"} I1014 22:28:19.483278 1 reconcile.go:34] Reconciling LocalVolume I1014 22:28:19.493026 1 reconcile.go:34] Reconciling LocalVolume I1014 22:28:19.510662 1 reconcile.go:34] Reconciling LocalVolume I1014 22:28:19.518589 1 reconcile.go:34] Reconciling LocalVolume However, the generation number was still 1 after those updates: [jon@dagobah 2011171]$ oc get daemonset.apps/diskmaker-manager -n openshift-local-storage -o yaml | head apiVersion: apps/v1 kind: DaemonSet metadata: annotations: deprecated.daemonset.template.generation: "1" local.storage.openshift.io/configMapDataHash: 17032e345ceabe091d608b58ed093b2c5d1c2f0107 dcfe34e7888d8ffca1c53c creationTimestamp: "2021-10-14T22:28:05Z" generation: 1 labels: app: diskmaker-manager This seems like a potential issue... if the generation number is not changing, then why isn't CreateOrUpdate() returning OperationResultNone? https://github.com/openshift/local-storage-operator/blob/81d182a4cabe04d2dc7f42d7b6167bd0d86900e2/vendor/sigs.k8s.io/controller-runtime/pkg/controller/controllerutil/controllerutil.go#L211-L224 2) Create 4 LV's on each of 2 worker nodes (8 LV's total) [jon@dagobah 2011171]$ cat reproduce-loopback.sh #!/bin/bash devprefix=/dev/loop volmode=fs declare -a workers=( "ip-10-0-138-245" "ip-10-0-163-198" ) case ${volmode} in fs) volumeMode=Filesystem ;; block) volumeMode=Block ;; esac echo "* volmode: ${volmode}" for (( w=0; w<${#workers[@]}; w++ )); do worker=${workers[$w]} echo "$w" "$worker" for (( d=0; d<4; d++ )); do devpath="${devprefix}$d" echo "$d" "${devpath}" cat << END | oc create -f - apiVersion: "local.storage.openshift.io/v1" kind: "LocalVolume" metadata: name: "local-${volmode}-w${w}-${d}" namespace: "openshift-local-storage" spec: nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - ${worker} storageClassDevices: - storageClassName: "local-${volmode}-w${w}-${d}" volumeMode: ${volumeMode} devicePaths: - ${devpath} END sleep 1 done done [jon@dagobah 2011171]$ ./reproduce-loopback.sh * volmode: fs 0 ip-10-0-138-245 0 /dev/loop0 localvolume.local.storage.openshift.io/local-fs-w0-0 created 1 /dev/loop1 localvolume.local.storage.openshift.io/local-fs-w0-1 created 2 /dev/loop2 localvolume.local.storage.openshift.io/local-fs-w0-2 created 3 /dev/loop3 localvolume.local.storage.openshift.io/local-fs-w0-3 created 1 ip-10-0-163-198 0 /dev/loop0 localvolume.local.storage.openshift.io/local-fs-w1-0 created 1 /dev/loop1 localvolume.local.storage.openshift.io/local-fs-w1-1 created 2 /dev/loop2 localvolume.local.storage.openshift.io/local-fs-w1-2 created 3 /dev/loop3 localvolume.local.storage.openshift.io/local-fs-w1-3 created Right away we can see the behavior described in the bug report. [jon@dagobah 2011171]$ oc get pods -n openshift-local-storage NAME READY STATUS RESTARTS AGE diskmaker-manager-qqhmz 0/1 Terminating 0 10s diskmaker-manager-xhs4h 0/1 Terminating 0 6s local-storage-operator-77d6b45fc9-t8879 1/1 Running 0 87m [jon@dagobah 2011171]$ oc get pods -n openshift-local-storage NAME READY STATUS RESTARTS AGE diskmaker-manager-qqhmz 0/1 Terminating 0 15s diskmaker-manager-xhs4h 0/1 Terminating 0 11s local-storage-operator-77d6b45fc9-t8879 1/1 Running 0 87m [jon@dagobah 2011171]$ oc get pods -n openshift-local-storage NAME READY STATUS RESTARTS AGE diskmaker-manager-5l52x 0/1 ContainerCreating 0 1s diskmaker-manager-plzvk 0/1 Pending 0 0s local-storage-operator-77d6b45fc9-t8879 1/1 Running 0 88m [jon@dagobah 2011171]$ oc get pods -n openshift-local-storage NAME READY STATUS RESTARTS AGE diskmaker-manager-5l52x 0/1 Terminating 0 3s diskmaker-manager-xw8ql 0/1 ContainerCreating 0 0s local-storage-operator-77d6b45fc9-t8879 1/1 Running 0 88m While that was happening, I captured the yaml for diskmaker-manager and ran a diff. [jon@dagobah 2011171]$ oc get daemonset.apps/diskmaker-manager -n openshift-local-storage -o yaml > diskmaker1.yaml [jon@dagobah 2011171]$ oc get daemonset.apps/diskmaker-manager -n openshift-local-storage -o yaml > diskmaker2.yaml [jon@dagobah 2011171]$ oc get daemonset.apps/diskmaker-manager -n openshift-local-storage -o yaml > diskmaker3.yaml [jon@dagobah 2011171]$ oc get daemonset.apps/diskmaker-manager -n openshift-local-storage -o yaml > diskmaker4.yaml [jon@dagobah 2011171]$ diff diskmaker1.yaml diskmaker4.yaml 5c5 < deprecated.daemonset.template.generation: "310" --- > deprecated.daemonset.template.generation: "339" 8c8 < generation: 310 --- > generation: 339 62c62 < resourceVersion: "240948" --- > resourceVersion: "241199" 198c198 < observedGeneration: 310 --- > observedGeneration: 339 At first it looks like just the generation number and resourceVersion are being updated. But multiple updates are happening between these 2 files. What's really happening is the ownerReferences and nodeAffinity lists are being reordered: [jon@dagobah 2011171]$ diff diskmaker2.yaml diskmaker3.yaml 5c5 < deprecated.daemonset.template.generation: "324" --- > deprecated.daemonset.template.generation: "329" 8c8 < generation: 324 --- > generation: 329 18,23d17 < name: local-fs-w0-3 < uid: 4178c3bf-33d9-4122-aac4-053be382362c < - apiVersion: local.storage.openshift.io/v1 < blockOwnerDeletion: false < controller: false < kind: LocalVolume 62c56,62 < resourceVersion: "241069" --- > - apiVersion: local.storage.openshift.io/v1 > blockOwnerDeletion: false > controller: false > kind: LocalVolume > name: local-fs-w0-3 > uid: 4178c3bf-33d9-4122-aac4-053be382362c > resourceVersion: "241122" 85c85 < - ip-10-0-138-245 --- > - ip-10-0-163-198 105c105 < - ip-10-0-163-198 --- > - ip-10-0-138-245 198,199c198 < observedGeneration: 321 < updatedNumberScheduled: 2 --- > observedGeneration: 329 [jon@dagobah 2011171]$ diff diskmaker3.yaml diskmaker4.yaml 5c5 < deprecated.daemonset.template.generation: "329" --- > deprecated.daemonset.template.generation: "339" 8c8 < generation: 329 --- > generation: 339 17a18,23 > name: local-fs-w0-3 > uid: 4178c3bf-33d9-4122-aac4-053be382362c > - apiVersion: local.storage.openshift.io/v1 > blockOwnerDeletion: false > controller: false > kind: LocalVolume 56,62c62 < - apiVersion: local.storage.openshift.io/v1 < blockOwnerDeletion: false < controller: false < kind: LocalVolume < name: local-fs-w0-3 < uid: 4178c3bf-33d9-4122-aac4-053be382362c < resourceVersion: "241122" --- > resourceVersion: "241199" 85c85 < - ip-10-0-163-198 --- > - ip-10-0-138-245 105c105 < - ip-10-0-138-245 --- > - ip-10-0-163-198 198c198 < observedGeneration: 329 --- > observedGeneration: 339 configMapDataHash is not getting updated like I had theorized earlier, but I captured the configmap during this experiment too: [jon@dagobah 2011171]$ oc get configmap -n openshift-local-storage local-provisioner -o yaml > configmap1.yaml [jon@dagobah 2011171]$ oc get configmap -n openshift-local-storage local-provisioner -o yaml > configmap2.yaml [jon@dagobah 2011171]$ oc get configmap -n openshift-local-storage local-provisioner -o yaml > configmap3.yaml [jon@dagobah 2011171]$ oc get configmap -n openshift-local-storage local-provisioner -o yaml > configmap4.yaml And here, again, it's the ownerReferences being reordered and updated. [jon@dagobah 2011171]$ diff configmap2.yaml configmap3.yaml 68,69c68,69 < name: local-fs-w0-2 < uid: 7e645f08-b7c1-4e1c-a3ce-8d9e089dbdc2 --- > name: local-fs-w1-1 > uid: 5dce825e-3eeb-4402-835a-03e37a01ecba 74,75c74,75 < name: local-fs-w0-3 < uid: 4178c3bf-33d9-4122-aac4-053be382362c --- > name: local-fs-w1-2 > uid: 9bb2cfae-3f78-46ee-bfc3-87c6a01d3173 80,81c80,81 < name: local-fs-w1-0 < uid: 2c2ee787-1938-43b2-8845-a02244819e4e --- > name: local-fs-w1-3 > uid: 03ab4601-dda4-4b32-ac81-fdc79dc8885b 86,87c86,87 < name: local-fs-w1-1 < uid: 5dce825e-3eeb-4402-835a-03e37a01ecba --- > name: local-fs-w0-0 > uid: d22d47e8-c03a-4b0e-8790-decbf979966b 92,93c92,93 < name: local-fs-w1-2 < uid: 9bb2cfae-3f78-46ee-bfc3-87c6a01d3173 --- > name: local-fs-w0-1 > uid: 8dd04e56-caa4-4374-9101-b7b890b368a8 98,99c98,99 < name: local-fs-w1-3 < uid: 03ab4601-dda4-4b32-ac81-fdc79dc8885b --- > name: local-fs-w0-2 > uid: 7e645f08-b7c1-4e1c-a3ce-8d9e089dbdc2 104,105c104,105 < name: local-fs-w0-0 < uid: d22d47e8-c03a-4b0e-8790-decbf979966b --- > name: local-fs-w0-3 > uid: 4178c3bf-33d9-4122-aac4-053be382362c 110,112c110,112 < name: local-fs-w0-1 < uid: 8dd04e56-caa4-4374-9101-b7b890b368a8 < resourceVersion: "249667" --- > name: local-fs-w1-0 > uid: 2c2ee787-1938-43b2-8845-a02244819e4e > resourceVersion: "249740" [jon@dagobah 2011171]$ diff configmap3.yaml configmap4.yaml 68,69c68,69 < name: local-fs-w1-1 < uid: 5dce825e-3eeb-4402-835a-03e37a01ecba --- > name: local-fs-w0-2 > uid: 7e645f08-b7c1-4e1c-a3ce-8d9e089dbdc2 74,75c74,75 < name: local-fs-w1-2 < uid: 9bb2cfae-3f78-46ee-bfc3-87c6a01d3173 --- > name: local-fs-w0-3 > uid: 4178c3bf-33d9-4122-aac4-053be382362c 80,81c80,81 < name: local-fs-w1-3 < uid: 03ab4601-dda4-4b32-ac81-fdc79dc8885b --- > name: local-fs-w1-0 > uid: 2c2ee787-1938-43b2-8845-a02244819e4e 86,87c86,87 < name: local-fs-w0-0 < uid: d22d47e8-c03a-4b0e-8790-decbf979966b --- > name: local-fs-w1-1 > uid: 5dce825e-3eeb-4402-835a-03e37a01ecba 92,93c92,93 < name: local-fs-w0-1 < uid: 8dd04e56-caa4-4374-9101-b7b890b368a8 --- > name: local-fs-w1-2 > uid: 9bb2cfae-3f78-46ee-bfc3-87c6a01d3173 98,99c98,99 < name: local-fs-w0-2 < uid: 7e645f08-b7c1-4e1c-a3ce-8d9e089dbdc2 --- > name: local-fs-w1-3 > uid: 03ab4601-dda4-4b32-ac81-fdc79dc8885b 104,105c104,105 < name: local-fs-w0-3 < uid: 4178c3bf-33d9-4122-aac4-053be382362c --- > name: local-fs-w0-0 > uid: d22d47e8-c03a-4b0e-8790-decbf979966b 110,112c110,112 < name: local-fs-w1-0 < uid: 2c2ee787-1938-43b2-8845-a02244819e4e < resourceVersion: "249740" --- > name: local-fs-w0-1 > uid: 8dd04e56-caa4-4374-9101-b7b890b368a8 > resourceVersion: "249836" So right now the things I want to understand are: A) When creating a single LV, why do we get so many messages about daemonset being updated, even though the generation number is only 1? I would expect CreateOrUpdate() to return OperationResultNone in that case and not print a log message. B) Why is the list ordering for ownerReferences and nodeAffinity constantly changing and causing updates to both the daemonset and the configmap? This seems only to happen with 2+ worker nodes. This is what's causing the diskmaker-manager pods to be recreated over and over.
Ok, RCA is understood now. I spent quite a bit of time tracing this with delve, and there were a few issues to fix. 1) We create the list of ownerRefs and node selector terms here in extractLVInfo: https://github.com/openshift/local-storage-operator/blob/81d182a4cabe04d2dc7f42d7b6167bd0d86900e2/pkg/controller/nodedaemon/aggregate.go#L88 But the order of the lists generated by that function is different every time. This explains my previous observations where the list ordering for the owner refs and node selectors were constantly changing. The fix for this is to sort the slice of lvs before looping through them, similar to what we already do in extractLVSetInfo, so that the lists will be created in a predictable order: https://github.com/openshift/local-storage-operator/blob/81d182a4cabe04d2dc7f42d7b6167bd0d86900e2/pkg/controller/nodedaemon/aggregate.go#L56-L61 2) Every time LSO updated the daemonset, it was attempting to clear TerminationMessagePath and TerminationMessagePolicy: (dlv) print existing.Spec.Template.Spec.Containers[0] k8s.io/api/core/v1.Container { ... TerminationMessagePath: "/dev/termination-log", TerminationMessagePolicy: "File", ... (dlv) print obj.Spec.Template.Spec.Containers[0] k8s.io/api/core/v1.Container { ... TerminationMessagePath: "", TerminationMessagePolicy: "", ... Looks like those fields were set by prometheus-operator. Maybe LSO is even "fighting" prometheus-operator to set those fields? Every time LSO updates the daemonset, I see the same thing. Those 2 fields are set on the existing object, but since they are not in MutateAggregatedSpec(), LSO tries to clear them on every update. Adding them to the definition in MutateAggregatedSpec() solves this particular issue. https://github.com/openshift/local-storage-operator/blob/81d182a4cabe04d2dc7f42d7b6167bd0d86900e2/pkg/controller/nodedaemon/daemonsets.go#L159 3) Every time LSO updated the daemonset, it was also attempting to clear the APIVersion field from the field selector for each environment variable, i.e.: Name: "MY_NODE_NAME", Value: "", ValueFrom: &v1.EnvVarSource{ FieldRef: &v1.ObjectFieldSelector{ - APIVersion: "v1", + APIVersion: "", FieldPath: "spec.nodeName", }, This again is because that field is not defined in MutateAggregatedSpec(), and it can be fixed by setting it to v1 for each entry here: https://github.com/openshift/local-storage-operator/blob/81d182a4cabe04d2dc7f42d7b6167bd0d86900e2/pkg/controller/nodedaemon/daemonsets.go#L233-L256 4) There is a Watch on DaemonSets that I don't think is necessary: https://github.com/openshift/local-storage-operator/blob/81d182a4cabe04d2dc7f42d7b6167bd0d86900e2/pkg/controller/nodedaemon/add.go#L70-L73 We already watch for changes on the daemonset at line 54, and LV's on line 65. I think this one can be removed--it's triggering an extra reconcile when one should not be needed. It was already removed from the master branch at some point: https://github.com/openshift/local-storage-operator/blob/a38697051f4567c7f630c897d11b202206113832/controllers/nodedaemon/reconcile.go#L207-L212 After fixing those 4 issues in my local test VM, I see the configmap and daemonset only getting updated once when creating a new LV: [vagrant@fedora33 deploy]$ k get localvolumes NAME AGE lv-loop0 19h lv-loop1 19h lv-loop2 19h [vagrant@fedora33 deploy]$ k apply -f lv-loop3.yaml localvolume.local.storage.openshift.io/lv-loop3 created [vagrant@fedora33 deploy]$ k logs local-storage-operator-5d6fc74cdd-5xml8 -f ... I1021 19:16:59.423579 1 reconcile.go:34] Reconciling LocalVolume I1021 19:16:59.423628 1 api_updater.go:74] Updating localvolume default/lv-loop3 I1021 19:16:59.450197 1 reconcile.go:34] Reconciling LocalVolume {"level":"info","ts":1634843819.4503026,"logger":"localvolumesetdaemon-controller","msg":"pr ovisioner configmap changed","Request.Namespace":"default","Request.Name":""} {"level":"info","ts":1634843819.4561226,"logger":"localvolumesetdaemon-controller","msg":"da emonset changed","Request.Namespace":"default","Request.Name":"","daemonset.Name":"diskmaker -manager","op.Result":"updated"} I1021 19:16:59.468948 1 reconcile.go:34] Reconciling LocalVolume I1021 19:16:59.476097 1 reconcile.go:34] Reconciling LocalVolume I1021 19:16:59.481306 1 reconcile.go:34] Reconciling LocalVolume I1021 19:16:59.484802 1 reconcile.go:34] Reconciling LocalVolume My testing so far was based on the release-4.8 branch, so next steps: - port fix over to 4.10 and 4.9 branches - test fix on 4.10, 4.9, 4.8 OCP clusters - file PR's for 4.10, 4.9, 4.8
RE https://bugzilla.redhat.com/show_bug.cgi?id=2011171#c11 Issues #1 and #2 apply to all 3 versions (4.8, 4.9, 4.10). Issues #3 and #4 only apply to 4.8 (already fixed on newer versions).
4.9 and 4.10 also have a kube-rbac-proxy sidecar that needs to be updated to avoid extra daemonset updates. The following fields were undefined and getting cleared on every update attempt: containers: - name: kube-rbac-proxy imagePullPolicy: IfNotPresent terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumes: - name: metrics-serving-cert secret: defaultMode: 420
./reproduce-loopback.sh * volmode: fs 0 ip-10-0-187-149.us-east-2.compute.internal 0 /dev/loop0 localvolume.local.storage.openshift.io/local-fs-w0-0 created 1 /dev/loop1 localvolume.local.storage.openshift.io/local-fs-w0-1 created 2 /dev/loop2 localvolume.local.storage.openshift.io/local-fs-w0-2 created 3 /dev/loop3 localvolume.local.storage.openshift.io/local-fs-w0-3 created 1 ip-10-0-142-92.us-east-2.compute.internal 0 /dev/loop0 localvolume.local.storage.openshift.io/local-fs-w1-0 created 1 /dev/loop1 localvolume.local.storage.openshift.io/local-fs-w1-1 created 2 /dev/loop2 localvolume.local.storage.openshift.io/local-fs-w1-2 created 3 /dev/loop3 localvolume.local.storage.openshift.io/local-fs-w1-3 created oc get pods NAME READY STATUS RESTARTS AGE diskmaker-discovery-jwh9m 2/2 Running 0 76m diskmaker-discovery-mzh29 2/2 Running 0 76m diskmaker-discovery-vhgs2 2/2 Running 0 76m diskmaker-manager-4c5zm 2/2 Running 0 15s diskmaker-manager-jphkp 2/2 Running 0 16s diskmaker-manager-zkpcv 2/2 Running 0 15s local-storage-operator-9855f445-pts2w 1/1 Running 0 130m Passed with local-storage-operator.4.10.0-202110270816
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056