Filing a bug for tracking purposes. The Assisted-service pod is showing high memory utilization. Prior to work around seeing memory usage at or around 1.9 GB. the workaround is to scale down AI operator. edit assisted-service deployment ENABLE_KUBE_API=false let assisted-service pod reboot.
Created attachment 1779141 [details] ai pod memory usage assisted-service memory usage before workaround.
Created attachment 1779143 [details] assisted-service pod memory after workaround ai pod memory usage post workaround
Created attachment 1779214 [details] Prometheus metrics for AI controller runtime These were captured on a fresh environment that was deployed with dev-scripts
Digging into this now. These are the metrics related to the baremetalhost reconcile. At the bottom we can see the length go up to 60s I have attached the full metrics dump for `controller-runtime` from the assisted-service POD. ``` # HELP controller_runtime_reconcile_time_seconds Length of time per reconciliation per controller # TYPE controller_runtime_reconcile_time_seconds histogram controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.005"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.01"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.025"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.05"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.1"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.15"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.2"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.25"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.3"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.35"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.4"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.45"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.5"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.6"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.7"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.8"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="0.9"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="1"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="1.25"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="1.5"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="1.75"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="2"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="2.5"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="3"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="3.5"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="4"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="4.5"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="5"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="6"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="7"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="8"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="9"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="10"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="15"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="20"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="25"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="30"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="40"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="50"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="60"} 3 controller_runtime_reconcile_time_seconds_bucket{controller="baremetalhost",le="+Inf"} 3 controller_runtime_reconcile_time_seconds_sum{controller="baremetalhost"} 0.00027825800000000003 controller_runtime_reconcile_time_seconds_count{controller="baremetalhost"} 3 # HELP controller_runtime_reconcile_total Total number of reconciliations per controller # TYPE controller_runtime_reconcile_total counter controller_runtime_reconcile_total{controller="baremetalhost",result="success"} 3 # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. ```
There are some optimizations we can do. For instance, we are proceeding with the `Reconcile` steps even when the BMH and InfraEnv are not ready. I'm working on a PR to solve this and allow us to interrupt the reconcile process if there is no point in proceeding.
Created attachment 1780326 [details] grafana memory graph
Created attachment 1780327 [details] ai.heap
Created attachment 1780328 [details] ai.trace
Created attachment 1780329 [details] ai.goroutine
Summary: The reported memory is the highest peak of memory used by Assisted Service and not the actual allocated memory. There are different metrics for memory reported by the metric-server, and, more specifically, cAdvisor. The specific fields can be found here: https://github.com/google/cadvisor/blob/75d3a4f100ba0e526ff463a93eba7304a6b91958/info/v1/container.go#L365-L400 kubectl top pod (and some of the metrics reported on prometheus) represent the currently allocated memory + the cached memory. Here are the stats for a freshly deployed Assisted Service where the reported behavior was replicated: Container Memory Usage Memory Requests Memory Requests % Memory Usage (RSS) Memory Usage (Cache) Memory Usage (Swap) postgres 16.86 MiB 400.00 MiB 4.21% 3.61 MiB 12.76 MiB 0 B assisted-service 1.49 GiB 512.00 MiB 298.97% 249.00 MiB 3.16 GiB 0 B POD 140.00 KiB - - - - - Longer version: Here is a longer version of what the debug process looked like for posterity. While I hope we won't need to do more investigations like this, I also hope that this experience could serve as a learning option for more folks: These are the `top25` stats taken from the `ai.heap` attachment (which contains the heap dump taken using `pprof`) in this BZ. We can also generate callgraphs from this data to analyze when and how these functions are called: ``` (pprof) inuse_space (pprof) top25 Showing nodes accounting for 28.93MB, 100% of 28.93MB total Showing top 25 nodes out of 137 flat flat% sum% cum cum% 13.08MB 45.23% 45.23% 14.58MB 50.41% k8s.io/api/core/v1.(*Secret).Unmarshal 2MB 6.92% 52.14% 2MB 6.92% reflect.mapassign 1.51MB 5.22% 57.37% 1.51MB 5.22% k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific 1.50MB 5.19% 62.55% 1.50MB 5.19% github.com/aws/aws-sdk-go/aws/endpoints.init 1MB 3.46% 66.01% 1MB 3.46% github.com/go-openapi/analysis.(*Spec).analyzeSchema 1MB 3.46% 69.47% 1MB 3.46% reflect.packEface 1MB 3.46% 72.93% 1MB 3.46% k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).Unmarshal 1MB 3.46% 76.39% 1MB 3.46% net/url.parse 0.72MB 2.49% 78.88% 15.30MB 52.90% k8s.io/api/core/v1.(*SecretList).Unmarshal 0.54MB 1.87% 80.74% 0.54MB 1.87% k8s.io/client-go/tools/cache.(*threadSafeMap).Add 0.54MB 1.85% 82.60% 0.54MB 1.85% k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName 0.52MB 1.81% 84.41% 0.52MB 1.81% regexp/syntax.(*compiler).inst (inline) 0.51MB 1.76% 86.16% 0.51MB 1.76% github.com/go-openapi/analysis.(*Spec).reset 0.50MB 1.74% 87.90% 0.50MB 1.74% golang.org/x/net/http2.glob..func3 0.50MB 1.73% 89.63% 0.50MB 1.73% reflect.New 0.50MB 1.73% 91.36% 2.02MB 7.00% regexp.compile 0.50MB 1.73% 93.09% 0.50MB 1.73% regexp/syntax.(*parser).newRegexp (inline) 0.50MB 1.73% 94.81% 0.50MB 1.73% github.com/prometheus/client_golang/prometheus.newHistogram 0.50MB 1.73% 96.54% 1.50MB 5.19% k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal 0.50MB 1.73% 98.27% 0.50MB 1.73% regexp/syntax.(*compiler).init (inline) 0.50MB 1.73% 100% 1.04MB 3.60% k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop 0 0% 100% 0.50MB 1.73% encoding/gob.(*Decoder).Decode 0 0% 100% 0.50MB 1.73% encoding/gob.(*Decoder).DecodeValue 0 0% 100% 0.50MB 1.73% encoding/gob.(*Decoder).decOpFor.func2 0 0% 100% 0.50MB 1.73% encoding/gob.(*Decoder).decOpFor.func4 ``` ``` (pprof) inuse_objects (pprof) top25 Showing nodes accounting for 84430mb, 99.38% of 84961mb total Dropped 34 nodes (cum <= 424mb) Showing top 25 nodes out of 103 flat flat% sum% cum cum% 16384mb 19.28% 19.28% 16390mb 19.29% k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop 10923mb 12.86% 32.14% 10923mb 12.86% regexp/syntax.(*compiler).init (inline) 7867mb 9.26% 41.40% 7867mb 9.26% github.com/aws/aws-sdk-go/aws/endpoints.init 7282mb 8.57% 49.97% 7282mb 8.57% net/url.parse 6554mb 7.71% 57.69% 6554mb 7.71% github.com/prometheus/client_golang/prometheus.newHistogram 6554mb 7.71% 65.40% 10703mb 12.60% k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal 5499mb 6.47% 71.87% 16202mb 19.07% k8s.io/api/core/v1.(*Secret).Unmarshal 5251mb 6.18% 78.05% 5251mb 6.18% reflect.mapassign 4681mb 5.51% 83.56% 4681mb 5.51% regexp/syntax.(*parser).newRegexp (inline) 4149mb 4.88% 88.45% 4149mb 4.88% k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).Unmarshal 3277mb 3.86% 92.30% 3277mb 3.86% reflect.packEface 3277mb 3.86% 96.16% 18892mb 22.24% regexp.compile 1821mb 2.14% 98.30% 1821mb 2.14% github.com/go-openapi/analysis.(*Spec).analyzeSchema 910mb 1.07% 99.37% 910mb 1.07% reflect.New 1mb 0.0012% 99.38% 16203mb 19.07% k8s.io/api/core/v1.(*SecretList).Unmarshal 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).Decode 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).DecodeValue 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decOpFor.func2 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decOpFor.func4 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decodeGobDecoder 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decodeMap 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decodeSingle 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decodeStruct 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).decodeValue 0 0% 99.38% 910mb 1.07% encoding/gob.(*Decoder).gobDecodeOpFor.func1 ``` ``` (pprof) alloc_space (pprof) top25 Showing nodes accounting for 102.91MB, 84.18% of 122.25MB total Dropped 136 nodes (cum <= 0.61MB) Showing top 25 nodes out of 235 flat flat% sum% cum cum% 37.42MB 30.61% 30.61% 37.42MB 30.61% bytes.makeSlice 15.77MB 12.90% 43.51% 15.77MB 12.90% k8s.io/apimachinery/pkg/runtime.(*Unknown).Unmarshal 13.08MB 10.70% 54.22% 14.58MB 11.93% k8s.io/api/core/v1.(*Secret).Unmarshal 5MB 4.09% 58.31% 5MB 4.09% reflect.mapassign 3.17MB 2.60% 60.90% 5.69MB 4.65% github.com/go-openapi/analysis.(*Spec).analyzeSchema 3MB 2.45% 63.36% 3MB 2.45% encoding/json.(*decodeState).objectInterface 2.32MB 1.90% 65.26% 2.32MB 1.90% encoding/gob.(*decBuffer).Size 2.03MB 1.66% 66.91% 2.03MB 1.66% regexp/syntax.(*compiler).inst 2.01MB 1.65% 68.56% 2.01MB 1.65% k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific 1.55MB 1.26% 69.82% 1.55MB 1.26% encoding/gob.(*encBuffer).Write 1.50MB 1.23% 71.05% 10MB 8.18% github.com/go-openapi/spec.(*Schema).UnmarshalJSON 1.50MB 1.23% 72.28% 1.50MB 1.23% github.com/aws/aws-sdk-go/aws/endpoints.init 1.50MB 1.23% 73.51% 1.50MB 1.23% strings.makeCutsetFunc 1.50MB 1.23% 74.73% 1.50MB 1.23% github.com/go-openapi/swag.(*splitter).gatherInitialismMatches 1.50MB 1.23% 75.96% 1.50MB 1.23% reflect.(*structType).Field 1.04MB 0.85% 76.81% 1.54MB 1.26% k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName 1MB 0.82% 77.63% 1MB 0.82% bufio.NewReaderSize 1MB 0.82% 78.45% 1MB 0.82% reflect.unsafe_NewArray 1MB 0.82% 79.27% 3.50MB 2.86% encoding/gob.(*Decoder).getDecEnginePtr 1MB 0.82% 80.09% 1MB 0.82% github.com/go-openapi/swag.(*NameProvider).GetJSONNames 1MB 0.82% 80.91% 1MB 0.82% reflect.packEface 1MB 0.82% 81.73% 1MB 0.82% k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).Unmarshal 1MB 0.82% 82.54% 1MB 0.82% reflect.New 1MB 0.82% 83.36% 1MB 0.82% reflect.Zero 1MB 0.82% 84.18% 2.50MB 2.05% encoding/gob.(*Decoder).recvType ``` ``` (pprof) alloc_objects (pprof) top25 Showing nodes accounting for 681601mb, 91.33% of 746274mb total Dropped 144 nodes (cum <= 3731mb) Showing top 25 nodes out of 227 flat flat% sum% cum cum% 98305mb 13.17% 13.17% 98305mb 13.17% github.com/go-openapi/swag.(*splitter).gatherInitialismMatches 98305mb 13.17% 26.35% 98305mb 13.17% reflect.(*structType).Field 81920mb 10.98% 37.32% 119322mb 15.99% encoding/gob.(*Decoder).decOpFor 65536mb 8.78% 46.10% 65536mb 8.78% reflect.(*rtype).Method 49153mb 6.59% 52.69% 49153mb 6.59% strings.makeCutsetFunc 33678mb 4.51% 57.20% 33678mb 4.51% reflect.Zero 32768mb 4.39% 61.59% 32768mb 4.39% encoding/json.(*decodeState).literalStore 32768mb 4.39% 65.99% 32768mb 4.39% errors.New (inline) 18658mb 2.50% 68.49% 18658mb 2.50% encoding/json.(*decodeState).objectInterface 17294mb 2.32% 70.80% 17294mb 2.32% reflect.New 16384mb 2.20% 73.00% 20204mb 2.71% bytes.(*Buffer).grow 16384mb 2.20% 75.19% 16384mb 2.20% encoding/gob.decString 16384mb 2.20% 77.39% 16390mb 2.20% k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop 14936mb 2.00% 79.39% 14936mb 2.00% reflect.mapassign 10923mb 1.46% 80.85% 10923mb 1.46% regexp/syntax.(*compiler).init 10923mb 1.46% 82.32% 10923mb 1.46% strings.genSplit 9362mb 1.25% 83.57% 74898mb 10.04% encoding/gob.(*Decoder).recvType 8363mb 1.12% 84.69% 8363mb 1.12% reflect.unsafe_NewArray 8192mb 1.10% 85.79% 8192mb 1.10% k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).RESTMappings 7867mb 1.05% 86.85% 7867mb 1.05% github.com/aws/aws-sdk-go/aws/endpoints.init 7282mb 0.98% 87.82% 7282mb 0.98% net/url.parse 6554mb 0.88% 88.70% 6554mb 0.88% github.com/go-openapi/analysis.(*referenceAnalysis).addRef 6554mb 0.88% 89.58% 6554mb 0.88% github.com/prometheus/client_golang/prometheus.newHistogram 6554mb 0.88% 90.46% 10703mb 1.43% k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal 6554mb 0.88% 91.33% 6554mb 0.88% net/url.escape ``` The main `Memory Usage` dashboard in grafana (screenshot attached) reports the memory usage for the POD, which corresponds to the `container_memory_working_set_bytes` metric which is described in the cAdvisor codebase as: The amount of working set memory, this includes recently accessed memory,dirty memory, and kernel memory. Working set is <= "usage". Here's the code of how tihs is calculated in cAdvisor: https://github.com/google/cadvisor/blob/75d3a4f100ba0e526ff463a93eba7304a6b91958/container/libcontainer/handler.go#L822-L835 ``` inactiveFileKeyName := "total_inactive_file" if cgroups.IsCgroup2UnifiedMode() { inactiveFileKeyName = "inactive_file" } workingSet := ret.Memory.Usage if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok { if workingSet < v { workingSet = 0 } else { workingSet -= v } } ret.Memory.WorkingSet = workingSet ``` The above is what `kubectl top pod` reports. This is not, however, the memory that is currently allocated by the system. The table above shows that the RSS memory usage, is significantly low. RSS usage in the context of cAdvisor, corresponds to: The amount of anonymous and swap cache memory (includes transparent hugepages). The RSS memory in this case is not to be confused with `Resident set size`. We can compare the data above to what is in the `/sys/fs/cgroup/memory/memory.stat` file inside the POD. This file will show values like: ``` bash-4.4$ cat /sys/fs/cgroup/memory/memory.stat cache 3188072448 rss 100110336 rss_huge 0 shmem 0 mapped_file 3919872 dirty 0 writeback 135168 swap 0 pgpgin 1774245 pgpgout 993465 pgfault 16863 pgmajfault 0 inactive_anon 78651392 active_anon 0 inactive_file 1579802624 active_file 1629388800 unevictable 0 hierarchical_memory_limit 19439853568 hierarchical_memsw_limit 9223372036854771712 total_cache 3188072448 total_rss 100110336 total_rss_huge 0 total_shmem 0 total_mapped_file 3919872 total_dirty 0 total_writeback 135168 total_swap 0 total_pgpgin 1774245 total_pgpgout 993465 total_pgfault 16863 total_pgmajfault 0 total_inactive_anon 78651392 total_active_anon 0 total_inactive_file 1579802624 total_active_file 1629388800 total_unevictable 0 ``` From the above, `total_rss` is what's reported as `Memory Usage (RSS)` whereas the working set is the memory usage - the inactive_files. All that being said, both numbers are important and they should both be monitored as they will, eventually, impact Kubernetes decision to kill the POD in case of lack of resources. To this end, there are 2 tasks that we could spend some time on to improve this situation: 1. The `UploadFile` function is reading the full[0] file into memory before the actual upload happens. We can optimize this by opening the file, and leaveraging the `UploadStreamToPublicBucket` 2. We should track why Assisted Service gets to the point of allocating that much memory (even if it's then free'd) at the same time. 3. Should we add `pprof` to AI and only enable the endpoint when running with debug enabled? 4. any other tools that we could have as part of AI that would help with investigations like this in the future? [0] https://github.com/openshift/assisted-service/blob/169af55fcfe240ce08e4f7714f1a6106a6c89d58/pkg/s3wrapper/filesystem.go#L82 Some useful links: - https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66 - https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d - https://medium.com/@eng.mohamed.m.saeed/memory-working-set-vs-memory-rss-in-kubernetes-which-one-you-should-monitor-8ef77bf0acee
I'm still running some investigations in parallel but, given the results from the previous investigations, I am changing the priority and severity of this bug. I would like to improve the situation and avoid having those peaks of memory, if possible.
Moving to `4.9` as we have verified that memory is not really allocated/blocked. We will work on improving the peaks of memory used for 4.9
I'm going to close this issues as `CURRENTRELEASE` as multiple fixes have gone in that have improved the situation. We should keep an eye on the system and open a new BZ if new work will be done on this front.