Created attachment 1859428 [details] OOM_with_dmesg_debug Description of problem: (1) After recent upgrades (mid january) x86_64 copr builders are failing with OOM. Some specific packages (built fine since 1yr) are now impossible to build. (2) Statuses on canceled builds are displayed in reverse: - "building" becomes "failed" - "failed" becomes "canceled" --- See builds for which OOM is always reproductible: * https://copr.fedorainfracloud.org/coprs/rezso/ML/build/3330542 fedora-rawhide-x86_64 (tensorflow) -> OOM * https://copr.fedorainfracloud.org/coprs/rezso/ML/build/3330465 fedora-rawhide-x86_64 (cupy) -> OOM * https://copr.fedorainfracloud.org/coprs/rezso/ML/build/3294988 fedora-rawhide-x86_64 (mxnet) -> OOM (+also timed out 172800) I could continue the list, for now kept only these examples. Other non x86_64 builders are fine there are no OOM observed. --- Of course tried to adjust nr-of-proc / mem / lto / annonbin but no luck. --- Captured a dmesg within a build, attached here as [OOM_with_dmesg_debug].
Thank you for the report, Balint. > After recent upgrades (mid january) x86_64 copr builders are failing with OOM. > Some specific packages (built fine since 1yr) are now impossible to build. From the memory management perspective, no change happened. We just updated the builders from Fedora 34 to Fedora 35: https://lists.fedoraproject.org/archives/list/copr-devel@lists.fedorahosted.org/thread/3JORYRKDWFMJSR35Z4LIKDEXH2T5263H/ > Statuses on canceled builds are displayed in reverse: The cancel vs. failed status is not deterministic. It very much depends in what build stage you hit the "cancel" button (the failed builds would fail anyway, even if you did not cancel it). This part is not a bug. Timeout should result in failure, not cancel. > See builds for which OOM is always reproductible: Can you please try to reproduce this locally? We don't have the capacity to debug build failures. Then perhaps move this bug to the compiler component? Also, do you think that turning off systemd-oomd would help?
(In reply to Pavel Raiskup from comment #1) > Thank you for the report, Balint. Thank you for your time Pavel ! > > After recent upgrades (mid january) x86_64 copr builders are failing with OOM. > > Some specific packages (built fine since 1yr) are now impossible to build. > > From the memory management perspective, no change happened. We just updated > the builders from Fedora 34 to Fedora 35: > https://lists.fedoraproject.org/archives/list/copr-devel@lists.fedorahosted. > org/thread/3JORYRKDWFMJSR35Z4LIKDEXH2T5263H/ (x) I see now. > > > > Statuses on canceled builds are displayed in reverse: > > The cancel vs. failed status is not deterministic. It very much depends > in what build stage you hit the "cancel" button (the failed builds would > fail anyway, even if you did not cancel it). This part is not a bug. > > Timeout should result in failure, not cancel. (x) Understand, nevermind, not trivial but it just looks counter intuitive sometimes. > > > See builds for which OOM is always reproductible: > > Can you please try to reproduce this locally? We don't have the capacity to * I tried to lower everything lto/anno/hardening and even constrain java (bazel) resources. * Locally I cannot reproduce it with ease, i believe is OOM behaviour related as described below. > debug build failures. Then perhaps move this bug to the compiler component? (x) Can't be a compiler bug, i try detail: - in case of tensorflow it OOM in the early stage of prep/cfg things (bazel tool is on top of java, i tried limit it's heap to 1G/2G). - in case of mxnet & cupy involving CUDA, I created a dedicated 'cuda-gcc' compiler for strict compatibility (i.e gcc-12 won't work with CUDA at all). > Also, do you think that turning off systemd-oomd would help? (x) Can you do that for a trial ? a. Maybe invoking some magic OOM switch flag from copr-cli side ? b. Try to adjust OOM params instead (i.e. put vm.overcommit_memory = 1, see below) c. Or just disable OOM completely (not sure what impact have for overall copr ecosystem) * My understanding with OOM on builders (correct me if the case): - Builders have 4G RAM (main) and 128G RAM (swap) - OOM killer usually working on some kind of 'oom score' metrics - The metric is described here (bottom): https://www.kernel.org/doc/gorman/html/understand/understand016.html - The allowed threshold is TotalRam * (OverCommitRatio/100) + TotalSwapPage, where OverCommitRatio is set by the system administrator. * The CommitLimit (new attached [builder-live.log]) looks fine: $ zcat builder-live.log.gz | grep CommitLimit CommitLimit: 153295124 kB * The vm.overcommit_memory is on heuristic, maybe we can put on 1 (+overcommit) ? $ cat builder-live.log | grep -e vm.overcommit -e vm.swappiness vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.swappiness = 60 vm.overcommit_memory: 0: heuristic overcommit (this is the default) 1: always overcommit, never check 2: always check, never overcommit No more ideas for now, except proposed {a,b,c} or perhaps having someone with better OOM kernel knowledge to help figure it out.
Created attachment 1859556 [details] VM_OOM_debug_infos #2
See also: https://pagure.io/copr/copr/issue/2077
> (x) Can you do that for a trial ? Testing: https://pagure.io/fedora-infra/ansible/c/4fe6812fc5318e153ad553139948688d1b192c98?branch=main Can you please give it a try?
Thank you for the kernel knob overview; we will try to experiment with those as well if turning off systemd-oomd doesn't help.
(In reply to Pavel Raiskup from comment #5) > > (x) Can you do that for a trial ? > > Testing: > https://pagure.io/fedora-infra/ansible/c/ > 4fe6812fc5318e153ad553139948688d1b192c98?branch=main Thank you Pavel ! > Can you please give it a try? (x) It keep failing by OOM, but the usual "terminated abruptly" (due to OOM as previously debugged) are not visible in dmesg-logs: Tested on tensorflow that relays on bazel (a java based buildsystem), that dies in early prep stage of the repo: 1. "terminated abruptly", killed by OOM, (no dmesg print was enabled): https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-rawhide-x86_64/03382595-tensorflow/builder-live.log.gz 2. just stale, probably OOM: https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-rawhide-x86_64/03382604-tensorflow/builder-live.log 3. just stale, probably OOM: https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-rawhide-x86_64/03382737-tensorflow/builder-live.log.gz 4. "terminated abruptly", but OOM related can't be seen in the subsequent dmesg print cmd: https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-35-x86_64/03384900-tensorflow/builder-live.log The OOM related messages can't be seen probably due to fact that systemd-oomd is stopped now. Conditions in .spec was just as for previous past ~10 successful builds. (x) Stopping systemd-oomd service would not be enough, here is my understanding on this: * stopping service itself does't disable kernel's OOM killing facility. * systemd-oomd is only a service that manages coredumps (store, compress, manage, logs). * systemd-oomd it only a service that is kind of "middleware" kernel<->disk dumps/logs. ---- I believe the next solution would be to tweak via /etc/sysctl.conf: * vm.overcommit_memory={1,2} & vm.overcommit_ratio=100 * or even disable the kernel oom-killer: vm.oom-kill=0
(x) In addition to previous comment #7, would add this more evident experiment to list: 5. Here is one with OOM killed (by kernel not systemd-oomd): https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-35-x86_64/03389229-tensorflow/builder-live.log Here a kernel side OOM is catched despite having a throttled bazel invokation "--jobs 2 --host_jvm_args=-Xmx1g --host_jvm_args=-Xms512m" (x) Also to complete my statement on "systemd-oomd" being "just (store, compress, manage, logs)": * systemd-oomd also have it's own ability to measure & **kill** processes _before_ kernel-OOM would kick-in. * in fact seems that systemd-oomd was introduced as a finer grained daemon in userspace compensating what OOM does (or _not_doing_properly_ !) in kernel space. * Investigated some details about: https://fedoraproject.org/wiki/Changes/EnableSystemdOomd --- So our problem seems related to kernel-OOM side rather than systemd-oomd one. The kernel side OOM may be disabled (and can leave systemd-oomd running in userspace), or we can change it's vm.* parameters as proposed.
> (x) In addition to previous comment #7, would add this more evident > experiment to list: > > 5. Here is one with OOM killed (by kernel not systemd-oomd): > > https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-35-x86_64/ > 03389229-tensorflow/builder-live.log More analysis for this point: - being killed for 3.6Gb of memory is hilarious, I think kernel-oom miss-penalize that process badly (or it's fluctuating behaviour ?!). [ 1107.737092] Out of memory: Killed process 6583 (java) total-vm:3663980kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1848kB oom_score_adj:0 - on my local PC tests (can't reproduce OOM) but can confirm that bazel/java dont use more than 4G, mem in prep-stage it fluctuates a lot, it do plan-ahead "schedules" while indexing large amount of code. - also it looks that kernel-oom somehow didn't took in account the swap space, he saw fluctuating 3.6G amounts on java and by it's metric (oom_score) killed it off. - there is also a well reasoned effort towards userland OOM done by facebook: https://github.com/facebookincubator/oomd , i believe systemd-oomd these days stands exactly for the same reasons. I even sustain the idea that copr-builders should even turn-off kernel-oom (completely) and instead keep the more reliable userland based OOM systemd-oomd running. Let me know if we can further experiment with those proposed at #7 , #8 and finally here.
So, the only thing that can happen is that the user breaks the builder so even the ssh connection get's broken -- we are able to automatically detect this situation, terminate the worker, and restart the build. So I think it is just OK to do vm.oom-kill=0 in this case. https://pagure.io/fedora-infra/ansible/c/fb7a2198f7981c558f0dbbeac62f5bb90c90284c?branch=main
Should be deployed to production. Can you please re-try once more?
(In reply to Pavel Raiskup from comment #11) Thank you Pavel ! > Should be deployed to production. Can you please re-try once more? Yes, I'll back here with results.
Ok, we can not disable oom-killer entirely on Fedora. So I went with: https://pagure.io/fedora-infra/ansible/blob/4c6d121f4c1e2171a0df5628b46a3e2ab98fc814/f/roles/copr/backend/files/provision/provision_builder_tasks.yml#_129-133 The system has this: vm.overcommit_memory = 2 vm.overcommit_ratio = 50 Which is huge pile of memory... overcommit_ratio=100 shouldn't be needed I guess. The machines with vm.oom-kill=0 did not even boot, so don't worry if you already submitted the testing build (it will be processed on the `overcommit_memory = 2` system (once booted).
(In reply to Pavel Raiskup from comment #13) > Ok, we can not disable oom-killer entirely on Fedora. So I went with: > https://pagure.io/fedora-infra/ansible/blob/ > 4c6d121f4c1e2171a0df5628b46a3e2ab98fc814/f/roles/copr/backend/files/ > provision/provision_builder_tasks.yml#_129-133 > > The system has this: > vm.overcommit_memory = 2 > vm.overcommit_ratio = 50 * See now, no problem will test this way. > > Which is huge pile of memory... overcommit_ratio=100 shouldn't be needed I > guess. * Yes definitely should be a lot, 8G (for heap) is fine enough in fact. > The machines with vm.oom-kill=0 did not even boot, so don't worry if you > already > submitted the testing build (it will be processed on the `overcommit_memory > = 2` > system (once booted). * Noticed the large waiting time, guessed that machines maybe re-load/re-cache the new images.
* Can early confirm that OOM looks to behave normally now: - for the case tensoflow/bazel (most acute) java/bazel now pass the aggressive plan-ahead preping step. - deliberately did't constrained any of it's parameters (bazel), left those ones from latest successful builds. * Will also submit two "troubled" ones: - *mxnet* and *cupy* which suffered OOM (after >10h) at final linking stage (despite turning off lto/harden/annon). Please allow +24h/48h for extended test results, so will be back with a summary here. Thank you Pavel !
> Please allow +24h/48h for extended test results, so will be back with a summary here. (x) Unfortunately kernel-OOM still interfere in the very same way: https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-35-x86_64/03415599-tensorflow/builder-live.log [ 1035.210397] Out of memory: Killed process 6707 (java) total-vm:3621212kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1932kB oom_score_adj:0 [ 1035.360300] oom_reaper: reaped process 6707 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-rawhide-x86_64/03415707-tensorflow/builder-live.log.gz [ 446.863938] Out of memory: Killed process 6599 (java) total-vm:3617260kB, anon-rss:2464kB, file-rss:4kB, shmem-rss:400kB, UID:1000 pgtables:1360kB oom_score_adj:0 [ 446.946092] oom_reaper: reaped process 6599 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB zcat builder-live.log.gz | grep -e vm.over vm.overcommit_kbytes = 0 vm.overcommit_memory = 2 vm.overcommit_ratio = 50 ----///---- More options: The vm.oom-kill switch did not exist anymore (that's why image didn't even started). - so we can't *just* disable kernel's OOM :-( Our last (?) option would be: - vm.overcommit_memory=1 (from kernel docs: "1 -> the kernel pretends there is always enough memory until it actually runs out")
Isn't this a kernel issue in the end? Do you think that from any POV we give the kernel reason to kill the process? We'll have to test the OOM killer on a synthetic process, a predictably allocating process.
> Isn't this a kernel issue in the end? * Yes, highly probable. > Do you think that from any POV we give > the kernel reason to kill the process? We'll have to test the OOM killer on > a synthetic process, a predictably allocating process. * I think systemd-oomd exists there as a fine-grained OOM to compensate kernel's one. * I do not have the resources to *debug* kernel oom miss-behaviour for now. * Emulating what bazel/java does there in the heap can't be done just with simple synthetic malloc(). My documented point of view would be: no kernel OOM (or keep OOM at bay with safe enough values) but have a reliable systemd-oomd one instead.
> * Emulating what bazel/java does there in the heap can't be done just > with simple synthetic malloc(). I'm afraid this needs some consultation between java and kernel folks. What I can do here is to give you ssh access to one builder machine, for your own experiments. I was able to reproduce this easily there, even after an update to the latest F35 kernel (still the same)). Switching against kernel, hopefully someone can give us some hint.
> give you ssh access to one builder machine If anyone want's to give this a try, ping me on #buildsys-build.
(In reply to Pavel Raiskup from comment #19) Thanks for you patience Pavel ! > > * Emulating what bazel/java does there in the heap can't be done just > > with simple synthetic malloc(). > I'm afraid this needs some consultation between java and kernel folks. See now. Well, more experienced folks are welcome. If one wish to reproduce the OOM issue best (fastest way) is with bazel/java, the build receipt .specs can be invoked: (x) The "normal" unconstrained one, that just built fine anytime in the past ~1yr: https://copr-dist-git.fedorainfracloud.org/cgit/rezso/ML/tensorflow.git/tree/tensorflow.spec (x) The "testing" constrained bazel & java vevriant of same spec file: https://copr-dist-git.fedorainfracloud.org/cgit/rezso/ML/tensorflow.git/tree/tensorflow.spec?id=f66d75017e563d0cccb2a3a700a89e7f7ff0a32a (x) The diff in-between constrained vs unconstrained .spec files: https://copr-dist-git.fedorainfracloud.org/cgit/rezso/ML/tensorflow.git/commit/tensorflow.spec?id=f66d75017e563d0cccb2a3a700a89e7f7ff0a32a > What I can do here is to give you ssh access to one builder machine, for > your own experiments. I was able to reproduce this easily there, even > after an update to the latest F35 kernel (still the same)). It would not really helpful for me personally, don't have more ideas towards further debugs, printing with proc & dmesg revealed some of details. Last available option to my limited knowledge would be to try that last untested vm.overcommit_memory=1 mode, in hope to keep kernel-OOM at bay. > Switching against kernel, hopefully someone can give us some hint. Let me know if you need more details or help here, will follow up anytime requested.
* Another unexpected OOM killed last linking step of gcc/ld on one of the target: https://download.copr.fedorainfracloud.org/results/rezso/ML/fedora-36-x86_64/03473409-opencv/builder-live.log.gz [g++: fatal error: Killed signal terminated program cc1plus] * Historically `opencv` (a complete +cuda enabled one) never encountered any kind of build issues in the past year. https://copr.fedorainfracloud.org/coprs/rezso/ML/package/opencv/
So I think I have a clue now so I applied a hotfix: https://pagure.io/fedora-infra/ansible/blob/c78ad03ea7539035ce94c2ceff0777daa420cb50/f/roles/copr/backend/files/provision/provision_builder_tasks.yml#_294-298 After the upgrade to F35, there were two SWAP volumes - zram0 with priority 100 (new on F35) - /dev/<volume> with the default priority -2 (>= 140G, used since ever on our builders) I'm not sure how this multi-swap scenario works, but I suppose that the memory-intensive tasks start swapping to zram0, and when the space is not enough there — instead of relocating the pages to the other (lower priority) swap volume — they fail. Also, even though we both _disable_ and _stop_ systemd-oomd by Ansible, the service is still running and effectivelly oom-kills the most memory intensive process when the _first_ swap is eaten (while the other volume is (still?) not used at all). Also, note the default priority -2 is given by the default command 'swapon /dev/vdb'.
(In reply to Pavel Raiskup from comment #23) > So I think I have a clue now so I applied a hotfix: > https://pagure.io/fedora-infra/ansible/blob/ > c78ad03ea7539035ce94c2ceff0777daa420cb50/f/roles/copr/backend/files/ > provision/provision_builder_tasks.yml#_294-298 > Thanks a lot Pavel ! Allow 24/48h to test it, will be back. > After the upgrade to F35, there were two SWAP volumes > - zram0 with priority 100 (new on F35) > - /dev/<volume> with the default priority -2 (>= 140G, used since ever on > our builders) > > I'm not sure how this multi-swap scenario works, but I suppose that the > memory-intensive tasks start swapping to zram0, and when the space is not > enough there — instead of relocating the pages to the other (lower > priority) swap volume — they fail. > > Also, even though we both _disable_ and _stop_ systemd-oomd by Ansible, > the service is still running and effectivelly oom-kills the most memory > intensive process when the _first_ swap is eaten (while the other volume > is (still?) not used at all). > > Also, note the default priority -2 is given by the default command > 'swapon /dev/vdb'. The description make sense, it might be this multiswap thing perhaps kernel OOM did not account for one of it.
(In reply to Pavel Raiskup from comment #23) > So I think I have a clue now so I applied a hotfix: > https://pagure.io/fedora-infra/ansible/blob/ > c78ad03ea7539035ce94c2ceff0777daa420cb50/f/roles/copr/backend/files/ > provision/provision_builder_tasks.yml#_294-298 > > After the upgrade to F35, there were two SWAP volumes > - zram0 with priority 100 (new on F35) > - /dev/<volume> with the default priority -2 (>= 140G, used since ever on > our builders) Preliminary results on TF package build(s): - builds of tflow instance now pass bazel-preps stage in 30sec instead of 9minutes ! - there are definitely no OOMs of any kind on any of builders having x86_64 host, java moves smooth on the particular bazel/tf builds. - the unevenness of build time/speed that was also observed (like up to 5x times in between x86_64 buiders) looks to be also gone. Thank you much Pavel ! I continue with other "known failures" before fully acknowledge the real improvement, will try to enlist a little perf-statistics (related to previous unevenness).
(In reply to Pavel Raiskup from comment #23) > So I think I have a clue now so I applied a hotfix: > https://pagure.io/fedora-infra/ansible/blob/ > c78ad03ea7539035ce94c2ceff0777daa420cb50/f/roles/copr/backend/files/ > provision/provision_builder_tasks.yml#_294-298 > > After the upgrade to F35, there were two SWAP volumes > - zram0 with priority 100 (new on F35) > - /dev/<volume> with the default priority -2 (>= 140G, used since ever on > our builders) > > I'm not sure how this multi-swap scenario works, but I suppose that the > memory-intensive tasks start swapping to zram0, and when the space is not > enough there — instead of relocating the pages to the other (lower > priority) swap volume — they fail. > > Also, even though we both _disable_ and _stop_ systemd-oomd by Ansible, > the service is still running and effectivelly oom-kills the most memory > intensive process when the _first_ swap is eaten (while the other volume > is (still?) not used at all). > > Also, note the default priority -2 is given by the default command > 'swapon /dev/vdb'. This works. (x) Builds now goes without any OOM problems, just right from first submission: - 3592527 (tensorflow) - 3593305 (tensorboard) - 3592638 (mxnet) - 3590401 (opencv) In case of java based bazel prep-stage there is a massive improvement of 30secs instead of ~9min and doesn't OOM at all. (x) Now there is also an evenness of building time for intra-arches (especially x86_64): - 3590401 (opencv) vs 3501854 (opencv) Thanks again Pavel & COPR-Team !
This is IMO still worth tracking on Kernel. Multiple swap devices with priorities (+ zram) are probably not that unusual, and the SWAP behavior in such a case should be sane. In Copr we worked-around it by disabling the zram SWAP device: https://pagure.io/copr/copr/issue/2077
This message is a reminder that Fedora Linux 35 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '35'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 35 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13. Fedora Linux 35 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.
Reopening. The same problem is observed on F37.
This message is a reminder that Fedora Linux 37 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '37'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 37 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 37 entered end-of-life (EOL) status on None. Fedora Linux 37 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.