Created attachment 1750648 [details] dmesg from f33 + 5.10.10 + 40GB mem + direct-boot Fedora 32bit arm builders were: Fedora 32, lpae kernel, 5.6.x 24GB ram, direct kernel boot and were operating fine. On upgrading to: Fedora 33, lpae kernel, 5.10.x, any amounts of ram from 8->40GB, uefi or direct kernel boot kojid gets OOM killed during builds that use a lot of resources. python3.8 packager with tests enabled or gcc builds both show this issue. There may be others. The build goes along and then when doing tests kojid is OOM killed. ;( We have also tried Fedora 32 userspace with 5.10.x kernel. Sadly, there was a configuration change with the fedora 5.7 kernels, and all fedora kernels from 5.7 to 5.10.9 have HIGHPTE set, which causes 32bit arm lpae guests to 'pause', so we can't use any of them. ;( Basically it seems like between 5.6.x and 5.10.x the OOM handling got more aggressive or the 32bit lpae case changed to use the memory it has less effectively. I've tried also setting highmem_is_dirtyable => 1 and I tried 'lowmem_reserve_ratio' => 1 to no real effect. Will attach dmesg from the Fedora 33 + 5.10.10 + 40GB memory + direct boot case
Paul: were you able to reproduce this? Or would you like me to gather more info? if so what?
(In reply to Kevin Fenzi from comment #1) > Paul: were you able to reproduce this? Or would you like me to gather more > info? if so what? I have not been able to reproduce running the builds outside of Koji, could we set up a builder in staging to try there?
ok. I have setup buildvm-a32-01.stg.iad2.fedoraproject.org with f33 uefi install with 5.10.11-200.fc33.armv7hl+lpae. You will need to: * go to https://admin.stg.fedoraproject.org/accounts and use the 'forgot password' link to reset your password. * login and enroll a 2fa token and update your ssh keys (if out of date). * setup ssh to use our bastion hosts ( https://docs.pagure.org/infra-docs/sysadmin-guide/sops/sshaccess.html ) * ssh to buildvm-a32-01.stg.iad2.fedoraproject.org and you should be able to sudo there with your passwd/2fa token. I started a python3.8 build on it: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90112693 Let me know if there's anything I can further help with.
I did a clean fresh f34 install on buildvm-a32-01.stg and The problem persists. I upgraded that vm to 5.12.0-0.rc7.189.fc35.armv7hl+lpae and the problem persists. Let me know if there is anything I can do to move this along or more info I can gather.
Right now buildvm-a32-01.stg.iad2.fedoraproject.org is f34 and enabled in stg-koji as the only 32 bit arm builder. So, I have been using: fedpkg clone -a python3.8 cd python3.8 fedpkg srpm stg-koji build --scratch f33 python3.8-3.8.9-1.fc35.src.rpm --arch-override armv7hl The build should go to that builder, get most of the way done and OOM kill kojid and restart.
Created attachment 1777845 [details] libvirt xml for f34 32bit arm guest vm Here's the xml for libvirt for the f34 test vm I last duplicated the problem on.
Also, I'm wondering now if annobin is causing the problem.
Current status: In my testing last week, it seemed the problem was gone in the 5.12.x kernels (or at least much much more rare). Based on that, I moved all the builders to f34 and newest kernel. However, the issue may not be solved after all. I'd like to leave this open for feedback from users to see when/what/if this still happens to armv7 builds. Thanks.
This libreoffice build: https://koji.fedoraproject.org/koji/taskinfo?taskID=69447907 Buildroots: /var/lib/mock/f35-python-27616570-3666659 /var/lib/mock/f35-python-27620380-3666659 /var/lib/mock/f35-python-27622929-3666659 /var/lib/mock/f35-python-27623229-3666659 /var/lib/mock/f35-python-27623531-3666659 Total time 4:16:36 Task time 0:16:19 %prep: Initialized empty Git repository in /builddir/build/BUILD/libreoffice-7.1.3.2/.git/ + /usr/bin/git config user.name rpm-build + /usr/bin/git config user.email '<rpm-build>' + /usr/bin/git config gc.auto 0 + /usr/bin/git add --force . + /usr/bin/git commit --allow-empty -a --author 'rpm-build <rpm-build>' -m 'libreoffice-7.1.3.2 base' /var/tmp/rpm-tmp.K9GR5j: line 68: 25966 Killed /usr/bin/git commit --allow-empty -a --author "rpm-build <rpm-build>" -m "libreoffice-7.1.3.2 base" error: Bad exit status from /var/tmp/rpm-tmp.K9GR5j (%prep) We try again in https://koji.fedoraproject.org/koji/taskinfo?taskID=69473607 Until that builds, we cannot merge the Python 3.10 side tag :/
I am afraid that I see this again: https://koji.fedoraproject.org/koji/taskinfo?taskID=73270694
llvm seems also affected in %check.
FWIW wrt the libreoffice build failing on armv7hl, trying to convert -%__scm_setup_git_am to +%{__git} init +%{__git} config user.name rpm-build +%{__git} config user.email '<rpm-build>' +%{__git} config gc.auto 0 +%{__git} ls-files -z --others | xargs -0 -n 1000 %{__git} add --force +%{__git} ls-files -z | xargs -0 -n 1000 %{__git} commit --allow-empty --author "rpm-build <rpm-build>" -m "%{NAME}-%{VERSION} base" to try incrementally git adding and committing still failed to leave %prep
ok. I just moved all the builders to 5.12.19-300.fc34.armv7hl+lpae which we think was more stable. If everyone could keep an eye out for the kojid restarts and let me know if you still see them and on what build(s) that would be great.
As a general comment, I think that using a 32-bit kernel with this much physical memory is asking for trouble, as you are not just constrained on virtual addressing in the guest but also on lowmem. https://lwn.net/Articles/813201/ has some information about problems you may run into with highmem-heavy workloads. The Debian build machines have migrated to using 64-bit kernels a while ago, and I think this is the only sensible way forward in order to reliably build packages. Running 32-bit kernels will only get worse in the future as we plan to remove support for highmem in the future, and newer Arm cores (Cortex-A75 and later) no longer support running 32-bit kernels. Note that we never officially supported running 32-bit kernels on Armv8 hardware even though it works most of the time, and the known bugs are mostly limited to running 32-bit kernels on physical Armv8 hardware. My best suggestion is to add a 64-bit kernel package to the Fedora armhfp distro, ideally using the exact same kernel config that you ship on Fedora arm64, to limit the amount of extra validation work. This would help both the build service and actual users that may have the desire to run Fedora arm32 on arm64 hardware. Regardless of problems with your setup, I think we need to investigate the problem further, as it sounds like there is a problem that may hit normal user (with no highmem) as well. If you have a way to reproduce this, I would suggest first ruling out highmem as the root cause by using a kernel with CONFIG_HIGHMEM=n and CONFIG_VMSPLIT_2G=y, giving your guest 2GB of RAM and plenty of swapspace (backed e.g. by a host tmpfs file). Does this show the same regression between linux-5.6.x and linux-5.10.x, or do both kernels behave the same way without highmem?
Can we please move to aarch64 kernels for the builder VMs? We don't have a multilib compose for armv7hl, but that will not prevent us from running mock chroots for armv7hl on an aarch64 kernel with compat system calls enabled. I expect there is much more interest in 32-bit compat mode in the aarch64 kernels than the 32-bit LPAE kernel. And as Arnd mentioned in comment 14 (and others have pointed out in a previous discussion on the Fedora devel list), hardware support for 32-bit kernel mode is already gone from some current CPUs that still support armv7hl in userspace.
This message is a reminder that Fedora 33 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '33'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 33 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Moving to 34, as this definitely still happens there. I am hoping to test f35 soon...
ok, staging builders have been moved to f35 now. I am trying to see if the problem exists on them using the normal '5.15.6-200.fc35.armv7hl+lpae' kernel. Additionally, and I have no idea if this matters at all, we have moved our RHEL8 virtual hosts to using the "advanced virt" 8.3 stream from the normal 'rhel' virt stream. (newer qemu and libvirt, etc). I tried a python3.10 build and it finished fine: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90134957 Trying a gcc one now: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90134957 Other ideas on how to test it welcome.
I've just got this problem on real Koji with pypy3.8 in many builds. If you need something to try on stagging, pypy3.8 is a good candidate (or at least it seems). https://koji.fedoraproject.org/koji/taskinfo?taskID=80916190 https://koji.fedoraproject.org/koji/taskinfo?taskID=80916277 https://koji.fedoraproject.org/koji/taskinfo?taskID=80877195 https://koji.fedoraproject.org/koji/taskinfo?taskID=80877165 https://src.fedoraproject.org/rpms/pypy3.8/ (does not exist on staging, will probably need to import it or upload SRPM). My attempt: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90426931
It failed with: <class 'OSError'>: [Errno 28] No space left on device
Sorry about that. Cleaned up space on the hub... please refire?
Seems OK indeed: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90435429 https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90440889
So, I moved all the prod ones to 5.16.0 and things seem better, but not perfect. I'm wondering if there might be a xfs issue now... I installed some with btrfs and so far they have seen zero ooms. will keep watching.
> I'm wondering if there might be a xfs issue now... I installed some with > btrfs and so far they have seen zero ooms. > will keep watching. Probably more straight forward to use ext4 if xfs is as concern as that's likely what upstream is using.
(In reply to Kevin Fenzi from comment #25) > So, I moved all the prod ones to 5.16.0 and things seem better, but not > perfect. > > I'm wondering if there might be a xfs issue now... I installed some with > btrfs and so far they have seen zero ooms. > will keep watching. Here's an example of a build that seems to hit this https://koji.fedoraproject.org/koji/taskinfo?taskID=82341484
Yep. It has. I reassigned it to a btrfs one... lets see if it finishes now without further ooming. I picked btrfs over ext4 because btrfs has cgroup support, so it seemed like it might handle memory of the mock process tree better (but that might be wishfull thinking).
For almost half a year I haven't seen this much but since mid-August, we are hitting this again with Python builds on armv7hl. Recent examples: https://koji.fedoraproject.org/koji/taskinfo?taskID=91776229 https://koji.fedoraproject.org/koji/taskinfo?taskID=91908159 https://koji.fedoraproject.org/koji/taskinfo?taskID=91781603
Unfortunately none of the logs in the above tasks reveals anything about the oom happening. Ideally a full journalctl, since that will help understand if it's kernel oom-killer or user space systemd-oomd doing the kill. And why. I'm not sure how to make it easier or automatic to get specifically that log out of these build environments though.
Created attachment 1912244 [details] dmesg from buildvm-a32-05 Here's a dmesg from one of the builders. A journal would be... super large. I guess I could reboot them and capture journal after that.
I had another look at the numbers, and it's fairly clear why you run out of lowmem: the kernel runs with VMSPLIT_3G, so there is only 768MB of lowmem to start with. Of that, a whole 468MB are used for the mem_map[] array that is needed to manage the entire 40GB of total RAM. On top of that there are 64MB reserved for SWIOTLB, and around 15MB for the kernel itself, which leaves only 221MB for anything else that that the kernel does, and this is clearly not enough. You could build a 32-bit kernel with CONFIG_VMSPLIT_2G in order to add another 1GB of lowmem, but that would likely result in some builds running out of virtual user address space. In future kernels, highmem will likely go away, limiting the system to a few (depending on configuration, definitely <=4) GB of total RAM , which avoids the problem with running out of lowmem, but it means that you have less memory for concurrent processes. Just use a 64-bit kernel.
https://koji.fedoraproject.org/koji/taskinfo?taskID=91969604 was OOMing for 63:03:53 in case more examples are helpful
32-bit arm support has ended, this should probably be closed.