Bug 1504264
Summary: | hangs in mkfs.ext4/loop with 4.13.6-200.fc26.armv7hl+lpae | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Fenzi <kevin> | ||||||||
Component: | kernel | Assignee: | Peter Robinson <pbrobinson> | ||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 27 | CC: | airlied, bskeggs, dan, dustymabe, eparis, esandeen, hdegoede, herrold, ichavero, itamar, jarodwilson, jcm, jforbes, jglisse, jonathan, josef, jwboyer, kernel-maint, labbott, linville, mchehab, mjg59, ngompa13, nhorman, pbrobinson, quintela, steved | ||||||||
Target Milestone: | --- | Flags: | jforbes:
needinfo?
|
||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-08-29 15:18:02 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 245418 | ||||||||||
Attachments: |
|
Description
Kevin Fenzi
2017-10-19 20:03:20 UTC
This made me ponder whether it's time to just cut the 32-bit builders over to being containers on 64-bit VMs. That might make for a much more easily supportable setup. (In reply to Jon Masters from comment #1) > This made me ponder whether it's time to just cut the 32-bit builders over > to being containers on 64-bit VMs. That might make for a much more easily > supportable setup. Define supportable, it would be nothing like any of our other infrastructure at the current point in time. We aren't even using the systemd-nspawn mode of mock yet, so I fear we aren't ready to try and containerize builders. At the very least that mode breaks some image creation tasks. The host machines have been updated to the rhel7-alt kernel/userspace, and I've moved these armv7 vm's to fedora 27 and thought for a day or two it fixed this issue, but it seems it just made it happen less often, we have still hit it since moving to f27 on the vms. :( It seems to take a few days/composes before it hits. I suppose I could try Fedora 27 as the host os? Or is there anything else I could gather to help track this down any? Just a note that we are still hitting this. Daily rawhide composes hit these builders and sometimes hang them, rebooting them then causes the compose to continue. We are still seeing this. vm's now have: 4.14.8-300.fc27.armv7hl+lpae hosts are: 4.11.0-44.2.1.el7a.aarch64 I am going to try and install the vm's on different storage (iscsi if I can), and perhaps try the host running f27. Other ideas welcome. If kdump/crashdump won't work on armv7, then the sysrq interface could provide some help. "w" and "t" commands look useful. (In reply to Dan Horák from comment #6) > If kdump/crashdump won't work on armv7, then the sysrq interface could > provide some help. "w" and "t" commands look useful. yeah. and in case you haven't done this in a while I wrote down how to send sysrq to a kvm guest here: https://dustymabe.com/2012/04/21/send-magic-sysrq-to-a-kvm-guest-using-virsh/ (In reply to Jon Masters from comment #1) > This made me ponder whether it's time to just cut the 32-bit builders over > to being containers on 64-bit VMs. That might make for a much more easily > supportable setup. Does the cross-section of hardware used for AArch64 servers running AArch64 VMs support armv7hl containers (that is, have support for 32-bit arm instructions)? My experience thus far with several ARM servers (like the SoftIron ones) is that they lack that. Created attachment 1397637 [details] dmesg-w (In reply to Dan Horák from comment #6) > If kdump/crashdump won't work on armv7, then the sysrq interface could > provide some help. "w" and "t" commands look useful. Here's dmesg-w and dmesg-t output. Created attachment 1397638 [details]
dmesg-t
dmesg-t
Based on the backtraces, this smells like balancing getting stuck forever on dirty pages. Given this is 32-bit with highmem there might be something off in page calculations (especially since such things have happened in the past). I'll send an e-mail to linux-mm asking about this. In parallel it might be worth testing 4.15. Looking again, it seems like the writeback is just getting throttled a lot, we _might_ be hitting something fixed by https://patchwork.kernel.org/patch/10201593/ but given we know the underlying storage is slow, it might be worth testing on a different medium I've updated the guests to 4.15.3-300.fc27.armv7hl+lpae. Note that the underlying storage is a lv on the host using ssd's... so it shouldn't be all that slow. ;( I asked upstream and I got this response: "How much dirtyable memory does the system have? We do allow only lowmem to be dirtyable by default on 32b highmem systems. Maybe you have the lowmem mostly consumed by the kernel memory. Have you tried to enable highmem_is_dirtyable?" Can we check/try adjusting the highmem_is_dirtyable setting? ok. I have set that to 1 on the compose builders. Will see if the problem happens again. So, since setting that to 1: buildvm-armv7-01.arm.fedoraproject.org | SUCCESS | rc=0 | (stdout) 19:52:37 up 6 days, 19:41, 0 users, load average: 0.01, 0.06, 0.62 buildvm-armv7-02.arm.fedoraproject.org | SUCCESS | rc=0 | (stdout) 19:52:37 up 6 days, 21:16, 0 users, load average: 0.01, 0.07, 0.64 buildvm-armv7-03.arm.fedoraproject.org | SUCCESS | rc=0 | (stdout) 19:52:37 up 6 days, 58 min, 0 users, load average: 0.00, 0.03, 0.51 no reboots needed/hangs. ;) I guess this is something we just need to keep manually setting? Or is it something upstream would be willing to change the default on? If it's a tunable setting I think the preference is for us to set it but I'll follow up with upstream because multiple process stuck in D state is a bad failure mode for a tuning setting. Upstream wanted a little bit more information, can you run the scratch build https://koji.fedoraproject.org/koji/taskinfo?taskID=25509848 which has one additional debugging patch applied and then _WITHOUT_ the highmem_is_dirtyable setting let it hang and then run ---------- command line start ---------- # echo m > /proc/sysrq-trigger # echo t > /proc/sysrq-trigger # sleep 10 # echo m > /proc/sysrq-trigger # echo t > /proc/sysrq-trigger # sleep 10 # echo m > /proc/sysrq-trigger # echo t > /proc/sysrq-trigger ---------- command line end ---------- basically we want to collect the memory (sysrq-m) and the task state (sysrq-t) a couple of times to see if there is any change. Just as an update here: I booted one of our arm buildvm's in this kernel last week and have been waiting for it to hang. So far it hasn't. Hopefully it will soon. Created attachment 1412138 [details]
test-output
Here's the output of the various sysrq commands and dmesg at the end.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.17.7-100.fc27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28. If you experience different issues, please open a new bug report for those. *********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously. |