Bug 1955935
Summary: | Fedora 34 - system slow when copying large file from HDD to SSD (root partition) [BTRFS with compression] | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Shyam Jos <shyamjosepp> | ||||||||||||||||
Component: | kernel | Assignee: | fedora-kernel-btrfs | ||||||||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | unspecified | ||||||||||||||||||
Version: | 34 | CC: | acaringi, adscvr, airlied, alciregi, bberg, bskeggs, bugzilla, davide, esandeen, hdegoede, igor.raits, jarodwilson, jeremy, jglisse, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ngompa13, ptalbert, shyamjosepp, steved | ||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2021-05-05 17:16:03 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Shyam Jos
2021-05-01 14:38:51 UTC
This is actually an issue in the kernel, re-assigning. Can you capture sysrq+t while the hang is happening? It might help to have sysrq+t typed out before reproducing the problem, so that the typing isn't subject to the hang. Then 'journalctl -b -o short-monotonic --no-hostname > bug1955935-journal.log' and attach that log file to the bug. Thanks. Created attachment 1778823 [details]
journalctl -b -o short-monotonic --no-hostname
Hmm the sysrq+t doesn't show anything relevant. Can you sysrq+w when the box is misbehaving? You may have to ssh into the box from another machine and sysrq+w if it's too locked up. Also try running top and see if there's anything suspicious eating a lot of cpu. If you can get the PID of the copy process you can also cat /proc/<pid>/stack a few times to get a snapshot of what it's doing, as well as any pid that seems to be eating a lot of CPU. Created attachment 1779066 [details] bug1955935-journal1.log Created attachment 1779067 [details] bug1955935shya-journal2.log Created attachment 1779068 [details]
nautilus-pid-stack.log
Output of nautilus /proc/<pid>/stack
Cpu utilization was normal nothing suspicious(nautilus utilization was < 10 %), Any application that requires disk/IO becomes painfully slow at that time. Also, after installing kernel 5.11.17-300.fc34.x86_64 experienced the same issue. The sysrq+t journal has a bunch of missing lines: [ 937.960415] systemd-journald[723]: Missed 22 kernel messages I though journald only rate limited services, not kernel messages. That's kinda annoying. Ok so new idea, boot with parameter log_buf_len=8M and now the sysrq+t can be retrieved with just dmesg > dmesg.log or whatever. It might be a red herring, because what I see with sysrq+w is actually i915 related, so I wonder if this is a GPU hang and not IO related. When the hang happens, in addition to sysrq+t, could you also do grep -R . /proc/pressure And at any time: fpaste --sysinfo --printonly > sysinfo.txt fpaste --btrfsinfo --printonly > btrfsinfo.txt Attach those files. Thanks! Created attachment 1779211 [details]
dmesg log
grep -R . /proc/pressure && dmesg > dmesg.log1255 /proc/pressure/io:some avg10=61.36 avg60=45.06 avg300=37.58 total=505042612 /proc/pressure/io:full avg10=53.99 avg60=40.04 avg300=32.98 total=390619115 /proc/pressure/cpu:some avg10=2.45 avg60=2.84 avg300=3.42 total=176677631 /proc/pressure/memory:some avg10=4.68 avg60=2.33 avg300=1.74 total=34107161 /proc/pressure/memory:full avg10=3.48 avg60=1.52 avg300=0.99 total=20667551 Created attachment 1779212 [details]
fpaste --sysinfo --printonly > sysinfo.txt
Created attachment 1779213 [details]
fpaste --btrfsinfo --printonly > btrfsinfo.txt
Man there's just no indication that anything is wrong, nothing is doing IO or stuck anywhere. Can you try two things for me? First do echo "mq-deadline" > /sys/block/sda/queue/scheduler and see if that helps. If it doesn't install bcc-tools and run /usr/share/bcc/tools/biolatency while the copy is running. Then ctrl+c once you are done, it'll print a nice graph of the BIO latencies. So, the bug description says "freezing", but comment #8 says "Any application that requires disk/IO becomes painfully slow at that time." This makes a big difference to us. If the mouse cursor is still updating, then we would not consider that as a freeze of the system. We "just" happen to not be scheduling IO in an effective manner and other applications are impacted (which might be caused by buffer bloat in the disk caches even). Shyam, could you clarify what behaviour you are observing? @josef System is much responsive when mq-deadline scheduler is used @Benjamin Sorry for the misleading title, after changing the scheduler to mq-deadline, system seems much responsive when doing file copy, So I am closing this ticket for now Thank you all for the support I drop in a file: /etc/udev/rules.d/60-block-scheduler.rules containing: ACTION=="add", SUBSYSTEM=="block", \ KERNEL=="mmcblk*[0-9]|sd*[!0-9]", \ ENV{DEVTYPE}=="disk", \ ATTR{queue/scheduler}="mq-deadline" |