Bug 1173806 - Fedora21 freezes when use smt-enabled=off as kernel argument
Summary: Fedora21 freezes when use smt-enabled=off as kernel argument
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 21
Hardware: ppc64
OS: All
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-12-13 05:20 UTC by IBM Bug Proxy
Modified: 2015-01-12 14:49 UTC (History)
7 users (show)

Fixed In Version: kernel-3.17.7-300.fc21
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-21 06:36:07 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 119707 0 None None None Never

Description IBM Bug Proxy 2014-12-13 05:20:15 UTC

Comment 1 IBM Bug Proxy 2014-12-13 05:20:17 UTC
== Comment: #0 - Greg Kurz <KURZGREG.com> - 2014-12-12 10:46:08 ==
+++ This bug was initially created as a clone of LTC 119051 +++

Adding smt-enabled=off as kernel argument, the system will boot until the "Freeing initrd memory" line:
...
[    1.371729] vgaarb: loaded
[    1.372989] SCSI subsystem initialized
[    1.373977] libata version 3.00 loaded.
[    1.374158] usbcore: registered new interface driver usbfs
[    1.374246] usbcore: registered new interface driver hub
[    1.374382] usbcore: registered new device driver usb
[    1.374505] pps_core: LinuxPPS API ver. 1 registered
[    1.374563] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti>
[    1.374671] PTP clock support registered
[    1.377135] NetLabel: Initializing
[    1.377218] NetLabel:  domain hash size = 128
[    1.377328] NetLabel:  protocols = UNLABELED CIPSOv4
[    1.377472] NetLabel:  unlabeled traffic allowed by default
[    1.377983] Switched to clocksource timebase
[    1.395029] AppArmor: AppArmor Filesystem Enabled
[    1.402044] NET: Registered protocol family 2
[    1.403795] TCP established hash table entries: 524288 (order: 6, 4194304 bytes)
[    1.408343] TCP bind hash table entries: 65536 (order: 4, 1048576 bytes)
[    1.409301] TCP: Hash tables configured (established 524288 bind 65536)
[    1.409490] TCP: reno registered
[    1.409645] UDP hash table entries: 65536 (order: 5, 2097152 bytes)
[    1.411943] UDP-Lite hash table entries: 65536 (order: 5, 2097152 bytes)
[    1.415409] NET: Registered protocol family 1
[    1.415753] PCI: CLS 128 bytes, default 128
[    1.415962] Trying to unpack rootfs image as initramfs...
[    2.250464] Freeing initrd memory: 21952K (c000000003820000 - c000000004d90000)


Machine Type = Power 8 (S822L)

++++

Happens with fedora21 (and other distros) on both ppc64 and ppc64le.

I've sent a patch:

powerpc/powernv: force all CPUs to be bootable

http://patchwork.ozlabs.org/patch/420440/

Comment 2 IBM Bug Proxy 2014-12-13 05:30:15 UTC
------- Comment From hellerda.com 2014-12-13 05:26 EDT-------
== Comment: #8 - Greg Kurz <KURZGREG.com> - 2014-12-01 04:56:07 ==
The hang occurs because all running threads are looping in the split core code:

static void wait_for_sync_step(int step)
{
int i, cpu = smp_processor_id();

for (i = cpu + 1; i < cpu + threads_per_core; i++)
>		while(per_cpu(split_state, i).step < step)
>			barrier();

The problem is that the split core code needs all possible threads to participate... if the kernel is booted with smt-enabled set to something different than the maximum value, some threads are missing and this ruins the sync.

== Comment: #9 - Greg Kurz <KURZGREG.com> - 2014-12-01 05:24:28 ==
The current implementaqtion for smt-enabled= is a hack: it simply leaves hw threads looping where they happen to be (firmware probably)... This isn't acceptable in a production environment.

An "acceptable" fix would be to start all threads anyway and offline the ones that need to be to honour the requested SMT mode AFTER subcores init. This requires a non-trivial patch.

Since changing SMT mode from userspace when the system is booted is really straightforward, Michael Ellerman suggests we simply drop that smt-enabled= feature.

Smorigo,

Why were you using smt-enabled= ? Is there a reason not to do it after the system is booted with
ppc64_cpu --smt or writing directly to /sys/devices/system/cpu/cpu*/online ?

== Comment: #10 - Paulo Flabiano Smorigo <pfsmorigo.com> - 2014-12-01 06:23:34 ==
I used smt-enabled= because for me was the easier way to disable it. Like, add this parameter in GRUB_CMDLINE_LINUX and done. :)

I'll check if there is a problem to drop it.

== Comment: #11 - Paulo Flabiano Smorigo <pfsmorigo.com> - 2014-12-01 08:30:55 ==
Greg, are you saying to dropping it for good? Maybe we can add that as a feature request for next year. Btw, I'm ok with drop it for now.

== Comment: #12 - Greg Kurz <KURZGREG.com> - 2014-12-01 09:30:00 ==
(In reply to comment #11)
> Greg, are you saying to dropping it for good? Maybe we can add that as a
> feature request for next year. Btw, I'm ok with drop it for now.

Yes, drop it for good as suggested by Michael Ellerman...

<mpe> groug: that smt-enabled stuff is just a hack. It leaves the cpu executing wherever it happens to be, possibly in firmware, possibly busy looping somewhere, it's really no good for use in production
<mpe> the only way we could make it usable I think is to have the cpu come up, and then we offline it
<mpe> but I'm really inclined to say that should just be done in userspace
<groug> mpe, yeah... I had thought of something similar (starting and then offlining) but I agree it should be handled from userspace
<mpe> I'll talk to benh and anton about it tomorrow, but I think we just rip it out

The point is that it is already extremely easy to change SMT mode from an init script and you get the same result... compared to the hassle of doing it in the kernel without breaking things. Not even worth a feature request I would say.

== Comment: #13 - Greg Kurz <KURZGREG.com> - 2014-12-12 08:50:25 ==

------- Comment From hellerda.com 2014-12-13 05:29 EDT-------
Hi RH,

The request is to pick up the above patch.  Thx.

Comment 3 Josh Boyer 2014-12-15 19:41:58 UTC
Patch added in f20 - rawhide.  It will be in the next build of each.

Comment 4 Fedora Update System 2014-12-17 19:01:53 UTC
kernel-3.17.7-300.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-3.17.7-300.fc21

Comment 5 Fedora Update System 2014-12-17 19:03:48 UTC
kernel-3.17.7-200.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.17.7-200.fc20

Comment 6 Fedora Update System 2014-12-19 18:31:15 UTC
Package kernel-3.17.7-200.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.17.7-200.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-17283/kernel-3.17.7-200.fc20
then log in and leave karma (feedback).

Comment 7 Fedora Update System 2014-12-21 06:36:07 UTC
kernel-3.17.7-200.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 8 Fedora Update System 2014-12-22 02:32:24 UTC
kernel-3.17.7-300.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 IBM Bug Proxy 2015-01-12 09:50:14 UTC
------- Comment From KURZGREG.com 2015-01-12 09:41 EDT-------
(In reply to comment #11)
> kernel-3.17.7-200.fc20 has been pushed to the Fedora 20 stable repository.
> If problems still persist, please make note of it in this bug report.

(In reply to comment #12)
> kernel-3.17.7-300.fc21 has been pushed to the Fedora 21 stable repository.
> If problems still persist, please make note of it in this bug report.

I don't see these packages in the ppc64 repositories... is there something else to do ?

Comment 10 Josh Boyer 2015-01-12 14:49:29 UTC
The secondary arch team needs to get them built and updated in the ppc64 repos.


Note You need to log in before you can comment on or make changes to this bug.