1263039 – SLOF doesn't allow enough room for CAS response with large maxmem

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1263039 - SLOF doesn't allow enough room for CAS response with large maxmem

Summary: SLOF doesn't allow enough room for CAS response with large maxmem

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	SLOF
Sub Component:
Version:	7.2
Hardware:	ppc64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	7.2
Assignee:	David Gibson
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEV3.6PPC 1261812 1262143 1263563 1277183 1277184
TreeView+	depends on / blocked

Reported:	2015-09-15 02:09 UTC by David Gibson
Modified:	2016-02-21 11:15 UTC (History)
CC List:	14 users (show)
Fixed In Version:	SLOF-20150313-4.gitc89b0df.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-19 09:21:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2286	0	normal	SHIPPED_LIVE	SLOF bug fix and enhancement update	2015-11-19 09:32:07 UTC

Description David Gibson 2015-09-15 02:09:19 UTC

Description of problem:

When the guest advertises what features it supports with the client-architecture-support OF call, SLOF passes that information on to qemu with a hypercall, and receives a buffer with updated device tree information.

Currently downstream SLOF only allows 8k for this buffer which is not enough to accommodate all the necessary information if the system has enough devices.  In particular it will be insufficient if a large maximum amount of hotpluggable memory is advertised.

Version-Release number of selected component (if applicable):

SLOF-20150313-3.gitc89b0df.el7.noarch

How reproducible:

100%

Steps to Reproduce:
1. Start a guest with 

/usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G,slots=4,maxmem=512G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -nodefaults -rtc base=utc -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=/var/lib/libvirt/images/dwg-rhel72-20150902-le.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,bootindex=1,id=scsi0-0-0-0  -drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1-0 -vnc :10 -msg timestamp=on -usb -device usb-tablet,id=tablet1  -vga std -qmp tcp:0:4666,server,nowait -netdev user,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=00:54:5a:5f:5b:5c -chardev stdio,id=conmon,mux=on -device spapr-vty,chardev=conmon -mon conmon

[The disk image must be a sufficiently recent guest which supports hotplug memory via the PAPR DR mechanism]

Actual results:

SLOF executes, but when it starts to run the guest, qemu quits with:

Trying to load:  from: /vdevice/v-scsi@1000/disk@8000000000000000 ...   Successfully loaded
qemu: error creating device tree: (spapr_populate_drconf_memory(spapr, fdt)): FDT_ERR_NOSPACE

Expected results:

Guest boots sucessfully.

Additional info:

When implementing hotplug memory, Bharata Rao patched SLOF to increase the size of the CAS buffer, but this change didn't make it downstream.

Comment 1 David Gibson 2015-09-15 02:16:28 UTC

Karen, Miya, acks please.

Comment 2 David Gibson 2015-09-15 02:19:06 UTC

Draft build with fix at https://brewweb.devel.redhat.com/taskinfo?taskID=9835585

Comment 3 Qunfang Zhang 2015-09-15 06:52:59 UTC

Hi, David

Just test your build in comment 2, result is:

1) Boot up guest with "-m 2G,slots=2,maxmem=512G -smp 2,sockets=1,cores=2,threads=1":

Guest could boots up with about 17s. (In the buggy official build SLOF-20150313-3.gitc89b0df.el7.noarch, could not boot up and prompt the error in comment 0. )

2) Boot up guest with "-m 2G,slots=2,maxmem=1024G -smp 2,sockets=1,cores=2,threads=1"

Guest could boot up with about 30 mins.

3) Boot up guest with "-m 2G,slots=2,maxmem=2048G -smp 2,sockets=1,cores=2,threads=1"

Still reproduce the issue. (It takes about 27 mins to reproduce it. Qemu process consumes 100% cpu at first, and after about 27 mins, guest fails to boot up and prompts the error)

(qemu) 
(qemu) qemu: error creating device tree: (spapr_populate_drconf_memory(spapr, fdt)): FDT_ERR_NOSPACE

Hi, David

Could you help confirm? 

Thanks,
Qunfang

Comment 4 Qunfang Zhang 2015-09-15 07:02:30 UTC

Please ignore the guest boot up slow issue. After apply the qemu scratch build in bug 1262143 comment 5. guest could start booting quickly. 

But, still, with 2048G memory, guest fails to boot up and prompt:

(qemu) 
(qemu) qemu: error creating device tree: (spapr_populate_drconf_memory(spapr, fdt)): FDT_ERR_NOSPACE

Comment 5 Karen Noel 2015-09-15 13:47:57 UTC

I agree this is a blocker for ppc64le. This issue blocks support for large ppc64le guests, either lots of memory or many devices.

The fix is low risk and fixes the issue.

Comment 6 David Gibson 2015-09-16 03:18:24 UTC

Qunfang,

Thanks for the test.  It's still crashing with maxmem=2T because the new buffer was sized for only 1T of maxmem.

RHEV have already decided for other reasons that we should limit maxmem to 1T for the RHEL 7.2 release so I think the remaining problem can be deferred.

Once the patch is merged, please just verify up to 1T maxmem.

Comment 7 Qunfang Zhang 2015-09-16 05:02:41 UTC

David,

Got it, thanks!

Comment 8 Miroslav Rezanina 2015-09-16 14:49:29 UTC

Fix included in SLOF-20150313-4.gitc89b0df.el7

Comment 10 Qunfang Zhang 2015-09-18 07:06:08 UTC

Reproduced the bug on SLOF-20150313-3.gitc89b0df.el7.noarch.

# /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G,slots=4,maxmem=512G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -nodefaults -monitor stdio -rtc base=utc -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=RHEL-7.2-LE-new.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,bootindex=1,id=scsi0-0-0-0  -drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1-0 -vnc :10 -msg timestamp=on -usb -device usb-tablet,id=tablet1  -vga std -qmp tcp:0:4666,server,nowait -netdev tap,id=hostnet1,script=/etc/qemu-ifup,vhost=on -device virtio-net-pci,netdev=hostnet1,id=net1,mac=00:54:5a:5f:5b:5c
QEMU 2.3.0 monitor - type 'help' for more information
(qemu) 
(qemu) 
(qemu) 
(qemu) qemu: error creating device tree: (spapr_populate_drconf_memory(spapr, fdt)): FDT_ERR_NOSPACE
/etc/qemu-ifdown: could not launch network script


Verified pass on SLOF-20150313-4.gitc89b0df.el7.noarch.rpm.

Boot the guest with "maxmem=512G" and "maxmem=1024G", guest could boot up successfully, reboot and shutdown guest, all works well.

So this bug is fixed.

Comment 11 David Gibson 2015-09-18 07:28:10 UTC

Changing hardware back to ppc64.  SLOF is technically big-endian (ppc64) even if guest and host are little endian.

Comment 12 Qunfang Zhang 2015-09-18 07:37:07 UTC

(In reply to David Gibson from comment #11)
> Changing hardware back to ppc64.  SLOF is technically big-endian (ppc64)
> even if guest and host are little endian.

Oops, good to know, thanks for correction.

Comment 16 Qunfang Zhang 2015-09-21 06:26:04 UTC

Setting to VERIFIED according to comment 10.

Comment 18 errata-xmlrpc 2015-11-19 09:21:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2286.html

Note You need to log in before you can comment on or make changes to this bug.