Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2127525

Summary: crash fails to load with --zero_excluded on certain ppc64le vmcore after commit cdd57e8 to read emergency stack
Product: Red Hat Enterprise Linux 9 Reporter: Dave Wysochanski <dwysocha>
Component: crashAssignee: lijiang
Status: CLOSED ERRATA QA Contact: xiaoying yan <yiyan>
Severity: low Docs Contact:
Priority: unspecified    
Version: 9.2CC: hbathini, ruyang, xiawu, yiyan
Target Milestone: rcKeywords: Tracking, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: crash-8.0.2-1.el9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-09 07:40:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dave Wysochanski 2022-09-16 17:18:44 UTC
Description of problem:
We had a ppc64le vmcore come in that looked to be partially damaged and it failed to load with --zero_excluded option and threw the following error:
crash: invalid kernel virtual address: 38  type: "paca->emergency_sp"

The same build loads the vmcore in minimal mode with --minimal option, but obviously this is not as good as regular mode.

An earlier build of crash 8.0.1 succeeded in loading the vmcore with --zero_excluded option.  Some upstream change (seems to be very recent, after crash 8.0.1) caused crash to fail to load on this vmcore.


Version-Release number of selected component (if applicable):
Upstream as of 6722ea102264

How reproducible:
Everytime with a certain customer vmcore.


Steps to Reproduce:
1. try to load crash with the specific vmcore (el7)
2.
3.

Actual results:
crash does not load with --zero_excluded

Expected results:
crash loads with --zero_excluded and does not need --minimal


Additional info:
I file this against RHEL9 as a placeholder, but it really is upstream.  We can close CURRENTRELEASE or whatever as needed.

I believe this happened somewhere in these changes, based on builds that succeed and fail in our production system):
6722ea102264 arm64: Fix for st->_stext_vmlinux not initialized when set VA_BITS_ACTUAL
93b880217de2 ppc64: use a variable for machdep->machspec
4dc2f1c32d1c ppc64: print emergency stacks info with 'mach' command
cdd57e8b16ab ppc64: handle backtrace when CPU is in an emergency stack
4d1b968abb28 ppc64: rename ppc64_paca_init to ppc64_paca_percpu_offset_init
3ee5956721d9 ppc64: dynamically allocate h/w interrupt stack
c67ce5bbb8e3 ppc64: fix bt for '-S' case
d8869b085483 Extend field length of task attributes
85f39061390f Fix for "dev" command on Linux 5.11 and later
b8f2ae6b494d sbitmapq: Limit kernels without sbitmap again
6bc3b74c6e2b sbitmapq: Fix for kernels without struct wait_queue_head
c07068266b41 Make "dev -d|-D" options parse sbitmap on Linux 4.18 and later
12fe6c7cdd76 sbitmapq: Fix for sbitmap_queue without min_shallow_depth member
0d3e86fee5ee sbitmapq: Fix for sbitmap_word without cleared member
9ce31a14d108 sbitmapq: Fix for sbitmap_queue without ws_active member
c672d7a4c290 Doc: update man page for the "bpf" and "sbitmapq" commands
68ce0b9a35d7 Fix for "dev -d|-D" options to support blk-mq change on Linux v5.18-rc1
7095c8fd029e Enhance "dev -d|-D" options to support blk-mq sbitmap
dda5b2d02b8d gdb: print details of unnamed struct and union
0f162febebc4 bt: arm64: add support for 'bt -n idle'
6833262bf871 bt: x86_64: filter out idle task stack
9705669a49c3 Makefile: add missing crash_target.o to be cleaned
3750803f6ae5 sbitmapq: fix invalid offset for "sbitmap_word_depth" on Linux v5.18-rc1
530fe6ad7e4d sbitmapq: fix invalid offset for "sbitmap_queue_round_robin" on Linux v5.13-rc1
a295cb40cd5d sbitmapq: fix invalid offset for "sbitmap_queue_alloc_hint" on Linux v5.13-rc1
364b2e413c69 sbitmapq: remove struct and member validation in sbitmapq_init()
ae52398a13fa ppc64: update the NR_CPUS to 8192
0ca55e460757 Mark start of 8.0.2 development phase with version 8.0.1++


I will attempt a bisect - assuming I'm successful, I'll report the problem to the upstream list if I cannot determine an easy fix.

I can give a private comment pointing to the location of the vmcore on our internal server.

Comment 2 Dave Wysochanski 2022-09-18 01:09:33 UTC
I had a lot of difficulty getting 'git bisect run' to work but finally found the problem.  Here is the bisect results.

cdd57e8b16aba2f5714673368d6dbc7565d59841 is the first bad commit
commit cdd57e8b16aba2f5714673368d6dbc7565d59841
Author: Hari Bathini <hbathini.com>
Date:   Mon Jul 4 10:55:44 2022 +0530

    ppc64: handle backtrace when CPU is in an emergency stack
    
    A CPU could be in an emergency stack when it is running in real mode
    or any special scenario like TM bad thing. Also, there are dedicated
    emergency stacks for machine check and system reset interrupt. Right
    now, no backtrace is provided if a CPU is in any of these stacks.
    This change ensures backtrace is processed appropriately even when
    a CPU is in any one of these emergency stacks. Also, if stack info
    cannot be found, print that message always instead of only when
    verbose logs are enabled.
    
    Related kernel commits:
    729b0f715371 ("powerpc/book3s: Introduce exclusive emergency stack for machine check exception.")
    b1ee8a3de579 ("powerpc/64s: Dedicated system reset interrupt stack")
    
    Signed-off-by: Hari Bathini <hbathini.com>

 defs.h  |  12 ++++
 ppc64.c | 203 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 203 insertions(+), 12 deletions(-)
bisect run success
$ git bisect log
# bad: [bdbf5887d6259ea3108d4fa674f3794adad54d52] Fix gcc-11 compiler warnings on gdb-10.2/gdb/symtab.c
# good: [2d193468e5fe7ee1c6be4c73083cc5ef8d922b74] crash-8.0.0 -> crash-8.0.1
git bisect start 'bdbf588' '2d19346'
# good: [d8869b08548362345fc34e4cf17a1eac9bddec6b] Extend field length of task attributes
git bisect good d8869b08548362345fc34e4cf17a1eac9bddec6b
# bad: [7591e3c07cef4900f6b0ca797270cb7527fb4e29] Fix gcc-11 compiler warning on makedumpfile.c
git bisect bad 7591e3c07cef4900f6b0ca797270cb7527fb4e29
# bad: [4dc2f1c32d1c99586e67032c9cd62c5c4334049c] ppc64: print emergency stacks info with 'mach' command
git bisect bad 4dc2f1c32d1c99586e67032c9cd62c5c4334049c
# good: [3ee5956721d9a67fe8d4c6d5022aa022c5f9a11c] ppc64: dynamically allocate h/w interrupt stack
git bisect good 3ee5956721d9a67fe8d4c6d5022aa022c5f9a11c
# bad: [cdd57e8b16aba2f5714673368d6dbc7565d59841] ppc64: handle backtrace when CPU is in an emergency stack
git bisect bad cdd57e8b16aba2f5714673368d6dbc7565d59841
# good: [4d1b968abb286ea39ea080ae073b0e2b5bfe6c4e] ppc64: rename ppc64_paca_init to ppc64_paca_percpu_offset_init
git bisect good 4d1b968abb286ea39ea080ae073b0e2b5bfe6c4e
# first bad commit: [cdd57e8b16aba2f5714673368d6dbc7565d59841] ppc64: handle backtrace when CPU is in an emergency stack
$

Comment 4 Dave Wysochanski 2022-09-19 09:16:03 UTC
Thread 1 "crash" hit Breakpoint 3, readmem (addr=56, memtype=1, buffer=0x13763178, size=8, type=0x109a1ed0 "paca->emergency_sp", error_handle=1) at memory.c:2300
2300	                                error(INFO, INVALID_KVADDR, addr, type);
(gdb) bt
#0  readmem (addr=56, memtype=1, buffer=0x13763178, size=8, type=0x109a1ed0 "paca->emergency_sp", error_handle=1) at memory.c:2300
#1  0x000000001020b768 in ppc64_init_paca_info () at ppc64.c:1248
#2  0x000000001020835c in ppc64_init (when=3) at ppc64.c:700
#3  0x00000000100f70d0 in main_loop () at main.c:782
#4  0x00000000105325c4 in captured_main (data=data@entry=0x7fffffffea80) at main.c:1284
#5  gdb_main (args=args@entry=0x7fffffffeac0) at main.c:1313
#6  0x00000000105326e8 in gdb_main_entry (argc=<optimized out>, argv=<optimized out>) at main.c:1338
#7  0x00000000101e710c in gdb_main_loop (argc=2, argv=0x7ffffffff028) at gdb_interface.c:81
#8  0x00000000100f6ed4 in main (argc=3, argv=0x7ffffffff028) at main.c:720
(gdb) frame 1
#1  0x000000001020b768 in ppc64_init_paca_info () at ppc64.c:1248
1248				readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
(gdb) 
(gdb) frame 1
#1  0x000000001020b768 in ppc64_init_paca_info () at ppc64.c:1248
1248				readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
(gdb) list
1243			ulong offset = MEMBER_OFFSET("paca_struct", "emergency_sp");
1244	
1245			if (!(ms->emergency_sp = (ulong *)calloc(kt->cpus, sizeof(ulong))))
1246				error(FATAL, "cannot malloc emergency stack space.\n");
1247			for (i = 0; i < kt->cpus; i++)
1248				readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
1249					sizeof(void *), "paca->emergency_sp",
1250					FAULT_ON_ERROR);
1251		}
1252	
(gdb) p i
$1 = 1
(gdb) p ms->emergency_sp[0]
$2 = 0
(gdb) p ms->emergency_sp[1]
$3 = 0
(gdb) p offset
$4 = 56
(gdb) p/x 56
$6 = 0x38
(gdb) p kt->cpus
$7 = 188
(gdb) 
(gdb) p paca_ptr[i]
$8 = 0
(gdb) 
[1]+  Stopped                 gdb ./crash
$ git log --oneline | head
cdd57e8 ppc64: handle backtrace when CPU is in an emergency stack
4d1b968 ppc64: rename ppc64_paca_init to ppc64_paca_percpu_offset_init
3ee5956 ppc64: dynamically allocate h/w interrupt stack
c67ce5b ppc64: fix bt for '-S' case
d8869b0 Extend field length of task attributes
85f3906 Fix for "dev" command on Linux 5.11 and later
b8f2ae6 sbitmapq: Limit kernels without sbitmap again
6bc3b74 sbitmapq: Fix for kernels without struct wait_queue_head
c070682 Make "dev -d|-D" options parse sbitmap on Linux 4.18 and later
12fe6c7 sbitmapq: Fix for sbitmap_queue without min_shallow_depth member

Comment 6 lijiang 2022-09-28 08:10:22 UTC
Thank you for pointing out this issue, Dave Wysochanski.

(In reply to Dave Wysochanski from comment #4)
> Thread 1 "crash" hit Breakpoint 3, readmem (addr=56, memtype=1,
> buffer=0x13763178, size=8, type=0x109a1ed0 "paca->emergency_sp",
> error_handle=1) at memory.c:2300
> 2300	                                error(INFO, INVALID_KVADDR, addr, type);
> (gdb) bt
> #0  readmem (addr=56, memtype=1, buffer=0x13763178, size=8, type=0x109a1ed0
> "paca->emergency_sp", error_handle=1) at memory.c:2300
> #1  0x000000001020b768 in ppc64_init_paca_info () at ppc64.c:1248

That's true, it failed at readmem():

1246                         readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
1247                                 sizeof(void *), "paca->emergency_sp",
1248                                 FAULT_ON_ERROR);

But, this is expected. I did the debugging and got the address of paca_ptr[i] as below:

paca_ptr[0]=c0000000014c0180, offset=38
paca_ptr[1]=0, offset=38
crash: invalid kernel virtual address: 38  type: "paca->emergency_sp"

The address of paca_ptr[1] is zero, it is an invalid address, so the readmem() failed with the flag 'FAULT_ON_ERROR'. If the readmem() uses the flag 'RETURN_ON_ERROR|QUIET', it will work as before, but it may be inaccurate, does it make sense?

Comment 7 IBM Bug Proxy 2022-09-28 10:21:14 UTC
------- Comment From hbathini.com 2022-09-28 06:13 EDT-------
I am not sure what good is a vmcore that does not have paca info
of running CPUs but 'RETURN_ON_ERROR|QUIET' instead of 'FAULT_ON_ERROR'
should let it proceed. So, a warning at the first instance of missing
paca (NULL) and moving on is not such a base idea...

Comment 8 lijiang 2022-09-29 06:39:22 UTC
Thank you for the comment, Hari. Let's see if Dave Wysochanski has any questions or concerns.

Comment 9 Dave Wysochanski 2022-09-29 11:39:55 UTC
Hi all, thanks for looking at this bug.  I know such vmcores are rare but they do happen in real customer environments.  When that happens having to only use minimal mode will greatly curtail most engineers investigations (we don't even have 'bt' for example).  Having a larger set of crash commands available with --zero_excluded even if maybe crash fails later at some command is a large step forward, IMO.

I wonder about --zero_excluded and what patch would be appropriate.  While we can argue this is one file/function that maybe needs fixed up, I do have a couple questions:
1. Should we look at a limited change in only this area of the code that was added (the emergency stacks code in ppc64.c) to relax the last readmem() parameter, or can it be (should it be) a larger change inside readmem if "--zero_excluded" is set?
* Maybe some investigation of a larger change is warranted, though it may be of low value unless we have other vmcores that we know would be helped by a larger change to readmem if --zero_excluded is set
* Just a quick look on one of our production x86_64 systems finds around 25 vmcores marked with a crash command containing "--minimal".  I tried one of them with --zero_excluded and it loaded fine.  I may investigate patching our retrace-server system to add --zero_excluded to "crash_cmd" so we'll get better information in the future.

2. I'm not clear whether we should change the last parameter on READMEM only if --zero_excluded is set, or change it unconditionally.
* I lean towards unconditional since AFAIK this code existed a long time without the "ppc64: handle backtrace when CPU is in an emergency stack", so that must be a rare condition.

3. I'm not clear on which one we should really use of the two options: 'RETURN_ON_ERROR|QUIET'.
* Lean towards RETURN_ON_ERROR since user will then know it's a problem.  This has the downside it may spam the console but I don't think the # of messages is too large (I counted a few hundred lines).

4. Are we going to leak memory if we RETURN_ON_ERROR 
* Missing "free(paca_ptr)"?  But maybe this is not a big deal and we live with it.

Comment 10 lijiang 2022-09-30 08:44:47 UTC
Thank you for sharing your idea, Dave Wysochanski.

Maybe we need to make a trade-off. How about the following change? It will check the '--zero_excluded' flag. Because the original patch was submitted by Hari, I would like to know if Hari has any comments on this changes. Thanks

diff --git a/ppc64.c b/ppc64.c
index 4ea1f7c..71f1707 100644
--- a/ppc64.c
+++ b/ppc64.c
@@ -1213,9 +1213,12 @@ static void
 ppc64_init_paca_info(void)
 {
 	struct machine_specific *ms = machdep->machspec;
-	ulong *paca_ptr;
+	ulong *paca_ptr, ret = FAULT_ON_ERROR;
 	int i;
 
+	if (*diskdump_flags & ZERO_EXCLUDED)
+		ret = RETURN_ON_ERROR;
+
 	if (!(paca_ptr = (ulong *)calloc(kt->cpus, sizeof(ulong))))
 		error(FATAL, "cannot malloc paca pointers space.\n");
 
@@ -1224,13 +1227,13 @@ ppc64_init_paca_info(void)
 		ulong paca_loc;
 
 		readmem(symbol_value("paca_ptrs"), KVADDR, &paca_loc, sizeof(void *),
-			"paca double pointer", FAULT_ON_ERROR);
+			"paca double pointer", ret);
 		readmem(paca_loc, KVADDR, paca_ptr, sizeof(void *) * kt->cpus,
-			"paca pointers", FAULT_ON_ERROR);
+			"paca pointers", ret);
 	} else if (symbol_exists("paca") &&
 		   (get_symbol_type("paca", NULL, NULL) == TYPE_CODE_PTR)) {
 		readmem(symbol_value("paca"), KVADDR, paca_ptr, sizeof(void *) * kt->cpus,
-			"paca pointers", FAULT_ON_ERROR);
+			"paca pointers", ret);
 	} else {
 		free(paca_ptr);
 		return;
@@ -1245,7 +1248,7 @@ ppc64_init_paca_info(void)
 		for (i = 0; i < kt->cpus; i++)
 			readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
 				sizeof(void *), "paca->emergency_sp",
-				FAULT_ON_ERROR);
+				ret);
 	}
 
 	if (MEMBER_EXISTS("paca_struct", "nmi_emergency_sp")) {
@@ -1256,7 +1259,7 @@ ppc64_init_paca_info(void)
 		for (i = 0; i < kt->cpus; i++)
 			readmem(paca_ptr[i] + offset, KVADDR, &ms->nmi_emergency_sp[i],
 				sizeof(void *), "paca->nmi_emergency_sp",
-				FAULT_ON_ERROR);
+				ret);
 	}
 
 	if (MEMBER_EXISTS("paca_struct", "mc_emergency_sp")) {
@@ -1267,7 +1270,7 @@ ppc64_init_paca_info(void)
 		for (i = 0; i < kt->cpus; i++)
 			readmem(paca_ptr[i] + offset, KVADDR, &ms->mc_emergency_sp[i],
 				sizeof(void *), "paca->mc_emergency_sp",
-				FAULT_ON_ERROR);
+				ret);
 	}
 
 	free(paca_ptr);

Comment 11 Dave Wysochanski 2022-09-30 09:09:19 UTC
Lianbo's patch in comment #10 seems reasonable to me.  I'm not sure if we need to make it conditional on the "--zero_excluded" flag (my comment #2), but maybe Hari can provide guidance.

FWIW, I have a PR for retrace-server to try "--zero_excluded" in the future in the case we have only used "--minimal", so we should get more data in the future about usage of this option: https://github.com/abrt/retrace-server/pull/474

Comment 12 IBM Bug Proxy 2022-09-30 11:01:37 UTC
------- Comment From hbathini.com 2022-09-30 06:59 EDT-------
Doesn't sound like a bad idea to warn about missing page and trying to move on,
with or without '--zero-excluded'. So, I would prefer it with
'RETURN_ON_ERROR' always..

Comment 13 IBM Bug Proxy 2022-09-30 11:11:34 UTC
------- Comment From hbathini.com 2022-09-30 07:04 EDT-------
(In reply to comment #8)
> Thank you for sharing your idea, Dave Wysochanski.
>
> Maybe we need to make a trade-off. How about the following change? It will
> check the '--zero_excluded' flag. Because the original patch was submitted
> by Hari, I would like to know if Hari has any comments on this changes.
> Thanks
>
> diff --git a/ppc64.c b/ppc64.c
> index 4ea1f7c..71f1707 100644
> --- a/ppc64.c
> +++ b/ppc64.c
> @@ -1213,9 +1213,12 @@ static void
> ppc64_init_paca_info(void)
> {
> struct machine_specific *ms = machdep->machspec;
> -	ulong *paca_ptr;
> +	ulong *paca_ptr, ret = FAULT_ON_ERROR;
> int i;
>
> +	if (*diskdump_flags & ZERO_EXCLUDED)
> +		ret = RETURN_ON_ERROR;
> +
> if (!(paca_ptr = (ulong *)calloc(kt->cpus, sizeof(ulong))))
> error(FATAL, "cannot malloc paca pointers space.\n");
>
> @@ -1224,13 +1227,13 @@ ppc64_init_paca_info(void)
> ulong paca_loc;
>
> readmem(symbol_value("paca_ptrs"), KVADDR, &paca_loc, sizeof(void *),
> -			"paca double pointer", FAULT_ON_ERROR);
> +			"paca double pointer", ret);
> readmem(paca_loc, KVADDR, paca_ptr, sizeof(void *) * kt->cpus,
> -			"paca pointers", FAULT_ON_ERROR);
> +			"paca pointers", ret);
> } else if (symbol_exists("paca") &&
> (get_symbol_type("paca", NULL, NULL) == TYPE_CODE_PTR)) {
> readmem(symbol_value("paca"), KVADDR, paca_ptr, sizeof(void *) * kt->cpus,
> -			"paca pointers", FAULT_ON_ERROR);
> +			"paca pointers", ret);
> } else {
> free(paca_ptr);
> return;
> @@ -1245,7 +1248,7 @@ ppc64_init_paca_info(void)

Also, the below reads are pointless if paca_ptr[i] is '0'. Maybe a check
for paca_ptr[i] before reading here?

> for (i = 0; i < kt->cpus; i++)
> readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
> sizeof(void *), "paca->emergency_sp",
> -				FAULT_ON_ERROR);
> +				ret);
> }
>
> if (MEMBER_EXISTS("paca_struct", "nmi_emergency_sp")) {
> @@ -1256,7 +1259,7 @@ ppc64_init_paca_info(void)
> for (i = 0; i < kt->cpus; i++)
> readmem(paca_ptr[i] + offset, KVADDR, &ms->nmi_emergency_sp[i],
> sizeof(void *), "paca->nmi_emergency_sp",
> -				FAULT_ON_ERROR);
> +				ret);
> }
>
> if (MEMBER_EXISTS("paca_struct", "mc_emergency_sp")) {
> @@ -1267,7 +1270,7 @@ ppc64_init_paca_info(void)
> for (i = 0; i < kt->cpus; i++)
> readmem(paca_ptr[i] + offset, KVADDR, &ms->mc_emergency_sp[i],
> sizeof(void *), "paca->mc_emergency_sp",
> -				FAULT_ON_ERROR);
> +				ret);

Comment 14 lijiang 2022-10-04 07:18:08 UTC
> Also, the below reads are pointless if paca_ptr[i] is '0'. Maybe a check
> for paca_ptr[i] before reading here?
> 

If paca_ptr[i] is an invalid kernel virtual address, the readmem(..., RETURN_ON_ERROR) will fail with some error information, which can tell users that it may be unreliable.

> > for (i = 0; i < kt->cpus; i++)
> > readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
> > sizeof(void *), "paca->emergency_sp",
> > -				FAULT_ON_ERROR);
> > +				ret);
> > }
> >

The above code looks concise and readable. After adding a check for paca_ptr[i], seems that it doesn't look as good as before. Any comments?

@@ -1243,9 +1243,13 @@ ppc64_init_paca_info(void)
                if (!(ms->emergency_sp = (ulong *)calloc(kt->cpus, sizeof(ulong))))
                        error(FATAL, "cannot malloc emergency stack space.\n");
                for (i = 0; i < kt->cpus; i++)
+                       if (!IS_KVADDR(paca_ptr[i])) {
+                               error(WARNING, "invalid kernel virtual address: %lx , paca->emergency_sp\n", paca_ptr[i]);
+                               continue;
+                       }
                        readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
                                sizeof(void *), "paca->emergency_sp",
-                               FAULT_ON_ERROR);
+                               RETURN_ON_ERROR);
        }

Thanks.

Comment 15 IBM Bug Proxy 2022-10-04 09:40:45 UTC
------- Comment From hbathini.com 2022-10-04 05:36 EDT-------
(In reply to comment #12)
> > Also, the below reads are pointless if paca_ptr[i] is '0'. Maybe a check
> > for paca_ptr[i] before reading here?
> If paca_ptr[i] is an invalid kernel virtual address, the readmem(...,
> RETURN_ON_ERROR) will fail with some error information, which can tell users
> that it may be unreliable.

Yeah. Actually, warnings are better than proceeding quietly on read failures..

> > > for (i = 0; i < kt->cpus; i++)
> > > readmem(paca_ptr[i] + offset, KVADDR, &ms->emergency_sp[i],
> > > sizeof(void *), "paca->emergency_sp",
> > > -				FAULT_ON_ERROR);
> > > +				ret);
> > > }
> > >
>
> The above code looks concise and readable. After adding a check for
> paca_ptr[i], seems that it doesn't look as good as before. Any comments?

Agree.

Comment 16 lijiang 2022-10-05 02:25:39 UTC
Thank you, Hari and Dave Wysochanski. I posted a patch to upstream, we can continue to discuss it there.

https://www.mail-archive.com/crash-utility@redhat.com/msg09560.html

Thanks.

Comment 17 lijiang 2022-10-10 02:20:16 UTC
Patch is ready on upstream:
487551488b15 ("ppc64: still allow to move on if the emergency stacks info fails to initialize")

Comment 18 IBM Bug Proxy 2022-10-14 07:00:49 UTC
------- Comment From hbathini.com 2022-10-14 02:55 EDT-------
With the fix available upstream, changing the bug status to submitted..

Comment 26 errata-xmlrpc 2023-05-09 07:40:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (crash bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2262