Bug 2178836

Summary:	rasdaemon spewing diskerror_eventstore messages
Product:	Red Hat Enterprise Linux 9	Reporter:	Andrew Schorr <ajschorr>
Component:	rasdaemon	Assignee:	Aristeu Rozanski <arozansk>
Status:	CLOSED MIGRATED	QA Contact:	Jiri Dluhos <jdluhos>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	CentOS Stream	CC:	bstinson, jwboyer
Target Milestone:	rc	Keywords:	MigratedToJIRA, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-25 17:25:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andrew Schorr 2023-03-15 20:32:31 UTC

Description of problem:
rasdaemon spews out fairly incomprehensible diskerror_eventstore messages.

Version-Release number of selected component (if applicable):
rasdaemon-0.6.4-6.el9.x86_64

How reproducible:
I'm not sure.

Steps to Reproduce:
1. Boot system and run 'journalctl -b -u rasdaemon'
2.
3.

Actual results:
Mar 15 16:13:23 ti26 rasdaemon[2399]:           <idle>-0     [043]     0.000002: block_rq_complete:    2023-03-15 16:13:23 -0400
Mar 15 16:13:24 ti26 rasdaemon[2399]: rasdaemon: diskerror_eventstore: 0x5628c0e93318
Mar 15 16:13:24 ti26 rasdaemon[2399]: rasdaemon: register inserted at db
Mar 15 16:13:24 ti26 rasdaemon[2399]:           <idle>-0     [037]     0.000002: block_rq_complete:    2023-03-15 16:13:24 -0400
Mar 15 16:13:26 ti26 rasdaemon[2399]: rasdaemon: diskerror_eventstore: 0x5628c0e93318
Mar 15 16:13:26 ti26 rasdaemon[2399]: rasdaemon: register inserted at db
Mar 15 16:13:26 ti26 rasdaemon[2399]:           <idle>-0     [037]     0.000003: block_rq_complete:    2023-03-15 16:13:26 -0400
Mar 15 16:13:28 ti26 rasdaemon[2399]: rasdaemon: diskerror_eventstore: 0x5628c0e93318
Mar 15 16:13:28 ti26 rasdaemon[2399]: rasdaemon: register inserted at db
Mar 15 16:13:28 ti26 rasdaemon[2399]:           <idle>-0     [037]     0.000003: block_rq_complete:    2023-03-15 16:13:28 -0400
Mar 15 16:13:30 ti26 rasdaemon[2399]: rasdaemon: diskerror_eventstore: 0x5628c0e93318
Mar 15 16:13:30 ti26 rasdaemon[2399]: rasdaemon: register inserted at db
Mar 15 16:13:30 ti26 rasdaemon[2399]:           <idle>-0     [037]     0.000003: block_rq_complete:    2023-03-15 16:13:30 -0400
Mar 15 16:13:32 ti26 rasdaemon[2399]: rasdaemon: diskerror_eventstore: 0x5628c0e93318
Mar 15 16:13:32 ti26 rasdaemon[2399]: rasdaemon: register inserted at db
...

Expected results:
I have no clue what this signifies. It's a very unhelpful message. What does
it mean? Do I have a problem with one or more disks?


Additional info:
bash-5.1$ ras-mc-ctl --errors | tail
9255 2023-03-15 16:18:39 -0400 error: dev=0:2048, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
9256 2023-03-15 16:18:41 -0400 error: dev=0:2048, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
9257 2023-03-15 16:18:43 -0400 error: dev=0:2048, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
9258 2023-03-15 16:18:45 -0400 error: dev=0:2048, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
9259 2023-03-15 16:18:47 -0400 error: dev=0:2048, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
9260 2023-03-15 16:18:49 -0400 error: dev=0:2048, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
9261 2023-03-15 16:18:51 -0400 error: dev=0:2816, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 

No MCE errors.


What's this trying to tell me? I thought rasdaemon's job was to detect memory problems.

Also, the "ras-mc-ctl --errors" option is not documented in the man page. Some others are missing as well.

Comment 1 Aristeu Rozanski 2023-06-01 17:05:55 UTC

rasdaemon is the userspace for HERM which includes but is not limited to memory errors.

Comment 2 Aristeu Rozanski 2023-06-01 17:06:59 UTC

I'll take a look on why it's generating stdout/stderr messages as well see how to improve the recorded events.

Comment 3 Andrew Schorr 2023-06-02 01:16:53 UTC

Thanks. I'm happy if rasdaemon tells me about additional system errors beyond memory errors.
But are these actually errors? The messages are inscrutable, and I'm not aware of any actual
hardware problems with these drives. But if rasdaemon is trying to tell me about an actual
problem, then I'd certainly like to understand what the issue is. Some more examples
from another system:

bash-5.1$ ras-mc-ctl --errors | tail
9123 2023-06-01 16:44:15 -0400 error: dev=0:2080, sector=578224128, nr_sector=32, error='unknown block error', rwbs='R', cmd='', 
9124 2023-06-01 16:44:15 -0400 error: dev=0:2080, sector=578224160, nr_sector=64, error='unknown block error', rwbs='R', cmd='', 
9125 2023-06-01 16:44:15 -0400 error: dev=0:2080, sector=578225016, nr_sector=160, error='unknown block error', rwbs='R', cmd='', 
9126 2023-06-01 16:44:15 -0400 error: dev=0:2080, sector=562617792, nr_sector=32, error='unknown block error', rwbs='R', cmd='', 
9127 2023-06-01 16:51:08 -0400 error: dev=0:0, sector=-1, nr_sector=8, error='operation not supported error', rwbs='N', cmd='', 
9128 2023-06-01 16:51:12 -0400 error: dev=0:0, sector=-1, nr_sector=8, error='critical target error', rwbs='N', cmd='', 
9129 2023-06-01 16:51:20 -0400 error: dev=0:0, sector=-1, nr_sector=8, error='operation not supported error', rwbs='N', cmd='', 

No MCE errors.

bash-5.1$ 

How can I make sense of these error messages? Are these real or spurious problems? I don't know
how to interpret them.

Thanks,
Andy

Comment 4 Andrew Schorr 2023-08-25 20:16:40 UTC

I just upgraded another system, and rasdaemon is spewing incessant error messages. I simply 
have no idea what they mean. Is there any way to find out what it's trying to tell me?

In the journal, I see errors like this:

Aug 25 16:13:13 ti11 rasdaemon[22141]: rasdaemon: diskerror_eventstore: 0x55bb8bec8eb8
Aug 25 16:13:13 ti11 rasdaemon[22141]: rasdaemon: register inserted at db
Aug 25 16:13:13 ti11 rasdaemon[22141]:            <...>-660   [002]     0.009848: block_rq_complete:    2023-08-25 16:13:13 -0400
Aug 25 16:13:13 ti11 rasdaemon[22141]: rasdaemon: diskerror_eventstore: 0x55bb8bec8eb8
Aug 25 16:13:13 ti11 rasdaemon[22141]: rasdaemon: register inserted at db
Aug 25 16:13:13 ti11 rasdaemon[22141]:            <...>-677   [003]     0.009848: block_rq_complete:    2023-08-25 16:13:13 -0400
Aug 25 16:13:15 ti11 rasdaemon[22141]: rasdaemon: diskerror_eventstore: 0x55bb8bec8eb8
Aug 25 16:13:15 ti11 rasdaemon[22141]: rasdaemon: register inserted at db
Aug 25 16:13:15 ti11 rasdaemon[22141]:            <...>-660   [002]     0.009848: block_rq_complete:    2023-08-25 16:13:15 -0400
Aug 25 16:13:15 ti11 rasdaemon[22141]: rasdaemon: diskerror_eventstore: 0x55bb8bec8eb8
Aug 25 16:13:15 ti11 rasdaemon[22141]: rasdaemon: register inserted at db
Aug 25 16:13:15 ti11 rasdaemon[22141]:            <...>-677   [003]     0.009848: block_rq_complete:    2023-08-25 16:13:15 -0400


And "ras-mc-ctl --errors" says this:

10427 2023-08-25 16:13:13 -0400 error: dev=0:2096, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
10428 2023-08-25 16:13:13 -0400 error: dev=0:2112, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
10429 2023-08-25 16:13:15 -0400 error: dev=0:2096, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 
10430 2023-08-25 16:13:15 -0400 error: dev=0:2112, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='', 

What does it mean? I'm forced to stop the rasdaemon service to avoid being buried in messages.
Is there an actual problem here? How do I decode these messages?

Thanks,
Andy

Comment 5 RHEL Program Management 2023-09-25 17:22:22 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 6 RHEL Program Management 2023-09-25 17:25:57 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.