This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1980335 - [aarch64] [libvirt] Support RAS for aarch64
Summary: [aarch64] [libvirt] Support RAS for aarch64
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.0
Hardware: aarch64
OS: Linux
medium
medium
Target Milestone: beta
: 9.1
Assignee: khanicov
QA Contact: Hu Shuai (Fujitsu)
URL:
Whiteboard:
Depends On: 1838608
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-08 12:06 UTC by Eric Auger
Modified: 2023-09-22 17:32 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-22 17:32:24 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   RHEL-7489 0 None Migrated None 2023-09-22 17:32:20 UTC

Description Eric Auger 2021-07-08 12:06:04 UTC
In RHEL9 we want to enable the 'ras' modality. This is ongoing at qemu level by allowing the arm virt machine 'ras' option to be set. So logically, this setting should also be exposed in the libvirt XML.

Comment 1 Jaroslav Suchanek 2021-07-12 11:38:00 UTC
Andrea, please triage this one. Thanks.

Comment 9 Hu Shuai (Fujitsu) 2022-11-30 01:48:47 UTC
Hi, Andrea, is there any progress or plan for this new feature?

Comment 10 Andrea Bolognani 2022-12-02 18:06:58 UTC
(In reply to Hu Shuai (Fujitsu) from comment #9)
> Hi, Andrea, is there any progress or plan for this new feature?

I'll try to get it done in time for RHEL 9.2.

I'm not familiar with RAS outside of a very high-level understanding
though. Eric, do you have any useful pointers?

Comment 11 Eric Auger 2022-12-06 09:43:37 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1838608#c9(In reply to Andrea Bolognani from comment #10)
> (In reply to Hu Shuai (Fujitsu) from comment #9)
> > Hi, Andrea, is there any progress or plan for this new feature?
> 
> I'll try to get it done in time for RHEL 9.2.
> 
> I'm not familiar with RAS outside of a very high-level understanding
> though. Eric, do you have any useful pointers?

Hi Andrea, here is the procedure I used to test RAS at qemu level.
https://bugzilla.redhat.com/show_bug.cgi?id=1838608#c9
To be honest I tried hard to refresh my memories (Aug 21) but the overall principle was to poison one physical page backing RAM of the guest and then trigger an access to this page. When RAS is on, your guest recovers. If the option is not set qemu aborts with 'Hardware memory error!'

The setup was quite involved.
1) you need to translate one RAM GPA into host virtual address
2) you need to poison the GFN backing the host VA
3) from the guest you need to trigger an access to the GPA

I discovered later there is an hmp cmd that let you translate from GPA to HVA
virsh # qemu-monitor-command --hmp aarch64-vm2-rhel9.0 gpa2hva 0x7dbea0000
Host virtual address for 0x7dbea0000 (ram-node0) is 0xffff0fe90000

However I was not able to find an easy way to corrupt the resulting HVA
I thing there are ways like:

CONFIG_HWPOISON_INJECT
depends on CONFIG_MEMORY_FAILURE && CONFIG_DEBUG_KERNEL
sudo modprobe hwpoison_inject
echo "<addr>" > /sys/kernel/debug/hwpoison/corrupt-pfn
[110759.319941] Memory failure: 0x90b3b: corrupted page was clean: dropped without side effects
[110759.328435] Memory failure: 0x90b3b: recovery action for clean LRU page: Recovered

but this was not working for me at that time and I do not know the reason.

So what I eventually did is I developed a specific qmp poison cmd which did both:
the GPA 2 HVA conversion and the poisoning. This worked for me but requested a specific patch to be added on top of qemu. I guess that should be easier & nicer to execute the hmp cmd to retrieve the HVA and then have a separate C executable that does the poisoning, ie. madvise(vaddr, 0x1000, MADV_HWPOISON) -. At the time I wrote the qmp poison cmd I was not aware of the hmp gpa2hva facility. 

Then on guest side, I used mce-tools victim binary to trigger the actual access to that GPA.

Good luck! And yes we can rework this together :-)

Eric

Comment 14 Andrea Bolognani 2022-12-13 13:33:36 UTC
(In reply to Eric Auger from comment #11)
> Hi Andrea, here is the procedure I used to test RAS at qemu level.
> https://bugzilla.redhat.com/show_bug.cgi?id=1838608#c9
> To be honest I tried hard to refresh my memories (Aug 21) but the overall
> principle was to poison one physical page backing RAM of the guest and then
> trigger an access to this page. When RAS is on, your guest recovers. If the
> option is not set qemu aborts with 'Hardware memory error!'
> 
> The setup was quite involved.
> 1) you need to translate one RAM GPA into host virtual address
> 2) you need to poison the GFN backing the host VA
> 3) from the guest you need to trigger an access to the GPA
> 
> I discovered later there is an hmp cmd that let you translate from GPA to HVA
> virsh # qemu-monitor-command --hmp aarch64-vm2-rhel9.0 gpa2hva 0x7dbea0000
> Host virtual address for 0x7dbea0000 (ram-node0) is 0xffff0fe90000
> 
> However I was not able to find an easy way to corrupt the resulting HVA
> I thing there are ways like:
> 
> CONFIG_HWPOISON_INJECT
> depends on CONFIG_MEMORY_FAILURE && CONFIG_DEBUG_KERNEL
> sudo modprobe hwpoison_inject
> echo "<addr>" > /sys/kernel/debug/hwpoison/corrupt-pfn
> [110759.319941] Memory failure: 0x90b3b: corrupted page was clean: dropped
> without side effects
> [110759.328435] Memory failure: 0x90b3b: recovery action for clean LRU page:
> Recovered
> 
> but this was not working for me at that time and I do not know the reason.
> 
> So what I eventually did is I developed a specific qmp poison cmd which did
> both:
> the GPA 2 HVA conversion and the poisoning. This worked for me but requested
> a specific patch to be added on top of qemu. I guess that should be easier &
> nicer to execute the hmp cmd to retrieve the HVA and then have a separate C
> executable that does the poisoning, ie. madvise(vaddr, 0x1000,
> MADV_HWPOISON) -. At the time I wrote the qmp poison cmd I was not aware of
> the hmp gpa2hva facility. 
> 
> Then on guest side, I used mce-tools victim binary to trigger the actual
> access to that GPA.
> 
> Good luck! And yes we can rework this together :-)

Thanks for the detailed write-up!

I'll need to wrap my head around some of these concepts. Will
probably ask for clarifications as I poke around.

Using a patched QEMU shouldn't be a problem, and it's probably less
work to reuse what you've already built rather than coming up with a
separate tool. So I'll investigate that approach first.

As an important reminder to myself, based on our discussion from
earlier today: testing this requires using hardware that implements
the v8.2 specification.

Comment 15 Eric Auger 2022-12-13 14:01:21 UTC
To be more precise: "The RAS Extension is a mandatory extension to the Armv8.2 architecture, and it is an optional extension to the Armv8.0 and Armv8.1 architectures." But better test it directly on 8.2 HW!

Comment 17 Eric Auger 2023-03-03 14:26:15 UTC
Hi Kristina,

about your question "shall we enable RAS by default at libvirt level" I digged a little bit in the history. The original contributor initially wanted to enable it by default at qemu level (without any qemu ras option)

https://lore.kernel.org/all/20190620140409.3c713760@redhat.com/

but Igor challenged it, hence the introduction of the ras option. At qemu level it is disabled by default. Igor mentioned it consumes some resources (I guess he referred to the extra APEI/GHES ACPI tables including the guest GHES buffer. I would be tempted to let it off by default.

However I would be keen to know how this handled on x86. Sorry to bounce with another question but do you know what is the default value on x86?

Adding Laszlo in cc as we spent quite a lot of time following that topic if I remember correctly

Thanks

Eric

Comment 18 Laszlo Ersek 2023-03-04 05:27:16 UTC
*shudder*

My memories about this are vague. I only remember discussing the ACPI content (the linkage of the tables etc); those ACPI objects are complex.

I don't recall the feature ever being relevant for x86 guests. Reading through the patch title list at <https://lore.kernel.org/all/1557832703-42620-1-git-send-email-gengdongjiu@huawei.com/> seems to confirm that the feature is limited to arm:

Dongjiu Geng (10):
  hw/arm/virt: Add RAS platform version for migration
  ACPI: add some GHES structures and macros definition
  acpi: add build_append_ghes_notify() helper for Hardware Error
    Notification
  acpi: add build_append_ghes_generic_data() helper for Generic Error
    Data Entry
  acpi: add build_append_ghes_generic_status() helper for Generic Error
    Status Block
  docs: APEI GHES generation and CPER record description
  ACPI: Add APEI GHES table generation support
  KVM: Move related hwpoison page functions to accel/kvm/ folder
  target-arm: kvm64: inject synchronous External Abort
  target-arm: kvm64: handle SIGBUS signal from kernel or KVM

Comment 19 Eric Auger 2023-03-06 07:32:31 UTC
(In reply to Laszlo Ersek from comment #18)
> *shudder*
> 
> My memories about this are vague. I only remember discussing the ACPI
> content (the linkage of the tables etc); those ACPI objects are complex.
> 
> I don't recall the feature ever being relevant for x86 guests. Reading
> through the patch title list at
> <https://lore.kernel.org/all/1557832703-42620-1-git-send-email-
> gengdongjiu/> seems to confirm that the feature is limited to arm:
> 
Yes indeed it is ARM specific. Just wanted to know if your remembered how much this consumed resources. But no worry, the plan is to compare with x86 'default' enablement and I will double check those involved resources

Thanks

Eric

Comment 21 RHEL Program Management 2023-09-22 17:31:29 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 22 RHEL Program Management 2023-09-22 17:32:24 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.