Bug 428903 - RHEL 5 XEN kernel with 4GB memory, kmap address contents invalid after pci_unmap_sg()
RHEL 5 XEN kernel with 4GB memory, kmap address contents invalid after pci_un...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
i386 Linux
low Severity high
: rc
: ---
Assigned To: Xen Maintainance List
Martin Jenner
: Reopened
Depends On:
Blocks: 217105
  Show dependency treegraph
 
Reported: 2008-01-15 17:42 EST by Atul Mukker
Modified: 2008-02-07 12:21 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-02-07 12:21:28 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Atul Mukker 2008-01-15 17:42:15 EST
Description of problem:
kmap generated virtual address has junk data after copy and pci_unmap_sg on XEN
kernel

Version-Release number of selected component (if applicable):
Red Hat Enterprise 5 Gold - XEN Kernel

How reproducible:
Every Timer

Steps to Reproduce:
1. Have a RHEL 5 32-bit XEN installation with a SCSI LLD with 4GB memory.
2. In response to SCSI LLD queuecommand, generate a kernel virtual address for
the DMA buffer generated by pci_map_sg() using kmap/kmap_atomic.
3. Provide driver generated data at buffer address obtained above.
4. Do pci_unmap_sg() to free up memory. SCSI Mid-Layer get garbled data.
5. If pci_unmap_sg() is omitted, things seems to work ok.
6. Issue does not happen with 2GB memory. Can be duplicated with 4GB memory. 3GB
also does not work.
  
Actual results:
1. SCSI Mid-layer get just data when SCSI LLD provides driver generated response.

Expected results:
1. SCSI mid-layer should get correct data from LLD on all memory configurations.

Additional info:
Comment 2 Andrius Benokraitis 2008-01-16 13:49:42 EST
So does this mean you all call pci_unmap_sg before the IO has finished (or all
IOs in an entire scatter-gather set)?
Comment 3 Atul Mukker 2008-01-16 14:19:21 EST
The data (for some commands, partly) is generated by the driver itself.
pci_unmap_sg() is called just before completing the command to mid-layer, after
all processing is done for it. It is not done prematurely .
Comment 4 Chip Coldwell 2008-01-17 13:07:34 EST
(In reply to comment #3)
> The data (for some commands, partly) is generated by the driver itself.
> pci_unmap_sg() is called just before completing the command to mid-layer, after
> all processing is done for it. It is not done prematurely .

There was recently a similar problem caused by having the wrong dma mask in the
megaraid device.  Can you please verify that this is being set correctly?

Chip
Comment 5 Atul Mukker 2008-01-18 14:35:04 EST
This is observed on a 32-bit platform, where we always set the PCI dma mask to
0xffffffff. Can you please provide a reference to the issue on megaraid stack so
that we can look it up ourselves?

Thanks
-Atul Mukker
Comment 7 Andrius Benokraitis 2008-01-29 15:40:24 EST
Atul - have you been able to reproduce this on a non-Xen kernel?

Unfortunately, I think this is as far as we can go with trying to determine a
root cause, since it is in reference to a driver we do not ship inbox.
Furthermore, we haven't heard others (even within other groups at LSI) that are
seeing this.

I'm including Sumant and Eric at LSI that may be able to help, since they are
authorized to see your driver code.
Comment 8 Atul Mukker 2008-01-30 13:45:04 EST
The HW based stack would not see the issue since those do not need to use
kmap/kmap_atomic. All their operations are carried out by FW using DMA.

Since the issue seemed independent of the stack, we modified the megaraid SAS
driver to generate INQUIRY from within the driver using kmap_atomic, rather than
FW filling it up using DMA. And sure enough, the issue can now be duplicated on
that driver as well. If you want, I can mail you the modified driver. The
changes are small and can be easily ported to any RAID/SCSI stack for validation.
Comment 9 Andrius Benokraitis 2008-02-01 10:56:52 EST
Atul,

Could you try the latest 5.2 candidate kernel (-76) and see if this is still
occurring?

http://people.redhat.com/dzickus/el5/
Comment 10 Andrius Benokraitis 2008-02-01 13:08:51 EST
LSI - Oh, also can you test if you experience the issue with both Xen and
non-Xen 2.6.18-76.el5 kernels? Thanks!
Comment 11 Atul Mukker 2008-02-04 14:22:50 EST
We tried the latest kernel version xen-2.6.18-77.el5.i686 available on the site
provided by you. With this kernel, only SCSI INQUIRY seems to be transmitted
properly, but not other commands like READ CAPACITY, REPORT LUNS etc.
Comment 12 Stephen Tweedie 2008-02-05 09:22:17 EST
One possibility --- can you check whether your driver is correctly sending the
remapped DMA address from the scatter-gather list to the device?

On i386 with no Xen, bounce buffering will be happening in the block device
layer, and the scatter-gather list that the driver gets will already be updated
to honour the device's dma_mask. 

Thus, the pci_map_sg() call on bare metal kernels has essentially no effect on
i386.  Any bounce buffering needed is already handled elsewhere, and pci_map_sg
simply translates the virtual address of the scatterlist pages to bus addresses.

But with Xen, we have a swiotlb down in the dma-mapping layer.  This can result
in out-of-range pages, or IOs which cross page boundaries, being physically
copied elsewhere in pci_map_sg().  In this case, the scatterlist->dma_address
addresses set up by pci_map_sg() will NOT simply be equivalent to the bus
address of the original pages.  And of course, when pci_unmap_sg() is called to
complete the IO, the new swiotlb pages get copied back over the original ones,
if the dma direction so indicates.

As a result, the behaviour described above --- seeing correct data before a
pci_unmap_sg(), and incorrect data afterwards, but only on the Xen kernel --- is
exactly what we would expect if the driver was mistakenly using the sg->page
address instead of the calling sg_dma_address(sg) to read sg->dma_address when
walking a scatterlist to initialise DMA to the HBA.

Comment 13 Atul Mukker 2008-02-05 15:54:50 EST
I would like to re-iterate that the commands which are failing are getting
"cooked up" by the driver itself. These are not submitted to the HBA. On the
other hand, we do use page directly for kmap. Do you think, it might be causing
the issue. Below the the pseudo code used in the driver, please review and
provide feeback.

driver_queuecommand(struct scsi_cmnd* scp, void (* done)(struct scsi_cmnd *))
{
    struct scatterlist *mlsgl;
    unsigned char *va_map;
    unsigned char *va;

    if (scp->use_sg == 1 && scp->cmnd[0] == SCSI_INQUIRY) {
        mlsgl = (struct scatterlist *)scp->request_buffer;
        pci_map_sg(pdev, mlsgl, scp->use_sg, PCI_DMA_FROMDEVICE);
        va_map = kmap_atomic(mlsgl[i].page, KM_USER0);
        va = va_map + mlsgl[i].offset;
        generate_software_inquiry(va);
        kunmap_atomic(va_map, KM_USER0);
        pci_unmap_sg(pdev, mlsgl, scp->use_sg, PCI_DMA_FROMDEVICE);
        done(scp);
    }
}
Comment 15 Stephen Tweedie 2008-02-05 17:51:55 EST
Yes, that's exactly the bug I was talking about.

What your code is doing is:

  pci_map_sg(pdev, mlsgl, scp->use_sg, PCI_DMA_FROMDEVICE);

    (the swiotlb sets up a separate bounce buffer for the IO)

  va_map = kmap_atomic(mlsgl[i].page, KM_USER0);
  va = va_map + mlsgl[i].offset;
  generate_software_inquiry(va);
  kunmap_atomic(va_map, KM_USER0);

    (your driver uses sg[i].page, not sg_dma_address(sg[i]), as the address of 
    the sg segment, so it modifies the original buffer, not the bounce buffer)

  pci_unmap_sg(pdev, mlsgl, scp->use_sg, PCI_DMA_FROMDEVICE);

    (the swiotlb copies the results from the bounce buffer back into the
    original buffer... but the results were already written to the original 
    buffer, and they are now overwritten.)

so the corruption you describe is expected.  

You must not use the original buffer like this in between a map_sg and unmap_sg
pair.  The kernel is entirely permitted to use a software bounce buffer to
perform the sg mapping, and for the copying of data between the original buffers
and bounce buffers to work, you _must_ use the sg_dma_address(), not sg_page(),
to address the sg contents while map_sg() is in effect.

So you should simply drop the use of pci_map_sg()/pci_unmap_sg() entirely for
this case.  If you're filling in the buffer from software, there is no point in
setting up a hardware IOMMU mapping for the sg buffer if you know in advance
that you are going to be satisfying the IO entirely in software!  And even if
you did try to pci_map_sg() and use sg_dma_address() to find the bounce buffer,
you'd have to worry about potentially converting IOMMU bus-addresses back into
virtual addresses (again, not an issue on i386, but many x86-64 boxes will use
an IOMMU for IOs.)

The only reason that this works on baremetal i386 is that in that case, the
kernel happens never to use a swiotlb.
Comment 16 Stephen Tweedie 2008-02-06 09:17:34 EST
Closing NOTABUG, as the kernel is working as expected; but please feel free to
followup with any further questions if you still have any.
Comment 17 Atul Mukker 2008-02-06 12:07:52 EST
Updating with information exchanged over email:

Mukker, Atul wrote:
> Hello Stephen, Andrius:
> 
> Since you've closed bugzilla defect
> https://bugzilla.redhat.com/show_bug.cgi?id=428903, I am following 
> through this mail. Please let me know if the discussion should still 
> continue on the bugzilla portal.
> 
> First answer to your question: "So you should simply drop the use of
> pci_map_sg()/pci_unmap_sg() entirely for this case.  If you're filling 
> in the buffer from software, there is no point in setting up a 
> hardware IOMMU mapping for the sg buffer if you know in advance that 
> you are going to be satisfying the IO entirely in software!"
> 
> Unfortunately, this is not possible since we can achieve this for 
> simple non-IO commands like INQUIRY and READ CAPACITY but not, e.g., 
> for RAID 5 IO commands. Since this is a software based RAID stack, it 
> is possible that part of the data for a given IO will be generated by 
> the driver and part provided by the disk(s). Also, for RAID 5 parity 
> operations we anyway need kernel virtual addresses for the buffer to be
transferred.
> 
> So this brings us to your other suggestion: "And even if you did try 
> to
> pci_map_sg() and use sg_dma_address() to find the bounce buffer, you'd 
> have to worry about potentially converting IOMMU bus-addresses back 
> into virtual addresses (again, not an issue on i386, but many x86-64 
> boxes will use an IOMMU for IOs.)"
> 
> Is it possible? How? We need to support both i686 and x86-64 based 
> architectures.
>  
> And, thanks for a very informative writeup on the root cause.
> 
> Best regards
> -Atul
Comment 18 Stephen Tweedie 2008-02-06 13:31:54 EST
Mixing software and hardware access is not a problem in principle, as long as
the software access is not occurring *during* the pci_map_sg() section.  You can
access the pages in software before that call, and after the pci_unmap_sg(),
quite safely, so it will be possible to use a combination of software and DMA to
populate the entire buffer.

The fact that the swiotlb is kicking in at all, however, quite possibly
indicates a deeper problem.  In comment #5, you write

> we always set the PCI dma mask to 0xffffffff

dma_set_mask() takes a 64-bit mask.  If you're setting it to 32-bits, then you
are forcing all IOs above 4GB to use bounce buffering (either at the bio level
or at the swiotlb level), which is a significant performance overhead.  If your
driver is being layered over underlying devices which can perform 64-bit PCI
addressing, then that copying may be unnecessary.

Of course, for software raid5 writes, you probably _do_ want to copy, because
the data being written from the bio layer may actually be modified by the CPU
while the IO is in flight, leading to inconsistent raid5 checksums if you're not
careful.  But in that case, you don't want to rely on bounce buffering to do the
job, as bounce buffering will not copy the data if the buffer is already below
the DMA mask.

In short, mixing software and hardware IOs to satisfy a bio request is quite
possible, but you'll want to be very careful about how you do it so that you
always copy when you need to but never when you don't.
Comment 19 Atul Mukker 2008-02-06 15:30:41 EST
> as long as the software access is not occurring *during* the pci_map_sg() section.
:-( This is not possible in our stack. The data for read operation for a
degraded RAID 5 volume is generated entirely by our raid core layer, outside of
Linux kernel code. The RAID core would use a mix of direct reads from the disk
(DMA) and parity (software). By no means we can separate these two operations.
Therefore in our current design, it has to be
pci_map_sg()
kmap()
    RAID 5 core()
kunmap()
pci_unmap_sg()

> dma_set_mask() takes a 64-bit mask.  If you're setting it to 32-bits, then you
are forcing all IOs above 4GB to use bounce buffering

This is a software limitation we have right now and at this moment cannot be
changed.

> Of course, for software raid5 writes, you probably _do_ want to copy
This is already taken care of in our raid core, which shadows all write buffers.

So our quest for how to obtain kernel virtual address for DMA address returned
by pci_map_sg() (portable to i686 and x86-64) should be addressed.
Comment 20 Stephen Tweedie 2008-02-07 11:03:57 EST
> So our quest for how to obtain kernel virtual address for DMA address returned
> by pci_map_sg() (portable to i686 and x86-64) should be addressed.

I don't know of any standard way to do this.

You cannot simply use the virtual address from sg_page(), since as we've seen
above, that memory will get overwritten on unmap_sg if we're using swiotlb.

And you cannot use sg_dma_address() in general, since in the case that there's a
true IOMMU, and the dma address is getting remapped by hardware-specific
mechanisms which don't expose a generic way to get back to the virtual address.
 (virt_to_bus() and friends do NOT take IOMMU mappings into account.)

But your driver ought to be able to decide for itself when the map_sg and
unmap_sg operations are occurring, and synchronise the software accesses with
that.  I assume it's a pseudo-scsi driver, using drivers/scsi/scsi_lib.c?  If
so, I don't think there's anything in the scsi core mid-layer which calls
pci_map_sg() for you --- it's entirely up to your own driver when that gets
called, so you should be able to synchronise the software vs. hardware accesses
appropriately.

In particular, if you can decide in advance which segments are going to be
filled by software and which by hardware, you should be able to omit the
software-populated segments from the scatterlist before you do the pci_map_sg(),
 and you'll not have any problems doing the remaining software and hardware
accesses in parallel.
Comment 21 Atul Mukker 2008-02-07 11:27:32 EST
> In particular, if you can decide in advance which segments are going to be
filled by software and which by hardware,

This is not possible at all for us, since realistically speaking, the way RAID 5
operations happen, a single segment itself can partly be generated by the DMA
and part by software (e.g., when a segment falls across 2 disks in RAID 5 and
one of the disks is missing).

Changing train of thought a little bit, why XEN kernel shows this behavior and
not any other kernel (at least so far)?

What effect does a system memory footprint have on this behavior? The issue does
not happen with 1GB memory, but happens with 4GB.

How likely for a stack like our to encounter this behavior on non-xen kernels?

This information would be very important for us to know, so that we can chart
out course for RAID 5 support for various kernels.

Thanks a lot for your incessant support.
-Atul
Comment 22 Stephen Tweedie 2008-02-07 12:21:28 EST
You can still create a new scatterlist which represents only those regions which
are going to be filled by hardware, removing all portions to be filled by
software even if that requires chopping an existing sg segment into multiple pieces.

> Changing train of thought a little bit, why XEN kernel shows this behavior and
> not any other kernel (at least so far)?

Because the Xen kernel is the first case you've discovered in testing that uses
a swiotlb, I imagine.

> What effect does a system memory footprint have on this behavior? The issue
> does not happen with 1GB memory, but happens with 4GB.

It will only happen if swiotlb is triggered.  Because your DMA mask is set to
4G, any IO to memory above that will trigger the swiotlb remapping.  And on a
system with 4GB physical memory, some of the bottom 4G physical address space is
reserved for the PCI IO aperture, and hence some of the physical memory will
usually be remapped by the hardware to appear above the 4G boundary and will
trigger bounce buffering.

> How likely for a stack like our to encounter this behavior on non-xen kernels?

It won't happen on i386, but will happen on x86_64 on machines with >=4GB if
swiotlb is enabled there.  You can force swiotlb on with "swiotlb=force", but it
will also be enabled automatically if you have any memory above the 4GB physical
boundary and the hardware has no IOMMU (unless the user has manually disabled it
with "iommu=off".)

So it is quite possible to encounter this situation on a non-xen x86_64 kernel.

Note You need to log in before you can comment on or make changes to this bug.