Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 666908

Summary:

rhel5 32-bit DomU dies when bringing up vif on 64-bit host with 256G mem

Product:

Red Hat Enterprise Linux 5

Reporter:

Andrew Jones <drjones>

Component:

kernel-xen

Assignee:

Laszlo Ersek <lersek>

Status:

CLOSED WONTFIX

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.6

CC:

lersek, xen-maint, yuzhang

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-24 15:07:27 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

514490

Attachments:

Description	Flags
netfront IPRINTK patch for debugging	none
guest console log and hypervisor dmesg	none
netfront, netback and hypervisor logs of bug; with netback debug IPRINTK's	none
netbk patch for debugging: check_mfn() .address_bits = 32u, plus IPRINTKs	none

Description Andrew Jones 2011-01-03 17:37:50 UTC

Bug 616827 comment 18 describes testing that shows bringing up the vif on a 32-bit PV guest running on a 64-bit host on a system with 256G mem immediately crashes the guest. This problem reoccurred many times for a period of a few days, but then vanished and hasn't reoccurred once since. Testing was done on hp-dl580g7-01.lab.bos.redhat.com and further testing should be conducted to see if it reproduces. If so, a core of the guest should be captured to analyze the exact state of the guest at the time of the crash.

Comment 1 Laszlo Ersek 2011-01-06 23:26:35 UTC

$ crash 2011-0104-0814.49-rhel55.1.core \
    /usr/lib/debug/lib/modules/2.6.18-194.el5xen/vmlinux

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.el5xen/vmlinux
    DUMPFILE: 2011-0104-0814.49-rhel55.1.core
        CPUS: 4
        DATE: Tue Jan  4 07:35:17 2011
      UPTIME: 03:15:21
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 109
    NODENAME: dhcp47-60.lab.bos.redhat.com
     RELEASE: 2.6.18-194.el5xen
     VERSION: #1 SMP Tue Mar 16 22:08:06 EDT 2010
     MACHINE: i686  (2264 Mhz)
      MEMORY: 4 GB
       PANIC: "kernel BUG at arch/i386/mm/hypervisor.c:197!"
         PID: 2809
     COMMAND: "dhclient-script"
        TASK: c122e550  [THREAD_INFO: ea062000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)


Manually folding the long lines:

crash> log
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:197!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /class/misc/autofs/dev

Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink
iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables
x_tables ipv6 xfrm_nalgo crypto_api loop dm_multipath scsi_dh scsi_mod
parport_pc lp parport xennet pcspkr dm_raid45 dm_message dm_region_hash
dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd
uhci_hcd ohci_hcd ehci_hcd

CPU:    2
EIP:    0061:[<c0418c2e>]    Not tainted VLI
EFLAGS: 00010282   (2.6.18-194.el5xen #1)
EIP is at xen_pgd_pin+0x58/0x66
eax: ffffffea   ebx: ea062f00   ecx: 00000001   edx: 00000000
esi: 00007ff0   edi: 00000000   ebp: c375e550   esp: ea062f00
ds: 007b   es: 007b   ss: 0069
Process dhclient-script (pid: 2809, ti=ea062000 task=c122e550 task.ti=ea062000)
Stack: 00000002 028e7a50 00000000 005650a0 00000000 c04152f3 eac4a740 c041531c
       ea2f959c c041f273 ea062fbc bf985f7c 01200011 00000000 c375e550 c122e550
       eac4a740 eac4a200 ea2f95a8 ea2f95bc ea2f95b8 eac4a740 eb315c60 bf985f7c
Call Trace:
 [<c04152f3>] __pgd_pin+0x2a/0x37
 [<c041531c>] mm_pin+0x1c/0x23
 [<c041f273>] copy_process+0xbb2/0x11d1
 [<c041fae0>] do_fork+0x41/0x168
 [<c04031c1>] sys_clone+0x28/0x2d
 [<c0405413>] syscall_call+0x7/0xb
 =======================

Code: 63 c0 a1 28 82 77 c0 8b 14 90 81 e2 ff ff ff 7f 89 54 24 04 89 e3 b9 01
00 00 00 31 d2 be f0 7f 00 00 e8 16 87 fe ff 85 c0 79 08 <0f> 0b c5 00 02 c4 63
c0 83 c4 0c 5b 5e c3 56 89 c2 53 83 ec 0c

EIP: [<c0418c2e>] xen_pgd_pin+0x58/0x66 SS:ESP 0069:ea062f00
 <0>Kernel panic - not syncing: Fatal exception


crash> bt 2809
PID: 2809   TASK: c122e550  CPU: 2   COMMAND: "dhclient-script"
 #0 [ea062da0] xen_panic_event at c0409d19
 #1 [ea062db4] notifier_call_chain at c0620443
 #2 [ea062dc8] panic at c042059b
 #3 [ea062de4] die at c0406077
 #4 [ea062e18] do_invalid_op at c0406666
 #5 [ea062ec8] error_code (via invalid_op) at c0405595
    EAX: ffffffea  EBX: ea062f00  ECX: 00000001  EDX: 00000000  EBP: c375e550
    DS:  007b      ESI: 00007ff0  ES:  007b      EDI: 00000000
    CS:  0061      EIP: c0418c2e  ERR: ffffffff  EFLAGS: 00010282
 #6 [ea062efc] xen_pgd_pin at c0418c2e
 #7 [ea062f14] __pgd_pin at c04152ee
 #8 [ea062f1c] mm_pin at c0415317
 #9 [ea062f24] copy_process at c041f26e
#10 [ea062f68] do_fork at c041fadb
#11 [ea062fa4] sys_clone at c04031bc
#12 [ea062fb8] system_call at c040540c
    EAX: ffffffda  EBX: 01200011  ECX: 00000000  EDX: 00000000
    DS:  007b      ESI: 00000000  ES:  007b      EDI: b7f78bc8
    SS:  007b      ESP: bf985f7c  EBP: bf985fb8
    CS:  0073      EIP: 00b2c402  ERR: 00000078  EFLAGS: 00000282


Some steps are not displayed between copy_process() and mm_pin():

                                         kernel/arch/i386/mm/
 #6 [ea062efc] xen_pgd_pin at c0418c2e   -> hypervisor.c:197
 #7 [ea062f14] __pgd_pin at c04152ee     -> pgtable-xen.c:545
 #8 [ea062f1c] mm_pin at c0415317        -> pgtable-xen.c:567
               _arch_dup_mmap            -> pgtable-xen.c:599

                                         kernel/include/asm-i386/mach-xen/asm/
               arch_dup_mmap             -> mmu.h:31

                                         kernel/kernel/
               dup_mmap                  -> fork.c:398
               dup_mm                    -> fork.c:617
               copy_mm                   -> fork.c:673
  #9 [ea062f24] copy_process at c041f26e -> fork.c:1252

kernel/arch/i386/mm/pgtable-xen.c:

   542  static void __pgd_pin(pgd_t *pgd)
   543  {
   544          pgd_walk(pgd, PAGE_KERNEL_RO);
   545          xen_pgd_pin(__pa(pgd));
   546          set_bit(PG_pinned, &virt_to_page(pgd)->flags);
   547  }


kernel/arch/i386/mm/hypervisor.c:

   186  void xen_pgd_pin(unsigned long ptr)
   187  {
   188          struct mmuext_op op;
   189  #ifdef CONFIG_X86_64
   190          op.cmd = MMUEXT_PIN_L4_TABLE;
   191  #elif defined(CONFIG_X86_PAE)
   192          op.cmd = MMUEXT_PIN_L3_TABLE;
   193  #else
   194          op.cmd = MMUEXT_PIN_L2_TABLE;
   195  #endif
   196          op.arg1.mfn = pfn_to_mfn(ptr >> PAGE_SHIFT);
   197          BUG_ON(HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF) < 0);
   198  }

I have the followig impression: the guest kernel creates/copies the page tables
for the new process (in dup_mmap()), then sets them recursively to read-only
(pgtable-xen.c:544 calls pgd_walk()), "synchronizing" the mappings with the
hypervisor. Then the guest kernel asks the hypervisor to pin the highest level
directory. The hypervisor denies the request.

I looked at the corresponding hv location (do_mmuext_op()/MMUEXT_PIN_L2_TABLE,
at xen/arch/x86/mm.c:2347), and that region does log some messages if the
pinning fails (eg. "Error while pinning mfn %lx" and "Mfn %lx already pinned"),
but unfortunately, the hv log on hp-dl580g7-01.lab.bos.redhat.com is already
overflowed with:

(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version

Comment 2 Andrew Jones 2011-01-07 08:09:59 UTC

I don't recall seeing the message

Error while pinning mfn %lx" and "Mfn %lx already pinned

when this problem was reproducing more regularly. However, it's certainly worth looking for if we can get it to reproduce again. The other note is there were reports that it wasn't happening with rhel4 guests. Probably a good next step would be to run both rhel4 and rhel5 guests in boot loops (that bring up the vif) and see if it can reproduce for either/both. If it does then we might get some new data, and if it doesn't for rhel4 ever, then maybe we can compare the codes.

Comment 3 Laszlo Ersek 2011-01-07 13:16:38 UTC

I couldn't reproduce the bug at installation time or at first boot.

Host:

# uname -psrn
Linux hp-dl580g7-01.lab.bos.redhat.com 2.6.18-238.el5xen x86_64

Guests were created (in parallel) with:

----v----
v-i()
{
  local NAME="lacos-bz666908-$1-32bit-pv"
  local VARIANT="$2"
  local LOCATION="http://download.devel.redhat.com/$3"

  virt-install --connect=xen --name="$NAME" --ram=4096 --arch=i686 --vcpus=4 \
      --os-type=linux --os-variant="$VARIANT" -p --location="$LOCATION" \
      --disk=path=/var/lib/xen/images/"$NAME".img,size=10,sparse=true \
      --network=network:default --debug --prompt --extra-args=text
}

v-i rhel49 rhel4   nightly/latest-link-RHEL-4-U9-i386-AS/
v-i rhel55 rhel5.4 released/RHEL-5-Server/U5/i386/os/
v-i rhel56 rhel5.4 nightly/latest-RHEL5.6-Server/5/i386/os/
----^----

In all three guests, I created a simple partitioning scheme (only / and swap),
disabled IPv6 in rhel5[56] (rhel4 doesn't offer it), and stripped down the list
of packages to install. I selected DHCP, and accepted the option to start
DHCP at boot.

Testing:

for G in 49 55 56; do
  printf '==== %s ====\n' "$G"
  ssh lacos-$G \
    'uname -psrn;' \
    'lsmod | grep xennet;' \
    'ifconfig | fgrep -B1 "inet addr:192.168.122."'
  printf '\n'
done

==== 49 ====
Linux localhost.localdomain 2.6.9-94.ELxenU i686
xennet                 31177  0 [permanent]
eth0      Link encap:Ethernet  HWaddr 00:16:36:6C:A6:15
          inet addr:192.168.122.18  Bcast:192.168.122.255  Mask:255.255.255.0

==== 55 ====
Linux localhost.localdomain 2.6.18-194.el5xen i686
xennet                 29769  0 [permanent]
eth0      Link encap:Ethernet  HWaddr 00:16:36:76:FF:05
          inet addr:192.168.122.85  Bcast:192.168.122.255  Mask:255.255.255.0

==== 56 ====
Linux localhost.localdomain 2.6.18-238.el5xen i686
xennet                 30153  0 [permanent]
eth0      Link encap:Ethernet  HWaddr 00:16:36:1C:E6:63
          inet addr:192.168.122.65  Bcast:192.168.122.255  Mask:255.255.255.0


I'll try to bring the interfaces up/down from the consoles, and/or start reboot
loops.

Comment 4 Laszlo Ersek 2011-01-07 13:26:27 UTC

# xm vcpu-list | sed -n '1p;/^lacos/p'
Name                              ID VCPUs   CPU State   Time(s) CPU Affinity
lacos-bz666908-rhel49-32bit-pv    11     0    41   -b-      13.1 8-15,40-47
lacos-bz666908-rhel49-32bit-pv    11     1    46   -b-       5.0 8-15,40-47
lacos-bz666908-rhel49-32bit-pv    11     2    47   -b-       6.6 8-15,40-47
lacos-bz666908-rhel49-32bit-pv    11     3    45   -b-       4.9 8-15,40-47
lacos-bz666908-rhel55-32bit-pv    12     0    55   -b-      11.1 16-23,48-55
lacos-bz666908-rhel55-32bit-pv    12     1    52   -b-       5.6 16-23,48-55
lacos-bz666908-rhel55-32bit-pv    12     2    53   -b-       3.6 16-23,48-55
lacos-bz666908-rhel55-32bit-pv    12     3    49   -b-       6.9 16-23,48-55
lacos-bz666908-rhel56-32bit-pv    13     0    62   -b-      11.8 24-31,56-63
lacos-bz666908-rhel56-32bit-pv    13     1    60   -b-       5.5 24-31,56-63
lacos-bz666908-rhel56-32bit-pv    13     2    63   -b-       4.8 24-31,56-63
lacos-bz666908-rhel56-32bit-pv    13     3    61   -b-       5.2 24-31,56-63

Comment 5 Laszlo Ersek 2011-01-07 16:50:57 UTC

I intended comment 4 as an anchor to the "2010-07-12 04:53:28 EDT" entry in bug 616827 comment 0 -- I think the collected vcpu's of these three guests are distributed well enough on the physical CPU's, and so hopefully on machine RAM.

I've been running the following test for almost three hours now, in each of the three guests (on the console), in parallel:

K=0
while :; do
  uname -psrn
  printf '%u\n' $((K++))
  ifdown eth0
  ifup eth0
  sleep 1
done

"K" got in the range of 4800-5000 on each. I saw nothing special.

----v----
Determining IP information for eth0... done.
Linux localhost.localdomain 2.6.9-94.ELxenU i686
4837
ip_tables: (C) 2000-2002 Netfilter core team

----*----

Linux localhost.localdomain 2.6.18-194.el5xen i686
4883

Determining IP information for eth0... done.

----*----

Linux localhost.localdomain 2.6.18-238.el5xen i686
4938

Determining IP information for eth0... done.
----^----

I also created a rolling diff of "xm dmesg" in the host:

----v----
touch last
while :; do
  xm dmesg >next
  date --utc +'==== %Y-%m-%d %H:%M:%S %Z ===='
  diff last next | grep '^>'
  'mv' next last
  sleep 10
done 2>&1 \
| tee -i -a cumulative
----^----

The file "cumulative" only contains timestamps -- from 2011-01-07 13:46:21 UTC to 2011-01-07 16:36:17 UTC -- and the original "Allowing physinfo call with newer ABI version" messages.

Reboot loops are next.

Comment 6 Laszlo Ersek 2011-01-10 12:27:36 UTC

Left the three guests in a reboot loop Friday evening. Now only the RHEL55 is there and running, the other two have disappeared (not present in "xm list"). on_crash="preserve" doesn't seem to have had any effect, there is no new dump under /var/lib/xen/dump. Unfortunately, not even the least recently rotated xend.log has any trace of them. The hv dmesg is as useless as before:

# xm dmesg | uniq -c
  1 sctl.c:51: Allowing physinfo call with newer ABI version
170 (XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
  1 (XEN) event_channel.c:181:d0 EVTCHNOP failure: domain 0, error -22, line 181
 80 (XEN) sysctl.c:51: Allowing physinfo call with newer ABI version
  1 

It's anybody's guess when the EVTCHNOP failure was logged.

I'll try to do something with the consoles and restart the loops.

Comment 7 Laszlo Ersek 2011-01-10 14:50:47 UTC

To capture the guest, host and hypervisor consoles and add some timestamping, I did the following. I installed "screen" on my workstation, with the following ~/.screenrc:

----v----
logfile /path/to/common/screenlog
logfile flush 0
----^----

Then I started screen. I created five regions (C-a S), focussed each region in order (C-a tab), in each region I started a new shell (C-a c). In the first region / shell I started this script to generate timestamps:

----v----
while sleep 60; do
    date --utc +'==== %s %Y-%m-%d %H:%M:%S %Z'
done
----^----

In the second -- host and hv console:

----v----
console -s -l lersek -M console.lab.bos.redhat.com \
    hp-dl580g7-01.lab.bos.redhat.com \
| sed --unbuffered 's,^,host/xen -- ,'
----^----

In the third to fifth -- guest consoles:

----v----
while :; do
  ssh -t root.bos.redhat.com xm console \
      lacos-bz666908-rhel49-32bit-pv \
  | sed --unbuffered 's,^,rhel49 -- ,'
  sleep 3
done
----^----

(and similarly for rhel55 and rhel56).

... Added a sixth window -- xend.log:

----v----
while :; do
  ssh root.bos.redhat.com tail --follow=name \
      --retry --max-unchanged-stats=1 /var/log/xen/xend.log \
  | sed --unbuffered 's,^,xend -- ,'
  sleep 3
done
----^----

This appears to interleave all console outputs and the "timestamp stream" in the commonly appended-to "screenlog" file.

Comment 8 Laszlo Ersek 2011-01-10 14:55:01 UTC

(In reply to comment #7)

> ~/.screenrc:
> 
> ----v----
> logfile /path/to/common/screenlog
> logfile flush 0
> ----^----
> 
> Then I started screen. I created five regions (C-a S), focussed each region in
> order (C-a tab), in each region I started a new shell (C-a c).

Additionally, logging must be turned on per-window (C-a H).

Comment 9 Laszlo Ersek 2011-01-11 17:46:15 UTC

This time the RHEL49 remained running and the other two disappeared.

$ tr -d '\r' <screenlog | egrep '^rhel(49|55|56) -- Restart'  | sort | uniq -c
    565 rhel49 -- Restarting system.
     48 rhel55 -- Restarting system.
     16 rhel56 -- Restarting system.

Paolo helped me interpret the xend logs and it seems we stumbled on a xend problem that stops the reboot loop at some point. I'll add some debug statements to xend and restart it -- at least let's gather some info on that problem.

Meanwhile, I still can't reproduce the crash in this bug.

Comment 10 Laszlo Ersek 2011-01-13 15:10:46 UTC

Andrew removed "numa=on" from the hv command line, and the RHEL55 guest started to crash reproducibly. The RHEL49 and RHEL56 guests had no such problem, hence the problem might be related to the flipping netfront/netback. This is the RHEL55 error message:

rhel55 -- BUG: unable to handle kernel paging request at virtual address c098b012
rhel55 --  printing eip:
rhel55 -- 00332000 -> *pde = 00000011:ff243027
rhel55 -- BUG: unable to handle kernel paging request at virtual address 15555020
rhel55 --  printing eip:
rhel55 -- 00332000 -> *pde = 00000011:ff70e027
rhel55 -- BUG: unable to handle kernel paging request at virtual address 15555550
rhel55 --  printing eip:
rhel55 -- 00332000 -> *pde = 00000011:ff70e027

and the last three lines keep repeating. The guest seems to be spinning, and cannot be shut down/restarted by normal means; it must be forced off. It also leaves behind a zombie domain then. I dumped its core when it was spinning, for later analysis. After the forced off & reboot, it crashed again, with the same symptoms; the messages looked like

rhel55 -- BUG: unable to handle kernel paging request at virtual address c03b5012
rhel55 --  printing eip:
rhel55 -- 00b7a000 -> *pde = 00000011:d307a027
rhel55 -- BUG: unable to handle kernel paging request at virtual address 15555008
rhel55 --  printing eip:
rhel55 -- 00b7a000 -> *pde = 00000011:ffc20027

(There are many such xen console entries:

host/xen -- (XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d0 Error pfn 1073abe: rd=ffff8300bf410080, od=0000000000000000, caf=00000000, taf=0000000000000000
host/xen -- (XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d3 Error pfn 2733d40: rd=ffff8300bebfa080, od=0000000000000000, caf=00000000, taf=0000000000000000
host/xen -- (XEN) mm.c:654:d3 Error getting mfn 2912c81 (pfn 7d9c1f) from L1 entry 8000002912c81063 for dom3

but they were produced at unrelated times and should pertain to different domains.)

Comment 11 Andrew Jones 2011-01-13 15:28:11 UTC

Is it possible to "flip in" a page that satisfies the bitsize check for dom0 (64-bit), but doesn't for domU (32-bit)?

Comment 12 Laszlo Ersek 2011-01-18 16:02:10 UTC

I tried installing guests on intel-e7450-512-1.englab.nay.redhat.com, which has 32 cores and 1T of RAM, in the same way as described in comment 3. (Except I used "released/RHEL-5-Server/U6/i386/os/" as RHEL56 installation path.)

The RHEL55 guest got domid 3. The installation froze in anaconda on the "Sending request for IP information for eth0..." page. According to virt-manager, this guest is spinning now, taking up 2.09% CPU. "xm dmesg" says (again cluttered with useless ABI version warnings):

(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:
        d3 Error pfn XXXXXXX: rd=ffff8300bf208080, od=0000000000000000,
        caf=00000000, taf=0000000000000000
(XEN) mm.c:654:d3 Error getting mfn XXXXXXX (pfn 7c8502)
        from L1 entry 800000XXXXXXX063 for dom3

with XXXXXXX iterating over { 273e1d1 27af098 276d9a5 275f0c2 2731d07 275348f 2787298 27a2230 27a1bf8 }. (The other values don't change.) Then there's such a message in the end:

(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:
        d3 Error pfn 2707902: rd=ffff8300bf208080, od=0000000000000000,
        caf=00000000, taf=0000000000000000
(XEN) mm.c:516:d3 Could not get page ref for pfn 2707902
(XEN) mm.c:2760:d3 Could not get page for mach->phys update

The RHEL49 guest progressed with the installation nicely, and so did the RHEL56 guest.

Comment 13 Laszlo Ersek 2011-01-18 16:07:22 UTC

(In reply to comment #12)

> The RHEL55 guest got domid 3. The installation froze in anaconda on the
> "Sending request for IP information for eth0..." page. According to
> virt-manager, this guest is spinning now, taking up 2.09% CPU.

The last messages on the guest console are:

Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI No-Shortcut mode
XENBUS: Device with no driver: device/vbd/51712
XENBUS: Device with no driver: device/vif/0
Initalizing network drop monitor service
Freeing unused kernel memory: 176k freed
Write protecting the kernel read-only data: 394k

Comment 14 Laszlo Ersek 2011-01-18 17:41:45 UTC

(In reply to comment #12)
> The RHEL55 guest got domid 3. The installation froze in anaconda on the
> "Sending request for IP information for eth0..." page. According to
> virt-manager, this guest is spinning now, taking up 2.09% CPU.

I dumped the core:

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.el5xen/vmlinux
    DUMPFILE: 2011-0119-0017.52-lacos-bz666908-rhel55-32bit-pv.3.core
        CPUS: 4
        DATE: Tue Jan 18 11:17:46 2011
      UPTIME: 01:52:48
LOAD AVERAGE: 1.99, 1.97, 1.91
       TASKS: 54
    NODENAME: localhost.localdomain
     RELEASE: 2.6.18-194.el5xen
     VERSION: #1 SMP Tue Mar 16 22:08:06 EDT 2010
     MACHINE: i686  (2398 Mhz)
      MEMORY: 4 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: c06882c0  (1 of 4)  [THREAD_INFO: c0703000]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
     WARNING: panic task not found

Backtrace:

bt: cannot resolve stack trace:
 #0 [c0703ea0] _spin_lock at c061e94b
 #1 [c0703ea4] cache_alloc_refill at c046e298
 #2 [c0703ee8] do_IRQ at c0406f3d
bt: text symbols on stack:
    [c0703ea0] vprintk at c0420da1
    [c0703ea4] cache_alloc_refill at c046e29d
    [c0703ec4] vprintk at c0420da1
    [c0703ec8] monotonic_clock at c0408068
    [c0703ed8] do_softirq at c0406dfb
    [c0703ee4] __do_IRQ at c0447d54
    [c0703ee8] do_IRQ at c0406f42
    [c0703f0c] printk at c0421011
    [c0703f1c] search_module_extables at c0437e6b
    [c0703f34] do_page_fault at c062013c
    [c0703f40] search_module_extables at c0437e6b
    [c0703f58] do_page_fault at c062013c
    [c0703f9c] do_page_fault at c061f857
    [c0703fa0] do_page_fault at c061f857
    [c0703fa8] error_code at c0405597
    [c0703fb8] do_page_fault at c061f857
    [c0703fe0] do_page_fault at c061f857
    [c0703ff0] L6 at c040007b
    [c0703ff8] do_page_fault at c061f87e

log:

BUG: unable to handle kernel paging request at virtual address c01a8012
 printing eip:
ee343342
0100a000 -> *pde = 00000011:bf004027

BUG: unable to handle kernel paging request at virtual address 15555000
 printing eip:
c0620213
0100a000 -> *pde = 00000011:bf007027

19x:

BUG: unable to handle kernel paging request at virtual address 15555550
 printing eip:
c0620213
0100a000 -> *pde = 00000011:bf007027


BUG: unable to handle kernel paging request at virtual address 08179a80
 printing eip:
c061fe45
0100a000 -> *pde = 00000011:bf007027

21x:

BUG: unable to handle kernel paging request at virtual address 08179a80
 printing eip:
c0620213
0100a000 -> *pde = 00000011:bf007027

BUG: unable to handle kernel paging request at virtual address 08179a80
BUG: unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
c061fe45
0100a000 -> *pde = 00000011:bf007027

21x:

BUG: unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
c0620213
0100a000 -> *pde = 00000011:bf007027

BUG: unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
c0620213

Comment 15 Laszlo Ersek 2011-01-20 12:12:44 UTC

The problem should have something to do with netback balloning up for flipping receivers. When netback balloons up (kernel/drivers/xen/netback/netback.c):

114  static int check_mfn(int nr)
115  {
116    struct xen_memory_reservation reservation = {
117      .extent_order = 0,
118      .domid        = DOMID_SELF
119    };
120  
121    if (likely(alloc_index >= nr))
122      return 0;
123  
124    set_xen_guest_handle(reservation.extent_start, mfn_list + alloc_index);
125    reservation.nr_extents = MAX_MFN_ALLOC - alloc_index;
126    alloc_index += HYPERVISOR_memory_op(XENMEM_increase_reservation,
127                &reservation);
128  
129    return alloc_index >= nr ? 0 : -ENOMEM;
130  }

it doesn't explicitly initialize "reservation.address_bits", which then defaults to 0u, which then means (comment from xen/include/public/memory.h, lines 57-59):

"Maximum # bits addressable by the user of the allocated region (e.g., I/O devices often have a 32-bit limitation even in 64-bit systems). If zero then the user has no addressing restriction."

I think that the 64 bit host / netback is okay with whatever frames it gets from the hypervisor, but when the page is flipped to a 32bit pv guest, it may not be able to use it.

I think this theory is supported by the mfn's in comment 12. They all look like 27XXXXX, for example 273e1d1. The latter is binary "10 01110011 11100001 11010001" (26 bits), which cannot be shifted left by 10 bit positions without overflow.

Netback could set address_bits to 32:

----*----
diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c
index 8fb8927..ad8a581 100644
--- a/drivers/xen/netback/netback.c
+++ b/drivers/xen/netback/netback.c
@@ -115,6 +115,7 @@ static int check_mfn(int nr)
 {
 	struct xen_memory_reservation reservation = {
 		.extent_order = 0,
+		.address_bits = 32u,
 		.domid        = DOMID_SELF
 	};
 
----*----

This should force "down" all (at most 64) pages that are requested and pooled by netback for flipping purposes. In xen/common/memory.c:

517          if ( (reservation.address_bits != 0) &&
518               (reservation.address_bits <
519                (get_order_from_pages(max_page) + PAGE_SHIFT)) )
520          {
521              if ( reservation.address_bits <= PAGE_SHIFT )
522                  return start_extent;
523              args.memflags = MEMF_bits(reservation.address_bits);
524          }

I've built a -194 host kernel with this patch (https://brewweb.devel.redhat.com/taskinfo?taskID=3046375). Unfortunately, running under this kernel only made things worse on intel-e7450-512-1, because now:

(a) not even the RHEL56 guest can be installed -- the vm crashes as soon as virt-install is trying to boot it. I've changed the on_crash action in Guest.py to "preserve", but even so the guest's core couldn't be dumped. (*)

(b) A 64bit PV RHEL6 guest could be created and started, but it hangs during installation.

I'm also brewing a -240 host kernel with this patch (https://brewweb.devel.redhat.com/taskinfo?taskID=3051121) to see if it behaves differently. I think (or hope) this has a chance to work better, because the patch for bug 616827 changed how bitsizes are clamped.

However I think that to solve this issue completely, we'll need both the bitsize check fix (ie. the solution to bug 616827) *and* probably a smarter setting for address_bits in netback. We must make the hv allocate pages for netback *exactly* like it allocates pages, *correctly*, for 32bit pv guests otherwise. That needs two ingredients: "correctly", ie. the fix for bug 616827, and "netback", that is, instead of literal 32, we must pass in a bitsize requirement in the vein of "d->arch.physaddr_bitsize", where d is a 32bit pv guest domain.

(*) xm dmesg excerpt:

(XEN) Unhandled page fault in domain 11 on VCPU 0 (ec=0000)
(XEN) Pagetable walk from 000000000174cd1c:
(XEN)  L4[0x000] = 0000002fd3348027 000000000000157c
(XEN)  L3[0x000] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 11 (vcpu#0) crashed on cpu#94:
(XEN) ----[ Xen-3.1.2-194.el5.32bit_vif_crash_on_256g_bz666908  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    94
(XEN) RIP:    e019:[<00000000c0718bad>]
(XEN) RFLAGS: 0000000000000203   CONTEXT: guest
(XEN) rax: 000000000174cd1c   rbx: 00000000c063f2e2   rcx: 000000000000001a
(XEN) rdx: 0000000002fd3347   rsi: 00000000d3347027   rdi: 000000000000002f
(XEN) rbp: 000000000002d7fe   rsp: 00000000c0708ef4   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 0000002fd48cf000   cr2: 000000000174cd1c
(XEN) ds: e021   es: e021   fs: 0000   gs: 0000   ss: e021   cs: e019
(XEN) Guest stack trace from esp=c0708ef4:
(XEN)   00000000 c0718bad 0001e019 00010003 00000000 c062709d c0639b2b 00000000
(XEN)   00000000 c157c000 00000000 c063f74b c0420fd1 c081a360 c081a360 00000000
(XEN)   01190aff c071b86e 00000000 c1579000 0002d7fe 0002d7fe 00000000 00000000
(XEN)   00000000 00100000 c073a07e 00000000 c073a120 c077e862 c0714160 c063b6ae
(XEN)   000002d7 c063b691 00000d30 00000000 c0708fe8 00000000 00000000 00000062
(XEN)   00000000 00000000 75080800 c1579000 00000000 00000000 c0420fd1 c0627000
(XEN)   c0708fdc 0000009d c073a083 c0708fdc 75080800 c1579000 00000000 00000000
(XEN)   c070d6df c0627000 c0639b28 00000000 00000000 c077e800 75080800 c1579000
(XEN)   00000000 00000000 c040006f c070d6a5 c070d6ad c070d6b5 c04032fc c040338c
(XEN)   c04033d9 c040343a c040364d c0403685 c0403777 c061f3f8 c061f409 c061f418
(XEN)   c061f422 c04049db c0404ac9 c0404f54 c0404fa3 c0404fc4 c04051da c0621722
(XEN)   c0406fd2 c0407517 c040751f c040790a [...]

coredump attempt with xm:

Dumping core of domain: lacos-bz666908-rhel56-32bit-pv ...
Error: Failed to dump core: (1, 'Internal error', 'p2m_size < nr_pages -1 (0 < fffff')

Comment 16 Andrew Jones 2011-01-20 13:48:06 UTC

(b) A 64bit PV RHEL6 guest could be created and started, but it hangs during
installation.

Turns out the rhel6 guest was hung a long time (waiting on network?), but not forever, it eventually completed the install and was bootable.


I completely agree with the path we're going down here, which is attempting to answer and experiment with a solution to my question in comment 11, but there are still a couple things we need to keep in mind.

1) If the fix needs to be in the host (netback), then why do rhel 4.9 32-bit guests work?

2) Are there any upstream patches along this theme available? i.e. do recent SLES-11 32-bit guest kernels have this problem?

Comment 17 Laszlo Ersek 2011-01-20 14:01:01 UTC

(In reply to comment #16)

> 1) If the fix needs to be in the host (netback), then why do rhel 4.9 32-bit
> guests work?

The problem is in the flipping related part of netback. RHEL49 defaults to copying.

A recent RHEL5 netback doesn't try to balloon up (correctly) if the guest requests copying. Earlier RHEL5 netback versions (like -194, 5.5) incorrectly *do* try to balloon up, irrespectively whether the guest requested copying or flipping, but on my patched -194 build, in comment 15, I only cared about the RHEL49 guest until anaconda showed the first screen (language selection IIRC). So I didn't go through the installation deep enough to actually use netback (on the dhcp screen).


> 2) Are there any upstream patches along this theme available? i.e. do recent
> SLES-11 32-bit guest kernels have this problem?

In http://xenbits.xensource.com/linux-2.6.18-xen.hg, in the most recent revision (1064:3cdab8f7ff58), netback still looks like this:

166  static int check_mfn(int nr)
167  {
168    struct xen_memory_reservation reservation = {
169      .extent_order = 0,
170      .domid        = DOMID_SELF
171    };
172    int rc;
173  
174    if (likely(alloc_index >= nr))
175      return 0;
176  
177    set_xen_guest_handle(reservation.extent_start, mfn_list + alloc_index);
178    reservation.nr_extents = MAX_MFN_ALLOC - alloc_index;
179    rc = HYPERVISOR_memory_op(XENMEM_increase_reservation, &reservation);
180    if (likely(rc > 0))
181      alloc_index += rc;
182  
183    return alloc_index >= nr ? 0 : -ENOMEM;
184  }

That is, no address_bits or memflags setting. I'm not sure what access we have to SLES-11, but I guess we should try them out.

Comment 18 Laszlo Ersek 2011-01-20 15:08:16 UTC

Installed -240xen kernel on host, with corresponding hv.

[root@intel-e7450-512-1 ~]# uname -psrn
Linux intel-e7450-512-1.englab.nay.redhat.com 2.6.18-240.el5xen x86_64

*** v-i rhel49 rhel4 nightly/latest-RHEL-4/tree-AS-i386

Installation starts and gets past DHCP initialization.

*** v-i rhel55 rhel5.4 released/RHEL-5-Server/U5/i386/os

Installation starts and breaks when we attempt the DHCP initialization, with
the usual symptoms (spinning, and xm dmesg entries like "Error getting mfn
29d487e (pfn 7de549) from L1 entry 80000029d487e063 for dom4" -- the domid is
correct). IOW, the bug reproduces on a -240 host.

*** v-i-us rhel56 rhel5.4 released/RHEL-5-Server/U6/i386/os

Installation starts and gets past DHCP initialization.


Now I'll check with the patched -240xen.

Comment 19 Laszlo Ersek 2011-01-20 15:33:10 UTC

[root@intel-e7450-512-1 ~]# uname -psrn
Linux intel-e7450-512-1.englab.nay.redhat.com 2.6.18-240.el5.32bit_vif_crash_on_256g_bz666908xen x86_64

The patch doesn't fix the problem, the RHEL55 guest breaks the same way as before.

(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d2 Error pfn 28ecd8c: rd=ffff8300bf21e080, od=0000000000000000, caf=00000000, taf=0000000000000000
(XEN) mm.c:654:d2 Error getting mfn 28ecd8c (pfn 7e318c) from L1 entry 80000028ecd8c063 for dom2

Comment 20 Laszlo Ersek 2011-01-21 15:02:54 UTC

There are two developments. This will be a very long comment.

I.

The suspicion of netback (at least, check_mfn() missing
.address_bits=32) in comment turns out to be invalid. I patched the
function like this, based on -240:

----v----
diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c
index c96a724..b25a0f8 100644
--- a/drivers/xen/netback/netback.c
+++ b/drivers/xen/netback/netback.c
@@ -115,16 +115,31 @@ static int check_mfn(int nr)
 {
        struct xen_memory_reservation reservation = {
                .extent_order = 0,
+               .address_bits = 32u,
                .domid        = DOMID_SELF
        };
+       int rc;

+       IPRINTK("check_mfn(%d): alloc_index=%u\n", nr, alloc_index);
        if (likely(alloc_index >= nr))
                return 0;

        set_xen_guest_handle(reservation.extent_start, mfn_list + alloc_index);
        reservation.nr_extents = MAX_MFN_ALLOC - alloc_index;
-       alloc_index += HYPERVISOR_memory_op(XENMEM_increase_reservation,
-                                           &reservation);
+       rc = HYPERVISOR_memory_op(XENMEM_increase_reservation, &reservation);
+       if (likely(rc > 0)) {
+               unsigned u;
+
+               alloc_index += rc;
+               IPRINTK("XENMEM_increase_reservation rc=%d, alloc_index=%u\n",
+                       rc, alloc_index);
+               for (u = alloc_index - rc; u < alloc_index; ++u) {
+                       IPRINTK("mfn_list[%2u]=%lx\n", u, mfn_list[u]);
+               }
+       }
+       else {
+               WPRINTK("XENMEM_increase_reservation rc=%d\n", rc);
+       }

        return alloc_index >= nr ? 0 : -ENOMEM;
 }
----^----

(Brew link is https://brewweb.devel.redhat.com/taskinfo?taskID=3052536.)
The 32bit RHEL55 pv guest died the same way as before, but now the
host dmesg contains the following debug log:

[root@intel-e7450-512-1 ~]# uname -psrn
Linux intel-e7450-512-1.englab.nay.redhat.com
2.6.18-240.el5.32b_pv_bz666908_2xen x86_64

----v----
xen_net: check_mfn(1): alloc_index=0
xen_net: XENMEM_increase_reservation rc=64, alloc_index=64
xen_net: mfn_list[ 0]=bff36
xen_net: mfn_list[ 1]=bff35
xen_net: mfn_list[ 2]=bff34
xen_net: mfn_list[ 3]=bff33
xen_net: mfn_list[ 4]=bff32
xen_net: mfn_list[ 5]=bff31
xen_net: mfn_list[ 6]=bff30
xen_net: mfn_list[ 7]=bff2f
xen_net: mfn_list[ 8]=bff2e
xen_net: mfn_list[ 9]=bff2d
xen_net: mfn_list[10]=bff2c
xen_net: mfn_list[11]=bff2b
xen_net: mfn_list[12]=bff2a
xen_net: mfn_list[13]=bff29
xen_net: mfn_list[14]=bff28
xen_net: mfn_list[15]=bff27
xen_net: mfn_list[16]=bff26
xen_net: mfn_list[17]=bff25
xen_net: mfn_list[18]=bff24
xen_net: mfn_list[19]=bff23
xen_net: mfn_list[20]=bff22
xen_net: mfn_list[21]=bff21
xen_net: mfn_list[22]=bff20
xen_net: mfn_list[23]=bff1f
xen_net: mfn_list[24]=bff1e
xen_net: mfn_list[25]=bff1d
xen_net: mfn_list[26]=bff1c
xen_net: mfn_list[27]=bff1b
xen_net: mfn_list[28]=bff1a
xen_net: mfn_list[29]=bff19
xen_net: mfn_list[30]=bff18
xen_net: mfn_list[31]=bff17
xen_net: mfn_list[32]=bff16
xen_net: mfn_list[33]=bff15
xen_net: mfn_list[34]=bff14
xen_net: mfn_list[35]=bff13
xen_net: mfn_list[36]=bff12
xen_net: mfn_list[37]=bff11
xen_net: mfn_list[38]=bff10
xen_net: mfn_list[39]=bff0f
xen_net: mfn_list[40]=bff0e
xen_net: mfn_list[41]=bff0d
xen_net: mfn_list[42]=bff0c
xen_net: mfn_list[43]=bff0b
xen_net: mfn_list[44]=bff0a
xen_net: mfn_list[45]=bff09
xen_net: mfn_list[46]=bff08
xen_net: mfn_list[47]=bff07
xen_net: mfn_list[48]=bff06
xen_net: mfn_list[49]=bff05
xen_net: mfn_list[50]=bff04
xen_net: mfn_list[51]=bff03
xen_net: mfn_list[52]=bff02
xen_net: mfn_list[53]=bff01
xen_net: mfn_list[54]=bff00
xen_net: mfn_list[55]=bfeff
xen_net: mfn_list[56]=bfefe
xen_net: mfn_list[57]=bfefd
xen_net: mfn_list[58]=bfefc
xen_net: mfn_list[59]=bfefb
xen_net: mfn_list[60]=bfefa
xen_net: mfn_list[61]=bfef9
xen_net: mfn_list[62]=bfef8
xen_net: mfn_list[63]=bfef7
xen_net: check_mfn(1): alloc_index=63
xen_net: check_mfn(1): alloc_index=62
xen_net: check_mfn(1): alloc_index=61
xen_net: check_mfn(1): alloc_index=60
xen_net: check_mfn(1): alloc_index=59
xen_net: check_mfn(1): alloc_index=58
xen_net: check_mfn(1): alloc_index=57
xen_net: check_mfn(1): alloc_index=56
xen_net: check_mfn(1): alloc_index=55
xen_net: check_mfn(1): alloc_index=54
xen_net: check_mfn(1): alloc_index=53
xen_net: check_mfn(1): alloc_index=52
xen_net: check_mfn(1): alloc_index=51
xen_net: check_mfn(1): alloc_index=50
xen_net: check_mfn(1): alloc_index=49
xen_net: check_mfn(1): alloc_index=48
xen_net: check_mfn(1): alloc_index=47
xen_net: check_mfn(1): alloc_index=46
xen_net: check_mfn(1): alloc_index=45
xen_net: check_mfn(1): alloc_index=44
xen_net: check_mfn(1): alloc_index=43
xen_net: check_mfn(1): alloc_index=42
xen_net: check_mfn(1): alloc_index=41
xen_net: check_mfn(1): alloc_index=40
xen_net: check_mfn(1): alloc_index=39
xen_net: check_mfn(1): alloc_index=38
xen_net: check_mfn(1): alloc_index=37
xen_net: check_mfn(1): alloc_index=36
xen_net: check_mfn(1): alloc_index=35
xen_net: check_mfn(1): alloc_index=34
xen_net: check_mfn(1): alloc_index=33
xen_net: check_mfn(1): alloc_index=32
xen_net: check_mfn(1): alloc_index=31
xen_net: check_mfn(1): alloc_index=30
xen_net: check_mfn(1): alloc_index=29
xen_net: check_mfn(1): alloc_index=28
xen_net: check_mfn(1): alloc_index=27
xen_net: check_mfn(1): alloc_index=26
xen_net: check_mfn(1): alloc_index=25
xen_net: check_mfn(1): alloc_index=24
xen_net: check_mfn(1): alloc_index=23
xen_net: check_mfn(1): alloc_index=22
xen_net: check_mfn(1): alloc_index=21
xen_net: check_mfn(1): alloc_index=20
xen_net: check_mfn(1): alloc_index=19
xen_net: check_mfn(1): alloc_index=18
xen_net: check_mfn(1): alloc_index=17
xen_net: check_mfn(1): alloc_index=16
xen_net: check_mfn(1): alloc_index=15
xen_net: check_mfn(1): alloc_index=14
xen_net: check_mfn(1): alloc_index=13
xen_net: check_mfn(1): alloc_index=12
xen_net: check_mfn(1): alloc_index=11
xen_net: check_mfn(1): alloc_index=10
xen_net: check_mfn(1): alloc_index=9
xen_net: check_mfn(1): alloc_index=8
xen_net: check_mfn(1): alloc_index=7
xen_net: check_mfn(1): alloc_index=6
xen_net: check_mfn(1): alloc_index=5
xen_net: check_mfn(1): alloc_index=4
xen_net: check_mfn(1): alloc_index=3
xen_net: check_mfn(1): alloc_index=2
xen_net: check_mfn(1): alloc_index=1
xen_net: check_mfn(1): alloc_index=0
xen_net: XENMEM_increase_reservation rc=64, alloc_index=64
xen_net: mfn_list[ 0]=bfef6
xen_net: mfn_list[ 1]=bfef5
xen_net: mfn_list[ 2]=bfef4
xen_net: mfn_list[ 3]=bfef3
xen_net: mfn_list[ 4]=bfef2
xen_net: mfn_list[ 5]=bfef1
xen_net: mfn_list[ 6]=bfef0
xen_net: mfn_list[ 7]=bfeef
xen_net: mfn_list[ 8]=bfeee
xen_net: mfn_list[ 9]=bfeed
xen_net: mfn_list[10]=bfeec
xen_net: mfn_list[11]=bfeeb
xen_net: mfn_list[12]=bfeea
xen_net: mfn_list[13]=bfee9
xen_net: mfn_list[14]=bfee8
xen_net: mfn_list[15]=bfee7
xen_net: mfn_list[16]=bfee6
xen_net: mfn_list[17]=bfee5
xen_net: mfn_list[18]=bfee4
xen_net: mfn_list[19]=bfee3
xen_net: mfn_list[20]=bfee2
xen_net: mfn_list[21]=bfee1
xen_net: mfn_list[22]=bfee0
xen_net: mfn_list[23]=bfedf
xen_net: mfn_list[24]=bfede
xen_net: mfn_list[25]=bfedd
xen_net: mfn_list[26]=bfedc
xen_net: mfn_list[27]=bfedb
xen_net: mfn_list[28]=bfeda
xen_net: mfn_list[29]=bfed9
xen_net: mfn_list[30]=bfed8
xen_net: mfn_list[31]=bfed7
xen_net: mfn_list[32]=bfed6
xen_net: mfn_list[33]=bfed5
xen_net: mfn_list[34]=bfed4
xen_net: mfn_list[35]=bfed3
xen_net: mfn_list[36]=bfed2
xen_net: mfn_list[37]=bfed1
xen_net: mfn_list[38]=bfed0
xen_net: mfn_list[39]=bfecf
xen_net: mfn_list[40]=bfece
xen_net: mfn_list[41]=bfecd
xen_net: mfn_list[42]=bfecc
xen_net: mfn_list[43]=bfecb
xen_net: mfn_list[44]=bfeca
xen_net: mfn_list[45]=bfec9
xen_net: mfn_list[46]=bfec8
xen_net: mfn_list[47]=bfec7
xen_net: mfn_list[48]=bfec6
xen_net: mfn_list[49]=bfec5
xen_net: mfn_list[50]=bfec4
xen_net: mfn_list[51]=bfec3
xen_net: mfn_list[52]=bfec2
xen_net: mfn_list[53]=bfec1
xen_net: mfn_list[54]=bfec0
xen_net: mfn_list[55]=bfebf
xen_net: mfn_list[56]=bfebe
xen_net: mfn_list[57]=bfebd
xen_net: mfn_list[58]=bfebc
xen_net: mfn_list[59]=bfebb
xen_net: mfn_list[60]=bfeba
xen_net: mfn_list[61]=bfeb9
xen_net: mfn_list[62]=bfeb8
xen_net: mfn_list[63]=bfeb7
xen_net: check_mfn(1): alloc_index=63
xen_net: check_mfn(1): alloc_index=62
xen_net: check_mfn(1): alloc_index=61
xen_net: check_mfn(1): alloc_index=60
xen_net: check_mfn(1): alloc_index=59
xen_net: check_mfn(1): alloc_index=58
xen_net: check_mfn(1): alloc_index=57
xen_net: check_mfn(1): alloc_index=56
----^----

The hypervisor satisfies the bitsize constraint, but the domU still
crashes enters the infinite loop. This way we can exclude netback (or at
least, check_mfn()) as the culprit. This is the first development.

The hypervisor log contains the following entries (domid 1 is correct):

(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d1
    Error pfn 28dfb44: rd=ffff8300be7dc080, od=0000000000000000,
    caf=00000000, taf=0000000000000000
(XEN) mm.c:654:d1 Error getting mfn 28dfb44 (pfn 7f0344) from L1 entry
    80000028dfb44063 for dom1
(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d1
    Error pfn 28dfb44: rd=ffff8300be7dc080, od=0000000000000000,
    caf=00000000, taf=0000000000000000
(XEN) mm.c:516:d1 Could not get page ref for pfn 28dfb44
(XEN) mm.c:2760:d1 Could not get page for mach->phys update

To understand what these mean I had to look at the get_page() function
in xen/include/asm-x86/mm.h. In advance, the "pfn" in the above entries
is a misnomer, it's "mfn", and they look nothing like the mfn's
allocated and flipped to dom1 by netback.


II. Some quotes from mm.h:

20 struct page_info
21 {
22   /* Each frame can be threaded onto a doubly-linked list. */
23   struct list_head list;
24
25   /* Reference count and various PGC_xxx flags and fields. */
26   u32 count_info;
27
28   /* Context-dependent fields follow... */
29   union {
30
31     /* Page is in use: ((count_info & PGC_count_mask) != 0). */
32     struct {
33       /* Owner of this page (NULL if page is anonymous). */
34       u32 _domain; /* pickled format */
35       /* Type reference count and various PGT_xxx flags and fields. */
36       unsigned long type_info;
37     } __attribute__ ((packed)) inuse;
38
39     /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
40     struct {
41       /* Order-size of the free chunk this page is the head of. */
42       u32 order;
43       /* Mask of possibly-tainted TLBs. */
44       cpumask_t cpumask;
45     } __attribute__ ((packed)) free;
46
47   } u;

and

205 static inline int get_page(struct page_info *page,
206                            struct domain *domain)
207 {
208   u32 x, nx, y = page->count_info;
209   u32 d, nd = page->u.inuse._domain;
210   u32 _domain = pickle_domptr(domain);
211
212   do {
213     x  = y;
214     nx = x + 1;
215     d  = nd;
216     if ( unlikely((x & PGC_count_mask) == 0) ||  /* Not allocated? */
217          unlikely((nx & PGC_count_mask) == 0) || /* Count overflow? */
218          unlikely(d != _domain) )                /* Wrong owner? */
219     {
220       if ( !_shadow_mode_refcounts(domain) )
221         gdprintk(XENLOG_INFO,
222                 "Error pfn %lx: rd=%p, od=%p, caf=%08x, taf=%"
223                 PRtype_info "\n",
224                 page_to_mfn(page), domain, unpickle_domptr(d),
225                 x, page->u.inuse.type_info);
226       return 0;
227     }
228     __asm__ __volatile__(
229         LOCK_PREFIX "cmpxchg8b %3"
230         : "=d" (nd), "=a" (y), "=c" (d),
231           "=m" (*(volatile u64 *)(&page->count_info))
232         : "0" (d), "1" (x), "c" (d), "b" (nx) );
233   }
234   while ( unlikely(nd != d) || unlikely(y != x) );
235
236   return 1;
237 }

This function adds a reference to a page already owned by the domain.

II.1. The meaning of the hypervisor log fields is:

rd -- referencing domain
od -- owner domain, NULL if page is anonymous
caf -- general reference count (and various PGC_xxx flags and fields)
taf -- typed reference count (and various PGT_xxx flags and fields)

(For the difference between general and typed reference count, see the
beginning of xen/arch/x86/mm.c. Basically, there are mutually exclusive
uses for a frame (writable mapping, use as page table, use as page
directory), but a normal read-only reference can coexist with any of
those. The general refcount includes all uses, while the typed refcount
includes only references that pertain to the current use type.)

dom1 (rd=ffff8300be7dc080) asked to add a reference to machine frame
28dfb44. The page is anonymous (od=0000000000000000). The problem is,
there is no reference at all to this page yet (caf=00000000), so
get_page() refuses to increase the general refcount.

My next suspect: netfront attempts to have a bogus machine frame back a
virtual address on the flipping path. This is the second "development".

II.2. Putting the log message aside, what does get_page() do? (I
document it here because I had to understand the "how" first, in order
to get at the "what".) It implements a loop that:
- checks if the frame has at least one reference,
- checks if the increased refcount can still be represented in the
  allocated number of bits,
- checks if the domain already owns the frame.

If so, then it increases the refcount atomically (see below). As long as
the atomic increase fails, because the relevant page_info members were
modified during our checks and increase computation, the loop is
retried.

The atomic increase relies on the non-added layout of struct page_info
-- there must be no gap between .count_info and .u.inuse._domain --, and
it also relies on (or rather, adapts to) x86_64 being little endian.

When the inline assembly is reached,
- x and y hold the general refcount that we started out with,
- nx is the increased refcount, to be stored,
- d and nd contain the pickled representation of the domain pointer;
  IOW, the identity of the owner domain,
- and the checks were all successful.

                        __asm__ __volatile__(
assembler template ...    LOCK_PREFIX "cmpxchg8b %3"
output operands
  %0   EDX --> nd  ...    : "=d" (nd),
  %1   EAX --> y   ...      "=a" (y),
  %2   ECX --> d   ...      "=c" (d),
  %3   mem operand ...      "=m" (*(volatile u64 *)(&page->count_info))
input operands
   d  --> %0 (EDX) ...    : "0" (d),
   x  --> %1 (EAX) ...      "1" (x),
   d  -->     ECX  ...      "c" (d),
   nx -->     EBX  ...      "b" (nx)
                        );

The "=m" operand seems to imply an extra address-of operator, ie. &, so
that the instruction gets a pointer to the lowest byte of
page->count_info. This is where the structure layout and the
little-endianness matter: the 64-bit unsigned integer pointed to by the
%3 operand has a low u32 of page->count_info, and a high u32 of
page->u.inuse._domain.

The above assembly instruction executes the following, atomically:

  Compare EDX:EAX with m64.
    ie. (d, x) == (page->u.inuse._domain, page->count_info)

    ie. "if neither the owner domain, nor the general refcount changed"

  If equal, set ZF, and load ECX:EBX into m64.
    ie. (page->u.inuse._domain, page->count_info) = (d, nx)

    ie. "then store the increased refcount"

  Otherwise, clear ZF and load m64 into EDX:EAX.
    ie. (nd, y) = (page->u.inuse._domain, page->count_info)

    ie. "otherwise, get back the new owner domain and the new refcount"

In the latter case (when (nd, y) != (d, x)) the loop is retried, now for
the recently retrieved new owner and/or new refcount.

Comment 21 Laszlo Ersek 2011-01-22 02:10:47 UTC

Created attachment 474712 [details]
netfront IPRINTK patch for debugging

I created this patch to see what requests the guest issues to the hypervisor. It is to be applied on top of -194.

Comment 22 Laszlo Ersek 2011-01-22 03:33:14 UTC

Created attachment 474715 [details]
guest console log and hypervisor dmesg

I compiled an i686 kernel with the patch in the previous comment, and started an otherwise RHEL56 32bit pv guest with it, with dhcp turned off at boot time. The host was the one with check_mfn() logging in place. For starting the guest I used "xm create -c", and then on the guest console I stopped klogd message capturing (killall -SIGTSTP klogd) so that everything would go to the console. Then I have brought up the interface (ifup eth0).

I repeated this several times (I even rebooted the host), but only the very first attempt was lucky enough to generate a *complete* guest log. The guest fell into the infinite loop each time, and in all cases, except the very first, some important part of the guest log was missing. I'm attaching a text file with the guest and hypervisor logs (and the host logs, but that doesn't seem important). This comment analyzes the logs. Column 112 (starting with column 0) is missing, because the console output was wrapped like that in my terminal, but it doesn't hinder the analysis.

1. netif_poll() is entered

2. netif_poll() calls xennet_get_responses() in a loop

3. xennet_get_responses() composes an array of virtual->machine remapping operations, and an array of machine->physical mapping operations. It adds one element to each per invocation (although it could loop itself).

4. When the loop is complete, netif_poll() passes this batch of operations to the hypervisor: it flips the entire bunch, and updates the m2p mapping, in one fell swoop. 

5. Immediately after this, the guest kernel enters an infinite loop of unhandled page faults. The first faulting virtual address is c09a0012, on page c09a0000.

  BUG: unable to handle kernel paging request at virtual address c09a0012

In the output of step 3, we find this page (c09a0000) as the very first operation added to the virt->mach array:

  netfront: xennet_get_responses(): enter
  netfront: xennet_get_responses(): setting up page remap:
      page=(c1159400) pfn=(9a0) np=(c09e8400)
      rx_mcl[ 0].args[0..2]=(c09a0000 673ed063 80000027)
      mfn=(27673ed)
  netfront: xennet_get_responses(): setting up page remap #2:
      rx_mmu[ 0]=(ptr=(27673ed001) val=(9a0))
  netfront: xennet_get_responses(): exit

The three arguments ([0..2]) of the hv op say: store 80000027673ed063 into the PTE corresponding to the virtual address c09a0000. In other words: "back the page starting at virtual address c09a0000 with machine frame 27673ed, and set the PTE flags to 8000000000000063 (NX|A|PCD|PWT|U/S|R/W|P)". This request is invalid from a 32bit PAE domain with 4K pages, since it can address up to 0xffffff frames (64G).

The hypervisor logs the exact same attributes when the guest tries to add a reference to the page:

  (XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:
      d12 Error pfn 27673ed: rd=ffff8300bf204080, od=0000000000000000,
      caf=00000000, taf=0000000000000000
  (XEN) mm.c:654:d12 Error getting mfn 27673ed (pfn 768bed) from L1 entry
      80000027673ed063 for dom12

The next step is to investigate where the bad mfn comes from. It is returned by the very first gnttab_end_foreign_transfer_ref(ref) call.

Comment 23 Laszlo Ersek 2011-01-22 22:29:55 UTC

Interestingly, even without trying to bring up the network interface in the guest, it cannot be shut down cleanly. It always leaves back a zombie domain, and the hypervisor complains about dom0:

(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d0 Error pfn
    10f3158: rd=ffff8300bf21e080, od=ffff8300bf21e080, caf=00000000,
    taf=0000000000000000
(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d0 Error pfn
    10f3158: rd=ffff8300bf21e080, od=ffff8300bf21e080, caf=00000000,
    taf=0000000000000000
(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:225:d0 Error pfn
    10f3158: rd=ffff8300bf21e080, od=ffff8300bf21e080, caf=00000000,
    taf=0000000000000000

Comment 24 Laszlo Ersek 2011-01-23 19:47:57 UTC

Created attachment 474853 [details]
netfront, netback and hypervisor logs of bug; with netback debug IPRINTK's

interpretation in next comment

Comment 25 Laszlo Ersek 2011-01-23 22:22:24 UTC

Created attachment 474863 [details]
netbk patch for debugging: check_mfn() .address_bits = 32u, plus IPRINTKs

Attachment 474853 [details] contains host logs that were written by netback as patched with this patch (to be applied on top of -240). In the rest of this comment, this logfile will be referred to as "debug4.txt".

My current interpretation of netback's presently relevant workings is the following. "netback.c" line number correspond to -240.

** 1. The entry point of the "network layer" is netif_be_start_xmit(). This function is called with an skb that was allocated by the network layer, and with a net_device that also determines the frontent. If the receiver is the flipping kind (netback.c:255), then this function (run by the "network layer's thread" creates a deep copy of the skb (hereafter referred to as "nskb"), by calling netbk_copy_skb(), and releases/forgets the original skb.

** 2. nskb is created as a deep copy of skb like this.

A socket buffer seems to have two *kinds* of memory areas that are important here. The first kind is "skb->data", wich should always be there and consist of a single page (the data pointer should point somewhere inside the page). The second kind is "skb_shinfo(skb)->frags[i]", of which there are zero or more (hereafter "nr_frags"; "i" runs from 0, inclusive, to nr_frags, exclusive). "frags[i].page" points to a "struct page". 

netbk_copy_skb() allocates nskb (#A#), which provides us with nskb->data as well. (Follow netback.c:151 to __alloc_skb(), skbuff.c:160 and :169.) Then, as long as there is data left in under skb, frag pages are allocated (#B#) for nskb. See netback.c:186 and :193.

** 3. skb is released, nskb takes its place (netback.c:264-265), then nskb is queued to rx_queue, and a kick is given to the tasklet (net_rx_tasklet) that consumes rx_queue (netback.c:291-292). Then the network layer's role ends.

Again, these three steps create a deep copy of the network layer's skb, remove netback's reference to the network layer's skb, and queues the deep copy to netback's rx tasklet. The original (= allocated by the network layer) skb pages/frames are *never* flipped.

** 4. The action routine of net_rx_tasklet is net_rx_action(). It has the following structure: 4a -- setup loop, 4b -- execution, 4c -- release/verify loop.

** 4a. setup loop -- netback.c:551-576. This loop takes "as many as possible" (deeply copied) nskb's from rx_queue as possible. For each nskb (which also identifies the recipient guest), if the recipient netfront is flipping, then the function ensures that we have at least "nr_frags+1" machine frames in our pool (check_mfn() call, netback.c:557), then sets up the remapping/grant operations (gops) for the nskb (netback.c:567), by calling netbk_gop_skb() -- see 4a1 below. Then it queues the skb for the release/verification loop (step 4c), in the "rxq" queue, see netback.c:571.

** 4a1. netbk_gop_skb() iterates over the nr_frags+1 pages of nskb. (nr_frags, that is, possibly zero, frag pages (netback.c:418), and one skb->data page (netback.c:431)). For each page, netbk_gop_frag() is called.

The netbk_gop_frag() is where the magic is done (or more precisely, prepared for later execution). The fifth parameter of this function is the page which the machine frame will be flipped out from under, to the corresponding netfront, *and* under which we'll map in another machine frame, from our own machine frame pool.

(Note that this function is called in series by netbk_gop_skb() for the nr_frags+1 pages of the nskb. Before netbk_gop_skb() was called however, we made sure our pool contains at least nr_frags+1 machine frames, see the check_mfn() call in step 4a, line netback.c:557.)

netback.c:348 retrieves the current machine frame number. For the flipping receiver, netback.c:375 gets a new machine frame from our pool. netback.c:385 prepares the hypervisor multicall entry that will *remap* new_mfn in place of old_mfn, still under the same page. Finally, still for the flipping receiver, netback.c:397-398 prepare the "flip", that is, the transfer (grant) of the old_mfn to the target (netfront) domain. (The flipped mfn will be returned to the hypervisor by netfront.)

** 4b. execution -- netback.c:578-619. All the remap and grant (=flip) operations that were collected in the setup loop (4a), for *all* nskbs (and so possibly for multiple target netfronts!) are executed in one enormous multicall to the hypervisor. This handles copying, flipping, and the remapping that makes up for the flipped (transferred) mfns.

** 4c. release/verify loop -- netback.c:621-696. This loop processes the nskb's that were stashed to rxq in 4a. For each nskb that was destined for a flipping netfront, the nskb is released in a tricky fashion. First, all nr_frag frag pages (if any) are "forced off" from the nskb (netback.c:636) and released manually (netback.c:637, calls netbk_free_pages()). Second, the [n]skb itself (and so the page under skb->data) is released in netback.c:694 (this is not specific for flipping receivers). The journey of the (deeply copied) nskb ends here.

--------

Summary: whatever skb's we get from the network layer, if the target netfront is flipping, we create a deep copy, and operate on that later on. The deep copy has nr_frags+1 pages that are freshly allocated by us in step 2, places #A# and #B#. When we flip, we flip the frames backing these pages, and we use our own mfn pool *exclusively* for *release* purposes. That is, when we finally get rid of the (deep copy) nskb, those nr_frags+1 pages *must* be backed by something. We use our pool to provide this backing for released nskb pages.

The reason for this is that all allocations and releases involve caching, so whatever we release must be reusable as-is. In fact if we look at the attachments, for example attachment 474853 [details], we see that each time, except the very first, we get back in netbk_copy_skb() (network layer thread) that we just released at the end of net_rx_action() (netback tasklet) -- both virtual address and backing mfn are identical. This was misleading for me, because it suggested that netback's own mfn pool ("mfn_list", netback.c:106) would be used for flipping. Instead, it is only used for release purposes *directly*, and it's a pure coincidence that we see those same virtual addresses and mfns in #A# and #B#. (NB, debug4.txt, which logs the ifup actions, contains no trace of #B#, ie. allocation of frag pages.)

Therefore, now we know what causes the problem, and we also have an idea how it should be fixed.

First, we don't need to set reservation.address_bits to 32 in check_mfn(), because those machine frames are never flipped to netfronts -- we only need those machine frames to put *something* under the just-being-released nskb pages (of which there are nr_frags+1, for each nskb).

The problem is: in netbk_copy_skb(), location #A# (netback.c:151) and location #B# (netback.c:186) retrieve fresh pages -- or cached pages -- that are backed by mfns not necessarily suitable for a 32bit domain:

   151    nskb = alloc_skb(SKB_MAX_HEAD(0), GFP_ATOMIC);
   186      page = alloc_page(GFP_ATOMIC | __GFP_NOWARN | zero);

("zero" makes a difference on the last, partially filled frag page).

The idea is to "or in" __GFP_DMA32 in these two places additionally, if necessary. (See "include/linux/gfp.h".)

The skb links to the net_device, which embeds netif_t, which tells us about the guest end (domain id and flipping/copying choice, for example). netbk_copy_skb() should deeply copy these pieces of information in step 2 as well -- for example, we need the domid to know where to flip to. IOW, we should be able to add __GFP_DMA32 only when needed, because we should know whether the target domain is 32bit or not. (I must still figure out how precisely.)

The replacement frames from our own pool don't ever need to be "low". The low frames will be returned to the hypervisor by the 32bit netfronts, so we shouldn't run out of them.

(Naturally, if we have a hostile netfront, then we might run out of low pages, but ballooning interferes badly with flipping anyway -- it is easy to trigger memory squeeze in netback by ballooning. That's why RHEL5.6 defaults to copying. This (purported) fix should be relevant only for customers running default-behavior (rx_copy=0) 32 bit RHEL55 netfronts, on hosts that have more than 64 GB RAM (exceeding the 36 bit hardware PAE limit), *and* where numa=on is not present on the hv cmdline.)

Obviously, I'm assuming that __GFP_DMA32 is passed on ultimately to the hypervisor by the kmalloc() machinery. I hope this is the case, otherwise dom0 device drivers couldn't ensure their buffers are in the DMA zone.

I wanted to test this idea, but as soon as I installed my host kernel patched with __GFP_DMA32, I realized I wouldn't be able to reboot the intel machine, because the qekvm1 appliance kept barfing "network error" on me and I couldn't get past the BIOS setup dialog, and there was nobody to fix that during the weekend.

*If* __GFP_DMA32 doesn't work out, then I'll close this bug as a WONTFIX, because the fix would involve modifying the netback allocation/transfer patterns, or worse, touching the kmalloc() infrastructure. Workarounds galore exist: (a) make the guest choose copying, by any means (rx_copy=1, upgrade to 5.6), (b) try passing numa=on to the hv, (c) run your 32 bit apps on a 64 bit guest kernel.

Comment 26 Laszlo Ersek 2011-01-23 22:49:35 UTC

Though it might prove necessary to set .address_bits to 32 in addition, ie. make our own pool consist of machine frames that are also in the 32 bit zone. Without that, we might violate some kmalloc() contract: when we get the page (with __GFP_DMA32), the backing frame *is* low, but when we give it back, with the machine frame remapped under it, that property doesn't necessarily hold anymore. That might constitute lying to the kmalloc() machinery, since it could expect a __GFP_DMA32-issued page to be __GFP_DMA32 when returned as well.

I think I'll keep .address_bits=32. The penalty shouldn't be big, and I wouldn't want to add another source of obscure bugs.

Comment 27 Andrew Jones 2011-01-24 09:15:29 UTC

(In reply to comment #25)
> Obviously, I'm assuming that __GFP_DMA32 is passed on ultimately to the
> hypervisor by the kmalloc() machinery. I hope this is the case, otherwise dom0
> device drivers couldn't ensure their buffers are in the DMA zone.

Unfortunately this assumption is wrong.

Currently all PV guest memory (including dom0) is in ZONE_DMA (also ZONE_HIGHMEM for i386), so adding the DMA32 flag won't change anything. That's changing for 5.7 as we're allowing PV guests (and dom0) to have a normal memory map (i.e. DMA/DMA32/NORMAL), but none of that matters anyway since the HV doesn't guarantee you get the address ranges from any particular zone just because you requested it in the guest kernel.

You just can't use DMA or any given address if you're running as a PV guest with a basic configuration. So if it's necessary for that hardware then you need the HV to grant direct access to the specific addresses. dom0 gets pretty much everything granted to it by default, allowing the devices to work. The passthrough (insecure) feature is needed for other PV domains to get direct access to arbitrary addresses.

So we'll have to consider something else to fix this issue, but the analysis you've done here is excellent, it shows us exactly where to look. I also think it will help us to understand other memory bugs that have plagued us, particularly in the network stack.

Comment 28 Laszlo Ersek 2011-01-24 15:07:27 UTC

I.

The __GFP_DMA32 idea didn't work. The host rebooted as soon as the guest was started (without ever getting a chance to ifup eth0 in the guest). This is probably related to the fact that netback flips stuff to netfront even before the netfront interface is brought up in the guest (see attachment 474853 [details] for log messages). Likely the very first allocation in netbk_copy_skb() crashed the host. Unfortunately, the serial console of the machine is not accessible.

II.

I believe I've found some historical evidence that corroborates Andrew's point in comment 27.

(a) 03 Apr 2007 -- "linux: GFP_DMA/GFP_DMA32" thread on Xen-devel:

http://lists.xensource.com/archives/html/xen-devel/2007-04/msg00035.html

It states that drivers relying on GFP_DMA/GFP_DMA32 under Xen will break. The possible fixes presented in the second message match those in the last paragraph of comment 27 -- modify the specific driver (netback), or try to hook the allocator. There seems to be no definitive fix.

("drivers/media/video/video-buf.c" is mentioned as GFP_DMA32-reliant, middle-level code, that would have to be fixed. Interestingly enough, the latest commit (as of -240) for this file in the RHEL5 tree is b533be8 (from 08 May 2007) to fix bug 221478, backporting kernel upstream commit 10329b9 (12 Jan 2007)  -- it adds GFP_DMA32 to video frame buffer allocation so that video capture cards keep working on hosts that have lots of RAM.)

Fiddling the allocator is plainly out of question. Adapting netback would mean:
- doing the skb->nskb copy like we do now,
- then to an extra get-low-mfns-from-hv / remap-nskb-pages
  / return-high-mfns-to-hv combo,
- then flip these pages to domU.

This in fact would implement flipping on top of dom0-internal copying. The current copying path does exactly that, just without flipping, and in inter-domain fashion.

(b) The linux-2.6.18-xen changelog (default branch) mentions GFP_DMA{,32} in two changesets. In chronological order:

10 Dec 2007 -- 355:b865b15fb54b -- set up memory zones like bare-metal

http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/355

The code comment that c/s 355 removes looked very hopeful at first. Without this patch:

----v----
XEN: Our notion of "DMA memory" is fake when running over Xen. We simply put
all RAM in the DMA zone so that those drivers which needlessly specify GFP_DMA
do not get starved of RAM unnecessarily. Those drivers that *do* require
lowmem are screwed anyway when running over Xen!
----^----


23 Oct 2009 -- 940:6301ebc85480 -- now that GFP_DMA{,32} have real "teeth",
                                   remove unnecessary alloc restrictions

http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/940


c/s 940 is not really important. I though c/s 355 would be very important; it would supply the basis to make GFP_DMA{,32} actually work. Looking at RHEL5 (-240), we don't have it yet in-tree, but Andrew tells me he's already posted the backport and it's pending for RHEL5.7. However, it should only affect pseudo-physical addresses and is still one level too high for our needs.

In summary: of the two possible fixes, one (hacking the allocator) is a big no-no; the second (adding an extra intra-dom0 copy to netback) is wasted effort (both coding and CPU). The superficially related upstream patches are, indeed, superficial, they are not fixes. The solution is to use copying, or sidestep the entire issue if applicable, see the last paragraph of comment 25.