Bug 1195899

Summary: Live boot sometimes fails with a NULL pointer dereference when mounting /run/initramfs/squashfs
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: kernelAssignee: Mikuláš Patočka <mpatocka>
Status: CLOSED CURRENTRELEASE QA Contact: Storage QE <storage-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 22CC: admiller, agk, bugzilla, gansalmon, harald, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab, mfabian, mschmidt, msnitzer, robatino, samuel-rhbugs, satellitgo, walters
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-20 20:38:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
full journalctl output from an affected boot attempt
none
Full journal from a successful boot with the same image (for comparison)
none
rd.debug log (up to switch root) from an 02-21 (buggy) boot
none
rd.debug log (up to switch root) from an 02-18 (not buggy) boot
none
rd.debug log (up to switch root) from an 02-18 (not buggy) boot none

Description Adam Williamson 2015-02-24 20:13:10 UTC
Created attachment 994825 [details]
full journalctl output from an affected boot attempt

Recent Fedora 22 nightly live builds - at least since 02-21, but likely since the 3.20 kernel(?) - sometimes fail to boot. Very early in the process (when mounting /run/initramfs/squashfs) they hit a kernel NULL pointer dereference in dm_exception_store_set_chunk_size .

This seems to happen frequently for me with the 02-21 and 02-24 32-bit Workstation nightlies. It happens possibly less frequently, but definitely happens, with the 02-24 64-bit Workstation nightly.

Download links:

02-21 i686: https://kojipkgs.fedoraproject.org/work/tasks/5167/9015167/Fedora-Live-Workstation-i686-22-20150221.iso
02-24 i686: https://kojipkgs.fedoraproject.org/work/tasks/2668/9052668/Fedora-Live-Workstation-i686-22-20150224.iso
02-21 x86_64: https://kojipkgs.fedoraproject.org/work/tasks/5170/9015170/Fedora-Live-Workstation-x86_64-22-20150221.iso
02-24 x86_64: https://kojipkgs.fedoraproject.org/work/tasks/2670/9052670/Fedora-Live-Workstation-x86_64-22-20150224.iso

Will attach journalctl from an occurrence of the bug (with the x86_64 02-24 image).

This seems like an Alpha blocker, quasi-random violation of "All release-blocking images must boot in their supported configurations."

Comment 1 Adam Williamson 2015-02-24 20:19:09 UTC
Created attachment 994828 [details]
Full journal from a successful boot with the same image (for comparison)

Comment 2 Adam Williamson 2015-02-24 20:19:45 UTC
There are some suspicious messages from systemd just before the crash, but there are similar messages in a successful boot too. See the attached log from a successful boot with the same image.

Comment 3 Josh Boyer 2015-02-24 20:22:56 UTC
The attached journal shows some issues with the squashfs unit failing to mount but dm probably shouldn't be oopsing in any case.

Feb 24 20:03:30 localhost kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
Feb 24 20:03:30 localhost kernel: IP: [<ffffffff8160735a>] dm_exception_store_set_chunk_size+0x6a/0x140
Feb 24 20:03:30 localhost kernel: PGD 0 
Feb 24 20:03:30 localhost kernel: Oops: 0000 [#1] SMP 
Feb 24 20:03:30 localhost kernel: Modules linked in: isofs squashfs 8021q garp stp llc mrp 8139too virtio_scsi virtio_blk virtio_balloon qxl drm_kms_helper ttm crct10dif_pclmul crc32_pclmul crc32c_intel drm ghash_clmulni_intel sym53c8xx 8139cp scsi_transport_spi virtio_pci virtio_ring virtio mii ata_generic pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc loop
Feb 24 20:03:30 localhost kernel: CPU: 1 PID: 548 Comm: dmsetup Not tainted 4.0.0-0.rc1.git0.1.fc22.x86_64 #1
Feb 24 20:03:30 localhost kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Feb 24 20:03:30 localhost kernel: task: ffff88007fa9cf00 ti: ffff88007ea1c000 task.ti: ffff88007ea1c000
Feb 24 20:03:30 localhost kernel: RIP: 0010:[<ffffffff8160735a>]  [<ffffffff8160735a>] dm_exception_store_set_chunk_size+0x6a/0x140
Feb 24 20:03:30 localhost kernel: RSP: 0018:ffff88007ea1fb08  EFLAGS: 00010246
Feb 24 20:03:30 localhost kernel: RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001
Feb 24 20:03:30 localhost kernel: RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88007a6a7e00
Feb 24 20:03:30 localhost kernel: RBP: ffff88007ea1fb28 R08: 0000000000000000 R09: 0000000000000001
Feb 24 20:03:30 localhost kernel: R10: 000000000000000a R11: f000000000000000 R12: ffff88007faf9180
Feb 24 20:03:30 localhost kernel: R13: ffffc900044f6088 R14: ffff88007ea1fbc4 R15: ffff88007faf9180
Feb 24 20:03:30 localhost kernel: FS:  00007f4ced679800(0000) GS:ffff88007d700000(0000) knlGS:0000000000000000
Feb 24 20:03:30 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 24 20:03:30 localhost kernel: CR2: 0000000000000098 CR3: 000000007ea2d000 CR4: 00000000000007e0
Feb 24 20:03:30 localhost kernel: Stack:
Feb 24 20:03:30 localhost kernel:  0000000000000008 ffffc900044f6040 ffff88007a6a7e00 ffff88007faf9550
Feb 24 20:03:30 localhost kernel:  ffff88007ea1fb98 ffffffff816075ef 0000000000000168 000000007ac01600
Feb 24 20:03:30 localhost kernel:  ffffffff81ccbb20 ffff88007a6a7f30 000000087faf9258 00000000fef9ff77
Feb 24 20:03:30 localhost kernel: Call Trace:
Feb 24 20:03:30 localhost kernel:  [<ffffffff816075ef>] dm_exception_store_create+0x1bf/0x230
Feb 24 20:03:30 localhost kernel:  [<ffffffff816052fa>] snapshot_ctr+0x12a/0x650
Feb 24 20:03:30 localhost kernel:  [<ffffffff815f9448>] ? dm_split_args+0x68/0x170
Feb 24 20:03:30 localhost kernel:  [<ffffffff815f96ab>] dm_table_add_target+0x15b/0x3b0
Feb 24 20:03:30 localhost kernel:  [<ffffffff815f9207>] ? dm_table_create+0x87/0x140
Feb 24 20:03:30 localhost kernel:  [<ffffffff815fceab>] table_load+0x14b/0x370
Feb 24 20:03:30 localhost kernel:  [<ffffffff815fcd60>] ? retrieve_status+0x1c0/0x1c0
Feb 24 20:03:30 localhost kernel:  [<ffffffff815fdb52>] ctl_ioctl+0x232/0x520
Feb 24 20:03:30 localhost kernel:  [<ffffffff81317818>] ? SYSC_semtimedop+0x2c8/0xea0
Feb 24 20:03:30 localhost kernel:  [<ffffffff815fde53>] dm_ctl_ioctl+0x13/0x20
Feb 24 20:03:30 localhost kernel:  [<ffffffff81235056>] do_vfs_ioctl+0x2c6/0x4d0
Feb 24 20:03:30 localhost kernel:  [<ffffffff812352e1>] SyS_ioctl+0x81/0xa0
Feb 24 20:03:30 localhost kernel:  [<ffffffff81786029>] system_call_fastpath+0x12/0x17
Feb 24 20:03:30 localhost kernel: Code: 8b 80 98 00 00 00 48 8b 80 78 03 00 00 48 85 c0 74 0c 0f b7 80 28 06 00 00 66 85 c0 75 73 49 8b 7c 24 08 e8 59 cf ff ff 48 8b 00 <48> 8b 80 98 00 00 00 48 8b 80 78 03 00 00 48 85 c0 74 0c 0f b7 
Feb 24 20:03:30 localhost kernel: RIP  [<ffffffff8160735a>] dm_exception_store_set_chunk_size+0x6a/0x140
Feb 24 20:03:30 localhost kernel:  RSP <ffff88007ea1fb08>
Feb 24 20:03:30 localhost kernel: CR2: 0000000000000098
Feb 24 20:03:30 localhost kernel: ---[ end trace c82c3f5c1b456124 ]---

Comment 4 Mike Snitzer 2015-02-24 20:57:07 UTC
This is the only recent change to dm-snapshot (which went in during the 4.0 merge window):  http://git.kernel.org/linus/22aa66a3ee5

Not clear why/if it'd pose a problem on snapshot_ctr's call to dm_exception_store_set_chunk_size though...

Assigning to Mikulas to triage further.

Comment 5 satellitgo 2015-02-25 20:03:35 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1196236
same issue

Comment 6 Adam Williamson 2015-02-25 20:05:10 UTC
*** Bug 1196236 has been marked as a duplicate of this bug. ***

Comment 7 Mikuláš Patočka 2015-02-25 20:11:14 UTC
It is caused by the bdev field of struct dm_dev being NULL.

But I don't know what causes it. I couldn't find how device mapper could set this field to NULL.

Maybe there is some general memory corruption in other kernel component and you need to bisect it.

Comment 8 Adam Williamson 2015-02-25 20:46:21 UTC
So, we're trying to pin it down a bit. We're doing a few more tests, but so far, it looks like the 02-18 nightly live images do not have this bug, while 02-21 nightly lives do. Unfortunately nightly generation failed on 02-19 and 02-20, so that's as tightly as we can pin it down.

The 02-18 nightlies have kernel 3.20.0-0.rc0.git7.3.fc22 . The 02-21 nightlies have 3.20.0-0.rc0.git10.1.fc22 . So that looks to be the delta so far as the kernel goes.

Comment 9 Adam Williamson 2015-02-25 20:58:19 UTC
Note, 02-18 nightly boots do *not* have these systemd messages:

Feb 24 20:03:27 localhost systemd[1]: Reached target Basic System.
Feb 24 20:03:27 localhost systemd[1]: Starting Basic System.
Feb 24 20:03:30 localhost kernel: squashfs: version 4.0 (2009/01/31) Phillip Lougher
Feb 24 20:03:30 localhost kernel: ISO 9660 Extensions: Microsoft Joliet Level 3
Feb 24 20:03:30 localhost kernel: ISO 9660 Extensions: RRIP_1991A
Feb 24 20:03:30 localhost systemd[1]: Unit run-initramfs-squashfs.mount is bound to inactive unit. Stopping, too.
Feb 24 20:03:30 localhost systemd[1]: Unmounting /run/initramfs/squashfs...
Feb 24 20:03:30 localhost umount[526]: umount: /run/initramfs/squashfs: target is busy
Feb 24 20:03:30 localhost umount[526]: (In some cases useful info about processes that
Feb 24 20:03:30 localhost umount[526]: use the device is found by lsof(8) or fuser(1).)
Feb 24 20:03:30 localhost systemd[1]: run-initramfs-squashfs.mount mount process exited, code=exited status=32
Feb 24 20:03:30 localhost systemd[1]: Failed unmounting /run/initramfs/squashfs.
Feb 24 20:03:30 localhost systemd[1]: Unit run-initramfs-squashfs.mount is bound to inactive unit. Stopping, too.
Feb 24 20:03:30 localhost systemd[1]: Unmounting /run/initramfs/squashfs...
Feb 24 20:03:30 localhost umount[529]: umount: /run/initramfs/squashfs: target is busy
Feb 24 20:03:30 localhost umount[529]: (In some cases useful info about processes that
Feb 24 20:03:30 localhost umount[529]: use the device is found by lsof(8) or fuser(1).)
Feb 24 20:03:30 localhost systemd[1]: run-initramfs-squashfs.mount mount process exited, code=exited status=32
Feb 24 20:03:30 localhost systemd[1]: Failed unmounting /run/initramfs/squashfs.
Feb 24 20:03:30 localhost systemd[1]: Unit run-initramfs-squashfs.mount is bound to inactive unit. Stopping, too.
Feb 24 20:03:30 localhost systemd[1]: Unmounting /run/initramfs/squashfs...
Feb 24 20:03:30 localhost systemd[1]: Unmounted /run/initramfs/squashfs.
Feb 24 20:03:30 localhost systemd[1]: Unit run-initramfs-squashfs.mount entered failed state.
Feb 24 20:03:30 localhost systemd[1]: run-initramfs-squashfs.mount failed to run 'mount' task: No such file or directory
Feb 24 20:03:30 localhost systemd[1]: Failed to mount /run/initramfs/squashfs.
Feb 24 20:03:30 localhost systemd[1]: Mounting /run/initramfs/squashfs...

The 02-18 journal around that point just looks like this:

Feb 25 15:48:28 localhost systemd[1]: Starting Basic System.
Feb 25 15:48:28 localhost systemd[1]: Reached target Basic System.
Feb 25 15:48:28 localhost unknown[1]: <audit-1130> pid=1 uid=0 auid=4294967295 s
Feb 25 15:48:30 localhost kernel: squashfs: version 4.0 (2009/01/31) Phillip Lou
Feb 25 15:48:30 localhost kernel: ISO 9660 Extensions: Microsoft Joliet Level 3
Feb 25 15:48:30 localhost kernel: ISO 9660 Extensions: RRIP_1991A
Feb 25 15:48:30 localhost systemd[1]: Started dracut initqueue hook.
Feb 25 15:48:30 localhost kernel: audit_printk_skb: 6 callbacks suppressed
Feb 25 15:48:30 localhost kernel: audit: type=1130 audit(1424897310.802:13): pid
Feb 25 15:48:30 localhost unknown[1]: <audit-1130> pid=1 uid=0 auid=4294967295 s
Feb 25 15:48:30 localhost systemd[1]: Starting Remote File Systems (Pre).
Feb 25 15:48:30 localhost systemd[1]: Reached target Remote File Systems (Pre).

so perhaps there's some kind of dracut or systemd issue involved here too. A dracut build landed on 02-19:

http://koji.fedoraproject.org/koji/buildinfo?buildID=612988

which has two 'interesting' changelog entries:

- fixed shutdown routine for dm devices
- Include squashfs support for rd.live.fsimg

We *also* went from systemd 218 to systemd 219 between 02-18 (the working 02-18 nightly has systemd-218-5, 02-21 has 219-4). systemd changelog:

219-4 - Revive fedora-disable-resolv.conf-symlink.patch to unbreak composes.
219-3 - arm: disabling gold did not help; disable lto instead (#1193212)
219-2 - Update 90-default.present for dbxtool.
219-1 - New upstream release
      - This removes the sysctl/bridge hack, a different solution needs to be found for this (see #634736)
      - This removes the /etc/resolv.conf hack, anaconda needs to fix their handling of /etc/resolv.conf as symlink
      - This enables "%check" - disable gold on arm, as that is broken (see #1193212)
218-6 - aarch64 now has seccomp support

dracut looks like a more likely suspect than systemd to me, but obviously it'd be a good idea to look into both changes. CCing harald and michal - can you guys see if any of your changes could be related to this bug? Especially harald. It will be a proposed Alpha blocker so it's a priority issue.

Comment 10 Adam Williamson 2015-02-25 21:00:56 UTC
Proposing as an Alpha blocker, conditional violation of "All release-blocking images must boot in their supported configurations." Exactly how serious the bug is seems to vary from image to image and possibly system to system, but satellit says 02-25 nightly always fails to boot for him, and I found the 02-21 32-bit Workstation nightly always fails to boot for me in a KVM. 02-24 is more intermittent (sometimes works, sometimes fails). We'll see how badly TC5 is affected when it lands.

Comment 11 Adam Williamson 2015-02-25 21:40:02 UTC
Created attachment 995347 [details]
rd.debug log (up to switch root) from an 02-21 (buggy) boot

This is the log of a boot of the 02-21 32-bit Workstation nightly with 'rd.debug' added. Interestingly, this boot succeeded - the first time I've ever got that image to boot.

This should provide more detail about what's going on with those /run/initramfs/squashfs mount messages. I will attach an equivalent log from 02-18 next for comparison.

Comment 12 Adam Williamson 2015-02-25 21:41:50 UTC
Created attachment 995348 [details]
rd.debug log (up to switch root) from an 02-18 (not buggy) boot

Comment 13 Adam Williamson 2015-02-25 21:50:21 UTC
Created attachment 995351 [details]
rd.debug log (up to switch root) from an 02-18 (not buggy) boot

Comment 14 Mikuláš Patočka 2015-02-25 21:54:25 UTC
It's kernel memory corruption, so it can't be caused by dracut or systemd.

I suggest that you disable kernel modules, until you find out which one caused that crash in dm-snapshot. Maybe it's bug in the squashfs kernel driver.

Comment 15 Adam Williamson 2015-02-25 22:14:38 UTC
Mikulas: sorry, I should clarify. The kernel crasher *per se* can't be caused by dracut or systemd, no. But our theory is that something has changed in dracut or systemd which is causing us to hit a codepath that triggers an underlying kernel bug, whereas we didn't before. As I mentioned above, there's a bunch of stuff happening to a mount  - /run/initramfs/squashfs - in the 02-18 logs *right before the crash is triggered* which does not happen at all in the 02-21 logs.

It'd be good to fix the kernel crasher in any case, clearly, but we suspect there are actually two bugs here:

i) the kernel crasher
ii) whatever changed in the whole 'init the live environment' area which leads to us hitting the kernel crasher

and right now I'm trying to debug ii), not i).

On that path, here's a systemd commit which looks rather like it might be relevant:

http://cgit.freedesktop.org/systemd/systemd/commit/?id=06e97888883e2cc12eb6514e80c7f0014295f59b

Comment 16 Adam Williamson 2015-02-25 23:05:33 UTC
So, turns out Colin Walters had already done a systemd build with that commit reverted in order to fix *another* bug:

http://koji.fedoraproject.org/koji/buildinfo?buildID=614715
https://bugzilla.redhat.com/show_bug.cgi?id=1195761

I just built a live image with that build of systemd, and confirmed that it does not hit any of the "Unit run-initramfs-squashfs.mount is bound to inactive unit" messages. So I believe that reversion also resolves this bug.

Of course, upstream might not want to simply revert the change, presumably they'll want to find a way to keep the benefit without causing the breakage.

So, I'm thinking we can keep this bug for the kernel crash - because presumably the kernel shouldn't crash even when systemd is trying to unmount then remount /run/initramfs/squashfs - but drop the blocker status, and I'll file a new bug on systemd for the systemd part of it.

Comment 17 Adam Williamson 2015-02-25 23:20:52 UTC
Mikulas: so for your purposes, the new information is that this seems to be triggered by systemd doing something it didn't do before - repeatedly trying to unmount /run/initramfs/squashfs , then try to mount it again. That may help you track it down, maybe? the underlying bug may have been in the kernel for some time - we can't say it only appeared in the *kernel* recently, as what changed to make us start seeing this bug was *systemd's* behaviour.

Comment 18 Josh Boyer 2015-02-26 15:05:55 UTC
Mikulas,

I'm familiar at all with the dm code, but is there a possibility that the bdev pointer is freed out from under dm if the block device goes away?  The live image seems to do all kinds of weird things with loop devices, so I'm curious if the destruction of a loopdev after dm code is called could cause the NULL pointer error.

Looking in dm_get_device I see:

       } else {
                /* convert the path to a device */
                struct block_device *bdev = lookup_bdev(path);

                if (IS_ERR(bdev))
                        return PTR_ERR(bdev);
                dev = bdev->bd_dev;
                bdput(bdev);
        }

It gets bdev, and then assigns the local dev based on bdev->bd_dev.  But then it does a bdput on bdev, which I think essentially frees the reference count dm had on that block device.  Or is there another way that bdev is pinned and stored by dm?

Comment 19 Josh Boyer 2015-02-26 15:06:29 UTC
Ugh, my previous comment should have started: "I'm NOT (as in not at all) familiar with..."

Comment 20 Mikuláš Patočka 2015-02-26 16:24:25 UTC
The problem is that there is no clear code path that could set dm_dev->bdev to NULL.

dm_get_device calls dm_get_table_device and dm_get_table_device fills the dm_dev structure. dm_get_table_device performs td->dm_dev.bdev = NULL and then it calls open_table_device. Could open_table_device succeed and leave td->dm_dev.bdev NULL? If you look at the code, it couldn't. That would happen if blkdev_get_by_dev returned NULL, but it can't return NULL.

That's why I assume that the bug is not in dm, but somewhere else in the kernel - something is probably corrupting random memory.

I suggest: compile the kernel with debugging (CONFIG_DEBUG_PAGEALLOC, CONFIG_SLUB_DEBUG_ON and others) and hope that it catches the memory corruption sooner.

Or - try removing modules from the initrd. Remove the module for squashfs (or replace it with a dummy that fails at the beginning of squashfs_fill_super without doing anything) and see if the crash goes away. If it won't help, try removing other modules from the initrd image and find the one that causes it.

Comment 21 Adam Williamson 2015-02-26 16:46:20 UTC
It seems unusual for a bug that involves 'corrupting random memory' to always cause the *exact same thing to get set to NULL*, though? Usually when there's a bug along those lines we get a variety of somewhat different backtraces, whereas in this case we seem to keep getting this same one.

It looks like our OpenQA box actually hit this backtrace in a run *from the x86_64 network install image*, btw - I tried booting it in a VM several times myself and couldn't hit it there, though. It's certainly more common with live images.

Comment 22 Adam Williamson 2015-02-26 17:30:12 UTC
The 2015-02-21 i686 Workstation nightly has a debugging kernel (3.20.0-0.rc0.git10.1.fc22.i686), I tried booting that with _slub_debug=1 but it doesn't crash or show any warnings any earlier, still just hits the same trace.

Comment 23 Josh Boyer 2015-03-01 14:27:09 UTC
*** Bug 1197352 has been marked as a duplicate of this bug. ***

Comment 24 Colin Walters 2015-08-07 17:38:21 UTC
This is still affecting Fedora 23 Atomic:

Aug 07 16:27:26 localhost.localdomain systemd[1]: Unmounted /sysroot.
Aug 07 16:27:25 localhost.localdomain systemd[1]: Unmounting /sysroot...
Aug 07 16:27:25 localhost.localdomain systemd[1]: sysroot.mount: Unit is bound to inactive unit dev-mapper-atomicos\x2droot.device. Stopping, too.
Aug 07 16:27:15 localhost ostree-prepare-root[541]: Resolved OSTree target to: /sysroot/ostree/deploy/fedora-atomic/deploy/d69a7e73c4ccd9aba218f408791e721e46e1924bc2d55884d333a13bf229424e.0
Aug 07 16:27:15 localhost ostree-prepare-root[541]: Examining /sysroot//ostree/boot.0/fedora-atomic/c9dd3ab6bbf7cfc1971ce2ec7b3a436653a1a73c638a03213b65095e46e36ddb/0
Aug 07 16:27:14 localhost systemd[1]: Mounted /sysroot.
Aug 07 16:27:12 localhost systemd[1]: Mounting /sysroot...

Comment 25 Justin M. Forbes 2015-10-20 19:28:49 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.

Fedora 22 has now been rebased to 4.2.3-200.fc22.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.

If you experience different issues, please open a new bug report for those.

Comment 26 Josh Boyer 2015-10-20 20:38:34 UTC
I think we actually fixed this one through a different bug.  We bisected that bug back to 3.20 kernels and Mike and company were able to provide a fix.

I'm closing this out.  If we see it again, we can reopen or just open a new bug.