Bug 27100

Summary: GPU Hung copying a 2048x1152 pixmap
Product: xorg Reporter: John <jvinla>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: high CC: xunx.fang
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Bug details.
none
dmesg
none
log file
none
gpu dump part one
none
gpu dump part two
none
Requested batch buffer dump - hang #1 - 2.6.34-0.10.rc1.git0.fc14.x86_64
none
Other debug files.
none
Requested debug files - hang #2 - 2.6.34-0.10.rc1.git0.fc14.x86_64
none
debug log files, modified i915_render.c
none
Use tiling by default
none
How test was done.
none
How test was done. none

Description John 2010-03-15 16:13:52 UTC
GPU hangs, black screen, keyboard dead - a few seconds after entering password and pressing <Enter>. Before blackout, can see desktop being drawn (does not complete login).

IntelĀ® G31 + ICH7 Chipset

See attachment for more details.
Comment 1 John 2010-03-15 16:17:48 UTC
Created attachment 34090 [details]
Bug details.
Comment 2 John 2010-03-15 16:19:17 UTC
Created attachment 34091 [details]
dmesg
Comment 3 John 2010-03-15 16:22:19 UTC
Created attachment 34092 [details]
log file

intel_gpu_dump is over 1 MB and system will not let me submit it.
Comment 4 John 2010-03-16 09:28:03 UTC
New information.

runlevel 3, login
su
startx -- works
logout
as user, startx  -- works

restart, login as user = same old GPU hang (was not a permanent fix)  
Comment 5 Chris Wilson 2010-03-18 13:34:02 UTC
John, thanks for the report. We really do need the information contained within the GPU dump to be able to see what is causing the hang. In particular, recent kernels automatically capture the batch buffer dump in /debug/dri/.../i915_error_state [.34-rc1 and drm-intel-next]. Also there is an outside chance that there are relevant bug fixes in more recent xf86-video-intel and libdrm.
Comment 6 John 2010-03-18 14:30:50 UTC
Created attachment 34218 [details]
gpu dump part one
Comment 7 John 2010-03-18 14:31:47 UTC
Created attachment 34219 [details]
gpu dump part two
Comment 8 John 2010-03-18 14:43:23 UTC
Sorry, I guess that I need some hints.

"recent kernels automatically capture the batch buffer dump in
/debug/dri/.../i915_error_state [.34-rc1 and drm-intel-next]"
-- cannot find any such files. Do I have to enable something? Can you tell me a minimum kernel version to look for?

"Also there is an outside chance that there are relevant bug fixes in more recent xf86-video-intel and libdrm."
-- I'm up to date with the standard updates. I'm willing to try these newer version. Where do I find them and can you give me version numbers?

I just finished putting 64 bit Fedora 12 on a extra partition (including full standard updates). Same error and odd behaviour (old kernel works).
Comment 9 Carl Worth 2010-03-18 17:08:10 UTC
(In reply to comment #8)
> Sorry, I guess that I need some hints.

No problem. We're glad to try to help you help us. :-)

So the trick with intel_gpu_dump is that it's easy for it to emit a
dump that has a lot of text, but no actual information. For example,
if you run intel_gpu_dump when things are working normally you'll
basically get a dump with no useful information in it.

But to be able to debug an error, we need to see the single batch
buffer immediately before an error occurs. Often, an error will cause
the GPU to lockup and not process any further information. So if you
run intel_gpu_dump at that point, then you can get a dump that has
actually useful information in it.

> "recent kernels automatically capture the batch buffer dump in
> /debug/dri/.../i915_error_state [.34-rc1 and drm-intel-next]"
> -- cannot find any such files. Do I have to enable something? Can you tell me a
> minimum kernel version to look for?

Since it's so hard to manually capture a useful dump as described
above, (and hard for untrained users to know the difference between an
"empty" and a "useful" dump); Chris has been doing some work (on top
of some useful work from Jesse), to automate all of this.

The idea is to have the driver detect the error and automatically
capture the error-causing batch buffer and make it available.

So if you have a very recent kernel, (such as 2.6.34-rc1), then after
you mount debugfs you can do something like:

	cat /sys/kernel/debug/dri/0/i915_error_state

On my system that's running just fine now, that returns:

	no error state collected

If you can get that to return something else, then *that* will be
interesting and we'll want to see it.

That may require building your own kernel from scratch if you're not
doing that already.

-Carl
Comment 10 John 2010-03-19 07:17:59 UTC
Thanks for the hints.
I did not understand that ".34-rc1" was a kernel version. Still not sure what "drm-intel-next" means. Could not find "i915_error_state" because I was looking after a reboot and it was erased. I got greedy, enabled rawhide and upgraded versions of the kernel, xorg, etc. - and trashed my system. Hope to get you more info after a re-install. 
For what its worth, attachment 34218 [details] is an intel_gpu_dump taken just after a gpu hang. Also, there was a slight difference between the 32 and 64 bit 2.6.32.9-67 kernels. With the 64 bit, runlevel 3, login, startx worked without hanging.
The instruction page I've been looking at says to use "boot option drm.debug=0x06". Does that still apply?
-John
Comment 11 John 2010-03-20 12:58:46 UTC
Created attachment 34265 [details]
Requested batch buffer dump - hang #1 - 2.6.34-0.10.rc1.git0.fc14.x86_64

Did not finish drawing login screen before hang.
Comment 12 John 2010-03-20 12:59:55 UTC
Created attachment 34266 [details]
Other debug files.
Comment 13 John 2010-03-20 13:03:06 UTC
Created attachment 34267 [details]
Requested debug files - hang #2 - 2.6.34-0.10.rc1.git0.fc14.x86_64

Did not finish drawing login screen. Hung earlier than run #1 did.
Comment 14 Chris Wilson 2010-03-21 09:48:49 UTC
The two hangs crash on an identical batchbuffer, which is seemingly attempting to copy a source image [2048x1152] onto a [2048x2048] pixmap. The 2048 pixel wide, 8192 byte pitch, source and destination is documented as being the largest value handled by the hardware.

The question is: are the docs wrong for this chipset, or have I overlooked something else lurking in the batchbuffer?

To test whether this is a silly off-by-one, you can try this patch:
diff --git a/src/i915_render.c b/src/i915_render.c
index 819b963..eddcd91 100644
--- a/src/i915_render.c
+++ b/src/i915_render.c
@@ -197,7 +197,7 @@ static Bool i915_check_composite_texture(ScrnInfoPtr scrn, PicturePtr picture,
 
                w = picture->pDrawable->width;
                h = picture->pDrawable->height;
-               if ((w > 2048) || (h > 2048)) {
+               if ((w >= 2048) || (h >= 2048)) {
                        intel_debug_fallback(scrn,
                                             "Picture w/h too large (%dx%d)\n",
                                             w, h);
Comment 15 John 2010-03-21 17:21:27 UTC
1. I'm using Fedora 12 & xf86-video-intel-2.9.1 - so i915_render.c looks a little different than your version.
2. Last time I used rawhide to get newer versions of xorg, etc., I had to re install.
3. Made changes to xf86-video-intel-2.9.1 and installed. Hung earlier than usual.
4. If a newer version of i915_render.c is compatible with Fedora 12, please let me know so that I can try it.
5. Will attach new debug files.
  --John
========
static Bool i915_check_composite_texture(ScrnInfoPtr pScrn, PicturePtr pPict, int unit)
{
    if (pPict->repeatType > RepeatReflect)
        I830FALLBACK("Unsupported picture repeat %d\n", pPict->repeatType);

    if (pPict->filter != PictFilterNearest &&
        pPict->filter != PictFilterBilinear)
        I830FALLBACK("Unsupported filter 0x%x\n", pPict->filter);

    if (pPict->pDrawable)
    {
	int w, h, i;

	w = pPict->pDrawable->width;
	h = pPict->pDrawable->height;
	if ((w >= 2048) || (h >= 2048))
	    I830FALLBACK("Picture w/h too large -jev (%dx%d)\n", w, h);

	for (i = 0; i < sizeof(i915_tex_formats) / sizeof(i915_tex_formats[0]);
	     i++)
	{
	    if (i915_tex_formats[i].fmt == pPict->format)
		break;
	}
	if (i == sizeof(i915_tex_formats) / sizeof(i915_tex_formats[0]))
	    I830FALLBACK("Unsupported picture format 0x%x\n",
			 (int)pPict->format);
    }

    return TRUE;
}

Comment 16 John 2010-03-21 17:23:06 UTC
Created attachment 34304 [details]
debug log files, modified i915_render.c
Comment 17 John 2010-03-21 18:15:44 UTC
1. The 2048x1152 Chris mentions in comment #14 is my monitor resolution. Made me curious.
2. Tried my old monitor (1600x1200) and it works with kernels 2.6.32.9-67 and 2.6.34-0.10.rc1.git0.fc14.x86_64.
3. Both monitors are run analog.
4. 2048x1152 works very well with kernel 2.6.31.12-174.2.22.fc12.i686.
5. 2048x1152 works (at least for awhile - not tested long term) with 2.6.34-0.10.rc1.git0.fc14.x86_64, runlevel 3, login, startx. kernel 2.6.32.9-67 does not.
6. 2048x1152 does not work with 2.6.34-0.10.rc1.git0.fc14.x86_64, runlevel 5.
7. I really like 2048x1152.
  --John
Comment 18 Chris Wilson 2010-03-22 01:37:24 UTC
Just a quick note: hang-4 has an identical batchbuffer to the previous hangs, so the patch didn't take.

To clarify: in step 3, using startx, you did not launch nautilus or something to paint an image to the root window? Just wondering if you did anything to trigger this specific action of copying 2048x1152 source to a 2048x2048 pixmap. [Another question is what bit of software is creating a 2048x2048 pixmap for a 2048x1152 window...]
Comment 19 John 2010-03-22 07:31:21 UTC
Chris-
Because I was using a different version of i915_render.c, I did not use your patch. It was easier to just mod the file directly. I've checked and rechecked and it looks like everything went smoothly. The only thing I can think of is that another patch overwrote my changes. I will try to check that out. How can we show that the subroutine is **always** being called in the first place?

The test system is a generic gnome install. Most of the time it is not even completing the login screen - hard to see that I'm doing anything to cause the problem. I have no idea where a 2048x2048 image is coming from (generic gnome, not even special wallpaper).

2048x1152 works (at least for awhile - not tested long term) with
2.6.34-0.10.rc1.git0.fc14.x86_64, runlevel 3, login, startx. kernel 2.6.32.9-67
does not. Just those 3 steps, no other commands were used.
Comment 20 John 2010-03-22 09:02:33 UTC
*** Success ** 

Can now login using kernel 2.6.34-0.10.rc1.git0.fc14.x86_64

-Chris
I checked everything again and found a mistake on my part.
Some how I installed xorg-x11-drv-intel-devel-2.9.1-1.fc12.x86_64.rpm
instead of xorg-x11-drv-intel-2.9.1-1.fc12.x86_64.rpm.

  --John
Comment 21 Carl Worth 2010-03-22 14:23:41 UTC
(In reply to comment #20)
> *** Success ** 
> 
> Can now login using kernel 2.6.34-0.10.rc1.git0.fc14.x86_64

Great!

Can you help me understand the state of the bug, then?

Did you attempt Chris's patch to explore the possibility of an off-by-one bug?

Or did you identify something else that made things work?

Thanks,

-Carl
Comment 22 Carl Worth 2010-03-22 14:44:13 UTC
Chris is clearly taking the lead on this one, so reassigning to reflect
that reality.

-Carl
Comment 23 John 2010-03-22 15:36:45 UTC
I guess that I could have been clearer about what I did. (I tried Chris's patch.)

I decided that the best way to keep changes to a minimum and myself out of trouble was to modify a Fedora rpm. That way the files would be sure to end up in the correct places. I found http://bradthemad.org/tech/notes/patching_rpms.php and set up a directory structure like he describes.

Downloaded xorg-x11-drv-intel-2.9.1-1.fc12.src.rpm
Followed instructions of bradthemad:
rpm -ivh xorg-x11-drv-intel-2.9.1-1.fc12.src.rpm
Edit i915_render.c to make the changes Chris suggested (https://bugs.freedesktop.org/show_bug.cgi?id=27100#c15)
cd src/rpm/
rpmbuild -ba SPECS/xorg-x11-drv-intel.spec

================
Created: /home/jev/src/rpm/SRPMS/xorg-x11-drv-intel-2.9.1-1.fc12.src.rpm
Created: /home/jev/src/rpm/RPMS/x86_64/xorg-x11-drv-intel-2.9.1-1.fc12.x86_64.rpm
Created: /home/jev/src/rpm/RPMS/x86_64/xorg-x11-drv-intel-devel-2.9.1-1.fc12.x86_64.rpm
Created: /home/jev/src/rpm/RPMS/x86_64/intel-gpu-tools-2.9.1-1.fc12.x86_64.rpm
=================

cd /home/jev/src/rpm/RPMS/x86_64/
su
yum localinstall --nogpg xorg-x11-drv-intel-devel-2.9.1-1.fc12.x86_64.rpm <<< oops

The next day found my mistake and did:
rpm -Uhv --force xorg-x11-drv-intel-2.9.1-1.fc12.x86_64.rpm
reboot (runlevel 5) and login -- works!

That is as far as I have gotten. Chris is obviously more knowledgeable than me, and is in a better position to track down the root cause of the problem. Please let me know if I be of help.
Comment 24 John 2010-03-22 19:10:44 UTC
Chris -

Just for the record, I made a 32-bit, patched, version:  xorg-x11-drv-intel-2.9.1-1.fc12.i386.rpm.
So far, no problems. Can now login (runlevel 5) using kernels 2.6.32.9-67 and 2.6.32.9-70.

I'm up and running - but curious. How long does it usually take for these kind of patches to make their way into the repositories?

Thanks for your help.

  --John
Comment 25 Chris Wilson 2010-03-23 10:35:56 UTC
Created attachment 34369 [details] [review]
Use tiling by default

John, out of curiosity does this patch have any impact for you (after reverting the hack to cause s/w fallback for 2048 sized textures). What it does is switch to using tiling for the large texture which may be an effective workaround for this gpu...
Comment 26 John 2010-03-23 13:50:50 UTC
Chris, yes I'm willing to try out tiling. But I have a little problem. I cannot find the version of i830_uxa.c that you want to patch. I've found lots of versions, just not the correct one. The latest version I found was inside xf86-video-intel 2.10.903. Can you point me to the correct version? Sorry about being so clueless.

  --John
Comment 27 Chris Wilson 2010-03-23 14:01:15 UTC
2.10.903 is the current release candidate and this patch should apply on top of that. The essence of the patch is really just to set

  priv->tiling = I915_TILING_X;

in i830_uxa_create_pixmap() prior to the call to i830_uxa_compute_size(). That would be enough to test the workaround of using a tiled 2048 texture.
Comment 28 John 2010-03-23 18:01:57 UTC
Created attachment 34380 [details]
How test was done.

The tiling patch seems to work OK.
Comment 29 John 2010-03-23 18:02:33 UTC
Created attachment 34381 [details]
How test was done.

The tiling patch seems to work OK.
Comment 30 Chris Wilson 2010-03-24 09:49:42 UTC
I'm satisfied that we can use this workaround:

commit 2eec53d0b9232970fe3d03ce6c8940ebeea44bee
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 23 17:28:22 2010 +0000

    uxa: Default to using TILING_X for pixmaps.
    
    On memory constrained hardware, tiling is vital for good performance as
    it minimizes cache misses.  The downside is that for older hardware
    (which often suffers from the lack of bandwidth) requires the use of
    fences for many operations, which are in short supply and so may cause
    shorter batchbuffers. However our batch buffers are typically short and
    so this is unlikely to be a concern and not affect the performance wins.
    
    A quick bit of testing suggests the effect is inconclusive on
    firefox/i945:
                      linear            tiled
      xcb             205.470           206.219
      xcb-render-0.0  404.704           388.413
      xlib            166.410           170.805
    
    A secondary effect of the patch is to workaround a G31 specific hang
    when attempting to use linear 2048x2048 surfaces. Bonus!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.