Bug 22482

Summary: [GM965] Random X freezes
Product: xorg Reporter: zOOm_ER <zOOmER.gm>
Component: Driver/intelAssignee: Carl Worth <cworth>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: high CC: freedesktop-bugs, kan.liang, pete, quanxian.wang
Version: 7.4 (2008.09)   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
/var/log/Xorg.0.log
none
/etc/X11/xorg.conf
none
Xorg.0.log.old - when hang happened
none
here is compressed version of intel-gpu-dump
none
2d patch
none
mesa patch
none
xorg.0.log of starting X with 2d-driver from git none

Description zOOm_ER 2009-06-25 12:59:55 UTC
I'm experiencing random X freezes on my Toshiba satellite L300-110 laptop with intel X3100(GM965) graphics card

this is the same bug, as http://bugs.freedesktop.org/show_bug.cgi?id=20893
i created this bugreport only from Carl Worth's words

i'm installing mew kernel now to provide dumps with intel_gpu_dump.
i'll attach it soon
Comment 1 zOOm_ER 2009-06-25 15:28:00 UTC
I have a question about when i should run the indel_gpu_dump tool?
I mean, how much time from the hang i have?
As far as i understand, tool dumps most recently calls to gpu. So dump can be bloated after some time with some useless data, which will drop actual data on moment of the hang out of dump?
Comment 2 Gordon Jin 2009-06-25 19:14:48 UTC
Just run intel_gpu_dump after the hang, though you don't need to be super quick like within 1 second.

"random freeze" is normally hard to reproduce and fix. So please provide more info according to http://intellinuxgraphics.org/how_to_report_bug.html.
Comment 3 zOOm_ER 2009-06-26 03:15:31 UTC
Created attachment 27156 [details]
/var/log/Xorg.0.log
Comment 4 zOOm_ER 2009-06-26 03:16:08 UTC
Created attachment 27157 [details]
/etc/X11/xorg.conf
Comment 5 zOOm_ER 2009-06-26 03:17:37 UTC
here is my system config:

u1@linux-giq8:~> uname -a
Linux linux-giq8 2.6.30-5-default #1 SMP Fri Jun 26 01:49:07 MSD 2009 i686 i686 i386 GNU/Linux

u1@linux-giq8:~> pkg-config --modversion libdrm
2.4.11

u1@linux-giq8:~> glxinfo | grep vers
server glx version string: 1.2
client glx version string: 1.4
OpenGL version string: 2.0 Mesa 7.4.2
glu version: 1.3


now trying to catch a hang...
Comment 6 zOOm_ER 2009-06-26 14:05:29 UTC
ok. i got it. 
Xorg-log and intel-gpu-dump attached
Comment 7 zOOm_ER 2009-06-26 14:06:22 UTC
Created attachment 27181 [details]
Xorg.0.log.old - when hang happened
Comment 8 zOOm_ER 2009-06-26 14:15:09 UTC
intel-gpu-dump log is 1,5 MB
it cannot be uploaded here or to pastebin due to it's size
how can i provide it to be easily accesible?
Comment 9 zOOm_ER 2009-06-27 03:19:01 UTC
Created attachment 27187 [details]
here is compressed version of intel-gpu-dump
Comment 10 Liang Kan 2009-07-02 01:55:07 UTC
 We just fix an random hang issue on OpenSolaris/GEM. Since OpenSolaris and Linux use the similar X code, hope these two patchs can help on your problem.

Thanks!
Liang Kan 
Comment 11 Liang Kan 2009-07-02 01:57:51 UTC
Created attachment 27327 [details] [review]
2d patch
Comment 12 Liang Kan 2009-07-02 02:00:15 UTC
Created attachment 27328 [details] [review]
mesa patch
Comment 13 zOOm_ER 2009-07-02 04:27:46 UTC
(In reply to comment #10)
>  We just fix an random hang issue on OpenSolaris/GEM. Since OpenSolaris and
> Linux use the similar X code, hope these two patchs can help on your problem.
> 
> Thanks!
> Liang Kan 
> 

I never tried to build X\Mesa by myself.
I will be so thankful, if you can give some hints on how to apply those patches?
Comment 14 Carl Worth 2009-07-02 11:10:57 UTC
(In reply to comment #10)
>  We just fix an random hang issue on OpenSolaris/GEM. Since OpenSolaris and
> Linux use the similar X code, hope these two patchs can help on your problem.
> 
> Thanks!
> Liang Kan 

Unless I'm misreading things, your patches only change the driver's behavior for 865, but this bug report is concerning the 965 hardware, so the patches won't be relevant.

Meanwhile, have you submitted the patches to the intel-gfx mailing list? That's a good place for them, (or else in an independent bug report).

Thanks, and let me know if you have any questions.

-Carl
Comment 15 zOOm_ER 2009-07-02 13:46:04 UTC
(In reply to comment #14)
> (In reply to comment #10)
> >  We just fix an random hang issue on OpenSolaris/GEM. Since OpenSolaris and
> > Linux use the similar X code, hope these two patchs can help on your problem.
> > 
> > Thanks!
> > Liang Kan 
> 
> Unless I'm misreading things, your patches only change the driver's behavior
> for 865, but this bug report is concerning the 965 hardware, so the patches
> won't be relevant.
> 
> Meanwhile, have you submitted the patches to the intel-gfx mailing list? That's
> a good place for them, (or else in an independent bug report).
> 
> Thanks, and let me know if you have any questions.
> 
> -Carl
> 

from what i understand, patches do change behavior of all chipsets, except 865.
Comment 16 Carl Worth 2009-07-02 15:19:18 UTC
(In reply to comment #15)
> > Unless I'm misreading things, your patches only change the driver's behavior
> > for 865, but this bug report is concerning the 965 hardware, so the patches
> > won't be relevant.
> 
> from what i understand, patches do change behavior of all chipsets, except 865.

Yes, I did misread this. Thank you.

Eric Anholt just reviewed the patch and suspects that for OpenSolaris the patch works around a hardware issue that was worked around in Linux (in version 2.6.30) with the below commit.

So we would not expect the patch to help here.

Of course, if the patch included a description of what and why it was changing, then that would help too.

-Carl

commit 13f4c435ebf2a7c150ffa714f3b23b8e4e8cb42f
Author: Eric Anholt <eric@anholt.net>
Date:   Tue May 12 15:27:36 2009 -0700

    drm/i915: Don't allow binding objects into the last page of the aperture.
    
    This should avoid a class of bugs where the hardware prefetches past the
    end of the object, and walks into unallocated memory when the object is
    bound to the last page of the aperture.
    
    fd.o bug #21488
    
    Signed-off-by: Eric Anholt <eric@anholt.net>
Comment 17 Liang Kan 2009-07-02 18:49:34 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > > Unless I'm misreading things, your patches only change the driver's behavior
> > > for 865, but this bug report is concerning the 965 hardware, so the patches
> > > won't be relevant.
> > 
> > from what i understand, patches do change behavior of all chipsets, except 865.
> Yes, I did misread this. Thank you.
> Eric Anholt just reviewed the patch and suspects that for OpenSolaris the patch
> works around a hardware issue that was worked around in Linux (in version
> 2.6.30) with the below commit.
> So we would not expect the patch to help here.
> Of course, if the patch included a description of what and why it was changing,

  Sorry for so less info about this patch.
  Quanxian is preparing a patch with the description. Now I think he is testing it too before send it out.

> then that would help too.
> -Carl
> commit 13f4c435ebf2a7c150ffa714f3b23b8e4e8cb42f
> Author: Eric Anholt <eric@anholt.net>
> Date:   Tue May 12 15:27:36 2009 -0700
>     drm/i915: Don't allow binding objects into the last page of the aperture.
>     This should avoid a class of bugs where the hardware prefetches past the
>     end of the object, and walks into unallocated memory when the object is
>     bound to the last page of the aperture.
>     fd.o bug #21488
>     Signed-off-by: Eric Anholt <eric@anholt.net>

Yes, I once try Eric's patch in Solaris. It helps on some hang issues but not all. 
When I debuging the random hang issue on Solaris, I found the debug register DMA_FADD_P is always overstep 0x80 when hang. E.g if the hang batchbuffer obj's gtt offset is 0x8a31000, then length is ff0, DMA_FADD_P is 0x8a32080.
So I guess this maybe another prefetch related issue. So I don't use the last page of every batchbuffer object. That works on Solaris. 
Comment 18 Liang Kan 2009-07-02 20:44:48 UTC
(In reply to comment #13)
> (In reply to comment #10)
> >  We just fix an random hang issue on OpenSolaris/GEM. Since OpenSolaris and
> > Linux use the similar X code, hope these two patchs can help on your problem.
> > 
> > Thanks!
> > Liang Kan 
> > 
> I never tried to build X\Mesa by myself.
> I will be so thankful, if you can give some hints on how to apply those
> patches?

Please refer to the Intel Linux Graphics Driver Installation Guide 
http://intellinuxgraphics.org/install.html
Comment 19 qwang13 2009-07-03 02:46:37 UTC
This is the description for the patch of comment 11 and comment 2.
===
    For random hang of bug 22482, we are doubt it is related DMA 
    Prefetch issue. For 865G,we keep 16 as the prefetch size. 
    For others, we define 4096 as the reserved size in case of 
    DMA prefetch outside of his range. 
===

Actually why we provide this patch, we just get the result from a bunch of data for hang issue. DMA prefetch data from the batch buffer and then analysis the prefetch content. Once it comes across the BATCH BUFFER END symbol, it will stop. However, the size of prefetch is fixed, and it maybe cause DMA to prefetch outside of the page we allocated. It should be a logic analysis error. Command parser should stop once he come across the END, and don't need to analysis more even it got error content from outside. (it got mess of 0)

For example, we allocated 4KB, DMA prefetch size 1.5KB, in the last time, it will prefetch 0.5KB from outside of batch buffer. It will cause something unexpected if hardware logic has some fault. After the changing from 16 to 4096 in openSolaris, hang disappears. (We ever want to change it to 0x100, and else, not works). This change will lost 25% space. (We generally allocate 4*4KB, 1KB for reserved).

It should be a hardware issue, however I can not imagine the logic of it. Just guess.

zOOm_ER, I suggest you have a try, after all, there is not a reasonable patch for this.  

Comment 20 zOOm_ER 2009-07-04 04:52:50 UTC
ok. i've done some testing with these patches:
Mesa patch itself does not seem to change anything (i.e. i still catching hangs)
Patch of xf86-video-intel works (no hangs, and clobbered text-input fields disappeared with some other visible windows defects. [i didn't mention them before, because didn'n know, they are related to hangs])

But The Bad News(tm) are that 2d-driver pacth breaks opengl.
All opengl screensavers give me a blank screen or just frozen screen (luckily not completely hang)
Ang glxgears does not start - it throws on console this:
#glxgears
X Error of failed request:  BadRequest (invalid request code or no such operation)
  Major opcode of failed request:  136 (DRI2)
  Minor opcode of failed request:  7 ()
  Serial number of failed request:  30
  Current serial number in output stream:  30
Comment 21 Liang Kan 2009-07-05 18:31:52 UTC
(In reply to comment #20)
> ok. i've done some testing with these patches:
> Mesa patch itself does not seem to change anything (i.e. i still catching
> hangs)
> Patch of xf86-video-intel works (no hangs, and clobbered text-input fields
> disappeared with some other visible windows defects. [i didn't mention them
> before, because didn'n know, they are related to hangs])
> But The Bad News(tm) are that 2d-driver pacth breaks opengl.
> All opengl screensavers give me a blank screen or just frozen screen (luckily
> not completely hang)
> Ang glxgears does not start - it throws on console this:
> #glxgears
> X Error of failed request:  BadRequest (invalid request code or no such
> operation)
>   Major opcode of failed request:  136 (DRI2)
>   Minor opcode of failed request:  7 ()
>   Serial number of failed request:  30
>   Current serial number in output stream:  30

Could you please provide the xf86-video-intel/xserver/mesa/libdrm version which you build?
Comment 22 zOOm_ER 2009-07-06 02:33:32 UTC
> Could you please provide the xf86-video-intel/xserver/mesa/libdrm version which
> you build?
> 

xf86-video-intel is git version with this last commit:
###
commit 74227141923a2f5049592219ab80e8733062a5d9
Author: Barry Scott <barry.scott@onelan.co.uk>
Date:   Tue Jun 23 14:14:50 2009 +0100

    Fix segv for clipped movie window
###

mesa, which i build, and as i said does not hange anything after patch is git version with last commit:
###
commit 862488075c5537b0613753b0d14c267527fc6199
Merge: 060c7f2... 94e1117...
Author: Jakob Bornecrantz <jakob@vmware.com>
Date:   Fri Jul 3 18:53:58 2009 +0200

    Merge branch 'mesa_7_5_branch'
###

drm and xserver i didn't touch, so as i said before:

X.Org X Server 1.6.1
Release Date: 2009-4-14

u1@linux-giq8:~> uname -a
Linux linux-giq8 2.6.30-5-default #1 SMP Fri Jun 26 01:49:07 MSD 2009 i686 i686
i386 GNU/Linux

u1@linux-giq8:~> pkg-config --modversion libdrm
2.4.11
Comment 23 Carl Worth 2009-07-06 11:10:52 UTC
(In reply to comment #20)
> ok. i've done some testing with these patches:
> Mesa patch itself does not seem to change anything (i.e. i still catching
> hangs)
> Patch of xf86-video-intel works (no hangs, and clobbered text-input fields
> disappeared with some other visible windows defects. [i didn't mention them
> before, because didn'n know, they are related to hangs])

I'm glad you got the hangs to go away. Are you sure it's related to the patch, though? Can you try the driver from git without the patch to see if the hangs and clobbered text fields are also fixed? I'm suspecting that the patch here has nothing to do with the behavior you're seeing.
 
> But The Bad News(tm) are that 2d-driver pacth breaks opengl.

And that doesn't make any sense at all. The 2D driver patch can't change OpenGL. But again, if you were updating lots of components, (from packaged versions to git, for example), in order to be able to apply the patch, then one of those updates could cause an issue like this. And if so, then that would need to be a separate bug report.

For here, let's focus on the original hang issue. I'll look forward to hearing whether the current driver from git exhibits the hangs or not.

-Carl
Comment 24 zOOm_ER 2009-07-06 12:08:32 UTC
> I'm glad you got the hangs to go away. Are you sure it's related to the patch,
> though? Can you try the driver from git without the patch to see if the hangs
> and clobbered text fields are also fixed? I'm suspecting that the patch here
> has nothing to do with the behavior you're seeing.
>
I tried unchanged git 2d driver, and i can confirm, that it fixes clobbered text and hangs(i will try little bit more, but i tortured it pretty much)
 
> > But The Bad News(tm) are that 2d-driver pacth breaks opengl.
> 
> And that doesn't make any sense at all. The 2D driver patch can't change
> OpenGL. But again, if you were updating lots of components, (from packaged
> versions to git, for example), in order to be able to apply the patch, then one
> of those updates could cause an issue like this. And if so, then that would
> need to be a separate bug report.
>
> For here, let's focus on the original hang issue. I'll look forward to hearing
> whether the current driver from git exhibits the hangs or not.
> 
> -Carl
> 

so... i called "make" and "make install" in source directory, just after pulling 2d driver from git. it is enough to fix hangs and break glxgears... if i roll back to my distribution 2d driver, i got hangs again and working glxgears... i didn't touch mesa or drm. is it enough to crate new bugreport?
Comment 25 Carl Worth 2009-07-06 12:24:48 UTC
(In reply to comment #24)
> I tried unchanged git 2d driver, and i can confirm, that it fixes clobbered
> text and hangs(i will try little bit more, but i tortured it pretty much)

Great news. I'll mark this bug report as fixed now.

> so... i called "make" and "make install" in source directory, just after
> pulling 2d driver from git. it is enough to fix hangs and break glxgears... if
> i roll back to my distribution 2d driver, i got hangs again and working
> glxgears... i didn't touch mesa or drm. is it enough to crate new bugreport?

It's definitely a separate bug report yes. So if you wouldn't mind opening that, that would be appreciated. The behavior does seem very odd to me still, but I'll let the new assignee comment in more detail. :-)

Thanks again for your report and testing,

-Carl

Comment 26 Michel Dänzer 2009-07-07 03:13:36 UTC
(In reply to comment #25)
> > so... i called "make" and "make install" in source directory, just after
> > pulling 2d driver from git. it is enough to fix hangs and break glxgears... if
> > i roll back to my distribution 2d driver, i got hangs again and working
> > glxgears... i didn't touch mesa or drm. is it enough to crate new bugreport?
> 
> It's definitely a separate bug report yes.

glxgears might be broken because the DRI is disabled, which could also explain why the hangs no longer occur.
Comment 27 zOOm_ER 2009-07-07 03:20:09 UTC
(In reply to comment #26)
> glxgears might be broken because the DRI is disabled, which could also explain
> why the hangs no longer occur.
> 

how can i check for this?
Comment 28 Liang Kan 2009-07-07 18:03:51 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > glxgears might be broken because the DRI is disabled, which could also explain
> > why the hangs no longer occur.
> > 
> how can i check for this?

You can check the Xorg.0.log. Or you can run /usr/X11/bin/glxinfo to check "direct rendering" 
Comment 29 zOOm_ER 2009-07-08 06:32:40 UTC
i tried to run glxinfo, but it throws almost the same error, as glxgears:
# glxinfo
name of display: :0.0
X Error of failed request:  BadRequest (invalid request code or no such operation)
  Major opcode of failed request:  136 (DRI2)
  Minor opcode of failed request:  7 ()
  Serial number of failed request:  25
  Current serial number in output stream:  25

i will attach Xorg.0.log
Comment 30 zOOm_ER 2009-07-08 06:35:08 UTC
Created attachment 27497 [details]
xorg.0.log of starting X with 2d-driver from git

There are many warnings, but it seems, that DRI is enabled

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.