Bug 103509

Summary:

drm/i915 GPU Hang in Artful Advark 17.10

Product:

DRI

Reporter:

Luka Paunovic <internetfazoni>

Component:

DRM/Intel

Assignee:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

major

Priority:

high

CC:

elizabethx.de.la.torre.mena, intel-gfx-bugs, internetfazoni, jeparre, marc, omega

Version:

XOrg git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

ReadyForDev

i915 platform:

I965GM

i915 features:

GPU hang

Attachments:

Description	Flags
/sys/class/drm/card0/error	none
DMESG LOG	none
/sys/class/drm/card0/error with X 1.19 and Mesa 17.2.4	none
gzipped dmesg from boot	none

Description Luka Paunovic 2017-10-29 20:38:28 UTC

Created attachment 135158 [details]
/sys/class/drm/card0/error

I have installed fresh Ubuntu Mate 17.10 - Artful Advark.

This is my Intel configuration file:

$ cat /etc/X11/xorg.conf.d/20-intel.conf
Section "Device"
    Identifier  "Intel Graphics"
    Driver      "intel"
    Option      "TearFree"    "true"
    Option      "AccelMethod" "sna"
    Option      "DRI"         "3"
EndSection

I have installed latest available drivers using Intel Graphics Update Tools for Linux

Because tool wasn't able to run on 17.10 I temporary changed /etc/lsb-release to corespond 17.04 Zesty Zapus and then I successfully installed drivers using tool.

After all this trouble I still often have issues. My screen randomly goes black for 10 seconds because of GPU HANG. Also after some time elements rapidly flicker/disappear in programs which use HW acceleration (mostly chrome, disabling hardware acceleration is not an option because chrome works terrible)

VGA adapter info:

$ lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (primary) (rev 0c)

Dmesg:

[ 1964.877703] [drm] GPU HANG: ecode 4:0:0x54f4e8fb, in Xorg [874], reason: Hang on rcs0, action: reset
[ 1964.877707] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1964.877708] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1964.877709] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1964.877710] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1964.877711] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1964.919361] drm/i915: Resetting chip after gpu hang
[ 1972.939781] drm/i915: Resetting chip after gpu hang
[ 2004.879875] drm/i915: Resetting chip after gpu hang
[ 2258.924142] drm/i915: Resetting chip after gpu hang
[ 2376.394596] perf: interrupt took too long (7689 > 7688), lowering kernel.perf_event_max_sample_rate to 26000
[ 2417.923699] drm/i915: Resetting chip after gpu hang
[ 2708.941780] drm/i915: Resetting chip after gpu hang
[ 2738.869020] drm/i915: Resetting chip after gpu hang
[ 2760.862012] drm/i915: Resetting chip after gpu hang
[ 2770.846041] drm/i915: Resetting chip after gpu hang
[ 2780.862186] drm/i915: Resetting chip after gpu hang

CRASH DUMP IS IN ATTACHMENT

Comment 1 Luka Paunovic 2017-10-29 20:48:24 UTC

Also, I have to add.. after I log out and log in again or restart lightdm service (which also requires me to log in again) everything works really best and after 10-15 minutes everything starts happening again :/
So weird, and so annoying, and it's preventing me to use my laptop.
I hope this is fixable, also I forgot to mention I have disabled my internal monitor from grub with: video=LVDS-1:d
and I am using my external one VGA1

Comment 2 Chris Wilson 2017-10-30 09:48:50 UTC

That's error state is from -modesetting... Might be worth attaching the xorg.log to confirm.

Comment 3 Luka Paunovic 2017-10-30 11:19:58 UTC

here you are

Xorg.0.log https://pastebin.com/P9UuWhxe 
Xorg.0.log.old https://pastebin.com/uC1zTZCw

Comment 4 Luka Paunovic 2017-10-31 15:25:04 UTC

I just caught this in XORG LOG when I started having issues again

(EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.


What this mean?
How to fix this.

Comment 5 Elizabeth 2017-10-31 21:00:20 UTC

Hello, just in case, what Mesa version do you have? 
And if reproducible a dmesg with debug info, drm.debug=0xe on grub, may be helpful.

Comment 6 Luka Paunovic 2017-11-01 12:40:34 UTC

scorpius@scorpius-Vostro-A860:~$ glxinfo | grep "OpenGL version"
OpenGL version string: 2.1 Mesa 17.2.2

I have enabled debug as you told me in grub, I am now waiting for issue to come up (if it does :() and I will send the dmesg output.

Comment 7 Luka Paunovic 2017-11-01 13:11:55 UTC

Created attachment 135201 [details]
DMESG LOG

I have succeeded making bug appear again with Geeks3D GpuTest - GPU monitoring

Here is dmesg log

Comment 8 Luka Paunovic 2017-11-20 21:38:00 UTC

@Elizabeth this is a severe issue. Is there going to be a fix for this?

Comment 9 Elizabeth 2017-11-21 15:53:36 UTC

(In reply to Luka Paunovic from comment #8)
> @Elizabeth this is a severe issue. Is there going to be a fix for this?
Good afternoon Luka, 
Please retest with tip branch https://cgit.freedesktop.org/drm-tip, here are the latest commits that are being developed, particulary https://bugs.freedesktop.org/show_bug.cgi?id=103502#c3

commit 1d033beb20d6d5885587a02a393b6598d766a382
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Oct 31 10:36:07 2017 +0000

    drm/i915: Check incoming alignment for unfenced buffers (on i915gm) 

May help with

Active (rcs0) [59]:
00000000_020ef000  8294400 7e 00 [ 1d404 00 00 00 00 ] 00 X dirty uncached (fence: 7)
00000000_00c4d000   524288 7e 00 [ 1d405 00 00 00 00 ] 00 X dirty uncached (fence: 8)
00000000_00fe6000   327680 7e 00 [ 1d405 00 00 00 00 ] 00 X dirty uncached (fence: 12)

and

commit	e5330ac1f50b897d245753828e8887f297f69dd0 (Patch)
author	Chris Wilson <chris@chris-wilson.co.uk>	2017-10-31 12:22:35 (GMT)
committer	Chris Wilson <chris@chris-wilson.co.uk>	2017-11-01 13:43:14 (GMT)

drm/i915: Check that the breadcrumb wasn't disarmed automatically before parking

with 

[drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x60/0x80 [i915], irq posted? yes, current seqno=185cf, last=185d3

Also from error state, this is the latest instruction before gpu hang:
0x00070c98:      0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1)
0x00070c9c:      0x03cc1400:    format 8888, pitch 5120, rop 0xcc, clipping disabled,  
0x00070ca0:      0x00000044:    dst (68,0)
0x00070ca4:      0x00010045:    dst (69,1)
0x00070ca8:      0x1da93000:    dst offset 0x1da93000
0x00070cac:      0x00000000:    src (0,0)
0x00070cb0:      0x00000080:    src pitch 128
0x00070cb4:      0x0001c000:    src offset 0x0001c000

BR, Elizabeth.

Comment 10 Luka Paunovic 2017-12-02 11:44:04 UTC

> Please retest with tip branch

I do not know how to do that. 
Can you please tell me when will the fix be available for Ubuntu Artful Aardvark from the official repositories?

Comment 11 Chris Wilson 2017-12-02 11:54:31 UTC

You harry along the distribution; they should make sure that the fix is pushed out in a timely manner. To fix it yourself, you either grab a ppa that follows drm-tip (now is not the greatest moment since 4.15-rc1 is proving to be a rough ride), or roll back to the previous kernel.

Comment 12 Chris Wilson 2017-12-02 11:56:30 UTC

(Gah, wrong bug. This is not the 915gm bug who I thought was asking where they could find the fixed kernel.)

Comment 13 Luka Paunovic 2017-12-02 12:00:56 UTC

(In reply to Chris Wilson from comment #12)
> (Gah, wrong bug. This is not the 915gm bug who I thought was asking where
> they could find the fixed kernel.)

Can you please give me ANY ETA when will those fixes be available in "INTEL GRAPHICS UPDATE TOOL FOR LINUX* OS V2.0.6"

Will that be when a new version of the tool is released or is it possible that fixes come even with the current version?

Comment 14 Luka Paunovic 2017-12-17 13:46:20 UTC

Chris, can you please tell me to which kernel version can I downgrade in order to get my pc to work normally again. It's terrible I am programmer and Linux sysadmin and I have to work, this happens CONSTANTLY I CONSTANTLY have to stop my work and restart lightdm. I lost willingness to work because of this. This taught me to NEVER upgrade the kernel on my desktop PC again unless I'm upgrading to newer version of the distro! I can't believe something like this (bug) even got in the repo.

Comment 15 Luka Paunovic 2017-12-17 13:52:53 UTC

And this is the latest version I upgraded to and the issue is still present

Linux scorpius-Vostro-A860 4.13.0-19-generic #22-Ubuntu SMP Mon Dec 4 11:58:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Comment 16 omega 2017-12-17 19:10:12 UTC

@Luka: Linux sauron 4.8.0-46-generic #49-Ubuntu SMP Fri Mar 31 13:57:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux works for me. Latest kernel does not.

@Elisabeth: where do I get the package for testing the fix you provided?

Comment 17 omega 2017-12-17 19:11:04 UTC

@Luka: Linux sauron 4.8.0-46-generic #49-Ubuntu SMP Fri Mar 31 13:57:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux works for me. Latest kernel does not.

@Elisabeth: where do I get the package for testing the fix you provided?

Comment 18 Elizabeth 2017-12-18 16:14:35 UTC

(In reply to omega from comment #17)
> @Luka: Linux sauron 4.8.0-46-generic #49-Ubuntu SMP Fri Mar 31 13:57:14 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux works for me. Latest kernel does not.
> 
> @Elisabeth: where do I get the package for testing the fix you provided?
Thanks for the information, Omega. Luka, could you try it?
You can download latest stable or mainline from https://www.kernel.org and build it. This release has the latest fixes merged upstream, and you can build the package using this guide, section building kernel, steps 2 to 5. It may take a while to compile though.

FWIW, https://cgit.freedesktop.org/drm-tip has all the latest changes developed even the ones that aren't upstream yet, you also can try this branch if you have the time.

Comment 19 omega 2017-12-18 17:18:38 UTC

Installed Debian kernel package from here: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14.7/

Kernel is latest stable:

Linux version 4.14.7-041407-generic (kernel@gloin) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu3)) #201712171031 SMP Sun Dec 17 15:33:35 UTC 2017

X does not fire up. Kernel log says:

Dec 18 18:06:33 sauron kernel: [   25.790562] [drm] GPU HANG: ecode 7:0:0x85dffffc, in Xorg [1181], reason: Hang on rcs0, action: reset
Dec 18 18:06:33 sauron kernel: [   25.790563] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Dec 18 18:06:33 sauron kernel: [   25.790563] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Dec 18 18:06:33 sauron kernel: [   25.790563] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Dec 18 18:06:33 sauron kernel: [   25.790564] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Dec 18 18:06:33 sauron kernel: [   25.790564] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Dec 18 18:06:33 sauron kernel: [   25.790614] i915 0000:00:02.0: Resetting chip after gpu hang
Dec 18 18:06:34 sauron kernel: [   26.204998] random: crng init done
Dec 18 18:06:41 sauron kernel: [   33.756066] i915 0000:00:02.0: Resetting chip after gpu hang
Dec 18 18:06:49 sauron kernel: [   41.756094] i915 0000:00:02.0: Resetting chip after gpu hang
Dec 18 18:06:57 sauron kernel: [   49.754415] i915 0000:00:02.0: Resetting chip after gpu hang
Dec 18 18:07:05 sauron kernel: [   57.753701] i915 0000:00:02.0: Resetting chip after gpu hang
Dec 18 18:07:13 sauron kernel: [   65.785453] i915 0000:00:02.0: Resetting chip after gpu hang
Dec 18 18:07:27 sauron kernel: [   79.801912] i915 0000:00:02.0: Resetting chip after gpu hang

Comment 20 Elizabeth 2017-12-18 17:29:27 UTC

(In reply to omega from comment #19)
> Installed Debian kernel package from here:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14.7/
> 
> Kernel is latest stable:
> 
> Linux version 4.14.7-041407-generic (kernel@gloin) (gcc version 7.2.0
> (Ubuntu 7.2.0-8ubuntu3)) #201712171031 SMP Sun Dec 17 15:33:35 UTC 2017
> 
> X does not fire up. Kernel log says:
> 
> Dec 18 18:06:33 sauron kernel: [   25.790562] [drm] GPU HANG: ecode
> 7:0:0x85dffffc, in Xorg [1181], reason: Hang on rcs0, action: reset...
That looks different. Could you please try latest Mesa 17.3 release? Is your xorg 1.9?

Comment 21 omega 2017-12-18 17:59:22 UTC

Updated X to 1.19.5 and Mesa to 17.2.4 (the latest available for Ubuntu).

Still no joy.

Comment 22 omega 2017-12-18 18:01:25 UTC

Created attachment 136252 [details]
/sys/class/drm/card0/error with X 1.19 and Mesa 17.2.4

Added /sys/class/drm/card0/error with X 1.19 and Mesa 17.2.4

Comment 23 Luka Paunovic 2017-12-18 21:06:38 UTC

I am no kernel developer, but is this so hard to fix? What's the catch here? I mean is it the lack of funds or what.

Comment 24 Luka Paunovic 2017-12-22 16:05:29 UTC

It looks like this has been fixed in 

Linux scorpius-Vostro-A860 4.13.0-21-generic #24-Ubuntu SMP Mon Dec 18 17:29:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Is it? Or am I just hallucinating?! I tried to trigger the bug with the Geeks3D GpuTest (Stress test) and I couldn't do it.
So probably it's fixed?
Anyone?

Comment 25 Luka Paunovic 2017-12-24 21:12:02 UTC

It's not fixed :(
:((((((((((((

Comment 26 Luka Paunovic 2018-01-10 19:15:28 UTC

Still not fixed in 4.13.0-25-generic

Damn Intel..........

Comment 27 Luka Paunovic 2018-02-09 17:55:28 UTC

Even though this is not fixed I found a workaround.
I was playing with the settings and I realized that UXA is more stable than SNA.
The bug didn't occur with UXA

cat /etc/X11/xorg.conf.d/20-intel.conf

Section "Device"
	Identifier  "Intel Graphics"
	Driver      "intel"
	#Option      "TearFree"    "true"
	#Option      "AccelMethod" "sna"
	Option      "AccelMethod" "uxa"
	Option      "DRI"         "3"
EndSection

Comment 28 omega 2018-02-11 06:43:09 UTC

The proposed change of AccelMethod does not work for me.

Intel(R) HD Graphics 4600 in Intel(R) Core(TM) i7-4790K CPU
Linux kernel: 4.15.2-041502-generic #201802072230 SMP

Comment 29 Jani Saarinen 2018-03-29 07:10:23 UTC

First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.

Comment 30 omega 2018-03-29 08:08:13 UTC

The issue is still present in a fully updated Ubuntu 17.10 with the newest mainline kernel 4.16rc7 from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Comment 31 Luka Paunovic 2018-03-29 15:07:17 UTC

Yes this is still a bug. I tried switching back to sna cuz uxa sucks.
But with uxa I do not have bug. It's still present with SNA

Comment 32 Jani Saarinen 2018-04-25 09:21:55 UTC

OK, thanks for the feedback. Chris, any advice here?

Comment 33 Jani Saarinen 2018-05-04 12:09:48 UTC

Chris?

Comment 34 Lakshmi 2018-09-11 09:12:01 UTC

Sorry for the delay...

Luka, Do you still have the issue?
Please try to reproduce the issue using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

Comment 35 omega 2018-09-14 18:04:08 UTC

Will this be merged into the mainline kernel? Ubuntu has not yet the latest commit:

https://git.launchpad.net/~ubuntu-kernel-test/ubuntu/+source/linux/+git/mainline-crack/log/

Would be much much less effort for me to use a prebuild kernel from the Ubuntu mainline repo instead of patching and building a kernel myself.

Comment 36 omega 2018-09-22 06:35:33 UTC

Created attachment 141685 [details]
gzipped dmesg from boot

Comment 37 omega 2018-09-22 06:35:51 UTC

I tried Kernel 4.19-rc4 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc4/ with kernel parameters drm.debug=0x1e log_buf_len=4M.

The boot process gets past the console, X server fires up but starts to hang and then crashes. 

Attached please find dmesg as requested.

Comment 38 omega 2018-10-01 06:52:24 UTC

I installed kernel 4.19.0-041900rc6-generic #201809301631. This seems to fix the issue.

# dmesg|grep "\(i915\|drm\)"                                                                                                                 
[    2.324905] fb: switching to inteldrmfb from VESA VGA
[    2.324969] [drm] Replacing VGA console driver
[    2.325456] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    2.325456] [drm] Driver supports precise vblank timestamp query.
[    2.325620] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    2.328116] [drm] Initialized i915 1.6.0 20180719 for 0000:00:02.0 on minor 0
[    2.352467] fbcon: inteldrmfb (fb0) is primary device
[    2.402622] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device

Comment 39 Lakshmi 2018-10-01 11:23:03 UTC

(In reply to omega from comment #38)
> I installed kernel 4.19.0-041900rc6-generic #201809301631. This seems to fix
> the issue.
Thanks for the feedback. Closing this bug as Fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.