59982 – Radeon: evergreen Atombios in loop during initialization on ppc64

Bug 59982 - Radeon: evergreen Atombios in loop during initialization on ppc64

Summary: Radeon: evergreen Atombios in loop during initialization on ppc64

Status:	RESOLVED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-01-28 18:37 UTC by Lucas Kannebley Tavares
Modified:	2013-02-28 17:34 UTC (History)
CC List:	2 users (show)

See Also:	59672
i915 platform:
i915 features:

Attachments
Log for module insertion errors on mainline (40.72 KB, text/plain) 2013-01-28 18:37 UTC, Lucas Kannebley Tavares	no flags	Details
BIOS for the adapter (58.50 KB, application/octet-stream) 2013-01-31 10:57 UTC, Lucas Kannebley Tavares	no flags	Details
possible fix (3.80 KB, patch) 2013-01-31 15:12 UTC, Alex Deucher	no flags	Details \| Splinter Review
Dumping registers to investigate values change (1.15 KB, patch) 2013-02-20 14:47 UTC, Lucas Kannebley Tavares	no flags	Details \| Splinter Review
dumping (1.62 KB, patch) 2013-02-20 16:33 UTC, Jerome Glisse	no flags	Details \| Splinter Review
Fixes on the Workaround (1.64 KB, patch) 2013-02-20 18:54 UTC, Lucas Kannebley Tavares	no flags	Details \| Splinter Review
Fixes on the Workaround (1.74 KB, patch) 2013-02-22 12:52 UTC, Lucas Kannebley Tavares	no flags	Details \| Splinter Review
Adding tests for all-1s after every read or write (4.34 KB, patch) 2013-02-27 14:24 UTC, Lucas Kannebley Tavares	no flags	Details \| Splinter Review
Show Obsolete (1) View All

Description Lucas Kannebley Tavares 2013-01-28 18:37:52 UTC

Created attachment 73788 [details]
Log for module insertion errors on mainline

During the initialization of an evergreen adapter on ppc64 there are a lot of atombios problems reported, though none seems fatal.

This is in sequence to bug #59672. Though this is now being run on the 3.8.0-rc5 kernel. The problem is basically a bunch of messages like:

[  854.915511] [drm:drm_vblank_get], enabling vblank on crtc 0, ret: 0
[  854.915520] [drm:drm_calc_vbltimestamp_from_scanoutpos], crtc 0: Noop due to uninitialized mode.
[  854.915526] [drm:drm_update_vblank_count], enabling vblank interrupts on crtc 0, missed -1
[  859.924406] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[  859.924416] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing C898 (len 62, WS 0, PS 0) @ 0xC8B4

A full dump is included as an attachment.

Comment 1 Jerome Glisse 2013-01-29 03:45:21 UTC

Could you please attach your video bios to the bug

Comment 2 Jerome Glisse 2013-01-29 03:46:48 UTC

cd /sys/bus/pci/devices/0000:01:05.0
sudo sh -c "echo 1 > rom"
sudo sh -c "cat rom > ~/bios.rom"

Something like that should do the trick, just change the pciid

Comment 3 Lucas Kannebley Tavares 2013-01-29 12:42:51 UTC

Hi Jerome, I attempted the dump without success

[root@localhost ~]# lspci
...
0001:01:00.0 VGA compatible controller: ATI Technologies Inc ...
0001:01:00.1 Audio device: ATI Technologies Inc ...
[root@localhost ~]# cd /sys/bus/pci/devices/0001:01:00.0
[root@localhost 0001:01:00.0]# echo 1 > rom
[root@localhost 0001:01:00.0]# cat rom > ~/bios.rom
[  588.381813] pci 0001:01:00.0: Invalid ROM contents
cat: rom: Input/output error
[root@localhost 0001:01:00.0]# lspci
[  637.187942] Kernel panic - not syncing: FAIL
...
[  637.190672] ===============================
[  637.190677] [ INFO: suspicious RCU usage. ]
[  637.190683] 3.8.0-rc5-kotd+ #6 Not tainted
[  637.190687] -------------------------------
[  637.190692] include/linux/rcupdate.h:468 Illegal context switch in RCU read-side critical section!
...

I'm investigating what was wrong with the rom dump, but if you have any ideas what it could be, any help would be appreciated :)

Comment 4 Lucas Kannebley Tavares 2013-01-29 12:50:22 UTC

As a side note, the kernel panic is actually induced by me. It actually means there was an access to an invalid address.

Comment 5 Jerome Glisse 2013-01-30 19:49:34 UTC

Can you try dumping the bios when booting with kms disable and nothing bind to the gpu

Comment 6 Lucas Kannebley Tavares 2013-01-31 10:57:01 UTC

Created attachment 73987 [details]
BIOS for the adapter

Here's the dump you requested, thanks for the reminder on modeset, was trying to achieve it via a pci_enable/io_remap on a quirk.

Comment 7 Alex Deucher 2013-01-31 15:12:09 UTC

Created attachment 73997 [details] [review]
possible fix

For some reason the current crtc enabled bit isn't going low.  It should be low already if the crtc is off, but perhaps it has to have been on previously for it to go low.  I guessing that since this is a ppc system, the card was never previously posted so the display was off before the driver loaded.  This patch checks to see what the current state of the crtcs are at driver load time.  That way we can skip disabling the display if it's already off.  That said, the messages are harmless in this case.

Comment 8 Jerome Glisse 2013-02-06 17:07:13 UTC

Here is how we try to figure out atombios stuck. We use the atombios disasm :

git://people.freedesktop.org/~mhopf/AtomDis

To produce a readable file ./atomdis bios.rom > bios.txt

Then when you get a message such as :

*ERROR* atombios stuck executing C898 (len 62, WS 0, PS 0) @ 0xC8B4

It means it's stuck executing function that is at offset 0xc898 (look for c898 in your disasm output it's EnableCRTC. Inside that atombios function it's stuck in a loop. 0xC8B4 is the offset of the instruction at which the loop was interrupted (from one run to the other this offset might point to a different instruction in the same loop).

So when you look at EnableCRTC it's stuck executing 0xC8B4 - 0xC898 = 0x1c which is :

  001c: 4aa59c1b01        TEST   reg[1b9c]  [.X..]  <-  01
  0021: 491c00            JUMP_NotEqual  001c

So test here test that register (0x1b9c << 2) ie register 0x6e70 as value of :
0x..01.... or if you prefer : (READREG(0x6e70) & 0x00ff0000) == 0x00010000


Lucas if you have any more atombios stuck don't hesitate to add them here.

To find the register meaning you can grep the various header files of drivers/gpu/drm/radeon/ mostly evergreen one and modesetting one.

Comment 9 Jerome Glisse 2013-02-06 17:08:08 UTC

Sorry ./atomdis bios.rom F > bios.txt

Comment 10 Lucas Kannebley Tavares 2013-02-08 17:17:52 UTC

Hi Jerome, thanks for the tips.

Well, I followed the next error
[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing CC68 (len 72, WS 0, PS 0) @ 0xCC97

Down to the test in 0x2f on 0xcc68.
command_table  0000cc68  #2c  (UpdateCRTC_DoubleBufferRegisters):
...
  0027: 5420b51b          CLEAR  reg[1bb5]  [...X]
  002b: 5420bd1b          CLEAR  reg[1bbd]  [...X]
  002f: 4a25b61b01        TEST   reg[1bb6]  [...X]  <-  01

I have a question here: how do I determine what are these registers? I couldn't match 1bb6 to anything on the radeon driver code, so I suppose that's somewhere else... or is there some other way to read that?

Anyway, I backtracked that code back to this call on atombios_crtc.c:

static void atombios_lock_crtc(struct drm_crtc *crtc, int lock)
{
...
	int index =
	    GetIndexIntoMasterTable(COMMAND, UpdateCRTC_DoubleBufferRegisters);
...
	atom_execute_table(rdev->mode_info.atom_context, index, (uint32_t *)&args);
}

which could've come from either of these:
static void atombios_crtc_prepare(struct drm_crtc *crtc)
static void atombios_crtc_commit(struct drm_crtc *crtc)

Since those are callbacks registered as helper funcs, and I'm not sure of their semantics, I ended up getting stuck :) 

static const struct drm_crtc_helper_funcs atombios_helper_funcs = {
	.prepare = atombios_crtc_prepare,
	.commit = atombios_crtc_commit,

Any ideas here?

Thanks! :)

Comment 11 Lucas Kannebley Tavares 2013-02-08 17:20:49 UTC

Nevermind the question about the registers, just re-read your post, which I should've done in the first place :)

Thanks

Comment 12 Alex Deucher 2013-02-09 14:40:59 UTC

UpdateCRTC_DoubleBufferRegisters takes the crtc hardware lock so that updates happen atomically rather than double buffered updates during the vupdate period.

You pass parameters to the atom table via a struct, in this case, ENABLE_CRTC_PS_ALLOCATION.

        args.ucCRTC = radeon_crtc->crtc_id;
        args.ucEnable = lock;


  0006: 370000            SET_ATI_PORT  0000  (INDIRECT_IO_MM)

Select the mmio register aperture.

  0009: 5214              CALL_TABLE  14  (ASIC_StaticPwrMgtStatusChange/SetUniphyInstance)

SetUniphyInstance updates the offset for the selected crtc based on args.ucCRTC parameter.

  000b: 0765b61bfe        AND    reg[1bb6]  [..X.]  <-  fe

This enables enables double buffering.

  0010: 3d650001          COMP   param[00]  [..X.]  <-  01

This checks the params to see is we are enabling the lock (args.ucEnable = ATOM_ENABLE) or disabling the lock (args.ucEnable = ATOM_DISABLE).

  0014: 443b00            JUMP_Equal  003b

If args.ucEnable == ATOM_ENABLE, jump to table offset 0x003b.

Drop the lock (args.ucEnable = ATOM_DISABLE).

  0017: 5430761a          CLEAR  reg[1a76]  [.X..]
  001b: 54306e1a          CLEAR  reg[1a6e]  [.X..]
  001f: 5430271a          CLEAR  reg[1a27]  [.X..]
  0023: 5430111a          CLEAR  reg[1a11]  [.X..]
  0027: 5420b51b          CLEAR  reg[1bb5]  [...X]
  002b: 5420bd1b          CLEAR  reg[1bbd]  [...X]
  002f: 4a25b61b01        TEST   reg[1bb6]  [...X]  <-  01

This tests the CRTC_DOUBLE_BUFFER_CONTROL.CRTC_UPDATE_PENDING bit.

  0034: 492f00            JUMP_NotEqual  002f

If the bit is high, we jump back to 0x002f.  If the bit is low, we're done.

  0037: 3a0000            SET_REG_BLOCK  0000
  003a: 5b                EOT

Take the lock (args.ucEnable = ATOM_ENABLE).

  003b: 0d25bd1b01        OR     reg[1bbd]  [...X]  <-  01
  0040: 54009e1b          CLEAR  reg[1b9e]  [XXXX]
  0044: 3a0000            SET_REG_BLOCK  0000
  0047: 5b                EOT


Just like in the other table, for some reason, the bit never goes low.

Comment 13 Lucas Kannebley Tavares 2013-02-13 13:38:01 UTC

Thanks for clarifying those things!

Well, I ran into a brand new set of questions while pursuing this.

>   0006: 370000            SET_ATI_PORT  0000  (INDIRECT_IO_MM)
> Select the mmio register aperture.

This sounds like selecting BARs, but from what I see, Region 0 would be the framebuffer (256M) and Region 2 would be the MMIO registers. Or how are those addresses mapped from within the adapter? Or does that mean that there are multiple register banks and you're picking one?

>  0009: 5214              CALL_TABLE  14  
>(ASIC_StaticPwrMgtStatusChange/SetUniphyInstance)
>SetUniphyInstance updates the offset for the selected crtc based on args.ucCRTC >parameter.

How are parameters passed here? Does it get the same parameters that the first call received? I take it, the reference for param[00] there means ucCRTC, then. Is that it?

>   0010: 3d650001          COMP   param[00]  [..X.]  <-  01
> This checks the params to see is we are enabling the lock (args.ucEnable = 
> ATOM_ENABLE) or disabling the lock (args.ucEnable = ATOM_DISABLE)

Ok, so, why is now param[00] referencing ucEnable? What is the reference to ucCRTC here?

>  0034: 492f00            JUMP_NotEqual  002f
> If the bit is high, we jump back to 0x002f.  If the bit is low, we're done.

So, the bit being low here means we don't have an update pending. Does it being high mean that the lock is still in effect (i.e. the CLEAR commands didn't take the disables down?)? 

>  0044: 3a0000            SET_REG_BLOCK  0000
>  0047: 5b                EOT

This seems to me like stack cleanup and return (I'm guessing EOT is End Of Table). Is that correct?

On the kernel driver side, I couldn't find who is calling, or what's the purpose of the crtc_prepare and crtc_commit functions, which are the only ones apparently using this call (atombios_lock_crtc). What are they meant to do?

Thanks

Comment 14 Alex Deucher 2013-02-13 14:08:04 UTC

(In reply to comment #13)
> Thanks for clarifying those things!
> 
> Well, I ran into a brand new set of questions while pursuing this.
> 
> >   0006: 370000            SET_ATI_PORT  0000  (INDIRECT_IO_MM)
> > Select the mmio register aperture.
> 
> This sounds like selecting BARs, but from what I see, Region 0 would be the
> framebuffer (256M) and Region 2 would be the MMIO registers. Or how are
> those addresses mapped from within the adapter? Or does that mean that there
> are multiple register banks and you're picking one?

No. There's only one register BAR.  It's for selecting between the register BAR and pci config registers.  See atom_op_setport().  I've never seen a table actually use anything other than the register BAR however.

> 
> >  0009: 5214              CALL_TABLE  14  
> >(ASIC_StaticPwrMgtStatusChange/SetUniphyInstance)
> >SetUniphyInstance updates the offset for the selected crtc based on args.ucCRTC >parameter.
> 
> How are parameters passed here? Does it get the same parameters that the
> first call received? I take it, the reference for param[00] there means
> ucCRTC, then. Is that it?

They are passed to the table for execution.  See atom_execute_table().  That function takes an atom context, an index (which table to execute), and pointer to the parameter struct.

> 
> >   0010: 3d650001          COMP   param[00]  [..X.]  <-  01
> > This checks the params to see is we are enabling the lock (args.ucEnable = 
> > ATOM_ENABLE) or disabling the lock (args.ucEnable = ATOM_DISABLE)
> 
> Ok, so, why is now param[00] referencing ucEnable? What is the reference to
> ucCRTC here?

See atombios_lock_crtc().  Use this parameter struct with the UpdateCRTC_DoubleBufferRegisters table:

typedef struct _ENABLE_CRTC_PARAMETERS
{
  UCHAR ucCRTC;
  UCHAR ucEnable;
  UCHAR ucPadding[2];
}ENABLE_CRTC_PARAMETERS;

See atombios.h.

The parameter struct is 1 dword.  The first byte is ucCRTC and the second byte is ucEnable.

> 
> >  0034: 492f00            JUMP_NotEqual  002f
> > If the bit is high, we jump back to 0x002f.  If the bit is low, we're done.
> 
> So, the bit being low here means we don't have an update pending. Does it
> being high mean that the lock is still in effect (i.e. the CLEAR commands
> didn't take the disables down?)? 

If the bit is high it means there is an update pending.  E.g., some change in the crtc state hasn't gone through yet.  I'm not sure why you are seeing it stuck high.

> 
> >  0044: 3a0000            SET_REG_BLOCK  0000
> >  0047: 5b                EOT
> 
> This seems to me like stack cleanup and return (I'm guessing EOT is End Of
> Table). Is that correct?

Yes.  correct.

> 
> On the kernel driver side, I couldn't find who is calling, or what's the
> purpose of the crtc_prepare and crtc_commit functions, which are the only
> ones apparently using this call (atombios_lock_crtc). What are they meant to
> do?

crtc_prepare() and crtc_commit() are called before and after a modeset on the crtc object.  See drm_crtc_helper_set_mode() in drm_crtc_helper.c.

In atombios_crtc_prepare() we take the crtc hardware lock so that all updates will happen atomically, then we disable the crtc.  Then in atombios_crtc_mode_set() we set up the pll, set the crtc timing, graphics plane base address, and scaler.  Finally in atombios_crtc_commit() we enable the crtc and drop the crtc hardware lock.

Comment 15 Brian King 2013-02-18 16:22:26 UTC

Can someone clarify something here? Is the bit that we are waiting to go low a bit in the adapter's memory? If so, is it the adapter hardware that we are waiting to set this bit?

Is there anyway to dump the adapter to determine its state when we hit the timeout?

Comment 16 Alex Deucher 2013-02-18 16:29:02 UTC

(In reply to comment #15)
> Can someone clarify something here? Is the bit that we are waiting to go low
> a bit in the adapter's memory? If so, is it the adapter hardware that we are
> waiting to set this bit?

It's a memory mapped register.  We are waiting for one of the display related bits to go low, so it's the GPU that would be setting that bit.  The timeout is in the driver.  We eventually drop out of the loop in the atom interpretor if we get stuck after a certain number of seconds.

> 
> Is there anyway to dump the adapter to determine its state when we hit the
> timeout?

You can dump the mmio registers.

Comment 17 Lucas Kannebley Tavares 2013-02-20 14:47:33 UTC

Created attachment 75176 [details] [review]
Dumping registers to investigate values change

Ok, so now I've tried dumping the register we're waiting for using this patch, and the output looks like this:

   OR_REG @ 0xD8EA
EVERGREEN_CRTC_BLANK_CONTROL: 0001
0x6ED8: 10000
      dst: 
REG[0x19A4].[7:0] -> 0x04
      src: 
PS[0x00,0x0000].[7:0] -> 0x00
      dst: 
REG[0x19A4].[7:0] <- 0x04
   EOT @ 0xD8EF
EVERGREEN_CRTC_BLANK_CONTROL: 0001
0x6ED8: 10000
<<
>> execute E82E (len 91, WS 0, PS 0)
   MOVE_PS @ 0xE834
EVERGREEN_CRTC_BLANK_CONTROL: ffffffff
0x6ED8: ffffffff

I'm dumping 0x6ED8 as it is the register whose bit never goes down. Following this, all references to either register are All F's. I'm wondering if this could be my testing interfering with the adapter operation, or if this is really what's going on, as it could indicate other problems.

Can I be dumping these registers there? Does that interfere with tests?
Should I dump another register for testing? Which one would be best?

From the 0xD8EA address, I can conclude it was executing the DAC1OutputControl function from the atombios that exited sucessfully. I'm investigating what happens afterwards that trigger this. Is it interrupt activation? Right now we're having to use LSIs, so it might be a problem there.

Thanks

Comment 18 Jerome Glisse 2013-02-20 15:33:41 UTC

So when all register return 0xffffffff it's because something went horribly wrong. Either the GPU memory controller is lockup or in bad or the IOMMU is blocking things.

My guess is that enabling the crtc to start scanning trigger request to the GPU memory controller and those request points to bad address. I would tripple check the memory controller setup and that the crtc base register points to valid vram inside the GPU memory address space.

Comment 19 Jerome Glisse 2013-02-20 15:34:42 UTC

Also does the ring/ib test that happen prior to any modesetting report success ?

Comment 20 Jerome Glisse 2013-02-20 16:33:35 UTC

Created attachment 75183 [details] [review]
dumping

Ok so dump reg that might trigger the GPU memory controller to start faulting.

Comment 21 Lucas Kannebley Tavares 2013-02-20 18:54:28 UTC

Created attachment 75196 [details] [review]
Fixes on the Workaround

Ok, there were some minor issues with the workaround which are fixed here

The output is:
[drm] ring test on 3 succeeded in 1 usecs
[drm] GRPH_PRIMARY_SURFACE[ 0]   0x0000000000000000
[drm] GRPH_SECONDARY_SURFACE[ 0] 0x0000000000000000
[drm] GRPH_PRIMARY_SURFACE[ 1]   0x0000000000000000
[drm] GRPH_SECONDARY_SURFACE[ 1] 0x0000000000000000
[drm] GRPH_PRIMARY_SURFACE[ 2]   0x0000000000000000
[drm] GRPH_SECONDARY_SURFACE[ 2] 0x0000000000000000
[drm] GRPH_PRIMARY_SURFACE[ 3]   0x0000000000000000
[drm] GRPH_SECONDARY_SURFACE[ 3] 0x0000000000000000
[drm] ib test on ring 0 succeeded in 0 usecs

And there's no longer a "stuck in loop" message, but the registers do become all f's. I'm investigating exactly where, to see if it's still the same issue.

Comment 22 Jerome Glisse 2013-02-20 22:03:03 UTC

What is weird is that it's showing reg all with 0 value which would mean that my patch does nothing but still you seem to go further along. Probably me doing bad casting can you add (uint64_t) in front of each RREG32 in my patch and see if it still print 0000000000000

Comment 23 Lucas Kannebley Tavares 2013-02-22 12:52:32 UTC

Created attachment 75313 [details] [review]
Fixes on the Workaround

Hi Jerome, this is the patch I actually used.
I had already done what you said and also removed a couple of left shifts you had added upon reading the low words.
The results are still the same, though.

Right now I'm tracing it by going through the code between the call to DAC1OutputControl on radeon_atom_encoder_dpms_avivo and the call to DPEncoderService on either radeon_dp_encoder_service or radeon_dp_link_train and instrumenting it, not sure which one yet, but I'm guessing the first. This is to make sure the driver is not doing anything else that could be going wrong in between calls, because after that DAC1OutputControl call, everything is still fine, it's somewhere in between those calls that the adapter goes to hell.

Another thing that I'd like to ask is, you suggested me to "make sure the pipes are off". I've looked through the registers looking for something to get DAC state or disable them, and have not found it. As far as I can tell, your patch would already make sure we're not doing improper access, but are there any more interesting registers I should be looking into as well?

Thanks

Comment 24 Alex Deucher 2013-02-22 15:04:13 UTC

If the card is not posted by the sbios, the display hardware is disabled until the driver attempts to initialize it.  The display controller enable bit is bit 0 of CRTC_CONTROL (0x6e70 + crtc_offset).  The DAC enable bit is bit 0 of DAC_ENABLE (0x6790).

Comment 25 Lucas Kannebley Tavares 2013-02-27 14:24:14 UTC

Created attachment 75640 [details] [review]
Adding tests for all-1s after every read or write

Ok, so after applying the refered to patch, I got several false WARN_ONs (where the adapter keeps working, so it's just a regular 0xFF), and at one point, I start getting real all-1s. That place is this:

WS[0x41].[31:24] <- 0x23
         MOVE_REG @ 0xD99C
EVERGREEN_CRTC_BLANK_CONTROL: 0001
0x6ED8: 10000
            src: 
WS[0x41].[31:0] -> 0x2304FFFF
            dst: 
REG[0x018A]------------[ cut here ]------------
WARNING: at drivers/gpu/drm/radeon/radeon_device.c:111
Modules linked in: radeon(+) drm_kms_helper ttm drm i2c_algo_bit i2c_core autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 sg ibmveth shpchp ext4(F) jbd2(F) mbcache(F) sr_mod(F) cdrom(F) sd_mod(F) crc_t10dif(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F)
NIP: d0000000069c2110 LR: d0000000069c2104 CTR: c000000000677f00
REGS: c00000000590a540 TRAP: 0700   Tainted: GF       W     (3.8.0+)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 28222482  XER: 0000000b
SOFTE: 1
CFAR: d000000006a029d0
TASK = c0000001ecd0b680[2589] 'modprobe' THREAD: c000000005908000 CPU: 4
GPR00: d0000000069c2104 c00000000590a7c0 d000000006abfd00 00000000ffffffff 
GPR04: 0000000000000001 0000000000000000 0000000000000000 000000002304ffff 
GPR08: 0000000030783031 c000000001067c50 000000000b3193b0 c000000000677f00 
GPR12: d000000006a76f30 c00000000edd0c00 00000080646700a0 0000000000000000 
GPR16: 0000010003bd0100 0000000000000000 c00000000590bc78 0000000000000030 
GPR20: c00000000590aa58 c00000000590aa50 0000000000000001 c0000001e5082000 
GPR24: c000000006fd8c80 000000002304ffff 000000000000018a 00000000ffffffff 
GPR28: 0000000000000000 000000002304ffff d000000006ab66d8 c0000001e5082000 
NIP [d0000000069c2110] .cail_reg_write+0x50/0x70 [radeon]
LR [d0000000069c2104] .cail_reg_write+0x44/0x70 [radeon]
Call Trace:
[c00000000590a7c0] [d0000000069c2104] .cail_reg_write+0x44/0x70 [radeon] (unreliable)
[c00000000590a850] [d0000000069d9530] .atom_put_dst+0x110/0x710 [radeon]
[c00000000590a920] [d0000000069dadd0] .atom_op_move+0xf0/0x1d0 [radeon]
[c00000000590a9e0] [d0000000069db1c4] .atom_execute_table_locked+0x314/0x3a0 [radeon]
[c00000000590aaf0] [d0000000069db5f8] .atom_op_calltable+0x108/0x170 [radeon]
[c00000000590ab80] [d0000000069db1c4] .atom_execute_table_locked+0x314/0x3a0 [radeon]
[c00000000590ac90] [d0000000069db5f8] .atom_op_calltable+0x108/0x170 [radeon]
[c00000000590ad20] [d0000000069db1c4] .atom_execute_table_locked+0x314/0x3a0 [radeon]
[c00000000590ae30] [d0000000069db2a4] .atom_execute_table+0x54/0x80 [radeon]
[c00000000590aed0] [d0000000069db474] .atom_asic_init+0x1a4/0x220 [radeon]
[c00000000590afb0] [d000000006a520e8] .evergreen_init+0x108/0x330 [radeon]
[c00000000590b040] [d0000000069c1d28] .radeon_device_init+0x578/0x6f0 [radeon]
[c00000000590b0e0] [d0000000069c48c0] .radeon_driver_load_kms+0xc0/0x180 [radeon]
[c00000000590b180] [d000000004eef200] .drm_get_pci_dev+0x1e0/0x2d0 [drm]
[c00000000590b240] [d0000000069a023c] .radeon_pci_probe+0xbc/0x100 [radeon]
[c00000000590b2d0] [c000000000359374] .local_pci_probe+0x64/0xb0
[c00000000590b370] [c000000000359488] .pci_call_probe+0xc8/0xf0
[c00000000590b410] [c00000000035a570] .pci_device_probe+0x90/0xb0
[c00000000590b4a0] [c000000000412004] .really_probe+0xb4/0x370
[c00000000590b550] [c000000000412320] .driver_probe_device+0x60/0xe0
[c00000000590b5e0] [c0000000004124ac] .__driver_attach+0x10c/0x110
[c00000000590b670] [c00000000040f7a8] .bus_for_each_dev+0x98/0xf0
[c00000000590b720] [c000000000411b28] .driver_attach+0x28/0x40
[c00000000590b7a0] [c0000000004106a8] .bus_add_driver+0x188/0x320
[c00000000590b840] [c000000000412c7c] .driver_register+0x9c/0x1c0
[c00000000590b8e0] [c00000000035a6b8] .__pci_register_driver+0x48/0x60
[c00000000590b960] [d000000004eef45c] .drm_pci_init+0x16c/0x1a0 [drm]
[c00000000590ba10] [d000000006a76c14] .radeon_init+0x108/0xa414 [radeon]
[c00000000590baa0] [c00000000000acc4] .do_one_initcall+0x64/0x1e0
[c00000000590bb60] [c0000000000fb0c8] .do_init_module+0x68/0x1e0
[c00000000590bc00] [c0000000000fc634] .load_module+0x8b4/0x9c0
[c00000000590bd30] [c0000000000fca18] .SyS_init_module+0x118/0x160
[c00000000590be30] [c000000000009954] syscall_exit+0x0/0x94
Instruction dump:
e9230000 ebe90330 7fe3fb78 480406e5 60000000 7fe3fb78 7fa4eb78 38a00000 
480407c1 60000000 2f83ffff 409e0008 <0fe00000> 38210090 e8010010 eba1ffe8 
---[ end trace 7065b906d56b6c01 ]---
.[31:0] <- 0x2304FFFF
         AND_REG @ 0xD9A1

Which seems to imply that at AtomBIOS function #10 - MemoryPLLInit things go bad, when it executes this instruction @94
  0079: 0300418a01        MOVE   WS_REMIND/HI32 [XXXX]  <-  reg[018a]  [XXXX]
  007e: 5e05410000f7dfffff0001  MASK   WS_REMIND/HI32 [XXXX]  &  dff70000  |  0100ffff
  0089: 4ba50102          TEST   param[01]  [.X..]  <-  02
  008d: 449400            JUMP_Equal  0094
  0090: 0fe54120          OR     WS_REMIND/HI32 [X...]  <-  20
  0094: 01028a0141        MOVE   reg[018a]  [XXXX]  <-  WS_REMIND/HI32 [XXXX]

Any thoughts in this? I've been trying to makes heads or tails of what exactly this means for a few hours now. I know it's initializing the PLL, what I don't get is why zeroing out bits 30 and 19, and then setting bit 24 would cause invalid memory accesses.

Or, did my test influence the flow of the program, and I shouldn't be reading this register shortly after writing to it?

Comment 26 Lucas Kannebley Tavares 2013-02-27 14:26:21 UTC

Btw, the calling path here seems to be
evergreen_init -> atom_asic_init -> (ASIC_Init) -> (SetMemoryClock) ->	(MemoryPLLInit)

Comment 27 Jerome Glisse 2013-02-27 15:08:44 UTC

I don't know what this register does maybe Alex can shed some light

Comment 28 Alex Deucher 2013-02-27 15:47:37 UTC

I'm trying to find out more internally.

Comment 29 Alex Deucher 2013-02-27 16:29:21 UTC

Does the card work on an x86 system (even just checking to see if the bios post screen is fine)?  I just want to confirm that it's not an issue with the card itself.

Comment 30 Lucas Kannebley Tavares 2013-02-27 17:21:02 UTC

Well, I don't have an x86 system to test that on. I could get one, in time.

What I do have are two different adapters, bought separately, on two different ppc64 systems with the exact same error.

This makes me think the adapters are fine :)

Comment 31 Lucas Kannebley Tavares 2013-02-27 17:44:03 UTC

I just altered the patch, removing the reads that were forced after writes to make it less intrusive and the results are the same.

Comment 32 Lucas Kannebley Tavares 2013-02-28 17:34:16 UTC

After some further investigation, we found that despite the fact that we were disabling MSIs, the adapter was still using it. 

After we provided a 32-bit address to it, we got it to work properly. The solution to this will have to be done not in software, so I'm closing the bug.

Thanks for all the help, guys

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.