Edgerouter-X JTAG debugging

I'm still playing with my domesticated ER-X to try and see whether I can divert the limited accounting features of the mt7621 to acquire accurate netflow metrics.
Since I'm a total n00b to Linux networking, I'd like to be able to debug the code in action, to which end I'm trying to set up JTAG debugging on the ER-X, which in turn has been a little bit of an ordeal.

The JTAG port on the ER-X is the standard MIPS EJTAG pinout:

Pin Signal Signal Pin
1 nRST GND 2
3 TDI GND 4
5 TDO GND 6
7 TMS GND 8
9 TCK GND 10
11 nSRST 12
13 DINT 3.3V 14

though I haven't tried wiring up the DINT pin.

After some kerfuffle, I managed to get nice, speedy JTAG communication going through an inexpensive Adafruit FT232H board using the config file I found in this article.
This gave me the ability to reset the ER-X, to halt and resume, read registers and memory a bit of a ways into the boot of the Linux kernel. It turns out that the Linux kernel stomps the JTAG pins into GPIO mode because of this stanza in the device tree file:

&state_default {
	gpio {
		groups = "uart2", "uart3", "pcie", "jtag";
		function = "gpio";
	};
};

Removing the "jtag" allows me to debug the kernel past that point, but then I would get "spontaneous" reboots after I hung out for a bit on a break. Well, that's the watchdog doing its job as per another device tree entry:

		wdt: watchdog@100 {
			compatible = "mediatek,mt7621-wdt";
			reg = <0x100 0x100>;
			mediatek,sysctl = <&sysc>;
		};

Now, having turned both of those down, I'm having another problem that's likely an OpenOCD issue, though I may well be holding it wrong.
The problem is that I can set breakpoints with gdb and I'll hit them just fine. If I then remove the breakpoint and continue, all is well.
However, if I step or continue without removing the breakpoint, I get stuck, and gdb keeps breaking at the same address each time I try to continue.
I've tried using hardware breakpoints, which doesn't change anything, though I suspect OpenOCD is using a software breakpoint on the successor instruction whether on step or continue. I've started digging into the OpenOCD sources to try and figure out what's going on, but this feels like a bit of an ordeal, and I'm hoping someone here can point me at how I'm holding it wrong :slight_smile: .

2 Likes

Some random thoughts.. (only - no direct answers)

gdb <-> OpenOCD protocol is called Remote Serial Protocol and you can see it by running gdb "set debug monitor" before connecting to target. It's a bit cryptic but you get used to it and you might get a clue what's happening.

On desktops gdb ("break") defaults to software breakpoints. You have to explicitly request hardware breakpoints using "hbreak". I don't know which one is the default with gdb over OpenOCD.

gdb apparently requires two breakpoints for some actions (eg. single step) in some cases. Hardware breakpoints seem to behave erratically on some targets: https://forums.sifive.com/t/gdb-breakpoint-and-openocd-problem/1384/2

In OpenOCD you can set breakpoints manually (without gdb). Maybe you can try at this layer to isolate the problem.

2 Likes

I don't have experience with openocd on linux platforms and gdb. Microcontrollers only :confused:

Only other thing I thought of which I was looking into for another target: kdbg over serial?

1 Like

I've been trying to parse the OpenOCD -d3 logs, as I'm pretty sure the problem is on that side. It's not a bad idea to also see what gdb thinks it's doing.

Yeah, I set my captive GDB to use HW breakpoints and I see that reflected on the OpenOCD side. That being said, the code is all in RAM, so using SW bps should be workable - I should see whether that works at least as well - or poorly.

Yeah, when you step or continue, the original breakpoint has to be cleared momentarily in order to allow that instruction to execute. I guess gdb will set a breakpoint on the successor instruction(s?) then restore the original bp when the successor hits.
Actually, I wonder if gdb uses the built-in single-stepping, something to look at.

That's a good idea, will try...

Thanks, that's something to look into.
Right now I'm deep into what may turn out to be a sunk cost fallacy, but at least the MIPS m4k is well documented and OpenOCD is OSS, so it should be possible to make this work.

Mkay, so grousing through the OpenOCD logs; what seems to be happening is that OpenOCDs handling of software breakpoints on m4k (or perhaps on MIPS at large) is simply broken. When a software breakpoint hits, it goes to a bad case in gdb_last_signal, which in turn throws gdb for a loop.
This happens whether I use software breakpoints explicitly, or whether gdb implicitly uses them for the step or continue from a HW breakpoint.
It should be easily fixable.

1 Like

Nice find! I dredged the OpenOCD bug tracker and found these: https://sourceforge.net/p/openocd/tickets/355/ https://sourceforge.net/p/openocd/tickets/371/

Weirdness, the m4k software breakpoints replace the target instruction with the SDBBP instruction. I'm not expert, but it looks like to detect those, it's necessary to probe for the DBp bit in the Debug Register, which is CP0 Register 23, Select 0.
I don't see this read anywhere, meaning either this regressed, or the SW breakpoint implementation was never complete.

... time passes ...

This is so weird, I seem to be unable to continue execution from where an SDBBP instruction caused entry to debug mode, but then has been replaced with the original instruction.
According to the docs, best I understand them, there's no need to acknowledge the DDp bit in any way.

unable to continue execution from where an SDBBP instruction caused entry to debug mode, but then has been replaced with the original instruction.

Not sure what the memory model of mips4k is, but on ARM it's necessary to flush icache after modifying code.

Good point. The seemingly random behavior I’ve seen could be explained by caching.

Mkay, by hacking around with how SDBBP and the original instructions are written to the target, I've been able to get software breakpoints working. So instead of writing the instructions with target_write_u32, I use mips32_pracc_write_mem. This is a hack, but even there I had to muck with the logic around the cache mode detection to hit the SYNCI case.
I think probably the right thing to do for the m4k case is to still write the instruction directly, but then issue a simple SYNCI instruction with the instruction address. Perhaps this needs a preceding barrier of some sort to make sure the instruction is flushed to dcache for absolute correctness.
The location and length of the breakpoint write(s) are known, and we know the m4k has SYNCI, so there's no need for all the generality in the generic mips32 case.

I'm not sure why the breakpoint functions cater to 16 bit aligned instructions by issuing two writes. Does the m4k tolerate 16 bit aligned instructions?

So at this point I'm able to set SW breakpoints, stepi past them and continue when stopped on one. Everything seems to still go haywire when I try to step, though that's likely a different problem. Either I'm holding gdb wrong, or else my debug info is no good.

1 Like

https://openocd.org/doc-release/doxygen/targetmips.html Ouch .. I think I have a headache

Mkay, so it turns out target_write_u32 actually bottlenecks in the mips_pracc_write_mem function on the m4k. So I guess it's just down to figuring out how to probe the cache mode properly.

Yeah, I've read that, the 1004k manual and some of the code several times, and I still only understand how it's supposed to work in principle. I guess it's not crazy to use the CPUs execution unit in this way? It's certainly about as flexible as you could imagine.

And here's a patch that seems to fix it. Now I guess I need to figure out how to land this in OpenOCD.

2 Likes

The patch is uploaded for review, now to get on with what I actually wanted to do :slight_smile: .

2 Likes

Great job! Now you're the expert!! :smiley:

2 Likes

Mkay, next problem I guess.
While I was sorting the software breakpoints, I'd turned SMP down by adding maxcpus=0 to the bootargs. After I reverted that change and configured the 4 "CPUs" into a single SMP target, everything went haywire.

This is due to the behaviour mentioned in the mediatek script for the mt7621:

# Each core can be halted/resumed individually.
# When a VPE is halted, the another VPE within the same core will also be
# halted automatically.

Apparently the mt7621 contains two cores, each with 2 VPEs, which in turn contain only a single TC per VPE.

So, halting an smt target throws OpenOCD for a loop, as half the targets enter debug mode in time, while the the others don't. Then when I try to resume the target, two of them resume, while the other two enter debug mode, creating this never ending leapfrog.

I found a hint for how it's possible for a debugger to work around this in
MIPS Debug Low-Level Bring-Up Guide:

Another VPE on this core is in debug mode
A core implementing the MT-ASE executes single-threaded while in debug mode. If one VPE on a
core is in debug mode then another VPE on that same core will be unable to take a debug exception
until the first VPE leaves debug mode. The DA-net implements independent VPE run control by not
leaving a “halted” VPE in debug mode but instead, offlining all tc bound to that VPE and exiting debug
mode to allow continued debug activity on either VPE.

I guess it would be viable to implement this in OpenOCD, but I'm not sure I have the gumption for it, as I can more easily turn down two of the VPEs, or just turn down SMP entirely.

1 Like

Super interesting, didn't know about this aspect of MIPS architecture.

If your goal is still the original one (fix iptables accounting), my thoughts are along yours, you can continue developing with a single core, it shouldn't change much about iptables stuff.