Support for RTL838x based managed switches

Yeah, I am following that.
Looks like an ideal solution for flashing FW and potentially modified U-boot straight from the web ui

Hm ... The WebUI upload should be possible since

Thanks @anon13997276, but nothing works. I tried all modes. Those 4 combo ports only works when u-boot initialize them (i.e. loading initramfs using them).

I tried multiple configurations, using either INTERNAL_PHY or EXTERNAL_PHY, SWITCH_PORT and SWITCH_SFP_PORT.

I also have two conflicting information about those ports. My device is a rtl8393 and u-boot report those ports as RTL8214FC. However, the driver detects them as 2 "External RTL8393 SERDES" and 2 "RTL8218B (external)". I don't know who is telling me the truth. If I use SWITCH_PORT, It can detect link presence for all but lan49 port, although it still does not work.

Could you build an image from this branch and try it out? It has janh's latest code for the combo ports in and some more that might help: https://github.com/bkobl/openwrt/tree/rtl8214qf_merge

You can take the DGS-1210-16 .dts as a template.

It still does not work as expected. It is unstable, normally only working after I boot with those combo ports connected.

I was also hoping that the gpio pins were the same but in my tests, the ping related to the sfp module (mod-def0-gpio?) is a little bit different. It is one pin lower for lan50 for example. Is there a place or method to get those pin settings?

Edgecore ECS-2100 series seems to be a similar device, possibly supportable. Does anyone know how to stop u-boot? I tried many combinations and nothing worked. They also refused to publish the GPL package.

Getting the GPIO pins for the SFP ports correct is necessary for it to work properly.

There are different ways to identify the GPIOs. The easiest is if the OEM firmware offers the show tech-support command. Just issue that and it will show amongst a lot of other info the GPIO configuration. If this does not work the next thing I do is to use a special SFP module which has wires soldered to the connector and exiting a the rear end of the module so that I can trace e.g. the mod-def0-gpio to a pin typically on the RTL8231. There is a datasheet for that chip available and it will directly give you all the GPIO numbers. The least efficient is to dump gpios from within u-boot (rtk pinGet and the like) or from within linux without or with the module inserted/signal present. This gives only the input pins such as LOS and MOD-DEF0 .

I did a mistake: the pins match exactly the ones used by dgs-1210-16 (based only on mod-def0-gpio). Anyway, it does not work. As it is a different SoC, it might require to port those patches to RTL839x.
They might work nicely with dgs-1210-28. I might test them next week.

I'll submit dgs-1210-52 as living without 4 ports in 52 might be an acceptable price to pay to have OpenWrt. https://github.com/openwrt/openwrt/pull/10227
I added a DTS macro to easily flip the ports on/off for further tests.

I am not sure you mentioned it before, but if, then I did not get it. The PHY-code for the 8214FC is heavily dependent on the SoC. It starts with the fact that the PHY id of the external RTL8218B is identical to the RTL8214FC, so in order to identify it, one needs to check to which ports it is connected: only the high ports (20-27) on the RTL838x are linked to a SerDes capable of QSGMII and 1000BX, and it even depends on whether it is an RTL8382 or RTL8380. On the RTL839x these are other ports >=48. But it does not end there. There is a configuration patch of the PHY depending on its version and what the SoC is, and currently it is only configured on the RTL838x. The RTL PHYs are heavily linked to the RTL SoCs, the RTL8221B found on the RTL93xx devices even has a different PHY id than the allegedly same chip which can be found on 2.5 GBit adapters sold individually and supported by the latest linux kernels.

I just extended my branch with full support for HPE 1920-8G.

This time, the SFP ports are working right away. The only issue with them is that there is no TX disable support. The bootloader has code specifically for this device, which toggles GPIOs 27/28 on the RTL8231, but these aren't actually connected anywhere.

There was another issue though, the built-in watchdog fails to reboot this device and it hangs. As the bootloader uses the external watchdog for rebooting, I added that to the device tree as GPIO watchdog.

However, this resulted in the watchdog triggering during PHY initialization. The problem is that the PHY patching takes longer than the watchdog timeout, and as the kernel is built without preemption, the kernel fails to schedule the watchdog ping in time. After adding a call to cond_sched to rtl838x_smi_wait_op this issue is gone. I'm not sure if this is the ideal solution here?

With that change, both 1920-8G and 1920-16G should be practically fully working, so I'm thinking about submitting the changes soon. The only remaining thing that doesn't work fully is the Power LED. But I don't think there is an easy fix for that, as it is controlled via LED_SW_CTRL (at least the bootloader sets it to be permanently on).

@svanheule: What is the status of your sys-led patch series?

2 Likes

It should very likely be safe, there is a global SMI lock held while you reschedule and polling by the SoC is disabled for the entire PHY package. So neither CPU nor SoC should be doing anything dangerous with the PHY of SMI registers during the switched context. But have you tried to simply enable preemtive scheduling? The intention was always to allow preemption, the merely voluntary yielding of the context was an oversight.

That one should be good to go. I was busy with other things in the meantime and decided to wait for some more feedback or Tested-by's.

Removing the setup calls for the LEDs should be fine. I always wanted to keep them because I had hoped to be able to fix the 3/4 byte issue of the nor driver by restoring e.g. the PLL values for the SPI core or even reset the flash chip before reset. But none of that alone helps to allow the devices to reboot. The only thing that does help is using the original nor flash driver and not use the current SPI driver. The issue seems to be that the internal nor flash access hardware used to bring u-boot into memory during boot is put into some broken state by the default linux NOR driver using the SPI driver and resetting the NOR/SPI core on the SoC does not work. My hope that it was merely the flash chip itself can be proven to be incorrect by resetting it before reboot, I even tested this by using the chip's reset pin.

1 Like

Do you know the specific difference that causes this? We could write and submit behaviour specific to rtl8380/81/82 devices; that's why the maintainers insist on different compatibles after all.

1 Like

Figuring out the differences is hard. It is a comparison between apples and pears, the original driver is a NOR flash driver (it could be using an I2C or USB bus to communicate to the flash), the current driver is an SPI driver. The NOR driver did plenty of stuff specific to the DMA flash access these SoCs provide and make sure there is no collision. It read out the 3/4 strapping pins and then adapted the way it worked, always making sure that if there was a parallel DMA-triggered access, the flash was in the same state the DMA access controller expected, like the number of parallel IO pins and 3/4 byte access. This can be much easier controlled in a NOR driver than in an SPI driver. BTW: we do not even switch this DMA access off after booting. The SDK v.4 only uses a NOR driver doing these DMA accesses for flash access (MIO, memory mapped access, see datasheet, 'rtk_spi_flash_mio.c, 'rtk_nor_flash.c'), the original driver was based on the PIO-version (programmable IO, see datasheet) of the SDK v.2, which could do parallel IO, but not DMA. The SPI driver neither does parallel IO nor obviously any DMA, while if the DMA controller was configured initially by u-boot to do so, it will continue to try parallel IO. There is probably a very good reason the RTL93xx got an additional proper SPI controller and it is interesting that no other SPI devices are found on RTL838x devices, the only example I know is the MAX3421 in an RTL839x system. My current plan is to write a NOR driver based on the MIO code in the SDK v.4 which could be used on all the systems that have the 3/4 byte issue, there is no problem in having both a NOR driver and an SPI driver, the NOR driver might also be considerably faster faster than the current SPI driver (it could do quad-io) and use much less CPU time (because of DMA).

I'm getting some rare boot locks with this error repeating endlessly in an RTL8393 switch (DGS-1210-52):

[    1.583470] irq 0: nobody cared (try booting with the "irqpoll" option)
[    1.590875] CPU: 0 PID: 1033 Comm: kworker/u4:8 Not tainted 5.10.127 #0
[    1.598250] Stack : 806aaf4c 80725df8 80740000 800805e4 807b0000 806abc84 00000000 00000000
[    1.607638]         8200bdcc 808f0000 80684284 829a3898 80734be7 00000001 8200bd70 d5f33898
[    1.617028]         00000000 00000000 80684284 8200bc08 ffffefff 00000000 00000000 ffffffea
[    1.626408]         0000006b 8200bc14 0000006b 8073a128 00000000 00000000 00000000 80680000
[    1.635790]         80724d0c 00000008 807a96f4 80725df8 00000018 8035de4c 00000000 808f0000
[    1.645178]         ...
[    1.647920] Call Trace:
[    1.650664] [<80006d68>] show_stack+0x30/0x100
[    1.655638] [<80303524>] dump_stack+0x9c/0xcc
[    1.660513] [<80086e8c>] __report_bad_irq+0x5c/0xf4
[    1.665974] [<800873e0>] note_interrupt+0x2cc/0x31c
[    1.671427] [<80083ae4>] handle_irq_event_percpu+0x5c/0x74
[    1.677567] [<8008938c>] handle_percpu_irq+0x88/0xb8
[    1.683124] [<80082fe8>] generic_handle_irq+0x44/0x5c
[    1.688781] [<80335430>] realtek_gpio_irq_handler+0xc8/0x1a8
[    1.695110] [<80082fe8>] generic_handle_irq+0x44/0x5c
[    1.700767] [<8031f634>] realtek_irq_dispatch+0x94/0x158
[    1.706708] [<80082fe8>] generic_handle_irq+0x44/0x5c
[    1.712379] [<805e3ab4>] do_IRQ+0x1c/0x2c
[    1.716862] [<8031ed3c>] plat_irq_dispatch+0x68/0xf0
[    1.722418] [<800020a8>] except_vec_vi_end+0xb8/0xc4
[    1.727981] [<800b8060>] smp_call_function_single+0x158/0x1f0
[    1.734414] [<800b8700>] smp_call_function+0x24/0x30
[    1.739964] [<80010284>] flush_tlb_mm+0x4c/0xec
[    1.745035] [<8016ca84>] tlb_finish_mmu+0x16c/0x1bc
[    1.750485] [<8016a4cc>] exit_mmap+0x100/0x1ec
[    1.755470] [<80029bc8>] mmput+0x48/0xfc
[    1.759868] [<801abe04>] free_bprm+0x28/0xc0
[    1.764651] [<801ae0f8>] kernel_execve+0x150/0x1c8
[    1.770014] [<80043e98>] call_usermodehelper_exec_async+0x110/0x190
[    1.777022] [<80001b98>] ret_from_kernel_thread+0x14/0x1c
[    1.783060] 
[    1.784724] handlers:
[    1.787250] Disabling IRQ #0

Maybe something that should be initialized. This is the system interruptions once it boots correctly

           CPU0       CPU1       
  7:     238025     240801      MIPS   7  timer
  8:      20189      22824      MIPS   0  IPI call
  9:        573        639      MIPS   1  IPI resched
 19:          0          0  realtek-rtl-intc  19  realtek-otto-wdt
 20:          5          3  realtek-rtl-intc  20  rtl839x-link-state
 24:       2819       2729  realtek-rtl-intc  24  eth0
 31:        140        148  realtek-rtl-intc  31  ttyS0
ERR:         67

I have seen this a handful of times before, too, in combination with the MAX3421 which generates IRQs at a very high frequency already during boot. There is somewhere a race condition during initialization of the IRQs. I have only seen this happening during boot, though. Once the system is up, it is stable, so this points towards an issue with the initialization.

The same with me. It only happens during boot and never again. However, once it happens, it loops and you need to restart the system manually.

I believe my device does not have a MAX3421 as there is no USB (that I noticed). So, it might be something else generating those IRQ. It looks like my device is dual core (I didn't notice before checking /proc/interrupts). Maybe the issue is a race between cores.

Meanwhile, can we have a hack that ignores those kind of IRQ? I was hoping "Disabling IRQ #0" msg should do the trick but they keep coming.

Looking at the code again, it looks like generic_handle_irq() might be called with IRQ 0 if the interrupt line cannot be resolved (for whatever reason). That's obviously a bug and should not happen.

IRQ 0 is not a valid IRQ on this platform. So even if the subsystem tries to disable it, that probably fails.

1 Like

Anyone who has an RTL839x device and is feeling like doing a few reboots: the top patch in my test/spurious-gpio-irq staging branch will provide feedback if it can't map a HW IRQ.

I've rebooted twice, but the issue didn't trigger. Note that if this is really a bad IRQ, it will now be ignored until it is cleared in some other way. Please make sure the interrupt priority for the main UART is set high enough, so it will actually print. It's entirely possible that this will cause infinite spamming of messages, but at least we'll (hopefully) know the GPIO line triggering the bad IRQ.