[sfp] module transmit faults

  • Master | 19.07
  • ALLNET ALL4781 (small form factor DSL modem)
ethtool -m eth2
    Identifier                                : 0x03 (SFP)
    Extended identifier                       : 0x04 (GBIC/SFP defined by 2-wire interface ID)
    Connector                                 : 0x22 (RJ45)
    Transceiver codes                         : 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00
    Transceiver type                          : Ethernet: 1000BASE-SX
    Encoding                                  : 0x01 (8B/10B)
    BR, Nominal                               : 1300MBd
    Rate identifier                           : 0x00 (unspecified)
    Length (SMF,km)                           : 0km
    Length (SMF)                              : 0m
    Length (50um)                             : 0m
    Length (62.5um)                           : 0m
    Length (Copper)                           : 255m
    Length (OM3)                              : 0m
    Laser wavelength                          : 0nm
    Vendor name                               : ALLNET
    Vendor OUI                                : 00:0f:c9
    Vendor PN                                 : ALL4781
    Vendor rev                                : V3.4
    Option values                             : 0x08 0x00
    Option                                    : Retimer or CDR implemented
    BR margin, max                            : 0%
    BR margin, min                            : 0%
    Vendor SN                                 : 0000000FC9157640
    Date code                                 : 18032900

Observing in the logs frequently:

libphy: SFP I2C Bus: probed
sfp sfp: module ALLNET ALL4781 rev V3.4 sn 0000000FC9157640 dc 29-03-18
sfp sfp: unknown connector, encoding 8b10b, nominal bitrate 1.3Gbps +0% -0%
sfp sfp: 1000BaseSX+ 1000BaseLX- 1000BaseCX- 1000BaseT- 100BaseTLX- 1000BaseFX- BaseBX10- BasePX-
sfp sfp: 10GBaseSR- 10GBaseLR- 10GBaseLRM- 10GBaseER-
sfp sfp: Wavelength 0nm, fiber lengths:
sfp sfp: 9µm SM : unsupported
sfp sfp: 62.5µm MM OM1: unsupported/unspecified
sfp sfp: 50µm MM OM2: unsupported/unspecified
sfp sfp: 50µm MM OM3: unsupported/unspecified
sfp sfp: 50µm MM OM4: 2.540km
sfp sfp: Options: retimer
sfp sfp: Diagnostics:
sfp sfp: module transmit fault indicated
sfp sfp: module transmit fault recovered
sfp sfp: module transmit fault indicated
sfp sfp: module persistently indicates fault, disabling

Reckon that all checks for the state machine (sm) [1]

  • SFP_S_WAIT_LOS
  • SFP_S_LINK_UP

are failing repeatedly and once the number of retries (five) is exhausted it prints the above. To restore connectivity it is often necessary to reboot the node or wait some time (> 3 min) prior invoking ifupdown for the logical WAN interface.

Not sure whether the module is at fault or whether the OS has issues to query the modem properly or interpret the response from module, such as

  • Loss of Signal
  • Module Transmitter Fault Signal

Is there a way to debug this somehow and eventually remedy?


[1] https://github.com/torvalds/linux/blob/master/drivers/net/phy/sfp.c#L1831

got in touch with the developer (R. King) currently looking after SFP.C development at Linux source and thought to share the gist of it - still a bit to digest for the interested, perhaps with a grain of salt:

  • SFP MSA conformity:
    • wording, such as may or shall, in the various MSA documents is rather suggestive than obligatory and leaves quite some wiggle room
    • self proclaimed by the vendor may not be necessarily a reliable statement (seems there is no certification standard yet for establishing such conformity)
    • examples for conformity misses:
      1. takes 40-50 seconds after deasserting TX_DISABLE to initialise and deassert TX_FAULT, when the SFP MSA explicitly states a limit of 300ms (t_init) for TX_FAULT to deassert.
      2. EEPROM does not respond for 50 seconds after plugging in, where the SFP MSA explicitly states 300ms (t_serial) maximum.
      3. EEPROM contains incorrect data, such as but not limited to:
      • indicating the module has a LC connector, yet it has an RJ45, or vice versa.
      • indicating NRZ encoding for an ethernet SFP, where it should be 8b10b or 64b66b encoding.
      • indicating a single data rate, or even the wrong data rate, when the module is documented as supporting other rates.
      • indicating an extended compliance technology that it doesn't support, presumably originally chosen when the number was unallocated by SFF-8024.
      • claiming to support 1000BASE-SX, a fiber standard, when the module is actually for VDSL2 over copper.
    • module manufacturers may provide own tailored drivers that are not being up-streamed to kernel source development and/or are not always being made available to the end user either
    • for that purpose commercial / industrial grade network appliances often implement vendor lockin

  • userland to query the SFP module
    • mii-tool is not suitable for modules that do not have a PHY; the "PHY" registers are emulated, and are there just for compatibility
    • ethtool is better suited
    • cat /sys/kernel/debug/gpio for modules with GPIO, debugfs enabled and being mounted, or alternatively gpioinfo from gpiod-tools
    • if the I2C GPIO expander is interrupt capable watching cat /proc/interrupts | grep sfp could help in debugging efforts

  • state machine checks implemented by the SFP.C code
    • is written to the SFP MSA and not the GBIC standard
    • does not query/leverage the module's EEPROM
    • provides a check for inverted LOS for big endian arch but not for little endian
    • with the aforementioned conformity issues this could lead easily to a mishap in the signal state communication with a non-conform module
    • checks signal status (asserted / dessarted) for RX_LOS and TX_FAULT and acts according to the signal status received from the module

As for this particular issue it has not been fully debugged, which is a bit difficult, and thus no remedy provided yet.

Least

module transmit fault indicated

is known to be pertinent to the TX_FAULT signal check.

The developer is aware of a number of SFP modules that misbehave - from his perspective. And this might be just another one. It could also be a module hardware defect in the end.

1 Like