Modify

Opened 3 years ago

Last modified 16 months ago

#19085 reopened defect

nanostation m5 loco xw "loses" interface

Reported by: anonymous Owned by: developers
Priority: normal Milestone:
Component: kernel Version: Trunk
Keywords: Cc:

Description

after ~6 hours uptime my nanostation "loses" it's lan interface.

means: eth0 is still up, but the link is gone... ethtool reports link down/speed 10m/half-duplex/no autoneg, mii-tool even finds "No MII transceiver present!", and there's no way to get the link up again with ethtool

btw, this doesn't happen with the original firmware (actually something is happening there, too, dmesg shows "AR8032 Hang WAR - Fast Reset..." and right after "... PHY Reset"), so i think this could be solved in software somehow... maybe somebody has an idea

[i also know of some people who have the same problem with their nanostation loco's, so it's not only me or broken cables or sth. like that]

Attachments (6)

nanostation_loco_xw_boot.log (10.6 KB) - added by devel@… 3 years ago.
Output of dmesg for a Nanostation M5 loco XW. At this moment eth0 is still operational.
800-net-phy-fix-at8033-sgmii-mode.patch (2.8 KB) - added by anonymous 3 years ago.
801-net-phy-at803x-add-ath8032.patch (1.8 KB) - added by anonymous 3 years ago.
dmesg-patched.log (10.6 KB) - added by leoso 3 years ago.
dmesg output of firmware with patch
dmesg-not-patched.log (11.0 KB) - added by leoso 3 years ago.
dmesg output of firmware with NO patch applied
loco-with-yesterdays-trunk-without-patches.txt (3.1 KB) - added by ufo@… 3 years ago.
with trunk, when the problems starting.. (btw, loco-m5, but not a new xw)

Download all attachments as: .zip

Change History (69)

comment:1 Changed 3 years ago by devel@…

We (a Freifunk community) encountered the same problem: most of the Nanostation M5 loco XW lose their ethernet connectivity every few hours. It does not happen for (old) XM devices or for M5 High-Power XW devices.

There seem to be two distinct error patterns causing this ethernet failure:

1) kernel log: eth0 entered disabled state

Without any changes regarding the devices connected via ethernet there is a sudden drop of the line as indicated by the dmesg output above. This is confirmed by the output of the following command:

ip link show dev eth0 | grep NO-CARRIER

This error seems to happen much more often (every few hours instead of days) if eth0 is part of a bridge (brctl) - even if eth0 is the only member of this bridge.
Thus this error may be connected with the promiscuous mode (enabled for bridge members).

2) half-duplex -> down

kernel messages:

[33432.010000] eth0: link up (10Mbps/Half duplex)
[33434.010000] eth0: link down

The frequency of this bug seems to be noticeable lower if commit [43776] is applied. Nevertheless this does not elimenate the bug completely.

The above observations were made with the Barrier Breaker release including commit [42549] (for supporting the loco devices in Barrier Breaker).

For now we did not manage to prepare the specific circumstances that would trigger the above issues immeadiately. Thus we need to wait for some hours or days in order to reproduce the bug.

comment:2 Changed 3 years ago by anonymous

I wonder, how hard it would be to implement this "hang-detection", like in ubiquiti's firmware

kernel log shows:

AR8032 Hang WAR - Fast Reset...
AR8032 Hang WAR - PHY Reset...
AR8023 Hang WAR - Complete.

which means, they reset the ethernet tranceiver chip directly, once the link is gone... this rather indicates a hardware bug, as a bug in the driver

i also tried to get a patch from ubiquiti's firmware developers, but got zero reaction :( it's also not included in the gpl archive

so... as i'm not a kernel hacker, i can only hope that someone will find some time to look at the datasheet (http://www.datasheet4u.com/datasheet/A/R/8/AR8032-Atheros.pdf.html), and implement sth. similar...
[or maybe force ubiquiti to release a proper gpl tarball?]

my conclusion: don't buy those newer nanostation's loco m xw devices, they're buggy

comment:4 Changed 3 years ago by devel@…

We started to collect kernel messages related to the eth0 failure events.

The following log messages are typical for hosts that are failing regularly:

[  202.320000] eth0: link up (10Mbps/Full duplex)
[  204.320000] eth0: link up (100Mbps/Full duplex)
[  844.320000] eth0: link down
[  845.320000] eth0: link up (100Mbps/Full duplex)
[ 3085.330000] eth0: link up (10Mbps/Half duplex)
[ 3087.330000] eth0: link up (100Mbps/Full duplex)
[ 3639.340000] eth0: link down
[ 3640.340000] eth0: link up (100Mbps/Full duplex)
[ 3880.350000] eth0: link down
[ 3881.350000] eth0: link up (100Mbps/Full duplex)
[ 5071.360000] eth0: link down
[ 5072.360000] eth0: link up (100Mbps/Full duplex)
[ 6390.370000] eth0: link down

Thus it seems like there is an event causing a renegotiation of the link (link down) from time to time. Exactly one second later there is a link up event, if we are lucky. Otherwise there is no further renegotiation and the links stays down permanently.

Sadly this situation cannot be resolved manually by using ethtool (error message: No MII transceiver present). For now a reboot is the only known workaround for this situation.

comment:5 follow-up: Changed 3 years ago by devel@…

We tried to set a fixed ethernet link speed with ethtool on some devices that are affected by this bug:

ethtool -s eth0 autoneg off
ethtool -s eth0 speed 100
ethtool -s eth0 duplex full

One of these devices went down from an average of 5 failures per day down to zero failures (27 hours up to now). Its kernel log shows a lot of renegotiation events (ca. 100 per day). They all end up with the fixed link speed (100Mbps/Full duplex) after exactly one or two seconds delay:

[90171.670000] eth0: link up (1000Mbps/Full duplex)
[90173.670000] eth0: link up (100Mbps/Full duplex)
[92611.690000] eth0: link up (10Mbps/Half duplex)
[92613.690000] eth0: link up (100Mbps/Full duplex)
[93047.690000] eth0: link down
[93048.690000] eth0: link up (100Mbps/Full duplex)
[93340.690000] eth0: link down
[93341.690000] eth0: link up (100Mbps/Full duplex)
[93789.690000] eth0: link down
[93790.690000] eth0: link up (100Mbps/Full duplex)
[94240.690000] eth0: link up (1000Mbps/Full duplex)
[94242.690000] eth0: link up (100Mbps/Full duplex)
[95298.710000] eth0: link up (10Mbps/Half duplex)
[95300.710000] eth0: link up (100Mbps/Full duplex)

Thus the negotiation some works (with hiccups) for one device.

Sadly two other loco XW devices with the same fixed link speed setup still expose the known failure.

comment:6 follow-up: Changed 3 years ago by devel@…

We noticed that we do not have any issues with devices that are continuously receiving incoming traffic on the ethernet internet - e.g. due to olsr routing messages (every 3s) or regular pings (interval of 12s).

Outgoing traffice (without return) does not seem to help.

Thus it feels like this problem is caused by some problem with an evergy-saving mode of the ethernet port.

Changed 3 years ago by devel@…

Output of dmesg for a Nanostation M5 loco XW. At this moment eth0 is still operational.

comment:7 in reply to: ↑ 5 Changed 3 years ago by devel@…

Replying to devel@…:

We tried to set a fixed ethernet link speed with ethtool on some devices that are affected by this bug:

ethtool -s eth0 autoneg off
ethtool -s eth0 speed 100
ethtool -s eth0 duplex full

One of these devices went down from an average of 5 failures per day down to zero failures (27 hours up to now).

With hindsight: the above is irrelevant, since the reduction of failures during that day was probably caused by constant traffic. Please ignore the above.

comment:8 Changed 3 years ago by anonymous

Hm, i'm not sure if that applies to my device, too, because it's connected to a box running olsrd, too... it's still hanging.

oh, and i got some code from ubiquiti (how they are resetting the ethernet chip):

https://bpaste.net/show/4fe36e9363c6

but i had no chance yet to look at the GPL-driver and how to port the code

comment:9 Changed 3 years ago by jow

Copy of the paste contents as it expires on the 20th.

As our driver is not GPL but we can give an idea how to detect and reset PHY in GPL ar71xx ethernet driver:

#define GPIO_OE_ADDRESS                       0x18040000
#define GPIO_OUT                              0x18040008
#define GPIO_SET                              0x1804000C
#define GPIO_CLEAR                            0x18040010
#define AR8032_EXPECTED_ID1                   0x4d
#define ATHR_PHY_ID1                     2

static int ar803x_phy_reset(void* arg) {
    if ((athr_reg_rd(GPIO_OUT) & (1 << 0)) == 0) {
        // Set GPIO0 to 1 (not in reset)
        printk("Setting GPIO0 high (AR803x out of reset)\n");
        athr_reg_wr(GPIO_SET,(1<<0));
    }
    if (athr_reg_rd(GPIO_OE_ADDRESS) & (1 << 0)) {
        // Set GPIO0 as output
        printk("Configuring GPIO0 as Output.\n");
        athr_reg_rmw_clear(GPIO_OE_ADDRESS, (1 << 0));
    }
    athr_reg_wr(GPIO_CLEAR,(1<<0));
    mdelay(2);
    athr_reg_wr(GPIO_SET,(1<<0));
    mdelay(2);
    return 0;
}

static int ar803x_check_reset(void* arg)
        int retries, ret = 0;
        athr_gmac_t *mac = (athr_gmac_t *)arg;
        uint16_t phy_id = phy_reg_read(mac->mac_unit, mac->phy->address, ATHR_PHY_ID1);

        if (phy_id == AR8032_EXPECTED_ID1) 
        {
            //No PHY hang detected
            return 0;
        }
        printk("AR803x Hang WAR - PHY Reset...\n");
        ar803x_phy_reset(mac);
        retries = 102; //To be sure last try > 10ms after reset
        while ((phy_reg_read(mac->mac_unit, mac->phy->address, ATHR_PHY_ID1) = AR8032_EXPECTED_ID1) && --retries) {
            udelay(100); 
        }
        if (retries) {
            printk("AR803x Hang WAR - Complete.\n");
            return 1;
        }
        return -1;
}

As we are polling phy state (link status) in our driver we periodically call ar803x_check_reset() to check and reset PHY if there is not right ID read via MDIO.
BTW: "AR8032 Hang WAR - Fast Reset" tries MAC layer reset. It is not included in the sample.

-Edmundas

comment:10 follow-up: Changed 3 years ago by cmlara

Question:

Is this actually an issue in trunk? The boot log is from Kernel 3.10.49 which may be BB.

It looks like this may possibly be covered in Kernel 3.16 and newer
http://permalink.gmane.org/gmane.linux.network/318765
http://lxr.free-electrons.com/source/drivers/net/phy/at803x.c?v=3.16#L258

258 static void at803x_link_change_notify(struct phy_device *phydev)
259 {
260         struct at803x_priv *priv = phydev->priv;
261 
262         /*
263          * Conduct a hardware reset for AT8030 every time a link loss is
264          * signalled. This is necessary to circumvent a hardware bug that
265          * occurs when the cable is unplugged while TX packets are pending
266          * in the FIFO. In such cases, the FIFO enters an error mode it
267          * cannot recover from by software.
268          */

Looks to me like this may catch it.

comment:11 in reply to: ↑ 6 Changed 3 years ago by leoso

Replying to devel@…:

We noticed that we do not have any issues with devices that are continuously receiving incoming traffic on the ethernet internet - e.g. due to olsr routing messages (every 3s) or regular pings (interval of 12s).

Outgoing traffice (without return) does not seem to help.

Thus it feels like this problem is caused by some problem with an evergy-saving mode of the ethernet port.

We made several further tests with constant ping test which are send through eth0 interface. But all tests failed. The eth0 device hangs after some time. Therefore the above thesis is falsivied :)

comment:12 in reply to: ↑ 10 Changed 3 years ago by leoso

Replying to cmlara:

Question:

Is this actually an issue in trunk? The boot log is from Kernel 3.10.49 which may be BB.

It looks like this may possibly be covered in Kernel 3.16 and newer
http://permalink.gmane.org/gmane.linux.network/318765
http://lxr.free-electrons.com/source/drivers/net/phy/at803x.c?v=3.16#L258

258 static void at803x_link_change_notify(struct phy_device *phydev)
259 {
260         struct at803x_priv *priv = phydev->priv;
261 
262         /*
263          * Conduct a hardware reset for AT8030 every time a link loss is
264          * signalled. This is necessary to circumvent a hardware bug that
265          * occurs when the cable is unplugged while TX packets are pending
266          * in the FIFO. In such cases, the FIFO enters an error mode it
267          * cannot recover from by software.
268          */

Looks to me like this may catch it.

That is very interesting. It looks like other chips have the same problem. At least these workarounds indicate that. Sadly the driver ag71xx is not covered by the above files.

Does anybody know which PHY driver is responsible for the ag71xx chip?

comment:13 Changed 3 years ago by devel@…

A commentor above mentioned "AR8032" - does this refer to the PHY driver you mentioned? If not: where could I find this information?

comment:14 Changed 3 years ago by leoso

On startup the boot log of our loco xw states:

Sun May 10 14:27:29 2015 kern.info kernel: [ 1.150000] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.0:01 [uid=004dd023, driver=Generic PHY]

The PHY-ID here is 004dd023.
Which driver is responsible for this ID? I don not know? Some drivers have a list of IDs which they are responsible for. E.g. in ar8216.c you can find:

0x004dd033,
0x004dd034, /* AR8327 */
0x004dd036, /* AR8337 */
0x004dd041,
0x004dd042,
0x004dd043, /* AR8236 */

Therefore these IDs use ar8216.c as driver.

In at803x.c you can find

.phy_id = 0x004dd072,
.name = "Atheros 8035 ethernet",

.phy_id = 0x004dd076,
.name = "Atheros 8030 ethernet",

But searching for our ID (004dd023) or parts of the ID in the kernel source did not return any usefull results for us. Therefore I do not know which driver is responsible?

comment:15 Changed 3 years ago by cmlara

So first off, I can confirm (from having a user local to me) test that these devices do have an issue in Trunk (tested around a week or so ago)

I had assumed that they used the at803x code but they may not.

I won't claim to be an expert in this section, but looks to me (ref: https://lists.openwrt.org/pipermail/openwrt-devel/2014-September/027757.html) (note: i think the code has changed since that original patch I just can't find its commit at the moment) that ath79_register_eth() called in arch/mips/ath79/mach-ubnt-xm.c along with ath79_register_mdio() and other hardware data listed above it to create the ethernet device.

ath79_eth0_data.phy_mask appears to have some regard to selecting which phy is being selected

source:trunk/target/linux/ar71xx/files/arch/mips/ath79/dev-eth.c is where ath79_register_eth() and ath79_register_mdio() is declared.

I haven't tracked it back further to see how the drivers play together (if at all)

Changed 3 years ago by anonymous

Changed 3 years ago by anonymous

comment:16 Changed 3 years ago by anonymous

could you please try 2 patches attached and see if it fixes the issue

comment:17 Changed 3 years ago by cmlara

Compiling and will report back when I have more information.

comment:18 Changed 3 years ago by leoso

I applied the patch to our OpenWRT firmware but there seems to be a problem with the eth0 interface afterwards. After the patch is applied the eth0 interface seems to be up but no packets are going out are are received. Also the LED "LAN1" on the loco xw is permanently off.
I will attach the dmesg output of a patched firmware and unpatched firmware. Maybe you have a look at it. One difference in the output is the missing "eth0: link up (100Mbps/Full duplex)" line when running the patched kernel.

Changed 3 years ago by leoso

dmesg output of firmware with patch

Changed 3 years ago by leoso

dmesg output of firmware with NO patch applied

comment:19 follow-ups: Changed 3 years ago by dman776

having this same issue with a NanoBeam NBE-M5-19. frustrating.

Changed 3 years ago by ufo@…

with trunk, when the problems starting.. (btw, loco-m5, but not a new xw)

comment:20 Changed 3 years ago by ufo@…

crazy: after next reboot that device is stable-running, now for 10h without problems, transfering some Gigabytes on LAN yet..
but my m5-loco-xw still has problems (on trunk)

comment:21 Changed 3 years ago by devel@…

@ufo: I never noticed any issues with the old Loco M5 XM. Only the more recent Loco M5 XW exposes this problem (as far as I know).
Both devices seem to be based on different hardware - thus your kernel log (comment:20) is probably not related to this ticket?

comment:22 in reply to: ↑ 19 Changed 3 years ago by bhnyc

Replying to dman776:

having this same issue with a NanoBeam NBE-M5-19. frustrating.

I've found one qMp image that works well with NBE19. Not sure if that is useful for you or not:
https://github.com/nycmeshnet/qMp-XWbin/tree/master/nanobeam

make sure to follow the readme.

comment:23 in reply to: ↑ 19 Changed 3 years ago by bhnyc

Last edited 3 years ago by bhnyc (previous) (diff)

comment:24 follow-up: Changed 3 years ago by cmlara

@bhnyc comment:22 Far as I can see qMp doesn't do any build root patching they just do a direct checkout of OpenWRT.

Since this ticket has been opened against Trunk and has been open before the RC1 for CC I don't see how the qMp firmware would actually solve this unless there is some patch in there that I am missing?

comment:25 in reply to: ↑ 24 Changed 3 years ago by bhnyc

Replying to cmlara:

@bhnyc comment:22 Far as I can see qMp doesn't do any build root patching they just do a direct checkout of OpenWRT.

Since this ticket has been opened against Trunk and has been open before the RC1 for CC I don't see how the qMp firmware would actually solve this unless there is some patch in there that I am missing?

My thoughts are that the version of CC used in that image worked, so this bug was introduced some stage after 2015-01-22. If we did a diff on the relevant CC source files it might give us a clue. (Also people are looking for working images to get by in the meantime).

comment:26 follow-up: Changed 3 years ago by bmoffitt@…

I have been testing some LOCO M5 XW units, and both under r42711 (kernel 3.10.49) and a more recent build using 3.18.14 I saw the Access Points losing contact with the router into which they were plugged. I put in a script to ping the router every 5 minutes, and the ping failed with disturbing frequency.

in 3.10.49, however, the same thing was happening at the client end - the LOCO would lose touch with the device attached to it and require a reboot. However, in 3.18.14, I did not observe that particular problem. So it appears something changed, although I'm not sure what.

I would really like to see this get fixed... I can't afford much, but would a bounty be useful?

comment:27 in reply to: ↑ 26 Changed 3 years ago by leoso

Replying to bmoffitt@…:

I have been testing some LOCO M5 XW units, and both under r42711 (kernel 3.10.49) and a more recent build using 3.18.14 I saw the Access Points losing contact with the router into which they were plugged. I put in a script to ping the router every 5 minutes, and the ping failed with disturbing frequency.

in 3.10.49, however, the same thing was happening at the client end - the LOCO would lose touch with the device attached to it and require a reboot. However, in 3.18.14, I did not observe that particular problem. So it appears something changed, although I'm not sure what.

Can you explain your installation in detail? Where is your client connected? On loco eth0 interface?
I am asking because if eth0 is down, there is only wifi interface working on loco. So if your client is connected with loco eth0 and your router also, both devices should not be able to connect to loco via cable.

comment:28 Changed 3 years ago by bmoffitt@…

Apologies - I should have explained more thoroughly. My testing setup comprises two LOCO M5 XW radios. One is configured as an Access Point (normal config except with strict WPA2 auth/encryption and 10 MHz channel width) connected to a router on its eth0 port. The second is configured as an Client device using relayd to make it "transparent." The client is connected to a laptop on its eth0 port, and, of course, the LOCOS are connected together on their wlan0 ports.

I re-tested another radio (to rule out a hardware problem on the radio) last night running OpenWRT r45845, kernel 3.18.14 and can verify that the LOCO did again lose contact with the router on its eth0 port last night, so the bug seems to remain.

comment:29 Changed 3 years ago by anonymous

do you have a procedure to induce the problem on-demand? or, do you just have to wait?

comment:30 Changed 3 years ago by bmoffitt@…

Alas, I have not discovered a way to induce it - I just have to wait.

comment:31 follow-up: Changed 3 years ago by devel@…

I would like to summarize the current state of this issue:

  • only XW loco devices are affected (AR8032)
  • Linux kernel versions at least up to 3.18.14 are affected
  • we received some quick-reset code that is used by Ubiquiti (#comment:8 and #comment:9)
    • patches based on this code were created, but they did not seem to solve the problem (or: created different problems) (#comment:16 and #comment:18)
  • some participants referred to NanoBeam NBE-M5-19 with similar problems (#comment:22)
    • it seems to be unclear if their problem was solved and if it was related at all

Please correct me if I misunderstood something.

comment:32 in reply to: ↑ 31 ; follow-up: Changed 3 years ago by cmlara

I work with the one gentleman reporting the NanoBeam in #comment:19 as part of a larger project.

As far as I/we can tell it affects the entire XW line (AR803x) EXCEPT the NanoStation (Which has a built in switch)(There is a "NanoStation" and "NanoStation Loco" for reference of those not familiar with the Ubiquiti product lineup, the "NanoStation" non loco device is basically the only device that does have a built in switch) (Its either it affects the whole line or we happen to be getting multiple bugs at once that cause the same failure)

This would seem to agree with the kernel code already around in #comment:12 and the question is (I think) at the moment "is that code actually being used" before we blame kernel (as module at803x is never loaded on devices and instead the "ag" module is used) as that code is basically the same as code in #comment:8 and #comment:9 just written in GPIOD format.

Replying to devel@…:

I would like to summarize the current state of this issue:

  • only XW loco devices are affected (AR8032)
  • Linux kernel versions at least up to 3.18.14 are affected
  • we received some quick-reset code that is used by Ubiquiti (#comment:8 and #comment:9)
    • patches based on this code were created, but they did not seem to solve the problem (or: created different problems) (#comment:16 and #comment:18)
  • some participants referred to NanoBeam NBE-M5-19 with similar problems (#comment:22)
    • it seems to be unclear if their problem was solved and if it was related at all

Please correct me if I misunderstood something.

comment:33 in reply to: ↑ 32 ; follow-up: Changed 3 years ago by leoso

Replying to cmlara:

I work with the one gentleman reporting the NanoBeam in #comment:19 as part of a larger project.

As far as I/we can tell it affects the entire XW line (AR803x) EXCEPT the NanoStation (Which has a built in switch)(There is a "NanoStation" and "NanoStation Loco" for reference of those not familiar with the Ubiquiti product lineup, the "NanoStation" non loco device is basically the only device that does have a built in switch) (Its either it affects the whole line or we happen to be getting multiple bugs at once that cause the same failure)

I see we have some problem here. When you talk about Loco XW line you means AR803x. In my log file from {#coment:18] I read the line

[ 0.000000] CPU0 revision is: 0001974c (MIPS 74Kc)
[ 0.000000] SoC: Atheros AR9342 rev 2

in boot log. Can you please give us your dmesg output with the SoC information? Do we have different SoCs with the same bug?

This would seem to agree with the kernel code already around in #comment:12 and the question is (I think) at the moment "is that code actually being used" before we blame kernel (as module at803x is never loaded on devices and instead the "ag" module is used) as that code is basically the same as code in #comment:8 and #comment:9 just written in GPIOD format.

At least for me having an AR9342 the driver at803x is not used but the driver in linux-3.18.11/arch/mips/ath79/.

comment:34 in reply to: ↑ 33 ; follow-up: Changed 3 years ago by cmlara

The 803x is the "ethernet subchip"(if you can call it that considering its on the same SoC I'm confused as to why it gets its own designator, but it apparently does)

The Actual SOC for the XW is a 934x (I'm not 100% sure what the x is on the ones I'm working with at the moment as they are not right in front of me but I'm wanting to say 9342 as well)

Regarding at803x not being used: that is my poin't, the kernel code I think solves this lockup issue is in that module (and has been for a while) but we(OpenWRT) are not calling itif I understand correctly meaning when the node locks up it never gets reset.

Replying to leoso:

Replying to cmlara:

I work with the one gentleman reporting the NanoBeam in #comment:19 as part of a larger project.

As far as I/we can tell it affects the entire XW line (AR803x) EXCEPT the NanoStation (Which has a built in switch)(There is a "NanoStation" and "NanoStation Loco" for reference of those not familiar with the Ubiquiti product lineup, the "NanoStation" non loco device is basically the only device that does have a built in switch) (Its either it affects the whole line or we happen to be getting multiple bugs at once that cause the same failure)

I see we have some problem here. When you talk about Loco XW line you means AR803x. In my log file from {#coment:18] I read the line

[ 0.000000] CPU0 revision is: 0001974c (MIPS 74Kc)
[ 0.000000] SoC: Atheros AR9342 rev 2

in boot log. Can you please give us your dmesg output with the SoC information? Do we have different SoCs with the same bug?

This would seem to agree with the kernel code already around in #comment:12 and the question is (I think) at the moment "is that code actually being used" before we blame kernel (as module at803x is never loaded on devices and instead the "ag" module is used) as that code is basically the same as code in #comment:8 and #comment:9 just written in GPIOD format.

At least for me having an AR9342 the driver at803x is not used but the driver in linux-3.18.11/arch/mips/ath79/.

comment:35 Changed 3 years ago by dman776

I have the Nanobeam M5 19.
If you need any info from me, I'll be happy to provide it.

comment:36 in reply to: ↑ 34 Changed 3 years ago by leoso

Replying to cmlara:

The 803x is the "ethernet subchip"(if you can call it that considering its on the same SoC I'm confused as to why it gets its own designator, but it apparently does)

Okay, thanks for clarification.

The Actual SOC for the XW is a 934x (I'm not 100% sure what the x is on the ones I'm working with at the moment as they are not right in front of me but I'm wanting to say 9342 as well)

Regarding at803x not being used: that is my poin't, the kernel code I think solves this lockup issue is in that module (and has been for a while) but we(OpenWRT) are not calling itif I understand correctly meaning when the node locks up it never gets reset.

Regarding the patches above: After the patches are applied you can see in the dmesg output (attached to #comment:18) that the at803x driver is used. This includes the code which resets the chip. BUT as stated in #comment:18 the interface has problems. It seems to be up but there are no packets received or transmitted. At the moment I do not know how to debug this.

comment:37 Changed 3 years ago by leoso

We are now able to detect the bug in the same way as Ubnt is doing it.
Presence of the bug can be detected by reading the PHY ID. When the chip "hangs" it is not returning the correct PHY ID. This can be reproduced by us now. It is important to fetch the ID directly from the hardware every time because the ID is cached by the driver by default.
We use the follwing code .

--- a/ag71xx_phy.c        2015-06-18 09:00:03.211078658 +0200
+++ b/ag71xx_phy.c        2015-06-27 21:07:07.402581651 +0200
@@ -21,7 +21,11 @@
        int status_change = 0;
 
        spin_lock_irqsave(&ag->lock, flags);
-
+
+       int id1 = phy_read( phydev, 2 );
+       int id2 = phy_read( phydev, 3 );
+       printk("Opennet Hack09: id=%04x %04x",id1, id2); /* should be "004d d023" before bug occured */
+
        if (phydev->link) {
                if (ag->duplex != phydev->duplex
                    || ag->speed != phydev->speed) {

As output of dmesg we get

...
Opennet Hack09: id=004d d023Opennet Hack09: id=004d d023Opennet Hack09:
id=004d d023Opennet Hack09: id=004d d023Opennet Hack09: id=004d
d023Opennet Hack09: id=004d d023Opennet Hack09: id=004d d023Opennet
Hack09: id=004d d023Opennet Hack09: id=0000 0000
[ 2038.330000] eth0: link up (10Mbps/Half duplex)
[ 2040.340000] Opennet Hack09: id=0000 0000
[ 2040.340000] eth0: link down
[ 2040.340000] br-lan: port 1(eth0) entered disabled state
[ 4355.520000] device br-lan entered promiscuous mode
[ 4369.450000] device br-lan left promiscuous mode

to be the

The next step is to implement the reset operation as suggested in #comment:9.

In earlier posts there was the question which PHY driver is used by ag71xx in our scenario. It is the "Generic PHY" driver which can be found under linux-3.18.14/drivers/net/phy/phy_device.c.

comment:38 Changed 3 years ago by dman776

Awesome! looking forward to getting this nailed down! Keep us posted. thx

comment:39 Changed 3 years ago by anonymous

This is great news - I'm very eager to see this fixed. If I can do any testing, please let me know.

comment:40 Changed 3 years ago by anonymous

This bug happens to me with a Ubiquiti UniFi AP too... Maybe I should file a new bug?

comment:41 Changed 3 years ago by dman776

@leoso, any updates on your testing?

comment:42 Changed 2 years ago by ufo@…

is it possible to mark all these UBNT-devices as BROKEN!?
its confusing when people buying that UBNT stuff, because openwrt website seems to recommend these devices.. and then want to use ethernet :-/

p.s. (entire XW line (AR803x) EXCEPT the NanoStation-M)

comment:43 Changed 2 years ago by leoso

Sure it is possible. It should be done! Feel free to edit all places you find by yourself. I added a note in http://wiki.openwrt.org/toh/ubiquiti/nanostationm5

comment:44 Changed 2 years ago by anonymous

Will you also remove that warning in all possible places once this is sorted out?

comment:45 Changed 2 years ago by leoso

This is not needed because the text is "See also Bug description (link is here). As of August 2015 there is no workaround in OpenWRT available."
I think it is important to warn users before they buy defect hardware.

comment:46 Changed 2 years ago by bmoffitt

I agree that we need to warn people about buying XW hardware.

However, it's also important to get it fixed. Since nobody has "stepped up" yet, I have started a crowdsource here: https://www.bountysource.com/issues/12898488-nanostation-m5-loco-xw-loses-interface

I started it up with a $200 bounty, and I hope others who want this fixed will contribute to the fund.

I also hope someone who can translate the info in comments 9 and 37 into a patch will step up and submit the patch. It's well beyond my capabilities to build the patch, but I would very much like to see it fixed.

Thanks, folks!

-Bill

comment:47 Changed 2 years ago by bmoffitt

Late last week I downloaded the "released" CC source to see if anything had changed, even though I haven't seen any recent checkins that would make me suspect anything. I have learned to allow for the possibility that things may have slipped by me...

And there has been a significant change: the "station" device now works perfectly. I am able to plug in a laptop to the client and access the network consistently - eth0 no longer "hangs" but, rather, goes down when nothing is connected and comes back up as soon as it detects an Ethernet connection. This is terrific, and I would very much like to thank whomever made the change.

However, there is also bad news: I am running a cron job that simply pings the router every 5 minutes, and it failed on both devices last night. It appears that the Ethernet port on the "access point" device is behaving badly, disconnecting at seemingly random intervals, then taking 4-5 seconds to come back up.

The following is the dmesg text from the AP device, starting from near the end of the init sequence:

[   25.932899] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   25.939525] br-lan: port 2(wlan0) entered listening state
[   25.945097] br-lan: port 2(wlan0) entered listening state
[   26.142407] br-lan: topology change detected, propagating
[   26.147917] br-lan: port 1(eth0) entered forwarding state
[   26.153950] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
[   27.942424] br-lan: port 2(wlan0) entered learning state
[   29.942366] br-lan: topology change detected, propagating
[   29.947887] br-lan: port 2(wlan0) entered forwarding state
[   52.882391] random: nonblocking pool is initialized
[  844.704882] eth0: link down
[  844.707814] br-lan: port 1(eth0) entered disabled state
[  845.706023] eth0: link up (100Mbps/Full duplex)
[  845.710682] br-lan: port 1(eth0) entered listening state
[  845.716184] br-lan: port 1(eth0) entered listening state
[  847.714557] br-lan: port 1(eth0) entered learning state
[  849.714547] br-lan: topology change detected, propagating
[  849.720061] br-lan: port 1(eth0) entered forwarding state
[ 3781.699358] eth0: link down
[ 3781.702280] br-lan: port 1(eth0) entered disabled state
[ 3782.700504] eth0: link up (100Mbps/Full duplex)
[ 3782.705169] br-lan: port 1(eth0) entered listening state
[ 3782.710647] br-lan: port 1(eth0) entered listening state
[ 3784.709043] br-lan: port 1(eth0) entered learning state
[ 3786.709029] br-lan: topology change detected, propagating
[ 3786.714531] br-lan: port 1(eth0) entered forwarding state
[ 4306.698876] eth0: link up (100Mbps/Half duplex)
[ 4308.698876] eth0: link up (100Mbps/Full duplex)
[ 6766.694272] eth0: link down
[ 6766.697174] br-lan: port 1(eth0) entered disabled state
[ 6767.695409] eth0: link up (100Mbps/Full duplex)
[ 6767.700095] br-lan: port 1(eth0) entered listening state
[ 6767.705579] br-lan: port 1(eth0) entered listening state
[ 6769.703936] br-lan: port 1(eth0) entered learning state
[ 6771.703924] br-lan: topology change detected, propagating
[ 6771.709427] br-lan: port 1(eth0) entered forwarding state
[ 6871.694837] eth0: link up (100Mbps/Half duplex)
[ 6873.694826] eth0: link up (100Mbps/Full duplex)
[10317.694864] eth0: link down
[10317.697768] br-lan: port 1(eth0) entered disabled state
[10318.696001] eth0: link up (100Mbps/Full duplex)
[10318.700678] br-lan: port 1(eth0) entered listening state
[10318.706157] br-lan: port 1(eth0) entered listening state
[10320.704532] br-lan: port 1(eth0) entered learning state
[10322.704509] br-lan: topology change detected, propagating
[10322.710014] br-lan: port 1(eth0) entered forwarding state
[13704.736625] eth0: link down
[13704.739524] br-lan: port 1(eth0) entered disabled state
[13705.737773] eth0: link up (100Mbps/Full duplex)
[13705.742461] br-lan: port 1(eth0) entered listening state
[13705.747966] br-lan: port 1(eth0) entered listening state
[13707.746292] br-lan: port 1(eth0) entered learning state
[13709.746276] br-lan: topology change detected, propagating
[13709.751795] br-lan: port 1(eth0) entered forwarding state
[20205.849238] eth0: link up (10Mbps/Half duplex)
[20207.849227] eth0: link up (100Mbps/Full duplex)
[23797.893776] eth0: link down
[23797.896676] br-lan: port 1(eth0) entered disabled state
[23798.894921] eth0: link up (100Mbps/Full duplex)
[23798.899617] br-lan: port 1(eth0) entered listening state
[23798.905121] br-lan: port 1(eth0) entered listening state
[23800.903458] br-lan: port 1(eth0) entered learning state
[23802.903440] br-lan: topology change detected, propagating
[23802.908953] br-lan: port 1(eth0) entered forwarding state
[24090.892760] eth0: link down
[24090.895664] br-lan: port 1(eth0) entered disabled state
[24091.893899] eth0: link up (100Mbps/Full duplex)
[24091.898587] br-lan: port 1(eth0) entered listening state
[24091.904073] br-lan: port 1(eth0) entered listening state
[24093.902426] br-lan: port 1(eth0) entered learning state
[24095.902424] br-lan: topology change detected, propagating
[24095.907928] br-lan: port 1(eth0) entered forwarding state
[25479.917544] eth0: link up (100Mbps/Half duplex)
[25481.917542] eth0: link up (100Mbps/Full duplex)
[28265.937745] eth0: link down
[28265.940651] br-lan: port 1(eth0) entered disabled state
[28266.938886] eth0: link up (100Mbps/Full duplex)
[28266.943552] br-lan: port 1(eth0) entered listening state
[28266.949055] br-lan: port 1(eth0) entered listening state
[28268.947417] br-lan: port 1(eth0) entered learning state
[28270.947410] br-lan: topology change detected, propagating
[28270.952915] br-lan: port 1(eth0) entered forwarding state
[28338.937342] eth0: link down
[28338.940241] br-lan: port 1(eth0) entered disabled state
[28339.938506] eth0: link up (100Mbps/Full duplex)
[28339.943167] br-lan: port 1(eth0) entered listening state
[28339.948673] br-lan: port 1(eth0) entered listening state
[28341.947001] br-lan: port 1(eth0) entered learning state
[28343.946980] br-lan: topology change detected, propagating
[28343.952502] br-lan: port 1(eth0) entered forwarding state
[28871.953261] eth0: link down
[28871.956165] br-lan: port 1(eth0) entered disabled state
[28872.954400] eth0: link up (100Mbps/Full duplex)
[28872.959055] br-lan: port 1(eth0) entered listening state
[28872.964558] br-lan: port 1(eth0) entered listening state
[28874.962929] br-lan: port 1(eth0) entered learning state
[28876.962904] br-lan: topology change detected, propagating
[28876.968404] br-lan: port 1(eth0) entered forwarding state
[29182.962099] eth0: link up (100Mbps/Half duplex)
[29184.962085] eth0: link up (100Mbps/Full duplex)
[31764.973069] eth0: link down
[31764.975968] br-lan: port 1(eth0) entered disabled state
[31765.974210] eth0: link up (100Mbps/Full duplex)
[31765.978868] br-lan: port 1(eth0) entered listening state
[31765.984367] br-lan: port 1(eth0) entered listening state
[31767.982735] br-lan: port 1(eth0) entered learning state
[31769.982714] br-lan: topology change detected, propagating
[31769.988232] br-lan: port 1(eth0) entered forwarding state
[33157.984714] eth0: link up (100Mbps/Half duplex)
[33159.984699] eth0: link up (100Mbps/Full duplex)
[33765.989590] eth0: link down
[33765.992493] br-lan: port 1(eth0) entered disabled state
[33766.990729] eth0: link up (100Mbps/Full duplex)
[33766.995414] br-lan: port 1(eth0) entered listening state
[33767.000902] br-lan: port 1(eth0) entered listening state
[33768.999255] br-lan: port 1(eth0) entered learning state
[33770.999242] br-lan: topology change detected, propagating
[33771.004748] br-lan: port 1(eth0) entered forwarding state
[33894.998831] eth0: link down
[33895.001734] br-lan: port 1(eth0) entered disabled state
[33895.999889] eth0: link up (100Mbps/Full duplex)
[33896.004572] br-lan: port 1(eth0) entered listening state
[33896.010052] br-lan: port 1(eth0) entered listening state
[33898.008415] br-lan: port 1(eth0) entered learning state
[33900.008393] br-lan: topology change detected, propagating
[33900.013917] br-lan: port 1(eth0) entered forwarding state
[34876.013830] eth0: link up (10Mbps/Half duplex)
[34878.013785] eth0: link up (100Mbps/Full duplex)

(apologies for the long "code" block)

This link is carrying almost no traffic, outside of the "check" script that runs every 5 minutes and some pings when I connect a laptop every few hours. Note that this is the condition under which the link was failing previously: after a few hours eth0 on the "station" devicewould "go to sleep" and then fail to wake up when it was connected.

Note: this test was run with Ubiquiti NanoStation LOCO M5 radios running r46834, which, although I checked it out of the "Chaos Calmer" branch, is still identifying itself as "Bleeding Edge."

Thanks,

Bill

comment:48 Changed 2 years ago by psyke83

Please take a look at my comments in bug #18922. My TL-WR842ND's wan port (eth1) sporadically dies from bittorrent activity and fails to recover from the tx timeout condition. I've identified the problematic code already, but I still have to test another change that is probably the proper fix. A quick look at the code shows that my patch would also affect the AR9342.

comment:49 Changed 2 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

fixed in r47892, r47895

comment:50 Changed 2 years ago by tillw

The changes in r47892, r47895 do not fix the "eth0: link down" problems which are indeed caused by a hardware defect in the PHY chip, which can (to my knowledge) only be fixed by a hardware reset of the PHY. Bug #18922 may have similar symptoms but has a very different origin.

To verify, we compiled and flashed latest trunk today to a Nanostation Loco XW. The bug described in this ticket occured after about 1 hour. :-(

Consequently, I suggest to re-open this bug.

Thanks,
Till

comment:51 Changed 2 years ago by nbd

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:52 Changed 2 years ago by psyke83

The patch is also problematic on TL-WR842ND; it recovers the interface but causes a soft irq storm that doesn't stop until the interface is restarted manually.

tillw,

Does manually restarting the interface resolve the tx timeout? If yes, this patch will probably fix both of our issues (but I'm unsure if it will screw up other SOCs, and maybe nbd can find a better fix):

diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
index 31b38d7..b87a36b 100644
--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -870,7 +870,7 @@ static void ag71xx_restart_work_func(struct work_struct *work)
 {
        struct ag71xx *ag = container_of(work, struct ag71xx, restart_work);
 
-       if (ag71xx_get_pdata(ag)->is_ar724x) {
+       if (ag71xx_get_pdata(ag)->is_ar7240) {
                ag->link = 0;
                ag71xx_link_adjust(ag);
                return;

P.S. The existing fix (as well as this one) doesn't *avoid* tx timeout errors from appearing; it only restores the interface after they happen.

Last edited 2 years ago by psyke83 (previous) (diff)

comment:53 Changed 2 years ago by psyke83

I re-read all replies & see that my patch won't help except for loco m5. Regardless, please let me know if restarting the interface (/etc/init.d/network restart) restores connectivity.

If yes, the phy id check can be used to trigger an if down/up event.

Last edited 2 years ago by psyke83 (previous) (diff)

comment:54 Changed 2 years ago by cmlara

Restarting interfaces has not (in the past, I have not re-tested it since this patch) resolved the issue. It really needs a low level chip reset and user level tools (to my knowledge) just do not have this ability to do a full chip reset.

I don't 100% understand the chip layouts but from some of the information I have grabbed from the web (and the labeling given by Ubiquiti) it seems 0x18040000 is a GPIO control register bank, and presumably the pin we are triggering is connected somehow to reset the ethernet chip.

https://wiki.opennet-initiative.de/wiki/Benutzer:Leo seemed to be getting closer however I can see the last code mistep was to try and use gpiod to write to the register bit rather than setting up a GPIO first and using that (GPIOD complained that no such gpio existed which is true) or using ar71xx_wr() to write to the register bit.

My last look was trying to see if there was a way to call ag71xx_hw_init() on the assumption that it may be 'cleaner' and it would re-initalize the chip however I was at first thinking that ath79_device_reset_set() and ath79_device_reset_clear() touched a register provided, instead it looks like it works on the reset register, though that may still work with AR71XX_RESET_GE0_MAC (or is it GE1?) in the pdata of the device to trigger a reset of the chip? It wouldn't follow what was suggested by Ubiquit, so I am not sure if its a 'better' or 'worse' place to reset and what repercussion there are by doing it in the reset register.

I've yet to have time to mock this up or to implement correctly (if it is even the right path)

Speaking for dman776 and myself: We concur bug still exists, testing triggered this morning in our lab as well.

Speaking for myself:
We did see the Ethernet interface going "UP UP" randomly with r47895 patch, not sure if it was about to get stuck and recovered, or if the patch itself has issues where it responds differently but eventually it did stick this morning.

comment:55 Changed 2 years ago by bmoffitt

Folks-

I flashed the latest nightly from the trunk (r47896) to test on a pair of Ubiquiti LOCO M5 XW radios. They have been running for a couple of hours now; the client side has been flawless, but I see some very odd behavior in the dmesg for the Access Point side:

[ 52.885082] random: nonblocking pool is initialized
[ 2086.730654] eth0: link down
[ 2086.733554] br-lan: port 1(eth0) entered disabled state
[ 2087.731803] eth0: link up (100Mbps/Full duplex)
[ 2087.736466] br-lan: port 1(eth0) entered forwarding state
[ 2087.742041] br-lan: port 1(eth0) entered forwarding state
[ 2089.740331] br-lan: port 1(eth0) entered forwarding state
[ 2913.722358] eth0: link down
[ 2913.725310] br-lan: port 1(eth0) entered disabled state
[ 2914.723475] eth0: link up (100Mbps/Full duplex)
[ 2914.728137] br-lan: port 1(eth0) entered forwarding state
[ 2914.733696] br-lan: port 1(eth0) entered forwarding state
[ 2916.731995] br-lan: port 1(eth0) entered forwarding state
[ 3792.713089] eth0: link down
[ 3792.715993] br-lan: port 1(eth0) entered disabled state
[ 3793.714170] eth0: link up (100Mbps/Full duplex)
[ 3793.718832] br-lan: port 1(eth0) entered forwarding state
[ 3793.724408] br-lan: port 1(eth0) entered forwarding state
[ 3795.722687] br-lan: port 1(eth0) entered forwarding state
[ 4905.700388] eth0: link down
[ 4905.703287] br-lan: port 1(eth0) entered disabled state
[ 4906.701538] eth0: link up (100Mbps/Full duplex)
[ 4906.706198] br-lan: port 1(eth0) entered forwarding state
[ 4906.711773] br-lan: port 1(eth0) entered forwarding state
[ 4908.710040] br-lan: port 1(eth0) entered forwarding state
[ 7286.674923] eth0: link up (10Mbps/Half duplex)
[ 7288.674905] eth0: link up (100Mbps/Full duplex)
[ 8076.665133] eth0: link down
[ 8076.668051] br-lan: port 1(eth0) entered disabled state
[ 8077.666274] eth0: link up (100Mbps/Full duplex)
[ 8077.670957] br-lan: port 1(eth0) entered forwarding state
[ 8077.676515] br-lan: port 1(eth0) entered forwarding state
[ 8079.674785] br-lan: port 1(eth0) entered forwarding state

As you can see, every so often the Ethernet port just goes down for almost exactly one second.

The configuration of that radio is just a standard Access Point config, with WPA2 PSK authentication/encryption. It is plugged into a switch on a "2Wire" brand router (so I cannot be absolutely sure the router is not at fault).

Question to psyke83 - you noted that "my patch won't help except for loco m5" but tillw noted "we compiled and flashed latest trunk today to a Nanostation Loco XW. The bug described in this ticket occured after about 1 hour." As far as I know the m5 is the only loco that has the xw architecture currently. Am I missing something here?

-Bill

comment:56 Changed 2 years ago by psyke83

bmoffitt,

I misstated the device; I believe I meant to say the "xm" which has an older AR724x SOC. I looked through the posted attachments, and only ufo's log exhibits tx timeout symptoms, so it's the only device that would be affected by the recent timeout patches.

I occasionally see eth0 going down and up, but honestly I don't use a wired client enough to test if it's really problematic or just a "hiccup" that doesn't really affect connectivity on my device.

Have you guys tried inserting another call to ag71xx_hw_init() somewhere in the driver during link up, perhaps near the beginning of ag71xx_open? Since a tx timeout is not triggered when your chip becomes nonfunctional, you'd have to test its effectiveness by manually bringing the interface down/up after you see the bug manifesting.

comment:57 Changed 2 years ago by cmlara

RE: bmoffitt: comment:55

Saw the same lins (except we were not in bridge mode so it didnt come from br-lan but the interface itself instead) where interface would go up, this mislead us at first (as noted in comment:54 ) to think the issue was resolved, but a day later the link actually finally hit the initial bug where the interface went down and would never come back up.

Re: psyke83: comment:56
I haven't put it in yet, I'm not even sure if it can be called direct from where I was thinking which would be inside of ag71xx_phy.c as in comment:37 a simple if() check could make it automatic if it can be placed there, and might need some pdata as well to make sure it does a reset at chip. That is one of two paths I'm thinking of (as noted in comment:54 ) just not sure which is the kernel appropriate method to handle it.

comment:58 Changed 2 years ago by anonymous

cmlara,

Maybe not, but I was suggesting only for you to test if it clears the condition when a down/up event occurs. So for example, insert the ag71xx_hw_init(ag) call after this line: https://github.com/openwrt-mirror/openwrt/blob/master/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c#L551

Of course, this is not an adequate solution, but if it resets the chip correctly for the next link up event, then you can worry about implementing a proper check by comparing the phy id, etc., afterwards.

comment:59 Changed 2 years ago by bmoffitt

Folks-

I did some testing over the weekend with a pair of Ubiquiti LOCO M5 (Atheros AR9340 Rev:2) units.

The original bug was with the client device - it would take eth0 down if no device was attached, but it would not always bring it up when a device was attached.

Currently, I am seeing the problem on both access point and client sides. On the access point side (as noted earlier), eth0 is going down and then coming back up a second later. However, later in that test, eth0 went down and did not come back up. I repeated the test with a different radio, and it behaved the same way.

The client device did not exhibit the "down for a second, then back up" behavior; at some point it just brought eth0 down and it stayed down - only one message in dmesg.

FWIW,

Bill

comment:60 Changed 2 years ago by WL7COO@…

Just a suggestion as I know virtually nothing about this hardware.

Past experience suggests that time related recurring hardware failures *may* be due to a memory leak.

If anyone is able to reliably run one of these to failure while capturing code execution in a debugger (or device simulator), you may see what memory and execution location status is at the time of failure. If the issue is a memory leak, this would point at which function/call/object may not be releasing all allocated memory during or after each execution.

(apologies if debugging ability is way past this in the current dev environment)

Hoping this might be useful for somebody equipped to work on this
73
...dan WL7COO

comment:61 Changed 23 months ago by anonymous

The solution used by Ubiquity not really seams to solve the problem. I just want to mention this thread, as nobody did before:
https://community.ubnt.com/t5/Installation-Troubleshooting/Nanostation-XW-Ethernet-issue/m-p/1164752/highlight/false

comment:62 Changed 16 months ago by dman776

we've successfully tested a fix for this! Cleaning up the code now.

comment:63 Changed 16 months ago by dman776

Opening up for comment to the code in our gerrit code review as we finalize this patch:

http://gerrit.aredn.org/#/c/57

We have tested on NBE-M5-19 and NS Loco M5 XW. Other affected ubnt XW devices soon to be tested.

From a NanoStation Loco M5 XW:

[26975.250000] ag71xx_check_reset: expected: 004d, got: 0000
[26975.250000] ag71xx_gpio_reset triggered
[26975.250000] eth0: link down
[26977.260000] eth0: link up (100Mbps/Full duplex)
[42601.260000] ag71xx_check_reset: expected: 004d, got: 0000
[42601.260000] ag71xx_gpio_reset triggered
[42601.260000] eth0: link up (10Mbps/Half duplex)
[42603.280000] eth0: link up (100Mbps/Full duplex)

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.