Modify

Opened 23 months ago

Last modified 20 months ago

#22075 new defect

Intermittent 802.11 transmission failures in a wireless/WDS bridge, ath9k

Reported by: jmomo Owned by: developers
Priority: normal Milestone: Designated Driver (Trunk)
Component: base system Version: Trunk
Keywords: wds, bridging, 4-address, ath9k Cc:

Description

I am seeing intermittent 802.11 transmission failures from my OpenWRT router to a bridged client.

I've been chasing this issue for two months now, so please bear with me as I explain boring details which may or may not be of any relevance.

The OpenWRT device is a TP-Link TL-WDR4300 (N750) router, hardware version 1.6.

I am under the impression that the fact I am doing 4-address/WDS bridging is a contributing factor, as I've never seen this issue prior to setting up the bridge two months ago, and non-bridged clients attaching to the same 802.11 phy don't experience the issue during an outage. I have "option wds 1" set in my OpenWRT config, and my bridge client is a Linux system with numerous computers behind it. I have included the wireless config below.

The symptoms are bizarre. During each outage, I can access the OpenWRT router and multiple hops beyond it, but then traffic dies out on the internet near the peering points of my ISP. I assume it is a coincidence that mtr/traceroutes die at the exact same spot out on the internet each time, but it's a strange thing to see.

I've done numerous tcpdump captures to verify that traffic is getting from my bridge client out to the internet, that replies are coming back, and that the OpenWRT system attempts to pass them back to the client. However, they never get there.

Outages occur seemingly at random, and in busts. Each individual outage usually lasts for 100-130 seconds. Then there is a ~60 second break in between and another outage occurs. This up/down process can go on for anywhere from three minutes to an hour.

Reauthenticating with the AP instantly clears up the issue, for at least awhile ("sudo -u root wpa_cli -i wlan0 reassociate" on the bridge client).

I finally nailed the issue down to OpenWRT last night by watching the output of iw during an outage.

Here is an example of iw output during an outage period, note "tx failed", which is incrementing throughout the outage. When I reauthenticate from the client, the interface is re-created, the counter goes to zero, and it stops incrementing until the next outage event.

The command used here was "watch -n 1 iw dev wlan1.sta1 station get 34:de:1a:xx:yy:zz":

Station 34:de:1a:xx:yy:zz (on wlan1.sta1)
        inactive time:  350 ms
        rx bytes:       15362806
        rx packets:     138019
        tx bytes:       107321206
        tx packets:     220692
        tx retries:     42086
        tx failed:      2278
        signal:         -65 [-69, -79, -65] dBm
        signal avg:     -65 [-68, -77, -65] dBm
        tx bitrate:     45.0 MBit/s MCS 2 40MHz short GI
        rx bitrate:     120.0 MBit/s MCS 5 40MHz short GI
        expected throughput:    25.542Mbps
        authorized:     yes
        authenticated:  yes
        preamble:       long
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        connected time: 4556 s

Additional clues which may or may not be relevant. Note that these readings were taken at a different time (prior) to the output above. I diffed the output of everything under /sys/kernel/debug/ieee80211/phy1 with 10 seconds in between:

 /sys/kernel/debug/ieee80211/phy1/ath9k/ani:
-    OFDM ERRORS: 170026
+    OFDM ERRORS: 170461
 
 /sys/kernel/debug/ieee80211/phy1/ath9k/reset:
-   Fatal HW Error: 22
+   Fatal HW Error: 24


Here is some info about my system:

root@openwrt:~# cat /etc/openwrt_version 
r49034
root@openwrt:~# cat /etc/openwrt_release 
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='Bleeding Edge'
DISTRIB_REVISION='r49034'
DISTRIB_CODENAME='designated_driver'
DISTRIB_TARGET='ar71xx/generic'
DISTRIB_DESCRIPTION='OpenWrt Designated Driver r49034'
DISTRIB_TAINTS='no-all busybox'
root@openwrt:~# cat /tmp/sysinfo/*
tl-wdr4300
TP-Link TL-WDR4300 v1
root@openwrt:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset
    Baseband Hang:  0
Baseband Watchdog:  0
   Fatal HW Error: 50
      TX HW error:  0
 Transmit timeout:  0
     TX Path Hang:  0
      PLL RX Hang:  0
         MAC Hang:  0
     Stuck Beacon:  0
        MCI Reset:  0
Calibration error:  0
Tx DMA stop error:  0
Rx DMA stop error:  0
root@openwrt:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani
            ANI: ENABLED
      ANI RESET: 188
     OFDM LEVEL: 0
      CCK LEVEL: 0
        SPUR UP: 120
      SPUR DOWN: 120
 OFDM WS-DET ON: 0
OFDM WS-DET OFF: 0
     MRC-CCK ON: 0
    MRC-CCK OFF: 0
    FIR-STEP UP: 120
  FIR-STEP DOWN: 172
 INV LISTENTIME: 0
    OFDM ERRORS: 404358
     CCK ERRORS: 0
root@openwrt:~# cat /etc/config/wireless
# 2.4Ghz
config wifi-device 'radio0'
	option type 'mac80211'
	option macaddr '10:fe:ed:aa:bb:cc'
	option hwmode '11g'
	option htmode 'HT40'
	list ht_capab 'LDPC'
	list ht_capab 'SHORT-GI-20'
	list ht_capab 'SHORT-GI-40'
	list ht_capab 'TX-STBC'
	list ht_capab 'RX-STBC1'
	list ht_capab 'DSSS_CCK-40'
	option channel 'auto'
	option country 'US'
	option txpower '27'
	option log_level 1 # Default 2

config wifi-iface
	option device 'radio0'
	option network 'lan'
	option mode 'ap'
	option ssid 'XXXXXXXXXXXXX'
	option encryption 'psk2+ccmp'
	option key 'ZZZZZZZZZZZZZZ'
	option wds 1

# 5Ghz
config wifi-device 'radio1'
	option type 'mac80211'
	option macaddr '10:fe:ed:xx:yy:zz'
	option hwmode '11a'
	option htmode 'HT40'
	list ht_capab 'LDPC'
	list ht_capab 'SHORT-GI-20'
	list ht_capab 'SHORT-GI-40'
	list ht_capab 'TX-STBC'
	list ht_capab 'RX-STBC1'
	list ht_capab 'DSSS_CCK-40'
	option channel 'auto'
	option country 'US'
	option txpower '17'
	option log_level 1 # Default 2

config wifi-iface
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option encryption 'psk2+ccmp'
	option ssid 'CCCCCCCCCCCCCC'
	option key 'DDDDDDDDDDDDDD'
	option wds 1

root@openwrt:~#


If you require more info, please ask and I'll get it. I have a USB flash ext-root on this system and I have plenty of troubleshooting tools installed to work with.

Attachments (0)

Change History (7)

comment:1 Changed 23 months ago by jmomo

I forgot to mention there is nothing in logread/syslog associated with the xmit failures, which seems somewhat surprising given there are "Fatal HW Errors" occurring.

comment:2 Changed 23 months ago by jmomo

My problems appear to be much more serious than I initially thought. My wlan1(5Ghz) raido is sometimes not coming up on boot/reconfig and I am seeing problem on 2.4Ghz as well, though the 2.4Ghz spectrum in my area is so crowded and noisy that I'm not sure if the counters I'm looking at are caused by spectral congestion or an internal issue.

I think my next step is to fall back to the official CC release and see how that works out. There are a couple of things from DD/Trunk I need, but I can backport it.

I'm not sure if this is a hardware problem or if this is driver/kernel issues. I regret that I don't have spare hardware to play with.

comment:3 Changed 23 months ago by jmomo

I downgraded to CC/15.05.1 today changed 5Ghz channels from 36 to 44, but the problem persists.

I wrote a script to watch output and it puts out things like this:

20160324173005: tx failed:	0 (+0)	Fatal HW Error:  4
20160324173006: tx failed:	6 (+6)	Fatal HW Error:  4
20160324173007: tx failed:	42 (+36)	Fatal HW Error:  4
20160324173008: tx failed:	69 (+27)	Fatal HW Error:  4
20160324173009: tx failed:	94 (+25)	Fatal HW Error:  4
20160324173011: tx failed:	119 (+25)	Fatal HW Error:  4
20160324173012: tx failed:	152 (+33)	Fatal HW Error:  4
20160324173013: tx failed:	191 (+39)	Fatal HW Error:  4
20160324173014: tx failed:	227 (+36)	Fatal HW Error:  4
20160324173015: tx failed:	265 (+38)	Fatal HW Error:  4
20160324173016: tx failed:	305 (+40)	Fatal HW Error:  4
20160324173017: tx failed:	346 (+41)	Fatal HW Error:  4
20160324173018: tx failed:	384 (+38)	Fatal HW Error:  4
20160324173019: tx failed:	421 (+37)	Fatal HW Error:  4
20160324173020: tx failed:	459 (+38)	Fatal HW Error:  4
20160324173021: tx failed:	480 (+21)	Fatal HW Error:  4
20160324173022: tx failed:	498 (+18)	Fatal HW Error:  4
20160324173023: tx failed:	525 (+27)	Fatal HW Error:  4
20160324173024: tx failed:	556 (+31)	Fatal HW Error:  4
20160324173025: tx failed:	594 (+38)	Fatal HW Error:  4
20160324173027: tx failed:	631 (+37)	Fatal HW Error:  4
20160324173028: tx failed:	657 (+26)	Fatal HW Error:  4
20160324173029: tx failed:	688 (+31)	Fatal HW Error:  4
20160324173030: tx failed:	716 (+28)	Fatal HW Error:  4
20160324173031: tx failed:	756 (+40)	Fatal HW Error:  4
20160324173032: tx failed:	797 (+41)	Fatal HW Error:  4
20160324173033: tx failed:	836 (+39)	Fatal HW Error:  4
20160324173034: tx failed:	873 (+37)	Fatal HW Error:  4
20160324173035: tx failed:	913 (+40)	Fatal HW Error:  4
20160324173036: tx failed:	953 (+40)	Fatal HW Error:  4
20160324173037: tx failed:	992 (+39)	Fatal HW Error:  4
20160324173038: tx failed:	1032 (+40)	Fatal HW Error:  4
20160324173039: tx failed:	1071 (+39)	Fatal HW Error:  4
20160324173040: tx failed:	1110 (+39)	Fatal HW Error:  4
20160324173041: tx failed:	1150 (+40)	Fatal HW Error:  4
20160324173043: tx failed:	1187 (+37)	Fatal HW Error:  4
20160324173044: tx failed:	1227 (+40)	Fatal HW Error:  4
20160324173045: tx failed:	1267 (+40)	Fatal HW Error:  4
20160324173046: tx failed:	1305 (+38)	Fatal HW Error:  4
20160324173047: tx failed:	1343 (+38)	Fatal HW Error:  4
20160324173048: tx failed:	1376 (+33)	Fatal HW Error:  4
20160324173049: tx failed:	1404 (+28)	Fatal HW Error:  4
20160324173050: tx failed:	1432 (+28)	Fatal HW Error:  4 <-- Client bridge dies right about here.
20160324173051: tx failed:	1439 (+7)	Fatal HW Error:  4
20160324173052: tx failed:	1451 (+12)	Fatal HW Error:  4
20160324173053: tx failed:	1459 (+8)	Fatal HW Error:  4
20160324173054: tx failed:	1462 (+3)	Fatal HW Error:  4
20160324173055: tx failed:	1463 (+1)	Fatal HW Error:  4
20160324173056: tx failed:	1464 (+1)	Fatal HW Error:  4
20160324173057: tx failed:	1476 (+12)	Fatal HW Error:  4
20160324173058: tx failed:	1480 (+4)	Fatal HW Error:  4
20160324173100: tx failed:	1484 (+4)	Fatal HW Error:  4
20160324173101: tx failed:	1491 (+7)	Fatal HW Error:  4
20160324173102: tx failed:	1506 (+15)	Fatal HW Error:  4
20160324173103: tx failed:	1517 (+11)	Fatal HW Error:  4
20160324173104: tx failed:	1522 (+5)	Fatal HW Error:  4
20160324173105: tx failed:	1531 (+9)	Fatal HW Error:  4
20160324173106: tx failed:	1540 (+9)	Fatal HW Error:  4
20160324173107: tx failed:	1545 (+5)	Fatal HW Error:  4
20160324173108: tx failed:	1551 (+6)	Fatal HW Error:  4
20160324173109: tx failed:	1560 (+9)	Fatal HW Error:  4
20160324173110: tx failed:	1564 (+4)	Fatal HW Error:  4
20160324173111: tx failed:	1571 (+7)	Fatal HW Error:  4
20160324173112: tx failed:	4 (+-1567)	Fatal HW Error:  4 <-- interface reset on client reassociation.
20160324173113: tx failed:	4 (+0)	Fatal HW Error:  4
20160324173114: tx failed:	4 (+0)	Fatal HW Error:  4


However, of about half of the outages I experience look like this. In the other half, there are no tx failures when an outage occurs. The Fatal HW Errors happen at random rather than when I see tx failures, so no apparent correlation there.

I wanted to try a USB 5Ghz adapter but when I did I ran into bug #22090, so fuck my life.

comment:4 follow-up: Changed 20 months ago by gonghuan98@…

Same issue as you described. Your messages offer more information than I got. I also find that the outage event and client bridge death can be found in hostapd debug logs. That is, when outage event happens all wds-stations lost connections with wds-ap. Forget to mention I have about 6-10 stations and all connected to one wds-ap. Do you think by any chance this is a hostapd related issue? If you have found the origin of the issue or get a solution please send me an email. Thanks a lot.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 20 months ago by jmomo

Replying to gonghuan98@…:

Same issue as you described. Your messages offer more information than I got. I also find that the outage event and client bridge death can be found in hostapd debug logs. That is, when outage event happens all wds-stations lost connections with wds-ap. Forget to mention I have about 6-10 stations and all connected to one wds-ap. Do you think by any chance this is a hostapd related issue? If you have found the origin of the issue or get a solution please send me an email. Thanks a lot.

Are you sure it's the same issue? Do you even have the same hardware?

I had thought about coming back and closing this. It is my opinion now that what I am seeing is either a hardware failure or some kind of fault/bug in the software for this wireless hardware.

The problem has nothing to do with bridging. I just started seeing it more often on a bridged host, but I am able to reproduce the problem on non-bridged hosts as well. It took a lot of troubleshooting to figure that out though.

I started using a USB attached Realtek adapter and the problems on that interface went away, so it's something specific to the built-in Atheros hardware. It's either bad hardware or some bugs in software specific to the hardware.

I had given up on this issue anyway. The hardware is old and I plan to replace it.

comment:6 in reply to: ↑ 5 Changed 20 months ago by gonghuan98@…

Replying to jmomo:

Replying to gonghuan98@…:

Same issue as you described. Your messages offer more information than I got. I also find that the outage event and client bridge death can be found in hostapd debug logs. That is, when outage event happens all wds-stations lost connections with wds-ap. Forget to mention I have about 6-10 stations and all connected to one wds-ap. Do you think by any chance this is a hostapd related issue? If you have found the origin of the issue or get a solution please send me an email. Thanks a lot.

Are you sure it's the same issue? Do you even have the same hardware?

I had thought about coming back and closing this. It is my opinion now that what I am seeing is either a hardware failure or some kind of fault/bug in the software for this wireless hardware.

The problem has nothing to do with bridging. I just started seeing it more often on a bridged host, but I am able to reproduce the problem on non-bridged hosts as well. It took a lot of troubleshooting to figure that out though.

I started using a USB attached Realtek adapter and the problems on that interface went away, so it's something specific to the built-in Atheros hardware. It's either bad hardware or some bugs in software specific to the hardware.

I had given up on this issue anyway. The hardware is old and I plan to replace it.

I tried set wireless bridge with relayd instead of wds and the problem went away. Connection is stable for more than two days. And my hardware is QCA9531,I guess yours is AR9344? It's very close of the two chips,right. As you do not want to go on with this issue, I have to solve this on my own...that is sad. One more thing ,I found some ancient posts similar to ours across 10 past years... thanks for you reply anyway.

comment:7 Changed 20 months ago by anonymous

Completely different chip. Totally different driver.

Add Comment

Modify Ticket

Action
as new .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.