Modify

Opened 3 years ago

Last modified 16 months ago

#18922 reopened defect

eth1 (ag71xx): transmit queue 0 timed out

Reported by: fpv2000@… Owned by: developers
Priority: normal Milestone: Chaos Calmer 15.05
Component: kernel Version: Trunk
Keywords: Cc:

Description

Hello,

I have a TP-Link WR-851N v9 configured for NAT and WiFi AP.

I use OpenWrt Chaos Calmer r44296.

Under a load of ~15mbits it got this crash :

[33034.010000] ------------[ cut here ]------------
[33034.010000] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x1e8/0x26c()
[33034.020000] NETDEV WATCHDOG: eth1 (ag71xx): transmit queue 0 timed out
[33034.030000] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT slhc nf_nat_irc nf_nat_ftp nf_nat nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_irc nf_conntrack_ftp nf_conntrack iptable_raw iptable_mangle iptable_filter ipt_REJECT ip_tables crc_ccitt compat ledtrig_usbdev ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables zram lzo_decompress lzo_compress zsmalloc ipv6 arc4 crypto_blkcipher ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
[33034.100000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.30 #1
[33034.100000] Stack : 00000006 00000000 00000000 00000000 00000000 00000000 803ac98e 00000032
[33034.100000] 	  803411e8 000000b9 802f8778 80341613 00000000 803a3b5c 803411e8 000000b9
[33034.100000] 	  803ac304 00000001 00000004 8029aecc 00000003 801ffc1c 00000108 000000b9
[33034.100000] 	  802fb93c 80331c74 00000000 00000000 00000000 00000000 00000000 00000000
[33034.100000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[33034.100000] 	  ...
[33034.140000] Call Trace:
[33034.140000] [<80241d4c>] show_stack+0x48/0x70
[33034.150000] [<802ab404>] warn_slowpath_common+0x84/0xb4
[33034.150000] [<802ab460>] warn_slowpath_fmt+0x2c/0x38
[33034.160000] [<80111740>] dev_watchdog+0x1e8/0x26c
[33034.160000] [<800eecc8>] call_timer_fn.isra.38+0x24/0x84
[33034.170000] [<8022ae9c>] run_timer_softirq+0x17c/0x1bc
[33034.170000] [<8008f5e8>] __do_softirq+0xf8/0x228
[33034.180000] [<80186c80>] irq_exit+0x54/0x70
[33034.180000] [<80060830>] ret_from_irq+0x0/0x4
[33034.190000] [<80060a80>] __r4k_wait+0x20/0x40
[33034.190000] [<80100c54>] cpu_startup_entry+0xa4/0x104
[33034.200000] [<8034e94c>] start_kernel+0x3c8/0x3e0
[33034.200000] 
[33034.200000] ---[ end trace 52508bb6bbfb5380 ]---

Thanks for your help

Attachments (4)

kernel.log (13.9 KB) - added by anonymous 3 years ago.
dmesg-wl842n.log (17.8 KB) - added by psyke83 3 years ago.
logread-wl842n.log (17.4 KB) - added by psyke83 3 years ago.
dmesg-wl842n-recovered.txt (16.5 KB) - added by psyke83 2 years ago.

Download all attachments as: .zip

Change History (48)

Changed 3 years ago by anonymous

Changed 3 years ago by psyke83

Changed 3 years ago by psyke83

comment:1 Changed 3 years ago by psyke83

I'm having the same problem with my TP Link-WL842N, r45922.

[35974.010000] ------------[ cut here ]------------
[35974.010000] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x1d8/0x25c()
[35974.020000] NETDEV WATCHDOG: eth1 (ag71xx): transmit queue 0 timed out
[35974.020000] Modules linked in: ifb ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_ftp iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt compat fuse act_connmark nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ledtrig_usbdev ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ipv6 arc4 crypto_blkcipher usb_storage sd_mod scsi_mod ext4 jbd2 mbcache crc16 crypto_hash ohci_platform ohci_hcd ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common [last unloaded: ifb]
[35974.130000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.14 #1
[35974.130000] Stack : 00000000 00000000 00000000 00000000 803e4272 00000032 00000000 80262490
[35974.130000] 	  00000001 00000029 8032bbcc 80384ba3 00000000 803e349c 80384ed8 00000029
[35974.130000] 	  803e38e0 00000001 00000004 800a3720 00000003 800803d0 0000012f 00000029
[35974.130000] 	  8032f1d8 8037fc6c 00000000 00000000 00000000 00000000 00000000 00000000
[35974.130000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[35974.130000] 	  ...
[35974.170000] Call Trace:
[35974.170000] [<80071058>] show_stack+0x48/0x70
[35974.180000] [<800804d8>] warn_slowpath_common+0x84/0xb4
[35974.180000] [<80080534>] warn_slowpath_fmt+0x2c/0x38
[35974.190000] [<80262490>] dev_watchdog+0x1d8/0x25c
[35974.190000] [<800ab19c>] call_timer_fn.isra.38+0x24/0x84
[35974.200000] [<800ab9b4>] run_timer_softirq+0x1bc/0x1f8
[35974.200000] [<800829d8>] __do_softirq+0xf8/0x230
[35974.210000] [<80082d3c>] irq_exit+0x54/0x70
[35974.210000] [<80060830>] ret_from_irq+0x0/0x4
[35974.220000] [<80060a80>] __r4k_wait+0x20/0x40
[35974.220000] [<8009f6b8>] cpu_startup_entry+0xf0/0x13c
[35974.220000] [<8039eb64>] start_kernel+0x434/0x44c
[35974.230000] 
[35974.230000] ---[ end trace e8e90892f46c2d1e ]---

This happens every 1-2 days, and the internet is unreachable unless I reboot the router or restart networking. The error doesn't affect wifi connectivity to the router, so SSH is always available. Due to the frequency of the problem I was forced to write a script to automatically recover connectivity: https://gist.github.com/psyke83/a2476609557f58ad946d

comment:2 Changed 2 years ago by psyke83

Could this be an oversight that might solve the tx timeout issues for the original reporter (QCA9553) and my own (AR7241) device? Both are set with the "is_ar724x" flag, but not "is_ar7240".

--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -914,7 +914,7 @@ static int ag71xx_tx_packets(struct ag71xx *ag)
 		struct sk_buff *skb = ring->buf[i].skb;
 
 		if (!ag71xx_desc_empty(desc)) {
-			if (pdata->is_ar7240 &&
+			if (pdata->is_ar724x &&
 			    ag71xx_check_dma_stuck(ag, ring->buf[i].timestamp))
 				schedule_work(&ag->restart_work);
 			break;

I've completed a local compile with this change but I've never tried flashing self-compiled build before. so I'm apprehensive. I'll follow up soon (if the build flashes/works correctly).

Edit: I neglected to mention that the bug is almost exclusively triggered by bittorrent activity on my device (via a WAN client), but it's difficult to reproduce reliably. I haven't had any luck triggering a tx timeout via ping flooding or iperf stress testing between a LAN / WAN clients.

Edit 2: the patched build flashed successfully - I'll download a few ISOs over bittorrent and see if I can trigger the bug.

Last edited 2 years ago by psyke83 (previous) (diff)

comment:3 Changed 2 years ago by anonymous

Update: the patch in the previous comment wasn't sufficient to solve the issue, but this *did*:

diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
index dc77699..28669ef 100644
--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -489,6 +489,8 @@ static void ag71xx_hw_init(struct ag71xx *ag)
 	ag71xx_hw_setup(ag);
 
 	ag71xx_dma_reset(ag);
+
+	printk("ag71xx_hw_init\n");
 }
 
 static void ag71xx_fast_reset(struct ag71xx *ag)
@@ -865,13 +867,19 @@ static void ag71xx_restart_work_func(struct work_struct *work)
 {
 	struct ag71xx *ag = container_of(work, struct ag71xx, restart_work);
 
+#if 0
 	if (ag71xx_get_pdata(ag)->is_ar724x) {
 		ag->link = 0;
 		ag71xx_link_adjust(ag);
 		return;
 	}
+#endif
 
 	ag71xx_stop(ag->dev);
+
+	if (ag71xx_get_pdata(ag)->is_ar724x)
+		ag71xx_hw_init(ag);
+
 	ag71xx_open(ag->dev);
 }
 
@@ -914,7 +922,7 @@ static int ag71xx_tx_packets(struct ag71xx *ag)
 		struct sk_buff *skb = ring->buf[i].skb;
 
 		if (!ag71xx_desc_empty(desc)) {
-			if (pdata->is_ar7240 &&
+			if (pdata->is_ar724x &&
 			    ag71xx_check_dma_stuck(ag, ring->buf[i].timestamp))
 				schedule_work(&ag->restart_work);
 			break;

Log snippet:

[66994.010000] ------------[ cut here ]------------
[66994.010000] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x1d8/0x25c()
[66994.020000] NETDEV WATCHDOG: eth1 (ag71xx): transmit queue 0 timed out
[66994.020000] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_ftp nf_conntrack iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt compat fuse act_ipt em_nbyte sch_teql sch_prio sch_pie sch_codel sch_gred em_meta sch_htb cls_basic em_text sch_tbf sch_red sch_sfq sch_fq act_police em_cmp sch_dsmark act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ledtrig_usbdev ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ifb zram lzo_decompress lzo_compress lz4_decompress lz4_compress zsmalloc ipv6 arc4 crypto_blkcipher usb_storage ohci_platform ohci_hcd ehci_platform ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 jbd2 mbcache usbcore nls_base usb_common crc16 crypto_hash
[66994.140000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.23 #1
[66994.150000] Stack : 00000000 00000000 00000000 00000000 803b4292 00000032 00000000 8024d7e8
[66994.150000]    00000001 00000059 80305cd8 8035c9e3 00000000 803b34d0 8035cd18 00000059
[66994.150000]    803b3918 00000001 00000004 800a35cc 00000003 800803bc 0000012f 00000059
[66994.150000]    803092d0 80357c6c 00000000 00000000 00000000 00000000 00000000 00000000
[66994.150000]    00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[66994.150000]    ...
[66994.190000] Call Trace:
[66994.190000] [<80070fe0>] show_stack+0x50/0x84
[66994.190000] [<800804c4>] warn_slowpath_common+0x84/0xb4
[66994.200000] [<80080520>] warn_slowpath_fmt+0x2c/0x38
[66994.200000] [<8024d7e8>] dev_watchdog+0x1d8/0x25c
[66994.210000] [<800ab068>] call_timer_fn.isra.38+0x24/0x84
[66994.210000] [<800ab880>] run_timer_softirq+0x1bc/0x1f8
[66994.220000] [<800829c4>] __do_softirq+0xf8/0x230
[66994.220000] [<80082d30>] irq_exit+0x54/0x70
[66994.230000] [<80060830>] ret_from_irq+0x0/0x4
[66994.230000] [<80060a80>] __r4k_wait+0x20/0x40
[66994.240000] [<8009f564>] cpu_startup_entry+0xf0/0x13c
[66994.240000] [<80375b5c>] start_kernel+0x42c/0x444
[66994.250000] 
[66994.250000] ---[ end trace 46214145c634dfe5 ]---
[66994.250000] eth1: tx timeout
[66994.260000] eth1: link down
[66994.940000] ag71xx_hw_init
[66997.650000] eth1: link up (100Mbps/Half duplex)

With the above patch, a tx timeout triggered while I was simultaneously downloading via bittorrent *and* via http, and the http download was not interrupted during recovery from the timeout event.

I'll try another build that only disables that same code block in the ag71xx_restart_work_func, as I'm not sure if it was the ag71xx_hw_init addition that was the reason for the successful recovery. This may take some time, as it took a *long time* to trigger the bug.

Changed 2 years ago by psyke83

comment:4 Changed 2 years ago by psyke83

OK, I can confirm that this was the only change necessary for my device to recover successfully from the tx timeout:

diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
index dc77699..10d3269 100644
--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -865,11 +865,13 @@ static void ag71xx_restart_work_func(struct work_struct *work)
 {
        struct ag71xx *ag = container_of(work, struct ag71xx, restart_work);
 
+#if 0
        if (ag71xx_get_pdata(ag)->is_ar724x) {
                ag->link = 0;
                ag71xx_link_adjust(ag);
                return;
        }
+#endif
 
        ag71xx_stop(ag->dev);
        ag71xx_open(ag->dev);

I'll keep poking to see if I can find a better fix than disabling that code block.

Last edited 2 years ago by psyke83 (previous) (diff)

comment:5 Changed 2 years ago by psyke83

OK, I found a 100% reproducible test case on my device. It is a slight change to uysiquebien's script at https://dev.openwrt.org/ticket/10269#comment:25:

#!/usr/bin/env python

import socket

TCP_IP = '4.testdebit.info'
TCP_PORT = 5200

print("Connecting")
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
s.connect((TCP_IP, TCP_PORT))
print("Connected")

while True:
    s.send(b"012")

Two conditions are necessary to trigger the timeout:

  1. The only three characters should be sent (e.g. "012") - any more and it doesn't work.
  2. The script must be run from a wired client. I can't trigger the bug when executed over wifi.

I used a public iperf3 server address/port for the script, which is listed.

Using the above script & some debug output I discovered that the tx timeout was being triggered when the tx queue became full. It seems that the netif_stop_queue is executed under this circumstance, but it never resumed, which evidently causes the interface to stop responding.

This seems to be the correct fix:

diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
index dc77699..86a5e06 100644
--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -522,6 +522,8 @@ static void ag71xx_fast_reset(struct ag71xx *ag)
        ag71xx_wr(ag, AG71XX_REG_MII_CFG, mii_reg);
 
        ag71xx_hw_set_macaddr(ag, dev->dev_addr);
+
+       netif_wake_queue(dev);
 }
 
 static void ag71xx_hw_start(struct ag71xx *ag)
Last edited 2 years ago by psyke83 (previous) (diff)

comment:6 Changed 2 years ago by nbd

@psyke83: Thanks for the analysis of the bug and your proposed patch. However, I looked at the code and I think your fix is incomplete. So I made a patch which I think should fix the issue properly. Please test if it works for you: http://nbd.name/ag71xx-fix.patch

comment:7 Changed 2 years ago by anonymous

nbd,

Thanks, your patch also works fine. Just to be clear what happens with your patch (and did with my incomplete version):

  1. Start a http download (to ensure interface recovers correctly)
  2. Execute script.
  3. The tx timeout triggers & the http download stalls.
  4. After a few seconds, the interface goes back up and the http download continues.
  5. Keeping the script running still continuously triggers the tx timeout, but the script usually stops after one or two recovery events (possibly due to a tcp timeout?)

I wonder if there's a better way to mitigate tx queue full issue so as to avoid tx timeouts entirely?

Two queries loosely related to the tx timeout issue:

  1. Is the ag71xx_check_dma_stuck check not applicable to AR7241? The pdata->is_ar7240 part of this code excludes my device from availing of it properly:
    		if (!flush && !ag71xx_desc_empty(desc)) {
    			if (pdata->is_ar7240 &&
    			    ag71xx_check_dma_stuck(ag, ring->buf[i].timestamp))
    				schedule_work(&ag->restart_work);
    			break;
    		}
    
  1. Is your descriptor splitting patch not applicable to AR7241? This check excludes it from being used on my device:
    if (!pdata->is_ar724x && !pdata->is_ar91xx) {
    		ag->tx_ring.desc_split = AG71XX_TX_RING_SPLIT;
    		ag->tx_ring.size *= AG71XX_TX_RING_DS_PER_PKT;
    	}
    

As long as these two points are irrelevant to my device, your patch is 100% fine. Thanks for the assistance.

comment:8 Changed 2 years ago by psyke83

The last comment was from me - I always neglect to check if I'm signed in, apologies.

comment:9 Changed 2 years ago by nbd

The DMA stuck check was not applied to AR7241 because it works around a bug specific to AR7240. Not sure if it could help with your issue on AR7241, you could give it a try.

The descriptor splitting is a workaround for a throughput issue on AR71xx, it should not be necessary on other soc types.

comment:10 Changed 2 years ago by nbd

Patch committed in r47892

comment:11 Changed 2 years ago by psyke83

I enabled & added some debug to the loop that uses ag71xx_check_dma_stuck, and it seems that the function was not triggering restart_work at least for this testcase, so you're most likely correct. Thanks for clarifying re: the splitting patch, too.

comment:12 Changed 2 years ago by VittGam

Hello,

I'm having the same exact problem on several Ubiquiti UniFi APs that have an AR7240 rev 2 SoC.

I'm now going to test your patch (r47892) to see if it solves the issue or not.

Felix, could you please also backport the patch to CC? It should apply cleanly.

Cheers,
Vittorio G

comment:13 Changed 2 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

backported in r47895

comment:14 Changed 2 years ago by anonymous

nbd,

It seems that your patch may have issues. After a tx timeout and recovery, while the interface works properly, ksoftirqd will peg the CPU until the interface is manually brought down/up. I'll try to isolate what's causing the soft irq storm when I have more time.

comment:15 Changed 2 years ago by makedir

Anything new on this what anon above said? Is it save to install latest trunk build?

comment:16 Changed 2 years ago by makedir

Does this affect Netgear WNDR3700 and 3800 too?

comment:17 Changed 2 years ago by psyke83

  • Resolution fixed deleted
  • Status changed from closed to reopened

nbd,

The soft irq storm after tx recovery is caused by this goto being called continuously:

https://github.com/openwrt-mirror/openwrt/blob/master/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c#L1073

I enabled some debug related to this routine, and this line is spammed (only after tx recovery):

eth1: stay in polling mode, rx=0, tx=0, limit=64

Additionally, eth1 will never disable polling mode after the tx recovery has taken place.

I'm not sure how to fix this correctly, but for the time being this workaround solves the sirq storm:

diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
index 31b38d7..b87a36b 100644
--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -870,7 +870,7 @@ static void ag71xx_restart_work_func(struct work_struct *work)
 {
        struct ag71xx *ag = container_of(work, struct ag71xx, restart_work);
 
-       if (ag71xx_get_pdata(ag)->is_ar724x) {
+       if (ag71xx_get_pdata(ag)->is_ar7240) {
                ag->link = 0;
                ag71xx_link_adjust(ag);
                return;

comment:18 Changed 2 years ago by psyke83

I'm thinking that it may be an issue with small DMA transfers. Look at this line:

	if (skb->len <= 4) {
		DBG("%s: packet len is too small\n", ag->dev->name);
		goto err_drop;
	}

However, a few lines later:

n = ag71xx_fill_dma_desc(ring, (u32) dma_addr, skb->len & ag->desc_pktlen_mask);

The ag->desc_pktlen_mask is 4095 on my platform, so if skb->len is between 4096-4100 (and so on), packet lengths below 5 will be processed erroneously.

My attempt to fix this issue:

--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -747,32 +747,33 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
        struct ag71xx_desc *desc;
        dma_addr_t dma_addr;
        int i, n, ring_min;
+       int len = skb->len & ag->desc_pktlen_mask;
 
        if (ag71xx_has_ar8216(ag))
                ag71xx_add_ar8216_header(ag, skb);
 
-       if (skb->len <= 4) {
+       if (len <= 4) {
                DBG("%s: packet len is too small\n", ag->dev->name);
                goto err_drop;
        }
 
-       dma_addr = dma_map_single(&dev->dev, skb->data, skb->len,
+       dma_addr = dma_map_single(&dev->dev, skb->data, len,
                                  DMA_TO_DEVICE);
 
        i = ring->curr % ring->size;
        desc = ag71xx_ring_desc(ring, i);
 
        /* setup descriptor fields */
-       n = ag71xx_fill_dma_desc(ring, (u32) dma_addr, skb->len & ag->desc_pktlen_mask);
+       n = ag71xx_fill_dma_desc(ring, (u32) dma_addr, len);
        if (n < 0)
                goto err_drop_unmap;
 
        i = (ring->curr + n - 1) % ring->size;
-       ring->buf[i].len = skb->len;
+       ring->buf[i].len = len;
        ring->buf[i].skb = skb;
        ring->buf[i].timestamp = jiffies;
 
-       netdev_sent_queue(dev, skb->len);
+       netdev_sent_queue(dev, len);
 
        desc->ctrl &= ~DESC_EMPTY;
        ring->curr += n;
@@ -797,7 +798,7 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
        return NETDEV_TX_OK;
 
 err_drop_unmap:
-       dma_unmap_single(&dev->dev, dma_addr, skb->len, DMA_TO_DEVICE);
+       dma_unmap_single(&dev->dev, dma_addr, len, DMA_TO_DEVICE);
 
 err_drop:
        dev->stats.tx_dropped++;

However, the tx timeout still occurred with this patch, so I'm thinking
a) the 4 byte minimum may be too low for my hardware?
b) the dma descritor 4-byte len check needs to be moved out of the (cur_len > split) check (and the value increased, if point (a) is correct):

--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -710,16 +710,15 @@ static int ag71xx_fill_dma_desc(struct ag71xx_ring *ring, u32 addr, int len)
                if (!ag71xx_desc_empty(desc))
                        return -1;
 
-               if (cur_len > split) {
+               if (cur_len > split)
                        cur_len = split;
 
-                       /*
-                        * TX will hang if DMA transfers <= 4 bytes,
-                        * make sure next segment is more than 4 bytes long.
-                        */
-                       if (len <= split + 4)
-                               cur_len -= 4;
-               }
+               /*
+                * TX will hang if DMA transfers <= 4 bytes,
+                * make sure next segment is more than 4 bytes long.
+                */
+               if (len <= split + 4)
+                       cur_len -= 4;
 
                desc->data = addr;
                addr += cur_len;

I'll do some more tests, but I need to wait until people aren't using the internet here.

Last edited 2 years ago by psyke83 (previous) (diff)

comment:19 Changed 2 years ago by anonymous

anything new?

comment:20 Changed 2 years ago by psyke83

The first patch in my previous post probably does fix a genuine bug with large packet sizes, but they weren't the cause of the tx timeouts, so it doesn't solve anything. Testing with larger DMA descriptor lengths didn't help, either.

comment:21 Changed 2 years ago by anonymous

Does this bug also happen on all ag71xx devices, or just the TP-Link WR-851N? What about the Netgear WNDR3700v2 and WNDR3800? Because I've noticed some other maybe related phenomena, like wifi becoming slow then suddenly working properly again at 24MB/s (no other wifi networks around at 5ghz), or wan speeds dropping to 0mbit then going up again.

comment:22 follow-up: Changed 2 years ago by dizzy

I am running DESIGNATED DRIVER(Bleeding Edge, r48016) on WR941ND_v5 and having the same problem.
My dmesg goes as:
[ 24.857657] eth0: link up (100Mbps/Full duplex)
[ 24.862385] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 25.095483] br-lan: port 2(wlan0) entered forwarding state
[ 44.555479] random: nonblocking pool is initialized
[ 6056.204855] eth1: link down
[ 6056.207808] br-lan: port 1(eth1) entered disabled state
[ 6056.705953] eth1: link up (1000Mbps/Full duplex)
[ 6056.710786] br-lan: port 1(eth1) entered forwarding state
[ 6056.716480] br-lan: port 1(eth1) entered forwarding state
[ 6058.714209] br-lan: port 1(eth1) entered forwarding state
[18610.676741] eth1: link down
[18610.679693] br-lan: port 1(eth1) entered disabled state
[18611.177835] eth1: link up (1000Mbps/Full duplex)
[18611.182688] br-lan: port 1(eth1) entered forwarding state
[18611.188367] br-lan: port 1(eth1) entered forwarding state
[18613.186107] br-lan: port 1(eth1) entered forwarding state
[68589.590802] eth1: link down
[68589.593772] br-lan: port 1(eth1) entered disabled state
[68590.091962] eth1: link up (1000Mbps/Full duplex)
[68590.096798] br-lan: port 1(eth1) entered forwarding state
[68590.102476] br-lan: port 1(eth1) entered forwarding state
[68592.099886] br-lan: port 1(eth1) entered forwarding state
[89785.554919] eth1: link down
[89785.557880] br-lan: port 1(eth1) entered disabled state
[89786.055996] eth1: link up (1000Mbps/Full duplex)
[89786.060863] br-lan: port 1(eth1) entered forwarding state
[89786.066556] br-lan: port 1(eth1) entered forwarding state
[89788.064237] br-lan: port 1(eth1) entered forwarding state
[98091.536793] eth1: link down
[98091.539788] br-lan: port 1(eth1) entered disabled state
[98092.037904] eth1: link up (1000Mbps/Full duplex)
[98092.042759] br-lan: port 1(eth1) entered forwarding state
[98092.048449] br-lan: port 1(eth1) entered forwarding state
[98094.046142] br-lan: port 1(eth1) entered forwarding state

comment:23 in reply to: ↑ 22 Changed 2 years ago by psyke83

Replying to dizzy:

I am running DESIGNATED DRIVER(Bleeding Edge, r48016) on WR941ND_v5 and having the same problem.

The original reporter and I have a device using the AR7241 chipset, but yours is based on AR9341. Additionally, your log shows no "tx timeout" messages, so it's probably not related.

comment:24 Changed 2 years ago by anonymous

Is the AR7161 affected by this too?

comment:25 Changed 2 years ago by psyke83

There's no way to be sure except to check your dmesg log and see if the timeouts are occurring. Recent builds will recover from the timeout condition, but the recovery causes an irq storm that will most likely cause instability or reboots (as I noticed with mine).

comment:26 Changed 2 years ago by anonymous

How do I enable those debug messages? My dmesg is pretty empty all the time.

comment:27 Changed 2 years ago by psyke83

You don't need any special debug enabled for the messages to show. You can also try to manually trigger a timeout using the script in comment #5 (but it can be difficult to reproduce; it rarely works over a wifi link, and SQM/QOS makes it much harder to trigger).

comment:28 Changed 2 years ago by psyke83

This patch solves the sirq storm that occurs after recovery from tx timeout:

Edit: I've submitted a revised patch to openwrt-devel for feedback.

Last edited 2 years ago by psyke83 (previous) (diff)

comment:29 Changed 2 years ago by nbd

please try latest trunk

comment:30 Changed 2 years ago by anonymous

Again, sorry for asking, I am no export like all the other people on here. Just to clerify this: If I dont see any kernel panic messages in dmesg, I am safe from this bug and I dont need to update to trunk? I safe problems with wifi bandwidth though too sometimes and I have no idea why. Sometimes it works fine and I get really good rates of about 24MB/s but then sometimes just 12-15MB/s, over 5GHz (there is no other 5GHz network in range and I am just using one WIFI client).

comment:31 Changed 2 years ago by psyke83

nbd,

It seems the latest trunk with your two commits are working perfectly. I've manually triggered the bug lots of times (I added some debug to log when the dma stuck condition occurs); the recovery happens successfully without a link state change or sirq storm, and the tx timeout is usually not triggered. Occasionally, if I leave testcase script running, a tx timeout will actually trigger, but it also recovers just fine if this happens.

I've set the old patches to deprecated/archived on patchwork; hopefully that's all that was needed to minimize your workload. Thanks!

To the last anonymous person: this bug would have caused your wan port to stop responding completely when a tx timeout occurred, and not recover until you manually rebooted/restarted the interface. If you never have internet connectivity loss with a tx timeout in your log, your problem is not related.

comment:32 Changed 2 years ago by nbd

  • Resolution set to fixed
  • Status changed from reopened to closed

great, thanks for testing

comment:33 Changed 2 years ago by VittGam

Hello,

At the end I still haven't had the opportunity to test the patches on the UniFi APs...

Anyway, Felix, could you please backport r48227 and r48228 to CC, so they could get into 15.05.1?

Thank you,
Vittorio G

comment:34 Changed 2 years ago by nbd

Done

comment:35 Changed 2 years ago by ivaradi@…

Hi,

I am using a TP-Link WR741ND, which exhibited the timeout problem (although on eth0). I have upgraded to the latest on the CC branch after the two commits mentioned above had been ported to it, and the problem went away.

However, ksoftirqd can still start producing a load hovering about ~60%, which sometimes can go away on its own, sometimes it cannot. Generally, networking continues to work normally. However, yesterday the load of ksoftirqd went up to near 100%, and my PC (connected via cable) could not receive a reply for repeated DHCPDISCOVER requests (while other network traffic seemd unaffected). I tried restarting dnsmasq, etc. but it did not help. The only thing that helped was to reset eth0 (ifconfig down/up), after which the ksoftirqd load went back to around 0, and DHCP worked again. (It might be noted that I experienced this DHCP problem earlier, when the router had Gargoyle 1.6 installed, and the cure was the same there, although I did not notice a high ksoftirqd load there).

I am not sure if this is related to this bug and the fix, but as soft IRQ storms happened earlier, something may not be entirely correct with the patch.

It seems that a bittorrent download can relatively easily cause this problem to appear. I am willing to debug it, but not being familiar with the code, I would appreciate some hints as to where to start (apart from starting to read and understand the code :))

comment:36 Changed 2 years ago by anonymous

As an addition to the above observation, while ksoftirqd is around 60%, the NET_RX value in /proc/softirqs increases at a rate of ~70000/sec.

comment:37 Changed 2 years ago by ivaradi

  • Resolution fixed deleted
  • Status changed from closed to reopened

I have debugged the ksoftirqd storms I experienced (described earlier) with the latest fixes.

It turned out, that during such storms ag71xx_poll() is called repeatedly. It does nothing most of the time, however, the TX_STATUS_PS bit is constantly set, so it stays in polling mode.

Further analysis revealed the reason to be that ag71xx_tx_packets() is called with flush set to true. In this case it basically sets the dirty pointer to the current one. However, the DMA descriptors still remain there in a non-empty state and are eventually transmitted. This increments the sent packet counter, which is then never decremented, because the ring is empty. Of course, if new packets arrive, the counter is incremented then decremented, but the initial increment remains there. This causes the TX_STATUS_PS bit to remain in 1. I guess, the counter sometimes overflows and thus the bit can eventually reset, causing the softirq storm to disappear. But it depends on what kind of traffic there is and how fast the kernel processes the packets, so the storm can last indefinitely even.

Now, I think a possible solution could be to mark the descriptors EMPTY when flushing, then issuing a wmb() and decrement the counter as many times as the number of descriptors processed.

However, I am wondering if this flushing is actually needed. The packets seem to be transmitted anyway, and do not seem to cause any trouble.

comment:38 Changed 2 years ago by nbd

please try trunk r48753

comment:39 Changed 2 years ago by ivaradi

I applied the patch to CC and have tested it for 4 days with heavy traffic. So far the softirq storm has not appeared (and neither was the TX_STATUS_PS getting stuck phenomenon detected). So I think the patch solves this issue.

What I detected though was the DHCPv4 problem described above and eth0 (the LAN ports) getting stuck completely (i.e. no traffic at all, though wifi and the WAN port continued to work properly). They can both be resolved by resetting the interface with ifconfig up/down. They also appeared earlier as well, so they are not caused by the patch and are probably unrelated to this issue.

I guess the issue of eth0 getting completely stuck might be a case of DMA getting stuck not detected by the current check. Do you think this could be the case, i.e. is it worth trying to debug with this assumption?

The DHCPv4 problem is quite baffling as other traffic seems to be unaffected, and it also appears quite rarely...

comment:40 Changed 2 years ago by ivaradi

I am running a cron job on the router that detects eth0 getting stuck by pinging an address on my LAN. When ping fails, it logs the values of the registers also examined by ag71xx_check_dma_stuck() (and then resets the eth0). I found the following:

  • RX_SM is always set to 0x00100113
  • FIFO_DEPTH is aways set to 0x02000000
  • TX_SM: the following bits were always set: 0x000e0500, the bits set in the following value were always zero: 0xfff002ff, but otherwise the actual values were different, though some values occured more than once. However, I have a limited number of samples (14), so it probably does not mean anything.

Unfortunately the document I have (about AR9331) does not contain anything about RX_SM or TX_SM, and according to it FIFO_DEPTH should not have a value like that. But perhaps that is a sure sign of the problem. Which bits should be checked to detect this condition?

comment:41 Changed 18 months ago by thorstenhirsch

I have the same problem...

  • with trunk
  • router: TP Link WR841NDv7
  • directly after booting (no load required to cause the problem)
  • the message "eth1: tx timeout" appears repeatedly every 25 seconds
  • eth1 really doesn't work at all (my cable modem is attached to eth1)

comment:42 Changed 18 months ago by anonymous

thorstenhirsch - what type of cable modem and what does dmesg say for your link negotiation? Is flow control enabled?

comment:43 Changed 18 months ago by thorstenhirsch

It's a Motorola SBV5121 cable modem. The link between router and cable modem is 100Mbps full-duplex and I'm pretty sure I saw a line like...

[12345.67890] eth1: link up (100Mbps/Full duplex)

...in the output of dmesg. Sorry, I cannot check it right now. But here's what I tried and what I am absolutely sure of, even though I'm writing this all down from my memory: I called "ifconfig eth1 down" and "ifconfig eth1 up", then I saw in dmesg:

[00000.000000] eth1: link down
[00001.000000] eth1: link up (100Mbps/Full duplex)
[00026.000000] eth1: tx timeout
[00051.000000] eth1: tx timeout
[00076.000000] eth1: tx timeout
[00101.000000] eth1: tx timeout
[...]

comment:44 Changed 16 months ago by thorstenhirsch

The problem persists in trunk (2016/09/03). Here is a more complete output of dmesg:

    [   14.771257] ip_tables: (C) 2000-2006 Netfilter Core Team
    [   14.801897] nf_conntrack version 0.5.0 (434 buckets, 1736 max)
    [   15.055551] xt_time: kernel timezone is -0000
    [   15.229634] PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
    [   15.240603] ath: phy0: Ignoring endianness difference in EEPROM magic bytes.
    [   15.249215] ath: EEPROM regdomain: 0x0
    [   15.249231] ath: EEPROM indicates default country code should be used
    [   15.249244] ath: doing EEPROM country->regdmn map search
    [   15.249271] ath: country maps to regdmn code: 0x3a
    [   15.249287] ath: Country alpha2 being used: US
    [   15.249300] ath: Regpair used: 0x3a
    [   15.263744] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
    [   15.268962] ieee80211 phy0: Atheros AR9287 Rev:2 mem=0xb0000000, irq=40
    [   24.874995] device eth0 entered promiscuous mode
    [   24.892488] IPv6: ADDRCONF(NETDEV_UP): br-lan: link is not ready
    [   24.957456] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [   26.490153] eth0: link up (1000Mbps/Full duplex)
    [   26.494868] br-lan: port 1(eth0) entered forwarding state
    [   26.500401] br-lan: port 1(eth0) entered forwarding state
    [   26.558263] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
    [   27.290629] eth1: link up (100Mbps/Full duplex)
    [   27.302355] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [   28.498892] br-lan: port 1(eth0) entered forwarding state
    [   30.969553] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
    [   31.057605] device wlan0 entered promiscuous mode
    [   33.440343] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
    [   33.446944] br-lan: port 2(wlan0) entered forwarding state
    [   33.452577] br-lan: port 2(wlan0) entered forwarding state
    [   33.510430] IPv6: ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
    [   35.448914] br-lan: port 2(wlan0) entered forwarding state
    [   35.952322] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-1: link becomes ready
    [   47.068889] ------------[ cut here ]------------
    [   47.073599] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:306 dev_watchdog+0x1dc/0x260()
    [   47.082103] NETDEV WATCHDOG: eth1 (ag71xx): transmit queue 0 timed out
    [   47.088673] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_pptp nf_nat_ipv4 nf_nat_amanda nf_conntrack_pptp nf_conntrack_ipv4 nf_conntrack_amanda mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ts_kmp ts_fsm ts_bm nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_nat_h323 nf_nat nf_log_ipv4 nf_log_common nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_h323 nf_conntrack_broadcast nf_conntrack iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables compat sch_teql sch_tbf sch_sfq sch_red sch_prio sch_pie sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp cls_basic act_police act_ipt x_tables act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ifb gpio_button_hotplug
    [   47.195712] CPU: 0 PID: 0 Comm: swapper Not tainted 4.4.14 #1
    [   47.201506] Stack : 803e0724 00000000 00000001 80430000 8042b2d8 8042af63 803c1e10 00000000
    [   47.201506]    80493790 804295dc 00000200 00100000 00000100 800a6834 803c744c 80420000
    [   47.201506]    00000003 804295dc 803c5858 80425c2c 00000100 800a4800 00000006 00000000
    [   47.201506]    00000000 801f5300 00000000 00000000 00000000 00000000 00000000 00000000
    [   47.201506]    00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    [   47.201506]    ...
    [   47.237538] Call Trace:
    [   47.240025] [<80071a24>] show_stack+0x50/0x84
    [   47.244420] [<80081628>] warn_slowpath_common+0xa0/0xd0
    [   47.249691] [<80081684>] warn_slowpath_fmt+0x2c/0x38
    [   47.254692] [<8028b9c0>] dev_watchdog+0x1dc/0x260
    [   47.259455] [<800af948>] call_timer_fn.isra.5+0x24/0x80
    [   47.264728] [<800afbc4>] run_timer_softirq+0x1ac/0x1f4
    [   47.269917] [<80083a84>] __do_softirq+0x164/0x290
    [   47.274672] [<8006a678>] plat_irq_dispatch+0xbc/0xf4
    [   47.279677] 
    [   47.281185] ---[ end trace 6d3f2a95e532138a ]---
    [   47.285830] eth1: tx timeout
    [   53.958926] random: nonblocking pool is initialized
    [   72.068878] eth1: tx timeout
    [  102.068881] eth1: tx timeout
    [  132.068888] eth1: tx timeout
    [  162.068893] eth1: tx timeout

As you can see, the link status is "link up (100Mbps/Full duplex)".

By the way - the problem is resolved in LEDE's trunk.

Last edited 16 months ago by thorstenhirsch (previous) (diff)

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.