Modify

Opened 5 years ago

Last modified 3 years ago

#13072 reopened defect

ag71xx WNDR3700 slow vlan routing

Reported by: severn@… Owned by: developers
Priority: normal Milestone: Barrier Breaker 14.07
Component: kernel Version: Attitude Adjustment 12.09 Beta
Keywords: Cc:

Description

The issue I'm having is that either the kernel or the ag71xx driver seems to be slow at routing packets in and back out the same device.

Test components:
eth0 - rtl8366s
eth1 - phy
managed HP switch
server (connected to HP switch on an untagged vlan 2 port)
client (connected to HP switch on an untagged vlan 1 port)
single iptables rule: iptables -t nat -I POSTROUTING -j MASQUERADE

Speed was testing by wget'ing a large file to the client from the server.

Test setup 1:
eth0.1 - rtl8366s port 0 tagged vlan 1 - hp switch
eth1.2 - hp switch tagged vlan 2
speeds normal (50MB/s)

Test setup 2:
eth0.1 - rtl8366s port 0 tagged vlan 1 - hp switch
eth0.2 - rtl8366s port 0 tagged vlan 2 - hp switch
speeds poor (15MB/s)

Test setup 2:
eth0.1 - rtl8366s port 0 tagged vlan 1 - hp switch
eth0.2 - rtl8366s port 0 tagged vlan 2 - hp switch
speeds poor (15MB/s)

Test setup 3:
eth0.1 - rtl8366s port 0 tagged vlan 1 - hp switch
eth0.2 - rtl8366s port 1 tagged vlan 2 /w faked MAC address - hp switch
speeds poor (15MB/s)

Test setup 4:
eth0.1 - rtl8366s port 0 tagged vlan 1 - hp switch
eth0.2 - rtl8366s port 1 to another router, then back to the hp switch on vlan 2
speeds poor (15MB/s)

Test setup 5:
eth1.1 - tagged vlan 1 - hp switch
eth1.2 - tagged vlan 2 - hp switch
speeds poor (15MB/s)

Test setup 6:
Replace the WNDR3700 with an atom with a single gigabit NIC, tagged vlan1 and vlan2
speeds normal (50MB/s)

Test setup 7:
Same as test setup 1, except this time wget in both directions (from client to server, and server to client, at the same time)
speeds normal for both transfers (50MB/s)

So it seems to me that if traffic is comes in and leaves on the same NIC, the performance goes to crap. I've tried playing around with the tx/rx ring sizes, resulting in reboots (rx ring 1) or no performance difference.

All NICs are gigabit, all cables cat6.

I also tried running adding a router between the server and the hp swtich, adding 50ms of latency. Oddly enough, this causes the speeds in the poor cases to drop to 500KB/s, and the speeds in the good cases to remain the same.

Packet caps on server and the client show that the client misses some packets so the server is forced to retransmit, when the speeds are poor.

Attachments (2)

ag71xx_dma.patch (6.4 KB) - added by severn@… 3 years ago.
256 byte DMA/descriptor
a1.patch (1.8 KB) - added by severn 3 years ago.

Download all attachments as: .zip

Change History (121)

comment:1 Changed 5 years ago by severn@…

I just saw https://forum.openwrt.org/viewtopic.php?id=32420

I am not using jumbo frames, and based on the "normal" speeds I saw, it looks like it is capable of routing through the SoC, just doesn't seem to like it when both endpoints are over the same NIC.

comment:2 Changed 5 years ago by nbd

please try current trunk

comment:3 Changed 5 years ago by severn@…

I've re-run two of the test setups with trunk snapshot r36507.

The server and client machines have changed since I originally ran the tests, which may account for the drop between the original run and the re-run; but the speed difference between the setup 1 and setup 2 still exist.

Re-ran test setup 1: 34 MB/s

Re-ran test setup 2: 11 MB/s

comment:4 Changed 5 years ago by severn@…

Re-run test setup 5 with stock Netgear kernel+ag7100 driver and gets me normal speeds.

comment:5 Changed 5 years ago by anonymous

Same issue here!

Backfire 10.03.1 on wr1043nd..

Is there any update?

comment:6 Changed 4 years ago by jow

  • Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07

Milestone Attitude Adjustment 12.09 deleted

comment:7 Changed 3 years ago by anonymous

Same issue on wndr3800. I was perplexed because my buffalo g300nh with much less cpu speed does better: https://forum.openwrt.org/viewtopic.php?id=52215

comment:8 Changed 3 years ago by dustbowl@…

Seems still exist on current trunk, both wndr3700 and wndr3800. would be nice hearing something about it from the devs. thank you.

comment:9 Changed 3 years ago by severn@…

Years later! =)

I've been messing around with the driver (in current trunk). I was able to get about 300 Mbps capping the CPU in test setup 2 by changing the TX DMA size to 256 bytes per descriptor. The code's really hacky right now, but I can put up some binaries (and/or the patch) somewhere if people are interested.

comment:10 Changed 3 years ago by nbd

Please post your patch here.

Changed 3 years ago by severn@…

256 byte DMA/descriptor

comment:11 follow-up: Changed 3 years ago by severn@…

I'm pretty sure it's not "clean" =)

Those are the changes I've been testing (vs git master-ish).

  • cap FIFO size/DMA per descriptor to 256 bytes
  • split tx packets into multiple descriptors
  • ack descriptors only when entire packet is sent - I get tx hangs if I don't do this

I only have a WNDR3700v2, so that's all I've tested with.
Let me know if you have any questions.

comment:12 Changed 3 years ago by anonymous

Im not an expert, but @severn if you think youve found a solution, please send it to the openwrt mailing list, patches added here mostly are ignored by the devs. It also would be nice if you could try to contact the maintainer of the drivers, I have tried, but he never mailed back.

comment:13 Changed 3 years ago by anonymous

How do I apply this patch for testing?

comment:14 Changed 3 years ago by anonymous

I submitted this before and it so far seems to help but I am testing double natted with another router in production.
/ticket/17569.html

Any idea how I could apply this patch to the AA drivers (to use gargoyle based on AA?)

comment:15 Changed 3 years ago by anonymous

How do I patch towards trunk myself, I dont know how to do this. I know how to compile, but not how to apply these patches. Is it compatible with wndr3800 too?

comment:16 Changed 3 years ago by anonymous

Do I just copy the file ag71xx_dma.patch into "trunk/target/linux/ar71xx/patches-3.10" and compile? Why are these numbers before the already included patches in that directory, dont I have to change the name of ag71xx_dma.patch? Thank you

comment:17 Changed 3 years ago by robnitro@…

What I did:
put patch in openwrt folder
terminal go to openwrt folder.
cd trunk
patch -p1 -i ../ag71xx_dma.patch
should say 2 files patched
then you run make or whatever you use to build openwrt from the trunk dir.

comment:18 Changed 3 years ago by anonymous

This patch broke my three wndr3800:

Wan port is resetting every few minutes:

Thu Aug 28 16:25:11 2014 daemon.notice odhcp6c[1383]: carrier => 0 event on eth1
Thu Aug 28 16:25:11 2014 daemon.notice odhcp6c[1383]: (re)starting transaction on eth1
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Network device 'eth1' link is down
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Network alias 'eth1' link is down
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Interface 'wan6' has link connectivity loss
Thu Aug 28 16:25:11 2014 daemon.notice odhcp6c[1383]: Starting SOLICIT transaction (timeout 4294967295s, max rc 0)
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Interface 'wan' has link connectivity loss
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Network device 'eth1' link is up
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Network alias 'eth1' link is up
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Interface 'wan6' has link connectivity
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Interface 'wan6' is setting up now
Thu Aug 28 16:25:11 2014 daemon.notice netifd: Interface 'wan' has link connectivity
Thu Aug 28 16:25:19 2014 daemon.notice netifd: Network device 'tun1' link is down
Thu Aug 28 16:25:19 2014 daemon.notice netifd: Interface 'vpn2' has link connectivity loss
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Network device 'eth1' link is down
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Network alias 'eth1' link is down
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Interface 'wan6' has link connectivity loss
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Interface 'wan' has link connectivity loss
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Network device 'eth1' link is up
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Network alias 'eth1' link is up
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Interface 'wan6' has link connectivity
Thu Aug 28 16:25:22 2014 daemon.notice netifd: Interface 'wan' has link connectivity
Thu Aug 28 16:25:38 2014 daemon.notice netifd: Network device 'eth1' link is down
Thu Aug 28 16:25:38 2014 daemon.notice netifd: Network alias 'eth1' link is down
Thu Aug 28 16:25:38 2014 daemon.notice netifd: Interface 'wan6' has link connectivity loss
Thu Aug 28 16:25:38 2014 daemon.notice netifd: Interface 'wan' has link connectivity loss

and

Thu Aug 28 15:45:02 2014 daemon.notice netifd: Interface 'wan6' is enabled
Thu Aug 28 15:45:02 2014 daemon.notice netifd: Network alias 'eth1' link is up
Thu Aug 28 15:45:02 2014 daemon.notice netifd: Interface 'wan6' has link connectivity
Thu Aug 28 15:45:02 2014 daemon.notice netifd: Interface 'wan6' is setting up now
Thu Aug 28 15:45:02 2014 daemon.notice netifd: Interface 'wan' is now up
Thu Aug 28 15:45:02 2014 daemon.warn odhcpd[1120]: DHCPV6 SOLICIT IA_NA from ...
Thu Aug 28 15:45:19 2014 daemon.notice netifd: Network device 'tun1' link is up
Thu Aug 28 15:45:19 2014 daemon.notice netifd: Interface 'vpn2' has link connectivity
Thu Aug 28 15:45:48 2014 daemon.notice netifd: Network device 'tun0' link is down
Thu Aug 28 15:45:48 2014 daemon.notice netifd: Interface 'vpn1' has link connectivity loss
Thu Aug 28 15:45:52 2014 daemon.notice netifd: Network device 'eth1' link is down
Thu Aug 28 15:45:52 2014 daemon.notice netifd: Network alias 'eth1' link is down
Thu Aug 28 15:45:52 2014 daemon.notice netifd: Interface 'wan6' has link connectivity loss
Thu Aug 28 15:45:52 2014 daemon.notice netifd: Interface 'wan' has link connectivity loss
Thu Aug 28 15:45:52 2014 daemon.notice netifd: Interface 'wan6' is now down
Thu Aug 28 15:45:52 2014 daemon.notice netifd: Interface 'wan6' is disabled
Thu Aug 28 15:45:52 2014 daemon.notice netifd: wan (19506): Received SIGTERM
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Network device 'eth1' link is up
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Interface 'wan' has link connectivity
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Interface 'wan' is setting up now
Thu Aug 28 15:45:53 2014 daemon.notice netifd: wan (20045): udhcpc (v1.22.1) started
Thu Aug 28 15:45:53 2014 daemon.notice netifd: wan (20045): Sending discover...
Thu Aug 28 15:45:53 2014 daemon.notice netifd: wan (20045): Sending select for ......
Thu Aug 28 15:45:53 2014 daemon.notice netifd: wan (20045): Lease of ... obtained, lease time 2864
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Interface 'wan6' is enabled
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Network alias 'eth1' link is up
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Interface 'wan6' has link connectivity
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Interface 'wan6' is setting up now
Thu Aug 28 15:45:53 2014 daemon.notice netifd: Interface 'wan' is now up
Thu Aug 28 15:45:58 2014 daemon.notice netifd: Network device 'tun0' link is up

comment:19 Changed 3 years ago by robnitro@…

Weird, I have no issues. I built from hnyman's configs on both latest BB 42313, and a few versions back trunk. I'll be testing it on a friend's ag300h soon, which also has the issue UNLESS i force 100 full duplex with ethtool -s eth1 speed 100 duplex full autoneg on (some reason autoneg off causes issues).

Oh and on my wndr3800, when I set ethtool -s eth1 speed 100 duplex full, it would take but the speeds would be horribly slow, even just a single speed test. Now with the patch, no issues.

Are you trying to run 3.14 kernel?
Double check you have the wan port plugged in well. My 3800 has one that doesn't click too solidly- I end up wrapping the wire around the router to prevent issues.

I still can't get it to work on AA 42171 gargoyle. Doh.

comment:20 Changed 3 years ago by anonymous

Reflashed without this patch and it's working again. I dont know why, but this patch broke my wan interface on two of my wndr3800 into resetting/losing wan connectivity every few minutes.

comment:21 Changed 3 years ago by robnitro@…

Found a bug with this driver on wndr3800, using BB r42314.
ethtool shows a high TX buffer, but I think it's not real,
but as a temp fix I can run ethtool -G eth1 tx 32 to fix.
root@OpenWrt:~# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX: 128
RX Mini: 0
RX Jumbo: 0
TX: 32
Current hardware settings:
RX: 128
RX Mini: 0
RX Jumbo: 0
TX: 1024

root@OpenWrt:~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 128
RX Mini: 0
RX Jumbo: 0
TX: 32
Current hardware settings:
RX: 128
RX Mini: 0
RX Jumbo: 0
TX: 1024

comment:22 Changed 3 years ago by anonymous

Does this have to do something with what I wrote about? Why is it stable for you and two out of three of my wndr3800 behaved in weird connectivity issues? I didnt notice it in the first place, just when one of the wndr3800 users called me telling me he had disconnects every few minutes on his pc (behind the router). I connected to it and didnt see something wrong but then noticed the ssh connection suddenly was gone 5 minutes later. And then I noticed the wan port seemed to toggle in a random broken way, where it always seemed to come alive shortly but a few times the device completely was dead. Now both of them run without any problem since I flashed without this patch again.

comment:23 Changed 3 years ago by anonymous

can confirm the wan disconnects on my wndr3700 (v1) as well with this patch, it seems to be random and logread tells me it detects a link down and then back up at 100mbit (it should do gigabit to the modem but it wont, even without this patch. The performance goes up to 220mbit for vlan routing though, without the patch i can do max 120mbit

comment:24 Changed 3 years ago by anonymous

What does the author of this patch say about this? It's obviously now that I am not the only one who have this broken wan toggle behavior with this patch. And wheres the official ag71xx maintainer for the driver :'( Maybe he can help with the problem.

comment:25 in reply to: ↑ 11 Changed 3 years ago by robnitro@…

So with my ethtool buffer issue:
The patch may have an issue that explains why ethtool -G shows 1024 for tx:

Basically tx ring size default is 128* (2048/256) = 1024.
Why defined that way instead of the default which was 32 and erased?

-#define AG71XX_TX_RING_SIZE_DEFAULT	32
+#define AG71XX_FIFO_LEN		2048
+
+/* Number of bytes to transfer per TX descriptor */
+#define AG71XX_TX_DESC_SIZE 256
+
+#define AG71XX_DESCS_PER_FIFO (AG71XX_FIFO_LEN/AG71XX_TX_DESC_SIZE)
+
+#define AG71XX_TX_RING_SIZE_DEFAULT	128*AG71XX_DESCS_PER_FIFO
 #define AG71XX_RX_RING_SIZE_DEFAULT	128
 
 #define AG71XX_TX_RING_SIZE_MAX		32

comment:26 Changed 3 years ago by severn

Original author of the patch here. Sorry, been away for a bit.

Can people who have this problem:

  1. just to confirm - it only happens on the WAN interface (eth1)? Can you test and see if eth0 (switch interface) is fine?
  1. can you post dmesg output immediately after the WAN interfaces disappears and comes back?

I increased the TX descriptors because in the VLAN routing scenario, you have 1 packet coming in (uses 1 descriptor) -- and if this was a full FIFO packet, it can use up to 8 descriptors when it's sent, since each TX descriptor is now 256 bytes instead of 2048 - so you'd be queuing/dropping packets that had been read because you wouldn't be able to send them.

i.e. full FIFO is 2048 bytes, now split into 256 byte chunks = 8 descriptors. Of course this is worst case.

Does changing the FIFO size to 32 help with the disconnects? Or do anything at all?

I'll test my WAN interface as well later tonight - I've only been running it routing VLANs on the switch side.

Re: AA - not sure if this will work, but you can try patching the BB driver and putting it in the AA tree (replacing its driver).

@anons and @robnitro - thanks, your feedback and help testing most appreciated.

comment:27 Changed 3 years ago by nbd

fyi, i'm currently working on a cleaned up version of that patch

comment:28 Changed 3 years ago by anonymous

@severn, I can confirm it was only the WAN interface (eth1) and not the lan side (eth0) but I don't have any logs at the moment as I reverted the patches for now. I might be able to test some more this weekend

comment:29 Changed 3 years ago by robnitro@…

Hey thank you! I learned how to compile because of this stupid bug... and knowing from testing that it wasn't due to ACK priority or latency of the fiber connection.

No, not fifo for 32, but the tx ring setting- which sets what ethtool sees. However, I changed it to 4*AG71XX_DESCS_PER_FIFO and the wan started to reboot.. even the lan was bugging out where I'd lose ability to ping or ssh! I went on wireless and saw dmesg say eth0 tx timeout, then eth0 link down, then pll_reg , link up.

comment:30 Changed 3 years ago by robnitro@…

Oh and this is not just wndr routers, I can say I have found the same issue on buffalo ag300h, which was fixed by setting eth1 to 100 fd (this on the wndr would make it persistently SLOW- so it's not a consistent thing from model to model that has these similar chips).

Meanwhile an older g300nh, did not have this issue, but supposedly uses the same switch???

comment:31 Changed 3 years ago by nbd

@robnitro: Thanks for writing that patch, here's my rewritten version: http://nbd.name/ag71xx-dma.patch

I'm currently running it on a WNDR3800. While it does seem to work just fine, I'm not seeing any performance improvement in either of the two patches compared to the unpatched version.

comment:32 Changed 3 years ago by nbd

Did some more testing: eth0 -> eth1 routing performance is slightly worse with this patch (340 mbit/s instead of 380 mbit/s), but eth0.1 -> eth0.2 routing performance is much better (~340 mbit/s instead of 170 mbit/s).

comment:33 Changed 3 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

fix committed in r42328

comment:34 Changed 3 years ago by anonymous

  • Resolution fixed deleted
  • Status changed from closed to reopened

Questions ... @nbd.

1.) why was this committed while it breaks wan? was this already fixed? I cant say there is any comment right now approving that.

2.) if you say it doesnt work and may even break routing more, again, why did you commit it?

3.) why set to fixed?

comment:35 Changed 3 years ago by nbd

First of all, looks like I copy&pasted the wrong name when mentioning the original author of the patch, it's severn, not robnitro.

As for the questions above:
1) I extensively tested the normal wan port on my device with my rewrite of the patch and it works just fine

2) I later revised that. It slightly reduces unidirectional routing throughput, but greatly enhances bidirectional routing throughput (especially when going in and out of the same device).

3) It works just fine in my tests.

comment:36 Changed 3 years ago by robnitro@…

Now I have problems with the official patch!
Performance fine, but then happens when I run internet speed tests.
Tried different cables, even forced 100 full duplex on eth1, same issue.

[ 1424.010000] eth1: tx timeout
[ 1424.010000] eth1: link down
[ 1424.800000] ar71xx: pll_reg 0xb8050014: 0x11110000
[ 1424.800000] eth1: link up (1000Mbps/Full duplex)
[ 1435.010000] eth1: tx timeout
[ 1435.010000] eth1: link down
[ 1435.800000] ar71xx: pll_reg 0xb8050014: 0x11110000
[ 1435.800000] eth1: link up (1000Mbps/Full duplex)
[ 1446.010000] eth1: tx timeout
[ 1446.010000] eth1: link down
[ 1446.800000] ar71xx: pll_reg 0xb8050014: 0x11110000
[ 1446.800000] eth1: link up (1000Mbps/Full duplex)
[ 1457.010000] eth1: tx timeout
[ 1457.010000] eth1: link down
[ 1457.800000] ar71xx: pll_reg 0xb8050014: 0x11110000
[ 1457.800000] eth1: link up (1000Mbps/Full duplex)

comment:37 Changed 3 years ago by robnitro@…

oh forgot to mention, used both 3.14 and 3.10 kernels and arokh and hnyman's build environments

comment:38 Changed 3 years ago by hnyman

I already posted to #14035, as the stack trace there matches my error, but I suspect that changes made by r42328 may be the reason for errors that have now surfaced with my wndr3700v2. I don't remember seeing errors with earlier builds upto 42886, and r42328 is roughly the only significant driver change since that.

[   32.930000] br-lan: port 2(wlan0) entered forwarding state
[  105.050000] ------------[ cut here ]------------
[  105.050000] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1e8/0x26c()
[  105.060000] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
[  105.060000] Modules linked in: ifb ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pptp pppox ppp_mppe ppp_generic nf_nat_ipv4 nf_conntrack_ipv4 mac80211 ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY usbserial ts_kmp ts_fsm ts_bm slhc nf_nat_rtsp nf_nat_irc nf_nat_ftp nf_nat nf_defrag_ipv4 nf_conntrack_rtsp nf_conntrack_irc nf_conntrack_ftp iptable_raw iptable_mangle iptable_filter ipt_ah ipt_REJECT ipt_ECN ip_tables crc_ccitt compat act_connmark act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress leds_wndr3700_usb ledtrig_usbdev ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 msdos ip_gre gre sit tunnel4 ip_tunnel tun vfat fat ntfs hfsplus cifs nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp850 nls_cp437 nls_cp1250 ipv6 sha256_generic sha1_generic md5 md4 hmac ecb des_generic arc4 crypto_blkcipher usb_storage ohci_hcd ehci_platform ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 crc16 jbd2 mbcache usbcore nls_base usb_common crypto_hash [last unloaded: ifb]
[  105.180000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.49 #1
[  105.190000] Stack : 00000000 00000000 00000000 00000000 8039bf1e 00000032 8031ef50 000000d9
[  105.190000] 	  802d7cc4 8031f37b 00000000 803939dc 8031ef50 000000d9 8039b874 00000001
[  105.190000] 	  00000004 8027e310 00000003 801e332c 802f04bc 000000d9 802d9354 8030fc74
[  105.190000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  105.190000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 8030fc00
[  105.190000] 	  ...
[  105.230000] Call Trace:
[  105.230000] [<80223fcc>] show_stack+0x48/0x70
[  105.230000] [<8028d8d8>] warn_slowpath_common+0x78/0xa8
[  105.240000] [<8028d934>] warn_slowpath_fmt+0x2c/0x38
[  105.240000] [<80100fa8>] dev_watchdog+0x1e8/0x26c
[  105.250000] [<800e0dc4>] call_timer_fn.isra.38+0x24/0x84
[  105.250000] [<8020cf70>] run_timer_softirq+0x17c/0x1bc
[  105.260000] [<8008bc78>] __do_softirq+0xd0/0x1bc
[  105.260000] [<80115b2c>] do_softirq+0x48/0x68
[  105.270000] [<80173488>] irq_exit+0x54/0x70
[  105.270000] [<80060830>] ret_from_irq+0x0/0x4
[  105.280000] [<80060a80>] __r4k_wait+0x20/0x40
[  105.280000] [<800f0a14>] cpu_startup_entry+0xa4/0x104
[  105.280000] [<8032b910>] start_kernel+0x38c/0x3a4
[  105.290000] 
[  105.290000] ---[ end trace 80ba0d8c257057ce ]---
[  105.300000] eth0: tx timeout
[  105.300000] eth0: link down
[  105.310000] ar71xx: pll_reg 0xb8050010: 0x11110000

comment:39 Changed 3 years ago by hnyman

Reverting r42328 makes my 42332 build to work normally again.

comment:40 Changed 3 years ago by nbd

temporarily disabled the change in r42333 until a fix is found

comment:41 Changed 3 years ago by robnitro@…

What is strange is that I recompiled with the old patch, and now it fails!
I hope it can be rectified... it was a good fix while it lasted. I wish I saved the bin files I made before :)

comment:42 Changed 3 years ago by anonymous

It never worked, not with the first patch, not with the rewritten one. Wan port always was unstable like I always quoted. Theres also some kind of weird bug, that if you connect "too fast" after a reboot via ssh, the router crashes and doesnt respond anymore. Also had some dhcp problems with latest trunk...

comment:43 Changed 3 years ago by severn

I'm at a loss as to why it apparently works for nbd and me all the time, only somewhat works for robnitro, and doesn't work for anon...

I've tested with r42328 and my original patch - same behaviour:

  • VLAN routing (in eth0.x out eth0.y), (in eth1.x out eth1.y)
  • "normal" routing in eth0.x out eth1, in eth1 out eth0.x
  • with DHCP, over PPPoE, PPPoE over VLAN (on eth0.x or eth1)
  • saturating the CPU with iperf TCP mode or UDP mode, simultaneous bidirectional or just normal 1 direction upload
  • speed test sites, regular browsing, etc.

...for the last few hours and I can't seem to reproduce the issues you guys see.

In all cases, I had bridges on the eth0.x with a wifi, no bridges on the eth1.x side, and I'm routing between interfaces (not bridging).

I have a laptop with gige on one port, and a managed switch on the other port (either WAN/eth1 port or another vlan on a switch port) - tried setting one or both to auto-negotiate 100meg full, or auto-negotiate gig full, doesn't seem to make a difference.

  1. Can you guys post full 'dmesg' with or without patch?
  1. Post 'ifconfig' before and after the tx timeout error?
  1. Can you tell me anything about your network setup (speed, what's on each port, if you're auto-negotiating or forced speeds, etc.) and what kind of traffic you're passing that's causes the error to show up?
  1. Can you guys try using r42328 but change ag71xx.h AG71XX_TX_RING_SPLIT to some other numbers, e.g. 128, 512, 1024?
  1. Can you try commenting out lines 707 and 708 from r42328 and see if it makes a difference?

This is my dmesg for comparison -

[    0.000000] MyLoader: sysp=aaaa5554, boardp=aaaa5554, parts=aaaa5554
[    0.000000] bootconsole [early0] enabled
[    0.000000] CPU revision is: 00019374 (MIPS 24Kc)
[    0.000000] SoC: Atheros AR7161 rev 2
[    0.000000] Clocks: CPU:680.000MHz, DDR:340.000MHz, AHB:170.000MHz, Ref:40.000MHz
[    0.000000] Determined physical RAM map:
[    0.000000]  memory: 04000000 @ 00000000 (usable)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x00000000-0x03ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00000000-0x03ffffff]
[    0.000000] On node 0 totalpages: 16384
[    0.000000] free_area_init_node: node 0, pgdat 803109b0, node_mem_map 81000000
[    0.000000]   Normal zone: 128 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 16384 pages, LIFO batch:3
[    0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0 
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16256
[    0.000000] Kernel command line:  board=WNDR3700 console=ttyS0,115200 mtdparts=spi0.0:320k(u-boot)ro,128k(u-boot-env)ro,15872k(firmware),64k(art)ro rootfstype=squashfs,jffs2 noinitrd
[    0.000000] PID hash table entries: 256 (order: -2, 1024 bytes)
[    0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
[    0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Writing ErrCtl register=00000000
[    0.000000] Readback ErrCtl register=00000000
[    0.000000] Memory: 60888k/65536k available (2239k kernel code, 4648k reserved, 603k data, 228k init, 0k highmem)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS:51
[    0.060000] Calibrating delay loop... 452.19 BogoMIPS (lpj=2260992)
[    0.060000] pid_max: default: 32768 minimum: 301
[    0.060000] Mount-cache hash table entries: 512
[    0.070000] NET: Registered protocol family 16
[    0.070000] MIPS: machine is NETGEAR WNDR3700/WNDR3800/WNDRMAC
[    2.680000] registering PCI controller with io_map_base unset
[    2.690000] bio: create slab <bio-0> at 0
[    2.690000] PCI host bridge to bus 0000:00
[    2.700000] pci_bus 0000:00: root bus resource [mem 0x10000000-0x16ffffff]
[    2.700000] pci_bus 0000:00: root bus resource [io  0x0000]
[    2.710000] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
[    2.710000] pci 0000:00:11.0: [168c:ff1d] type 00 class 0x020000
[    2.710000] pci 0000:00:11.0: fixup device configuration
[    2.720000] pci 0000:00:11.0: reg 10: [mem 0x00000000-0x0000ffff]
[    2.720000] pci 0000:00:11.0: PME# supported from D0 D3hot
[    2.720000] pci 0000:00:12.0: [168c:ff1d] type 00 class 0x020000
[    2.720000] pci 0000:00:12.0: fixup device configuration
[    2.720000] pci 0000:00:12.0: reg 10: [mem 0x00000000-0x0000ffff]
[    2.720000] pci 0000:00:12.0: PME# supported from D0 D3hot
[    2.720000] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
[    2.720000] pci 0000:00:11.0: BAR 0: assigned [mem 0x10000000-0x1000ffff]
[    2.730000] pci 0000:00:12.0: BAR 0: assigned [mem 0x10010000-0x1001ffff]
[    2.730000] pci 0000:00:11.0: using irq 40 for pin 1
[    2.740000] pci 0000:00:12.0: using irq 41 for pin 1
[    2.740000] Switching to clocksource MIPS
[    2.750000] NET: Registered protocol family 2
[    2.750000] TCP established hash table entries: 512 (order: 0, 4096 bytes)
[    2.750000] TCP bind hash table entries: 512 (order: -1, 2048 bytes)
[    2.760000] TCP: Hash tables configured (established 512 bind 512)
[    2.760000] TCP: reno registered
[    2.760000] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    2.770000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    2.780000] NET: Registered protocol family 1
[    2.780000] PCI: CLS 0 bytes, default 32
[    2.790000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    2.800000] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    2.810000] msgmni has been set to 118
[    2.810000] io scheduler noop registered
[    2.820000] io scheduler deadline registered (default)
[    2.820000] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
[    2.850000] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11) is a 16550A
[    2.860000] console [ttyS0] enabled, bootconsole disabled
[    2.870000] ath79-spi ath79-spi: master is unqueued, this is deprecated
[    2.880000] m25p80 spi0.0: found mx25l12805d, expected m25p80
[    2.880000] m25p80 spi0.0: mx25l12805d (16384 Kbytes)
[    2.890000] 4 cmdlinepart partitions found on MTD device spi0.0
[    2.890000] Creating 4 MTD partitions on "spi0.0":
[    2.900000] 0x000000000000-0x000000050000 : "u-boot"
[    2.910000] 0x000000050000-0x000000070000 : "u-boot-env"
[    2.910000] 0x000000070000-0x000000ff0000 : "firmware"
[    2.920000] 2 netgear-fw partitions found on MTD device firmware
[    2.920000] 0x000000070000-0x000000175440 : "kernel"
[    2.930000] mtd: partition "kernel" must either start or end on erase block boundary or be smaller than an erase block -- forcing read-only
[    2.940000] 0x000000175440-0x000000ff0000 : "rootfs"
[    2.950000] mtd: partition "rootfs" must either start or end on erase block boundary or be smaller than an erase block -- forcing read-only
[    2.960000] mtd: device 4 (rootfs) set to be root filesystem
[    2.970000] 1 squashfs-split partitions found on MTD device rootfs
[    2.970000] 0x0000003a0000-0x000000ff0000 : "rootfs_data"
[    2.980000] 0x000000ff0000-0x000001000000 : "art"
[    2.990000] Realtek RTL8366S ethernet switch driver version 0.2.2
[    2.990000] rtl8366s rtl8366s: using GPIO pins 5 (SDA) and 7 (SCK)
[    3.000000] rtl8366s rtl8366s: RTL8366 ver. 1 chip found
[    3.040000] libphy: rtl8366s: probed
[    3.350000] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode:RGMII
[    3.650000] ag71xx ag71xx.1: connected to PHY at rtl8366s:04 [uid=001cc960, driver=Generic PHY]
[    3.660000] eth1: Atheros AG71xx at 0xba000000, irq 5, mode:RGMII
[    3.670000] TCP: cubic registered
[    3.670000] NET: Registered protocol family 17
[    3.680000] 8021q: 802.1Q VLAN Support v1.8
[    3.690000] VFS: Mounted root (squashfs filesystem) readonly on device 31:4.
[    3.700000] Freeing unused kernel memory: 228K (80327000 - 80360000)
[    5.690000] usbcore: registered new interface driver usbfs
[    5.690000] usbcore: registered new interface driver hub
[    5.700000] usbcore: registered new device driver usb
[    5.710000] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    5.710000] ehci-platform: EHCI generic platform driver
[    5.720000] ehci-platform ehci-platform: EHCI Host Controller
[    5.720000] ehci-platform ehci-platform: new USB bus registered, assigned bus number 1
[    5.730000] ehci-platform ehci-platform: irq 3, io mem 0x1b000000
[    5.760000] ehci-platform ehci-platform: USB 2.0 started, EHCI 1.00
[    5.760000] hub 1-0:1.0: USB hub found
[    5.770000] hub 1-0:1.0: 2 ports detected
[    5.770000] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    5.780000] ohci-platform ohci-platform: Generic Platform OHCI Controller
[    5.780000] ohci-platform ohci-platform: new USB bus registered, assigned bus number 2
[    5.790000] ohci-platform ohci-platform: irq 14, io mem 0x1c000000
[    5.860000] hub 2-0:1.0: USB hub found
[    5.860000] hub 2-0:1.0: 2 ports detected
[    6.210000] ar71xx: pll_reg 0xb8050010: 0x11110000
[    6.210000] eth0: link up (1000Mbps/Full duplex)
[    9.460000] jffs2: notice: (327) jffs2_build_xattr_subsystem: complete building xattr subsystem, 20 of xdatum (1 unchecked, 19 orphan) and 33 of xref (0 dead, 19 orphan) found.
[    9.500000] eth0: link down
[   10.680000] NET: Registered protocol family 10
[   10.690000] nf_conntrack version 0.5.0 (954 buckets, 3816 max)
[   10.700000] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   10.710000] Loading modules backported from Linux version master-2014-05-22-0-gf2032ea
[   10.720000] Backport generated by backports.git backports-20140320-37-g5c33da0
[   10.730000] ip_tables: (C) 2000-2006 Netfilter Core Team
[   10.770000] xt_time: kernel timezone is -0000
[   10.790000] cfg80211: Calling CRDA to update world regulatory domain
[   10.790000] cfg80211: World regulatory domain updated:
[   10.800000] cfg80211:  DFS Master region: unset
[   10.800000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   10.810000] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[   10.820000] cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[   10.830000] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
[   10.840000] cfg80211:   (5170000 KHz - 5250000 KHz @ 160000 KHz), (N/A, 2000 mBm), (N/A)
[   10.840000] cfg80211:   (5250000 KHz - 5330000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[   10.850000] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[   10.860000] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
[   10.870000] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
[   10.900000] PPP generic driver version 2.4.2
[   10.930000] NET: Registered protocol family 24
[   10.960000] PCI: Enabling device 0000:00:11.0 (0000 -> 0002)
[   10.970000] ath: EEPROM regdomain: 0x0
[   10.970000] ath: EEPROM indicates default country code should be used
[   10.970000] ath: doing EEPROM country->regdmn map search
[   10.970000] ath: country maps to regdmn code: 0x3a
[   10.970000] ath: Country alpha2 being used: US
[   10.970000] ath: Regpair used: 0x3a
[   10.990000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   11.000000] ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xb0000000, irq=40
[   11.010000] cfg80211: Calling CRDA for country: US
[   11.010000] cfg80211: Regulatory domain changed to country: US
[   11.020000] cfg80211:  DFS Master region: FCC
[   11.020000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   11.030000] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   11.040000] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 1700 mBm), (N/A)
[   11.050000] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz), (N/A, 2300 mBm), (0 s)
[   11.050000] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 3000 mBm), (N/A)
[   11.060000] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 4000 mBm), (N/A)
[   11.070000] PCI: Enabling device 0000:00:12.0 (0000 -> 0002)
[   11.080000] ath: EEPROM regdomain: 0x0
[   11.080000] ath: EEPROM indicates default country code should be used
[   11.080000] ath: doing EEPROM country->regdmn map search
[   11.080000] ath: country maps to regdmn code: 0x3a
[   11.080000] ath: Country alpha2 being used: US
[   11.080000] ath: Regpair used: 0x3a
[   11.120000] ieee80211 phy1: Selected rate control algorithm 'minstrel_ht'
[   11.120000] ieee80211 phy1: Atheros AR9280 Rev:2 mem=0xb0010000, irq=41
[   17.420000] ar71xx: pll_reg 0xb8050010: 0x11110000
[   17.420000] eth0: link up (1000Mbps/Full duplex)
[   17.420000] device eth0.1 entered promiscuous mode
[   17.430000] device eth0 entered promiscuous mode
[   17.440000] br-lan: port 1(eth0.1) entered forwarding state
[   17.440000] br-lan: port 1(eth0.1) entered forwarding state
[   17.460000] device eth0.2 entered promiscuous mode
[   17.480000] br-lan2: port 1(eth0.2) entered forwarding state
[   17.480000] br-lan2: port 1(eth0.2) entered forwarding state
[   17.520000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   18.370000] cfg80211: Calling CRDA for country: CA
[   18.370000] cfg80211: Regulatory domain changed to country: CA
[   18.380000] cfg80211:  DFS Master region: FCC
[   18.380000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   18.390000] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   18.400000] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 1700 mBm), (N/A)
[   18.410000] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz), (N/A, 2400 mBm), (0 s)
[   18.410000] cfg80211:   (5490000 KHz - 5730000 KHz @ 80000 KHz), (N/A, 2400 mBm), (0 s)
[   18.420000] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 3000 mBm), (N/A)
[   19.130000] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[   19.160000] device wlan1 entered promiscuous mode
[   19.160000] br-lan: port 2(wlan1) entered forwarding state
[   19.170000] br-lan: port 2(wlan1) entered forwarding state

comment:44 Changed 3 years ago by nbd

@severn: i could not reproduce the issues with iperf, but with sites like speedtest.net they do show up on my device as well (with both patches)
i've reproduced it with the plain default config, routing from lan (eth0.1) to wan (eth1)

comment:45 Changed 3 years ago by severn

How odd, web sites work for me, as well as speedtest.net

@nbd - can you do packet caps on both sides outside of the router (LAN and WAN) and see which packets don't show up on the other side (i.e. when tx stops)

Can you try enabling/disabling tx/rx pause frames on the router? (I think ethtool should be able to do it)

comment:46 Changed 3 years ago by severn

@nbd Can you also see if not clearing tx'ed descs unless the entire packet is tx'ed helps?

i.e. line 907

 if (!ag71xx_desc_empty(desc)|| !ag71xx_desc_empty(last_desc_of_packet)) {

comment:47 follow-up: Changed 3 years ago by anonymous

I have three wndr3800, and all three showed different behavior with the patch. One worked (didnt see any problems), one completely failed (wasnt able to connect to it via lan anymore, all switch ports toggled), one semi failed (wan toggled on off on off). All three have the same config, and the same image build. Theres only one difference, in the way each of them connects to the internet. The first one working is connected via WAN to another OpenWRT router (static IP, nat routed), the 2nd totally broke connects to a VDSL Telekom router, also static ip and nat routed, the 3rd one with wan toggle is connected via DHCP to a cable modem. First one doesnt use any IPv6, the other two do somehow, where one of them though doesnt have ISP IPv6.

comment:48 Changed 3 years ago by anonymous

Heres a dmesg of one of the not working, dont really know if it helps any dont think so, but without the patch, I dont really want to patch anymore because it was a mess to get the routers working again each time:

[    0.000000] Linux version 3.10.49 (openwrt@Dice-Server) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r42321) ) #1 Thu Aug 28 19:57:20 CEST 2014
[    0.000000] MyLoader: sysp=aaaa5554, boardp=aaaa5554, parts=aaaa5554
[    0.000000] bootconsole [early0] enabled
[    0.000000] CPU revision is: 00019374 (MIPS 24Kc)
[    0.000000] SoC: Atheros AR7161 rev 2
[    0.000000] Clocks: CPU:680.000MHz, DDR:340.000MHz, AHB:170.000MHz, Ref:40.000MHz
[    0.000000] Determined physical RAM map:
[    0.000000]  memory: 08000000 @ 00000000 (usable)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x00000000-0x07ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00000000-0x07ffffff]
[    0.000000] On node 0 totalpages: 32768
[    0.000000] free_area_init_node: node 0, pgdat 803129b0, node_mem_map 81000000
[    0.000000]   Normal zone: 256 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 32768 pages, LIFO batch:7
[    0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32512
[    0.000000] Kernel command line:  board=WNDR3700 console=ttyS0,115200 mtdparts=spi0.0:320k(u-boot)ro,128k(u-boot-env)ro,15872k(firmware),64k(art)ro rootfstype=squashfs,jffs2 noinitrd
[    0.000000] PID hash table entries: 512 (order: -1, 2048 bytes)
[    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
[    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
[    0.000000] Writing ErrCtl register=00000000
[    0.000000] Readback ErrCtl register=00000000
[    0.000000] Memory: 126184k/131072k available (2253k kernel code, 4888k reserved, 597k data, 284k init, 0k highmem)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS:51
[    0.060000] Calibrating delay loop... 452.19 BogoMIPS (lpj=2260992)
[    0.060000] pid_max: default: 32768 minimum: 301
[    0.060000] Mount-cache hash table entries: 512
[    0.070000] NET: Registered protocol family 16
[    0.070000] MIPS: machine is NETGEAR WNDR3700/WNDR3800/WNDRMAC
[    2.680000] registering PCI controller with io_map_base unset
[    2.690000] bio: create slab <bio-0> at 0
[    2.690000] PCI host bridge to bus 0000:00
[    2.700000] pci_bus 0000:00: root bus resource [mem 0x10000000-0x16ffffff]
[    2.700000] pci_bus 0000:00: root bus resource [io  0x0000]
[    2.710000] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
[    2.710000] pci 0000:00:11.0: [168c:ff1d] type 00 class 0x020000
[    2.710000] pci 0000:00:11.0: fixup device configuration
[    2.720000] pci 0000:00:11.0: reg 10: [mem 0x00000000-0x0000ffff]
[    2.720000] pci 0000:00:11.0: PME# supported from D0 D3hot
[    2.720000] pci 0000:00:12.0: [168c:ff1d] type 00 class 0x020000
[    2.720000] pci 0000:00:12.0: fixup device configuration
[    2.720000] pci 0000:00:12.0: reg 10: [mem 0x00000000-0x0000ffff]
[    2.720000] pci 0000:00:12.0: PME# supported from D0 D3hot
[    2.720000] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
[    2.720000] pci 0000:00:11.0: BAR 0: assigned [mem 0x10000000-0x1000ffff]
[    2.730000] pci 0000:00:12.0: BAR 0: assigned [mem 0x10010000-0x1001ffff]
[    2.730000] pci 0000:00:11.0: using irq 40 for pin 1
[    2.740000] pci 0000:00:12.0: using irq 41 for pin 1
[    2.740000] Switching to clocksource MIPS
[    2.750000] NET: Registered protocol family 2
[    2.750000] TCP established hash table entries: 1024 (order: 1, 8192 bytes)
[    2.750000] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[    2.760000] TCP: Hash tables configured (established 1024 bind 1024)
[    2.760000] TCP: reno registered
[    2.770000] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    2.770000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    2.780000] NET: Registered protocol family 1
[    2.780000] PCI: CLS 0 bytes, default 32
[    2.790000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    2.800000] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    2.810000] msgmni has been set to 246
[    2.810000] io scheduler noop registered
[    2.820000] io scheduler deadline registered (default)
[    2.820000] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
[    2.850000] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11) is a 16550A
[    2.860000] console [ttyS0] enabled, bootconsole disabled
[    2.870000] ath79-spi ath79-spi: master is unqueued, this is deprecated
[    2.880000] m25p80 spi0.0: found mx25l12805d, expected m25p80
[    2.880000] m25p80 spi0.0: mx25l12805d (16384 Kbytes)
[    2.890000] 4 cmdlinepart partitions found on MTD device spi0.0
[    2.890000] Creating 4 MTD partitions on "spi0.0":
[    2.900000] 0x000000000000-0x000000050000 : "u-boot"
[    2.910000] 0x000000050000-0x000000070000 : "u-boot-env"
[    2.910000] 0x000000070000-0x000000ff0000 : "firmware"
[    2.920000] 2 netgear-fw partitions found on MTD device firmware
[    2.930000] 0x000000070000-0x000000176440 : "kernel"
[    2.930000] mtd: partition "kernel" must either start or end on erase block boundary or be smaller than an erase block -- forcing read-only
[    2.940000] 0x000000176440-0x000000ff0000 : "rootfs"
[    2.950000] mtd: partition "rootfs" must either start or end on erase block boundary or be smaller than an erase block -- forcing read-only
[    2.960000] mtd: device 4 (rootfs) set to be root filesystem
[    2.970000] 1 squashfs-split partitions found on MTD device rootfs
[    2.970000] 0x0000007f0000-0x000000ff0000 : "rootfs_data"
[    2.980000] 0x000000ff0000-0x000001000000 : "art"
[    2.990000] Realtek RTL8366S ethernet switch driver version 0.2.2
[    3.000000] rtl8366s rtl8366s: using GPIO pins 5 (SDA) and 7 (SCK)
[    3.000000] rtl8366s rtl8366s: RTL8366 ver. 1 chip found
[    3.040000] libphy: rtl8366s: probed
[    3.350000] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode:RGMII
[    3.650000] ag71xx ag71xx.1: connected to PHY at rtl8366s:04 [uid=001cc960, driver=Generic PHY]
[    3.660000] eth1: Atheros AG71xx at 0xba000000, irq 5, mode:RGMII
[    3.670000] TCP: cubic registered
[    3.670000] NET: Registered protocol family 17
[    3.680000] 8021q: 802.1Q VLAN Support v1.8
[    3.690000] VFS: Mounted root (squashfs filesystem) readonly on device 31:4.
[    3.700000] Freeing unused kernel memory: 284K (80329000 - 80370000)
[    6.320000] usbcore: registered new interface driver usbfs
[    6.320000] usbcore: registered new interface driver hub
[    6.330000] usbcore: registered new device driver usb
[    6.360000] SCSI subsystem initialized
[    6.370000] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    6.380000] ehci-platform: EHCI generic platform driver
[    6.380000] ehci-platform ehci-platform: EHCI Host Controller
[    6.390000] ehci-platform ehci-platform: new USB bus registered, assigned bus number 1
[    6.400000] ehci-platform ehci-platform: irq 3, io mem 0x1b000000
[    6.420000] ehci-platform ehci-platform: USB 2.0 started, EHCI 1.00
[    6.420000] hub 1-0:1.0: USB hub found
[    6.430000] hub 1-0:1.0: 2 ports detected
[    6.430000] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    6.440000] ohci-platform ohci-platform: Generic Platform OHCI Controller
[    6.450000] ohci-platform ohci-platform: new USB bus registered, assigned bus number 2
[    6.450000] ohci-platform ohci-platform: irq 14, io mem 0x1c000000
[    6.520000] hub 2-0:1.0: USB hub found
[    6.520000] hub 2-0:1.0: 2 ports detected
[    6.530000] usbcore: registered new interface driver usb-storage
[    6.770000] usb 1-1: new high-speed USB device number 2 using ehci-platform
[    6.980000] hub 1-1:1.0: USB hub found
[    6.980000] hub 1-1:1.0: 4 ports detected
[    7.050000] ar71xx: pll_reg 0xb8050010: 0x11110000
[    7.050000] eth0: link up (1000Mbps/Full duplex)
[    7.270000] usb 1-1.4: new high-speed USB device number 3 using ehci-platform
[    7.400000] usb-storage 1-1.4:1.0: USB Mass Storage device detected
[    7.400000] scsi0 : usb-storage 1-1.4:1.0
[    8.410000] scsi 0:0:0:0: Direct-Access              Flash Disk       5.00 PQ: 0 ANSI: 2
[    8.420000] sd 0:0:0:0: [sda] 974848 512-byte logical blocks: (499 MB/476 MiB)
[    8.420000] sd 0:0:0:0: [sda] Write Protect is off
[    8.430000] sd 0:0:0:0: [sda] Mode Sense: 0b 00 00 08
[    8.430000] sd 0:0:0:0: [sda] No Caching mode page found
[    8.440000] sd 0:0:0:0: [sda] Assuming drive cache: write through
[    8.440000] sd 0:0:0:0: [sda] No Caching mode page found
[    8.450000] sd 0:0:0:0: [sda] Assuming drive cache: write through
[    8.460000]  sda: sda1
[    8.460000] sd 0:0:0:0: [sda] No Caching mode page found
[    8.470000] sd 0:0:0:0: [sda] Assuming drive cache: write through
[    8.470000] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   10.640000] jffs2: notice: (369) jffs2_build_xattr_subsystem: complete building xattr subsystem, 43 of xdatum (1 unchecked, 42 orphan) and 70 of xref (0 dead, 47 orphan) found.
[   10.830000] jffs2: notice: (366) jffs2_build_xattr_subsystem: complete building xattr subsystem, 43 of xdatum (1 unchecked, 42 orphan) and 70 of xref (0 dead, 47 orphan) found.
[   10.870000] eth0: link down
[   11.280000] EXT4-fs (sda1): warning: maximal mount count reached, running e2fsck is recommended
[   11.290000] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts:
[   13.390000] NET: Registered protocol family 10
[   13.400000] tun: Universal TUN/TAP device driver, 1.6
[   13.410000] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[   13.420000] nf_conntrack version 0.5.0 (1976 buckets, 7904 max)
[   13.430000] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   13.440000] Netfilter messages via NETLINK v0.30.
[   13.450000] ip_set: protocol 6
[   13.470000] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   13.490000] u32 classifier
[   13.490000]     input device check on
[   13.490000]     Actions configured
[   13.500000] Mirror/redirect action on
[   13.510000] Loading modules backported from Linux version master-2014-05-22-0-gf2032ea
[   13.520000] Backport generated by backports.git backports-20140320-37-g5c33da0
[   13.530000] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.560000] usbcore: registered new interface driver usblp
[   13.570000] usbcore: registered new interface driver usbserial
[   13.570000] usbcore: registered new interface driver usbserial_generic
[   13.580000] usbserial: USB Serial support registered for generic
[   13.620000] xt_time: kernel timezone is -0000
[   13.650000] cfg80211: Calling CRDA to update world regulatory domain
[   13.650000] cfg80211: World regulatory domain updated:
[   13.660000] cfg80211:  DFS Master region: unset
[   13.660000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   13.670000] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[   13.680000] cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[   13.690000] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (N/A, 2000 mBm), (N/A)
[   13.700000] cfg80211:   (5170000 KHz - 5250000 KHz @ 160000 KHz), (N/A, 2000 mBm), (N/A)
[   13.700000] cfg80211:   (5250000 KHz - 5330000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[   13.710000] cfg80211:   (5490000 KHz - 5730000 KHz @ 160000 KHz), (N/A, 2000 mBm), (0 s)
[   13.720000] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
[   13.730000] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
[   13.740000] usbcore: registered new interface driver ftdi_sio
[   13.740000] usbserial: USB Serial support registered for FTDI USB Serial Device
[   13.800000] PPP generic driver version 2.4.2
[   13.810000] NET: Registered protocol family 24
[   13.850000] PCI: Enabling device 0000:00:11.0 (0000 -> 0002)
[   13.860000] ath: EEPROM regdomain: 0x0
[   13.860000] ath: EEPROM indicates default country code should be used
[   13.860000] ath: doing EEPROM country->regdmn map search
[   13.860000] ath: country maps to regdmn code: 0x3a
[   13.860000] ath: Country alpha2 being used: US
[   13.860000] ath: Regpair used: 0x3a
[   13.880000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   13.890000] cfg80211: Calling CRDA for country: US
[   13.890000] cfg80211: Regulatory domain changed to country: US
[   13.900000] cfg80211:  DFS Master region: FCC
[   13.900000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   13.910000] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   13.920000] cfg80211:   (5170000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 1700 mBm), (N/A)
[   13.930000] cfg80211:   (5250000 KHz - 5330000 KHz @ 80000 KHz), (N/A, 2300 mBm), (0 s)
[   13.940000] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 3000 mBm), (N/A)
[   13.950000] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 4000 mBm), (N/A)
[   13.950000] ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xb0000000, irq=40
[   13.980000] PCI: Enabling device 0000:00:12.0 (0000 -> 0002)
[   13.990000] ath: EEPROM regdomain: 0x0
[   13.990000] ath: EEPROM indicates default country code should be used
[   13.990000] ath: doing EEPROM country->regdmn map search
[   13.990000] ath: country maps to regdmn code: 0x3a
[   13.990000] ath: Country alpha2 being used: US
[   13.990000] ath: Regpair used: 0x3a
[   14.010000] ieee80211 phy1: Selected rate control algorithm 'minstrel_ht'
[   14.010000] ieee80211 phy1: Atheros AR9280 Rev:2 mem=0xb0010000, irq=41
[   21.060000] ar71xx: pll_reg 0xb8050010: 0x11110000
[   21.070000] eth0: link up (1000Mbps/Full duplex)
[   21.080000] device eth0.1 entered promiscuous mode
[   21.080000] device eth0 entered promiscuous mode
[   21.110000] br-lan: port 1(eth0.1) entered forwarding state
[   21.110000] br-lan: port 1(eth0.1) entered forwarding state
[   21.170000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   22.540000] cfg80211: Calling CRDA for country: DE
[   22.600000] cfg80211: Regulatory domain changed to country: DE
[   22.610000] cfg80211:  DFS Master region: ETSI
[   22.610000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   22.620000] cfg80211:   (2400000 KHz - 2483000 KHz @ 40000 KHz), (N/A, 2000 mBm), (N/A)
[   22.630000] cfg80211:   (5150000 KHz - 5250000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
[   22.640000] cfg80211:   (5250000 KHz - 5350000 KHz @ 80000 KHz), (N/A, 2000 mBm), (0 s)
[   22.640000] cfg80211:   (5470000 KHz - 5725000 KHz @ 80000 KHz), (N/A, 2700 mBm), (0 s)
[   22.650000] cfg80211:   (57240000 KHz - 65880000 KHz @ 2160000 KHz), (N/A, 4000 mBm), (N/A)
[   23.110000] br-lan: port 1(eth0.1) entered forwarding state
[   23.790000] ar71xx: pll_reg 0xb8050014: 0x1099
[   23.790000] eth1: link up (100Mbps/Full duplex)
[   23.810000] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   24.600000] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   24.660000] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[   24.670000] device wlan0 entered promiscuous mode
[   24.680000] br-lan: port 2(wlan0) entered forwarding state
[   24.680000] br-lan: port 2(wlan0) entered forwarding state
[   24.710000] device wlan1 entered promiscuous mode
[   24.710000] br-lan: port 3(wlan1) entered forwarding state
[   24.720000] br-lan: port 3(wlan1) entered forwarding state
[   24.860000] br-lan: port 2(wlan0) entered disabled state
[   25.240000] br-lan: port 3(wlan1) entered disabled state
[   27.020000] br-lan: port 2(wlan0) entered forwarding state
[   27.030000] br-lan: port 2(wlan0) entered forwarding state
[   27.030000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   27.080000] br-lan: port 3(wlan1) entered forwarding state
[   27.080000] br-lan: port 3(wlan1) entered forwarding state
[   27.090000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[   29.030000] br-lan: port 2(wlan0) entered forwarding state
[   29.080000] br-lan: port 3(wlan1) entered forwarding state

comment:49 Changed 3 years ago by anonymous

Nothing new on this?

comment:50 in reply to: ↑ 47 Changed 3 years ago by robnitro@…

Replying to anonymous:

I have three wndr3800, and all three showed different behavior with the patch. One worked (didnt see any problems), one completely failed (wasnt able to connect to it via lan anymore, all switch ports toggled), one semi failed (wan toggled on off on off). All three have the same config, and the same image build. Theres only one difference, in the way each of them connects to the internet. The first one working is connected via WAN to another OpenWRT router (static IP, nat routed), the 2nd totally broke connects to a VDSL Telekom router, also static ip and nat routed, the 3rd one with wan toggle is connected via DHCP to a cable modem. First one doesnt use any IPv6, the other two do somehow, where one of them though doesnt have ISP IPv6.

That gives a good clue!
1- working) direct lan connection to another router. no isp.
2- fail) DSL pppoe?
3- partial up/down) normal ethernet- cable ISP

It seems like maybe there is some timeout that gets triggered which isn't going to happen on a direct WAN link.
The dsl may be failing because in my own experience, DSL is less reliable than cable ISP, noise on the line causing retransmits/lost frames.

So, maybe the issue is that this patch works in ideal situations- like NBD in iperf, but when it comes to the web- your results may vary, and it only seems to happen when the load comes up (and higher possibility of lost packets/retransmits?)

comment:51 Changed 3 years ago by severn

I've managed to reproduce the issue consistently by sending pings of payload size 214, 218, 470 or 474 bytes. This value corresponds to 214 (payload) + 28 (ip + icmp header) + 14 (ethernet header) + vlan (4 bytes) = 256+4 bytes, or 218+28+14+0 (no vlan)= 256+4 bytes.

The TX engine seems to hang after it DMAs something 4 bytes or less, something that I discovered when I changed the minimum SKB length, but screwed up when I set up the split.

Also did some more performance testing and found that 512 seems to perform better than 256, which is coincidentally what the stock firmware also uses.

From: http://wiki.openwrt.org/toh/netgear/wndr3800

AG7100: Length per segment 512
AG7100: Max segments per packet 4

Following this is a patch against current trunk... let me know if it helps.

Changed 3 years ago by severn

comment:52 Changed 3 years ago by nbd

@severn: thanks for tracking this one down. i ran tests myself and i can confirm that the speed test issues are gone.
i also ran a script that tested all possible payload sizes. without the off-by-one fix it ran into a hang, with the fix all payload sizes work.
fixes committed in r42427-r42429

comment:53 Changed 3 years ago by anonymous

I'm not sure if this is related but I am on 42429 and I am still getting this error:

[ 6878.010000] ------------[ cut here ]------------
[ 6878.010000] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1e8/0x26c()
[ 6878.020000] NETDEV WATCHDOG: eth1 (ag71xx): transmit queue 0 timed out
[ 6878.020000] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv4 mac80211 ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT slhc nf_nat_irc nf_nat_ftp nf_nat nf_defrag_ipv4 nf_conntrack_irc nf_conntrack_ftp iptable_raw iptable_mangle iptable_filter ipt_REJECT ip_tables crc_ccitt compat ledtrig_usbdev ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 tun ipv6 arc4 crypto_blkcipher ohci_hcd ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common crypto_hash
[ 6878.090000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.49 #10
[ 6878.090000] Stack : 00000000 00000000 00000000 00000000 8038bf1e 00000033 8031af40 00000089
[ 6878.090000] 	  802d3984 8031b36b 00000000 803839dc 8031af40 00000089 8038b874 00000001
[ 6878.090000] 	  00000004 8027aa6c 00000003 801e2bfc 802ec14c 00000089 802d5014 8030bc74
[ 6878.090000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 6878.090000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 8030bc00
[ 6878.090000] 	  ...
[ 6878.130000] Call Trace:
[ 6878.130000] [<802237a0>] show_stack+0x48/0x70
[ 6878.140000] [<80289ff0>] warn_slowpath_common+0x78/0xa8
[ 6878.140000] [<8028a04c>] warn_slowpath_fmt+0x2c/0x38
[ 6878.150000] [<80100bec>] dev_watchdog+0x1e8/0x26c
[ 6878.150000] [<800e0a94>] call_timer_fn.isra.38+0x24/0x84
[ 6878.160000] [<8020c788>] run_timer_softirq+0x17c/0x1bc
[ 6878.160000] [<8008bc70>] __do_softirq+0xd0/0x1bc
[ 6878.170000] [<80115770>] do_softirq+0x48/0x68
[ 6878.170000] [<80172d80>] irq_exit+0x54/0x70
[ 6878.170000] [<80060830>] ret_from_irq+0x0/0x4
[ 6878.180000] [<80060a80>] __r4k_wait+0x20/0x40
[ 6878.180000] [<800f06e4>] cpu_startup_entry+0xa4/0x104
[ 6878.190000] [<80327910>] start_kernel+0x38c/0x3a4
[ 6878.190000] 
[ 6878.190000] ---[ end trace 477f36915f3808c0 ]---
[ 6878.200000] eth1: tx timeout
[ 6878.200000] eth1: link down
[ 6878.210000] ar71xx: pll_reg 0xb8050014: 0x11110000
[ 6878.210000] eth1: link up (1000Mbps/Full duplex)

comment:54 Changed 3 years ago by nbd

did you clean your kernel tree after updating?

comment:55 Changed 3 years ago by anonymous

No, I did not. I will clean and rebuild.

comment:56 Changed 3 years ago by robnitro@…

Works well on my wndr3800ch, I am using it double natted right now until I feel it is stable.
Cerowrt scripts page: netperfrunner.sh gives me a reasonable 40/40.
Similar when I run a speedtest.net download/upload, an ftp download doesnt jump down much.

Great job guys! Good teamwork and reverse engineering of netgear original settings!

comment:57 Changed 3 years ago by anonymous

Thanks guys! I can also confirm this speeds up things quite a bit, iperf performance on my wndr3700 v1 went up to 250mbit from about 100mbit (eth0 <-> eth0.20) and not a single disconnect on the wan interface

comment:58 Changed 3 years ago by nbd

  • Resolution set to fixed
  • Status changed from reopened to closed

backported to BB in r42433, thanks for testing

comment:59 Changed 3 years ago by anonymous

Awesome news! I will test this now too on my three wndr3800 (hope it stays stable :p). Whats meant with cleaning the kernel tree? Ive seen it now a few times on the comments. Does it affect if you sysupgrade too?

comment:60 Changed 3 years ago by anonymous

@robnitro Where to get netperf for openwrt? Theres no package in trunk right now.

comment:61 Changed 3 years ago by robnitro@…

Netperf I copied the link from BB and did
opkg install (the bb netperf pkg address)

same for memtester which is only in AA

comment:62 Changed 3 years ago by anonymous

@robnitro But is this test scenario really what counts, running that script on the router itself? I noticed before when I tested, that doing a download/upload on the router worked, but doing it from a pc behind the router on the switch was the problem, and caused the problem. Hope this is fixed with the patch.

comment:63 Changed 3 years ago by robnitro@…

Oh yeah, same here. but before patch, netperfrunner.sh was horribly unbalanced... I was getting 60/8 on a 75/75 line.

Also to who asked, sysupgrade is fine. Compiling is when best to make clean.

comment:64 Changed 3 years ago by anonymous

@nbd Latest trunk is crashing PPPoE wan interface for me until reboot. Sometimes it reconnects, but it seems broken.

comment:65 Changed 3 years ago by robnitro@…

To narrow it down, maybe you can try installing ethtool to change the speed
ex you have 1000 mbit full duplex to the pppoe
Then ethtool -s eth1 speed 100 duplex full autoneg on
If thats bad, try with autoneg off instead of on.

You can also try ethtool -s eth1 speed 1000 duplex full autoneg off

comment:66 Changed 3 years ago by anonymous

The loss of the WAN PPPoE interface seems to be related to the 1508 bytes MTU.

The problem seemed to go away after switching to the default 1500 bytes MTU.

Latency has increased a bit after upgrading to trunk with these slicing patches.

comment:67 Changed 3 years ago by anonymous

@robnitro Changing the tx ring size makes the wan interface disconnect right away and it takes several attempts to restore it. Something like this is probably going to happen if I change link speed with ethtool.

I was able to crash the router by changing the tx ring size on eth0 (wan interface) once. I couldn't ping the router on the lan interface.

I'm able to get 10-40 mbps more in speedtest with these slicing changes, but that seems to fluctuate a lot. That is probably because of PPPoE which can't be handled efficiently by the slow ar71xx.

comment:68 Changed 3 years ago by anonymous

Wait so newest patch is unstable again?? I did a speedtest with that script and get

Download: 34.48 Mbps
Upload: 7.57 Mbps

But it's a 50/10 line. I wonder if the patch doesnt work for me or if it's the server in that script?

comment:69 Changed 3 years ago by anonymous

  • Resolution fixed deleted
  • Status changed from closed to reopened

Sigh... I can confirm there are horrible latency problems with this newest build:

PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=50 time=73.062 ms
64 bytes from 8.8.8.8: seq=1 ttl=50 time=25.389 ms
64 bytes from 8.8.8.8: seq=2 ttl=50 time=25.531 ms
64 bytes from 8.8.8.8: seq=3 ttl=50 time=53.505 ms
64 bytes from 8.8.8.8: seq=4 ttl=50 time=56.439 ms
64 bytes from 8.8.8.8: seq=5 ttl=50 time=25.493 ms
64 bytes from 8.8.8.8: seq=6 ttl=50 time=25.315 ms
64 bytes from 8.8.8.8: seq=7 ttl=50 time=58.981 ms
64 bytes from 8.8.8.8: seq=8 ttl=50 time=63.122 ms
64 bytes from 8.8.8.8: seq=9 ttl=50 time=26.129 ms
64 bytes from 8.8.8.8: seq=10 ttl=50 time=25.988 ms
PING heise.de (193.99.144.80): 56 data bytes
64 bytes from 193.99.144.80: seq=0 ttl=249 time=11.285 ms
64 bytes from 193.99.144.80: seq=1 ttl=249 time=11.323 ms
64 bytes from 193.99.144.80: seq=2 ttl=249 time=11.447 ms
64 bytes from 193.99.144.80: seq=3 ttl=249 time=55.736 ms
64 bytes from 193.99.144.80: seq=4 ttl=249 time=11.423 ms
64 bytes from 193.99.144.80: seq=5 ttl=249 time=11.544 ms
64 bytes from 193.99.144.80: seq=6 ttl=249 time=13.831 ms
64 bytes from 193.99.144.80: seq=7 ttl=249 time=42.619 ms
64 bytes from 193.99.144.80: seq=8 ttl=249 time=11.645 ms
64 bytes from 193.99.144.80: seq=9 ttl=249 time=11.507 ms
64 bytes from 193.99.144.80: seq=10 ttl=249 time=11.370 ms
64 bytes from 193.99.144.80: seq=11 ttl=249 time=42.038 ms
64 bytes from 193.99.144.80: seq=12 ttl=249 time=11.636 ms

Before patch I had constant 11ms pings.

comment:70 Changed 3 years ago by anonymous

Ping wird ausgeführt für heise.de [193.99.144.80] mit 32 Bytes Daten:
Antwort von 193.99.144.80: Bytes=32 Zeit=52ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=20ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=14ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=55ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=51ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=50ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=59ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=57ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=73ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=100ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=91ms TTL=248 <-
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248

comment:71 Changed 3 years ago by anonymous

The latency jumps every few packets are just there if I have a little bit of load on the line, like 2-4mbit for example.

comment:72 Changed 3 years ago by anonymous

Here are some notes from my testing over the last few hours with ar71xx:

PPPoE drops often and the eth0 interface seems to go down frequently.

speedof.me tests don't finish 9 out of 10 times. speedtest.net hangs sometimes. TCP retransmissions are much higher than before. Browsing feels slower. Connecting to TCP services takes a lot longer and loading anything seems much slower.

speedof.me can now only go up to 1-2 mbps. It used to go up to 30 mbps or more. speedof.me starts the upload test at 50 kbps and barely manages to go up to 2mbps by the time it ends. It doesn't even manage to do anything after the download sometimes.

Latencies have gone up a lot for me as well. I'm getting 50-90 ms pings for a path where I was getting 3-4 ms.

The only improvement I was able to see is for download, but ping jumps to 100 ms or more, retransmissions are high and the latency for starting new TCP connections is higher.

This ar71xx hardware is probably very old, but this hardware used to provide better stability, latency and throughput.

comment:73 follow-up: Changed 3 years ago by nbd

did you make sure it was a clean build? and when eth0 "seems to go down", are there any messages in the log?

comment:74 Changed 3 years ago by anonymous

Would be nice hearing something about the latency spikes though. Reverted back to non patched image and pings are low and stable again.

comment:75 in reply to: ↑ 73 Changed 3 years ago by anonymous

Replying to nbd:

did you make sure it was a clean build? and when eth0 "seems to go down", are there any messages in the log?

The build should be a clean one, I've run make clean to clean up the build directory and I've seen make build actually build the compiler, the kernel and the image. The build I have is OpenWrt Chaos Calmer r42435 / LuCI Trunk (svn-r10527) with kernel 3.10.49.

The logs were saying something about the loss of connectivity and that eth0 was now up again.

Here's what I get in the system logs when changing the tx ring size for eth0 to 32:

daemon.notice netifd: Interface 'wan' is now down
kern.info kernel: [ 4020.899000] eth0: link down
daemon.notice netifd: Interface 'wan' is disabled
kern.info kernel: [ 4020.910000] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
daemon.notice netifd: Interface 'wan' is enabled
daemon.notice netifd: Interface 'wan' is setting up now
daemon.notice netifd: Network device 'eth0' link is down
daemon.notice netifd: Interface 'wan' has link connectivity loss
daemon.info pppd[7664]: Plugin rp-pppoe.so loaded.
daemon.info pppd[7664]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.6
daemon.notice pppd[7664]: pppd 2.4.6 started by root, uid 0

and the kernel logs:

[ 4020.899000] eth0: link down
[ 4020.910000] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 4021.682000] ar71xx: pll_reg 0xb8050010: 0x110000
[ 4021.682000] eth0: link up (1000Mbps/Full duplex)
[ 4021.691000] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

This is similar to what I was seeing when the wan interface was dropping on its own.

The network interface came back and the PPPoE connection was re-established, but this doesn't always happen. Sometimes it seems the kernel panics or both network interfaces get stuck and there's no way to debug without serial console.

I'll try another build with a fresh git clone and report back. I can provide more information gathered from this ar7161 if it helps.

comment:76 Changed 3 years ago by hnyman

I have been happily surfing with wndr3700v2 with the new code, achieving steady 60/8 Mbit speeds (with torrents generating the traffic volume). Browsing is also smooth, but I haven't really checked latency for spikes. No events in the logs, straight 100/10 wired connection.

I am writing this because I wonder if this new code is really mature enough to include it in the final BB 14.07 (when it has not been in the rc versions).

comment:77 Changed 3 years ago by anonymous

And whats the problem in just doing a ping test which takes like 10 seconds? Load something on the line which has constant traffic around 3-5mbit, for example a source stream on twitch. Then do a ping. Every few packages have spikes. And this is just with the patched drivers.

comment:78 follow-up: Changed 3 years ago by nbd

did you also test the ping without messing around with ethtool? please post what device and network config you're using.

comment:79 in reply to: ↑ 78 Changed 3 years ago by anonymous

Replying to nbd:

did you also test the ping without messing around with ethtool? please post what device and network config you're using.

network config:

config interface 'loopback'
        option ifname 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'
        option ula_prefix '*********************'

config interface 'lan'
        option ifname 'eth1'
        option proto 'static'
        option ipaddr '***.***.***.***'
        option netmask '255.255.255.0'
        option ip6assign '60'

config interface 'wan'
        option ifname 'eth0'
        option _orig_ifname 'eth0'
        option _orig_bridge 'false'
        option proto 'pppoe'
        option username '*****************'
        option password '*****************'
        option delegate '0'

config interface 'wan6'
        option ifname '@wan'
        option proto 'dhcpv6'

config switch
        option name 'switch0'
        option reset '1'
        option enable_vlan '1'

config switch_vlan
        option device 'switch0'
        option vlan '1'
        option ports '0 1 2 3 4'

device: UBNT RSPRO ar7161

I was having problems without messing with ethtool. I've tried the ethtool hacks to check if it helped improve latency.

comment:80 Changed 3 years ago by anonymous

I never messed around with anything. Youre mistaken me with the other three anons talking here. And I dont think he meant you (anon). wndr3800 ...

config interface 'loopback'
	option ifname 'lo'
	option proto 'static'
	option ipaddr '127.0.0.1'
	option netmask '255.0.0.0'

config interface 'lan'
	option type 'bridge'
	option proto 'static'
	option netmask '255.255.255.0'
	option ip6assign '64'
	option ipaddr '10.0.0.201'
	option _orig_ifname 'eth0.1 wlan0 wlan1'
	option _orig_bridge 'true'
	option ifname 'eth0.1'

config interface 'wan'
	option _orig_ifname 'eth1'
	option _orig_bridge 'false'
	option ifname 'eth1'
	option proto 'static'
	option netmask '255.255.255.0'
	option dns '8.8.8.8 8.8.4.4'
	option ipaddr '10.0.1.201'
	option gateway '10.0.1.203'
	option delegate '0'

config interface 'wan6'
	option ifname '@wan'
	option proto 'dhcpv6'

config globals 'globals'
	option ula_prefix 'fdd9:7b88:29a8::/48'

config switch
	option name 'rtl8366s'
	option reset '1'
	option enable_vlan '1'
	option blinkrate '2'

config switch_vlan
	option device 'rtl8366s'
	option vlan '1'
	option ports '0 1 2 3 5t'

config switch_port
	option device 'rtl8366s'
	option port '1'
	option led '6'

config switch_port
	option device 'rtl8366s'
	option port '2'
	option led '9'

config switch_port
	option device 'rtl8366s'
	option port '5'
	option led '2'

config interface 'vpn1'
	option proto 'static'
	option ifname 'tun0'
	option auto '0'
	option delegate '0'

config interface 'vpn2'
	option proto 'static'
	option ifname 'tun1'
	option auto '0'
	option delegate '0'

config interface 'vpn3'
	option proto 'static'
	option auto '0'
	option _orig_ifname 'tap0'
	option _orig_bridge 'false'
	option ifname 'tap0'
	option delegate '0'

comment:81 Changed 3 years ago by anonymous

I dont know why, but I just can reproduce the latency spikes when I do the following:

1.) open a stream on twitch.tv, quality source, chat activated
2.) do a ping
3.) every few packages spike (on router, and on pc behind router)

The CPU of router is under 10%. The bandwidth of my connection isn't even near limit. I have a 2nd wndr3800 with older unpatched image too and there pings are normal.

comment:82 follow-up: Changed 3 years ago by robnitro@…

 Wait so newest patch is unstable again?? I did a speedtest with that script and get

Download: 34.48 Mbps
Upload: 7.57 Mbps

But it's a 50/10 line. I wonder if the patch doesnt work for me or if it's the server in that script?

If you have qos, set the max to be 90% of your dl and 90% of your max. What kind of pings did the script report? You can also try with switch -n 2 (run less threads). It's hard to tell.. but pre-patch- if I ran -n 1 or 2, it was much more lopsided than the -n 4 default. Ex: I used to get 70/7 but now get 60/32 or 45/45 or so. My issue was that any half ore more load, ex 40 up or 40 down, will affect the traffic in the other direction, slowing it down to below 8 mbits.

@otheranon: I can confirm that ethtool can sometimes make things buggy... sorry for that advice.
And I don't think it's wise to mess with the tx ring buffer... even the pre-patched, I could get the interface to die messing with that.

@otheranon, NBD probably is also asking you to explain what kind of connection you are using and what speed it uses.
For example I have dhcp only, 1000 full duplex to my fiber service (75/75).
You can safely run ethtool eth1 just to list what speed and duplex you have.

comment:83 in reply to: ↑ 82 Changed 3 years ago by anonymous

Replying to robnitro@…:

 Wait so newest patch is unstable again?? I did a speedtest with that script and get

Download: 34.48 Mbps
Upload: 7.57 Mbps

But it's a 50/10 line. I wonder if the patch doesnt work for me or if it's the server in that script?

If you have qos, set the max to be 90% of your dl and 90% of your max. What kind of pings did the script report? You can also try with switch -n 2 (run less threads). It's hard to tell.. but pre-patch- if I ran -n 1 or 2, it was much more lopsided than the -n 4 default. Ex: I used to get 70/7 but now get 60/32 or 45/45 or so. My issue was that any half ore more load, ex 40 up or 40 down, will affect the traffic in the other direction, slowing it down to below 8 mbits.

@otheranon: I can confirm that ethtool can sometimes make things buggy... sorry for that advice.
And I don't think it's wise to mess with the tx ring buffer... even the pre-patched, I could get the interface to die messing with that.

@otheranon, NBD probably is also asking you to explain what kind of connection you are using and what speed it uses.
For example I have dhcp only, 1000 full duplex to my fiber service (75/75).
You can safely run ethtool eth1 just to list what speed and duplex you have.

I'm connected at 1 gbps full duplex on wan and lan.

QoS doesn't make a difference. I'm getting worse latency at any throughput.

comment:84 Changed 3 years ago by anonymous

Other anon here. No QoS, no ethertools, just the patched image => latency problems / spikes / no stable low pings while browsing / while have some load / routing, swtiching back to non patched image => stable pings again. Device is wndr3800, double nated via another router (static wan).

comment:85 Changed 3 years ago by anonymous

2014-09-09 02:02:40 Testing netperf.bufferbloat.net (ipv4) with 4 streams down and up while pinging gstatic.com. Takes about 60 seconds.
 Download:  9.45 Mbps
   Upload:  0.15 Mbps
  Latency: (in msec, 22 pings, 0.00% packet loss)
      Min: 114.518
    10pct: 114.788
   Median: 639.413
      Avg: 6775.876
    90pct: 19004.956
      Max: 25832.256

I dont think thats a good sign? Ive read on the bufferbloat site and it said OpenWRT should prevent this to happen?

comment:86 Changed 3 years ago by anonymous

You have to have qos-scripts or sqm-scripts enabled and set to the right settings.

comment:87 Changed 3 years ago by anonymous

Theres is no qos or sqm anymore on trunk... why? How do I setup qos on trunk build?

comment:88 Changed 3 years ago by dtaht

qos-scripts appears to be being built in barrier breaker.

http://downloads.openwrt.org/barrier_breaker/14.07-rc3/ar71xx/generic/packages/

for example has the qos-scripts in it. do a

opkg install qos-scripts luci-app-qos

comment:89 Changed 3 years ago by anonymous

Will qos scripts also work if you have many custom snat/vpn iptables rules and mangled routing like I have? Or is it just for a standard build with wan/lan scenario?

comment:90 Changed 3 years ago by anonymous

Those from bb doesnt work on trunk btw:

kernel (= 3.10.49-1-94831e5bcf361d1c03e87a15e152b0e8) *
 * opkg_install_cmd: Cannot install package qos-scripts.

comment:91 Changed 3 years ago by robnitro@…

Qos won't change your forwarding/iptables rules.
http://downloads.openwrt.org/snapshots/trunk/ar71xx/packages/base/
It's here, strange.
Well, any dependencies failed? If so, you may need to install those first.
If kmod- it's tricky, you have to use the file you compiled.

Then, install luci-app-qos.

Inside qos, set 90% of your max download and 90% of your max upload.
Customize rules if you need. It will fix the latency issues! If it's still kind of high, try 85% of your max. Some lines need more "space"

comment:92 Changed 3 years ago by anonymous

Theres still something wrong with this patched drivers, with or without QOS, it seems like buffer runs full or anything after a few seconds than falling back to normal, pings working a few seconds, go high, hen back to normal, in the same intervals:

Antwort von 193.99.144.80: Bytes=32 Zeit=14ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=23ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=51ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=45ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=19ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=17ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=52ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=48ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=14ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=542ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=34ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=48ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=47ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=44ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=44ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=44ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=50ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248

comment:93 Changed 3 years ago by robnitro@…

I just had a bug happen related to the eth1 interface.
Rebooted router after doing some wifi changes, to see if it sticks.

Booted up and no wan IP.
ethtool eth1 showed eth1 as 10mb/s half duplex ?!?!
So I ran ethtool -s eth1 autoneg off
ethtool -s eth1 autoneg on
Then the interface came to 1000 full duplex and worked.

Nothing with ethtool was being done, or changed.. just wifi, unrelated.

Something weird with the auto negotiation- like it will sometimes fail and default to a dead state.

Here's a log showing boot, and what happened after I hit "connect" in luci interfaces to try to see if it can get an IP (before ethtool autoneg mentioned above).

penWrt:~# dmesg|grep eth1
[    3.720000] eth1: Atheros AG71xx at 0xba000000, irq 5, mode:RGMII
[   21.090000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   21.980000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[  541.670000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
OpenWrt:~# logread |grep eth1
Tue Sep  9 19:48:25 2014 kern.info kernel: [    3.720000] eth1: Atheros AG71xx at 0xba000000, irq 5, mode:RGMII
Tue Sep  9 19:48:29 2014 kern.info kernel: [   21.090000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
Tue Sep  9 19:48:30 2014 kern.info kernel: [   21.980000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
Tue Sep  9 19:57:09 2014 kern.info kernel: [  541.670000] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready

comment:94 Changed 3 years ago by anonymous

@robnitro someone deleted my comment so Im trying again: qos-scripts totally broke my routing, just like I thought it would, no ipset nor mangled routing worked anymore if using the qos-scripts. I think it will just work in a normal wan/lan scenario withouth any custom iptables rules.

comment:95 follow-up: Changed 3 years ago by nbd

Please try r42457 (again after cleaning the kernel tree using make target/linux/clean)

comment:96 in reply to: ↑ 95 ; follow-up: Changed 3 years ago by anonymous

Replying to nbd:

Please try r42457 (again after cleaning the kernel tree using make target/linux/clean)

This seems to fix the issues I was having. TCP connections are getting established quicker.

I'm still getting some dropped packets and overruns on eth1:

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8131128 errors:0 dropped:123 overruns:118 frame:0
          TX packets:8784387 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000

comment:97 in reply to: ↑ 96 ; follow-up: Changed 3 years ago by anonymous

Replying to anonymous:

Replying to nbd:

Please try r42457 (again after cleaning the kernel tree using make target/linux/clean)

This seems to fix the issues I was having. TCP connections are getting established quicker.

I'm still getting some dropped packets and overruns on eth1:

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8131128 errors:0 dropped:123 overruns:118 frame:0
          TX packets:8784387 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000

eth0 is also seeing overruns

          RX packets:10274037 errors:0 dropped:0 overruns:1152 frame:0
          TX packets:10843485 errors:0 dropped:0 overruns:0 carrier:0

I didn't have QoS before these ag71xx updates. It's not enabled now either.

comment:98 Changed 3 years ago by robnitro@…

weird, I have no issues (see below).
Only thing is one time it rebooted and came up with 10 mbit half duplex, dead wan connection. I put ethtool eth1 autoneg off , ethtool eth1 autoneg on in my rc.local.
[Maybe it's because I am running a 3.14 kernel config from Arokh in forum (optimized build for wndr)]
Maybe try to change your txqueuelen
ifconfig eth1 txqueuelen 5 (or 10 or 32, etc).
I don't know why mine comes up as 5.. maybe it's part of the qos script.

OpenWrt:~# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 08:BD:43:AC:9C:6F
          inet addr:72.69.213.125  Bcast:72.69.213.255  Mask:255.255.255.0
          inet6 addr: fe80::abd:43ff:feac:9c6f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1761988 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1454019 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:5
          RX bytes:1902667483 (1.7 GiB)  TX bytes:1504739477 (1.4 GiB)
          Interrupt:5

OpenWrt:~# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 0A:BD:43:AC:9C:6E
          inet6 addr: fe80::8bd:43ff:feac:9c6e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:201915 errors:0 dropped:7 overruns:3 frame:0
          TX packets:287705 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:36138944 (34.4 MiB)  TX bytes:374759513 (357.3 MiB)
          Interrupt:4

comment:99 Changed 3 years ago by robnitro@…

Sorry ethtool -s for the autoneg on and autoneg off...

comment:100 in reply to: ↑ 97 Changed 3 years ago by severn

Replying to anonymous:
Are the dropped/overrun numbers constantly increasing? Is the CPU maxed?

txqueuelen shouldn't matter as long as you're not getting TX overruns.

Check eth0/eth1 TX ringsize though (ethtool -g) and make sure they're not too small. Default is 192. Anything 6 or less is likely to cause hangs. If you want to lower it, take whatever number you used to have and triple it, since each full size packet takes up 3 descriptors now. I can re-pro ping spikes with activity with TX ring size of 8, but not with the default. If you're double-NATing, do the ping spikes happen when you ping your second router?

I noticed that sometimes my WAN/eth1 port won't come up or will come up as 10/half, even before this change - usually fixes itself if I reset the port or replug the cable. TBH it just never bothered me enough to get around to filing a ticket for it...

comment:101 Changed 3 years ago by robnitro@…

severn, the 10 half can be fixed
opkg install ethtool

in rc.local
ethtool -s eth1 autoneg off
ethtool -s eth1 autoneg on

thanks again tons for the hard work you and nbd did!

comment:102 Changed 3 years ago by anonymous

Someone is deleting my comments all the time. There is no "-g" for my version of ethtool on my wndr3800. Also I never changed anything from defaults or ever used ethtool before. Yes I am double nated indeed, but I cant say if the ping spikes are also to the 2nd router, because of the numbers are way shorter, but it seems they are there:

64 bytes from 10.0.1.203: seq=0 ttl=64 time=0.693 ms ++
64 bytes from 10.0.1.203: seq=1 ttl=64 time=0.484 ms
64 bytes from 10.0.1.203: seq=2 ttl=64 time=0.542 ms ++
64 bytes from 10.0.1.203: seq=3 ttl=64 time=0.492 ms
64 bytes from 10.0.1.203: seq=4 ttl=64 time=0.496 ms
64 bytes from 10.0.1.203: seq=5 ttl=64 time=0.481 ms
64 bytes from 10.0.1.203: seq=6 ttl=64 time=0.558 ms ++
64 bytes from 10.0.1.203: seq=7 ttl=64 time=0.674 ms ++
64 bytes from 10.0.1.203: seq=8 ttl=64 time=0.474 ms
64 bytes from 10.0.1.203: seq=9 ttl=64 time=0.482 ms
64 bytes from 10.0.1.203: seq=10 ttl=64 time=0.479 ms
64 bytes from 10.0.1.203: seq=11 ttl=64 time=0.611 ms ++
64 bytes from 10.0.1.203: seq=12 ttl=64 time=0.481 ms
64 bytes from 10.0.1.203: seq=13 ttl=64 time=0.471 ms
64 bytes from 10.0.1.203: seq=14 ttl=64 time=0.785 ms ++
64 bytes from 10.0.1.203: seq=15 ttl=64 time=0.474 ms

comment:103 follow-up: Changed 3 years ago by nbd

Sub-millisecond latency variations are quite normal. Does anybody still have any issues relevant to this ticket, or can it be closed?

comment:104 Changed 3 years ago by anonymous

Nope. Ping spikes still happen with latest trunk. It's every few pings a rise of 30-50%. And those sub-millisecond rises (for ping to 2nd router on lan) are the same marge.

Antwort von 193.99.144.80: Bytes=32 Zeit=34ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=31ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=32ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=34ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=14ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=14ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=36ms TTL=248 ++
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=15ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=16ms TTL=248
Antwort von 193.99.144.80: Bytes=32 Zeit=33ms TTL=248 ++

comment:105 Changed 3 years ago by robnitro@…

from 16 to 32, ok, fine, it happens to me too sometimes, and I have no issues and get full speed... No idea what it is...
You're pinging from a client or the router? If router, do it from client PC, because router itself under load - the shell and userspace has less priority than routing. I can run tests get barely any ping spikes, but router shell is so slow because cpu max out!

Now run the same test with a build that doesn't have this patch... make sure the speeds come up the same, and see if you get the same....

So, pre patch, no ping spikes, netperfrunner.sh gets great pings too but guess what netperf and real life got? 0.5 download, 70 upload. Pings min 25 avg 30, max 50
Cpu use was around 25%.
No qos, same thing on speed but much higher pings min 45 avg 143 max 182
Router cpu never above 75% with qos on.

Post patch, now I get tiny spikes here and there (even with qos), but netperfrunner.sh can do 68 mbit down/ 35 mbit upload. pings min 26 avg 29 max 40 (wowza nice)
[CPU maxed out (overclocked to 800mhz too!)]
Qos off, 50 down/86 up pings min 44 avg 122 max 174
Cpu use w/ qos off, around 50%.

It also depends on your line, I have fiber, so even with qos off, my average ping doesn't go up much. When I had cable internet of max 10 down 0.5 up, it would have more problems with ping. Different physical systems have different issues of bloat (and why qos helps a lot- set to 85% or 90% of tested max speeds ul and dl).

comment:106 in reply to: ↑ 103 Changed 3 years ago by anonymous

Replying to nbd:

Sub-millisecond latency variations are quite normal. Does anybody still have any issues relevant to this ticket, or can it be closed?

I didn't require QoS and I wasn't using QoS before these patches. Post-patch, it seems QoS is absolutely necessary and it's still unable to restore performance to previous levels.

I wasn't getting dropped packets and overruns before. I've tried decreasing the tx ring size from 48 to 32, but that didn't seem to help either.

Ping replies are slightly more stable than with the previous broken patch, but they're still worse than before any of these patches.

I'm seeing dropped packets on the WAN and LAN interfaces, but they're not increasing. The dropped packets are probably not a big deal, but they might indicate there's a subtle problem.

This is what kind of latency I was getting, sorted from the lowest latency to greatest:
before patching < current post-patch < previous broken patch

These patches have increased download bandwidth by 10 mbps, but latency has increased and this looks like bloated behavior.

I'm trying to find an alternative to ar71xx based hardware and replace ar71xx.

comment:107 follow-up: Changed 3 years ago by nbd

how much traffic are you pushing through the device (up and down) when the issue occurs. what is your maximum WAN connection speed?

comment:108 in reply to: ↑ 107 ; follow-up: Changed 3 years ago by anonymous

Replying to nbd:

how much traffic are you pushing through the device (up and down) when the issue occurs. what is your maximum WAN connection speed?

My WAN is connected at 1000 full duplex. My bandwidth is 300 mbps download and 100 mbps upload. The device can go up to 190-200 mbps download at 100% CPU after patching. It was able to do 180-190 mbps before, but there were no problems with latency and packets dropped at the router level on the network interface.

Do you think there's something I should try to figure out if this is really something caused by the driver change and not by the increased traffic?

speedof.me seems to be having a strange problem with the upload test, it doesn't work at all sometimes. I'm not sure if this is a network related problem or something related to the ag71xx driver changes.

I've been performing these tests with speedof.me, speedtest.net and downloads from local FTPs.

comment:109 in reply to: ↑ 108 Changed 3 years ago by anonymous

Replying to anonymous:

Replying to nbd:

how much traffic are you pushing through the device (up and down) when the issue occurs. what is your maximum WAN connection speed?

My WAN is connected at 1000 full duplex. My bandwidth is 300 mbps download and 100 mbps upload. The device can go up to 190-200 mbps download at 100% CPU after patching. It was able to do 180-190 mbps before, but there were no problems with latency and packets dropped at the router level on the network interface.

Do you think there's something I should try to figure out if this is really something caused by the driver change and not by the increased traffic?

speedof.me seems to be having a strange problem with the upload test, it doesn't work at all sometimes. I'm not sure if this is a network related problem or something related to the ag71xx driver changes.

I've been performing these tests with speedof.me, speedtest.net and downloads from local FTPs.

I think there might also be a problem with QoS on PPPoE devices. QoS on bare devices seems to be much better than on PPPoE on top of an ethernet device.

Latency and QoS seems worse for upload than for download. Pinging a remote host without QoS on the router from another machine on the LAN during the download phase of speedtest.net yields ping reply times of 3-8ms. Pinging from the same machine during an upload yields ping reply times of 6-50ms at 100mbps without any downstream traffic.

comment:110 follow-ups: Changed 3 years ago by anonymous

I am sorry that I dont have an ancount on here, it's hard for you guys to talk to the real person right, I will see if I can register by time. Anyway...

1.) Those "doubled pings", latency spiking from 12ms to about 3x ms, are new since this patch was applied, pings were stable before.

2.) It is just, when there is some decent load on the line, but nothing near max. I have a 12/1 line, and the spikes begin when I for example open a twitch.tv stream, which has about 3mbit. I have checked this with bmon too.

3.) Those ping spikes can be monitored on the device itself (wndr3800), and behind it, from a pc for example which is connected to the 3800's switch.

4.) I dont use any QOS on the wndr3800 itself, it is connected via wan to another OpenWRT router (wbmr-hp-g300h), so double nated.

5.) I am using mangled routing on the wndr too, maybe thats a part of the problem line. But the CPU of the router never reaches anything critical. For example I use IPSet DHCP rules to route twitch traffic though my wan device, because all other traffic is going through an OpenVPN tunnel.

comment:111 in reply to: ↑ 110 Changed 3 years ago by hnyman

Replying to anonymous:

I am sorry that I dont have an ancount on here, it's hard for you guys to talk to the real person right, I will see if I can register by time. Anyway...

You don't need a proper account here to identify yourself. You can just enter your email address there.

Having a conversation is much easier if all participants have unique identifiers. This conversation is rather funny but must be frustrating for the devs, as there are so many "anonymous" users being active and also having questions targeted to them.

comment:112 in reply to: ↑ 110 Changed 3 years ago by notanon

Replying to anonymous:

I am sorry that I dont have an ancount on here, it's hard for you guys to talk to the real person right, I will see if I can register by time. Anyway...

1.) Those "doubled pings", latency spiking from 12ms to about 3x ms, are new since this patch was applied, pings were stable before.

2.) It is just, when there is some decent load on the line, but nothing near max. I have a 12/1 line, and the spikes begin when I for example open a twitch.tv stream, which has about 3mbit. I have checked this with bmon too.

3.) Those ping spikes can be monitored on the device itself (wndr3800), and behind it, from a pc for example which is connected to the 3800's switch.

4.) I dont use any QOS on the wndr3800 itself, it is connected via wan to another OpenWRT router (wbmr-hp-g300h), so double nated.

5.) I am using mangled routing on the wndr too, maybe thats a part of the problem line. But the CPU of the router never reaches anything critical. For example I use IPSet DHCP rules to route twitch traffic though my wan device, because all other traffic is going through an OpenVPN tunnel.

I'm the one who posted regarding the latency, QoS, PPPoE, tx ring sizes and latency (the one who replied to nbd a lot).

I'm not seeing the issue with twitch, but connections seem to be establishing slower than before. It feels like some of the snappiness has disappeared. Everything seemed to load instantly before the patch.

comment:113 Changed 3 years ago by notanon

@nbd Here's what happens when the connection drops

:23 2014 kern.info kernel: [560359.318000] eth0: link down
:23 2014 daemon.notice netifd: Network device 'eth0' link is down
:23 2014 daemon.notice netifd: Interface 'wan' has link connectivity loss
:23 2014 daemon.info pppd[1026]: Terminating on signal 15
:23 2014 daemon.info pppd[1026]: Connect time 9338.9 minutes.
:23 2014 daemon.info pppd[1026]: Sent 2302432708 bytes, received 932410639 bytes.
:23 2014 daemon.err miniupnpd[1455]: ioctl(s, SIOCGIFADDR, ...): Cannot assign requested address
:23 2014 daemon.err miniupnpd[1455]: Failed to get IP for interface pppoe-wan
:23 2014 daemon.warn miniupnpd[1455]: SendNATPMPPublicAddressChangeNotification: cannot get public IP address, stopping
:23 2014 daemon.notice netifd: Network device 'pppoe-wan' link is down
:26 2014 kern.debug kernel: [560362.412000] ar71xx: pll_reg 0xb8050010: 0x110000
:26 2014 kern.info kernel: [560362.412000] eth0: link up (1000Mbps/Full duplex)
:26 2014 daemon.notice netifd: Network device 'eth0' link is up
:26 2014 daemon.notice netifd: Interface 'wan' has link connectivity 
:26 2014 daemon.notice netifd: Interface 'wan' is setting up now
:26 2014 daemon.notice pppd[1026]: Connection terminated.
:26 2014 daemon.info pppd[1026]: Connect time 9338.9 minutes.
:26 2014 daemon.info pppd[1026]: Sent 2302432708 bytes, received 932410639 bytes.
:26 2014 daemon.info pppd[1026]: Exit.
:27 2014 daemon.notice netifd: Interface 'wan' is now down
:27 2014 kern.info kernel: [560362.572000] eth0: link down
:27 2014 daemon.notice netifd: Interface 'wan' is disabled
:27 2014 kern.info kernel: [560362.578000] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
:27 2014 daemon.notice netifd: Interface 'wan' is enabled
:27 2014 daemon.notice netifd: Interface 'wan' is setting up now
:27 2014 daemon.notice netifd: Network device 'eth0' link is down
:27 2014 daemon.notice netifd: Interface 'wan' has link connectivity loss
:27 2014 daemon.info pppd[3669]: Plugin rp-pppoe.so loaded.
:27 2014 daemon.info pppd[3669]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
:27 2014 daemon.notice pppd[3669]: pppd 2.4.7 started by root, uid 0
:27 2014 kern.debug kernel: [560363.502000] ar71xx: pll_reg 0xb8050010: 0x110000
:27 2014 kern.info kernel: [560363.502000] eth0: link up (1000Mbps/Full duplex)
:27 2014 kern.info kernel: [560363.512000] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
:27 2014 daemon.notice netifd: Interface 'wan' has link connectivity 
:29 2014 daemon.warn dnsmasq[2115]: no servers found in /tmp/resolv.conf.auto, will retry
:32 2014 daemon.info pppd[3669]: PPP session is 1625

Packet loss and failing to open a TCP connection are still issues I'm dealing with.

comment:114 follow-up: Changed 3 years ago by robnitro@…

Confusing, why is eth0 your wan port? On my wndr3800, eth0 is connected to the 4 port switch.

I've always had problems messing with ring sizes... are you changing it with ethtool? If so, don't.
Only thing I have in my rc.local is
ethtool -s eth1 autoneg off
ethtool -s eth1 autoneg on
Because the 3.14 arokh kernel builds I use have this bug where sometimes the wan port doesn't get a proper negotiation at boot.

comment:115 in reply to: ↑ 114 Changed 3 years ago by notanon

Replying to robnitro@…:

Confusing, why is eth0 your wan port? On my wndr3800, eth0 is connected to the 4 port switch.

I've always had problems messing with ring sizes... are you changing it with ethtool? If so, don't.
Only thing I have in my rc.local is
ethtool -s eth1 autoneg off
ethtool -s eth1 autoneg on
Because the 3.14 arokh kernel builds I use have this bug where sometimes the wan port doesn't get a proper negotiation at boot.

I'm not changing any ring size. It's eth0 becuase it's a different device with ar71xx.

comment:116 Changed 3 years ago by nbd

  • Resolution set to worksforme
  • Status changed from reopened to closed

should be fixed in current versions

comment:117 Changed 3 years ago by anonymous

  • Resolution worksforme deleted
  • Status changed from closed to reopened

i have installed BB on a WNDR3800 - switch to switch within a vlan (so eth0.1 -> eth0.1 ) is slow as hell (1Mbit max). currently doing a test transfer at 380kiB/s.
If BB is reboted, a disconnect occurs, but then the speed jumps up to 80MiB/s, to drop back the second bootup is complete.

(this was also posted to ticket #14120 as i'm not sure whether it is this issue or the other one)

comment:118 Changed 3 years ago by berni@…

Same here. Buffalo WZR-AG300H freshly upgraded to 15.05-RC1 (same issue with 14.07). I have Dual-WAN, with one PPPoE connected to WAN (eth1), and a cable modem connected to a port in another VLAN on eth0. Routing from eth0.100 to eth0.1 is bad (starts packetloss at around 6Mbit/s throughput), performance from eth1 (PPPoE) to eth0.1 is fine.

curl directly on the OpenWRT downloading something through eth0.100 archieves wirespeed for the link.

comment:119 Changed 3 years ago by berni@…

Additional info, eth0.100 to wlan0 is fine as well, so it is not limited by routing per se.

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.