Modify

Opened 7 years ago

Last modified 4 years ago

#9689 reopened defect

AG7240 Huge packet loss on light traffic

Reported by: p.titera@… Owned by: juhosg
Priority: normal Milestone: Barrier Breaker 14.07
Component: kernel Version: Trunk
Keywords: Cc:

Description

I have problems with TP-LINK TL-WR741ND. I have huge packet loss on ping when using current trunk (packet loss is up to 40%). I've traced it to wrong destination MAC address for received packets. First two bytes of destination address of packet are replaced by FF:FF it looks like this (sniffed on WAN port of router)

20:26:16.875058 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 43, length 40
20:26:16.960766 00:01:5c:32:f3:c1 > ff:ff:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 43, length 40
20:26:21.862439 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 44, length 40
20:26:21.889128 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 44, length 40
20:26:22.868478 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 45, length 40
20:26:22.953214 00:01:5c:32:f3:c1 > ff:ff:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 45, length 40
20:26:27.861793 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 46, length 40
20:26:27.911830 00:01:5c:32:f3:c1 > ff:ff:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 46, length 40
20:26:32.862060 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 47, length 40
20:26:32.888729 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 47, length 40
20:26:33.869147 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 48, length 40
20:26:33.931927 00:01:5c:32:f3:c1 > ff:ff:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 48, length 40
20:26:38.866044 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 49, length 40
20:26:38.883814 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 49, length 40
20:26:39.867554 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.92.149.98: ICMP echo request, id 1, seq 50, length 40
20:26:39.882871 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.92.149.98 > 81.200.57.13: ICMP echo reply, id 1, seq 50, length 40

You may see that from time to time packet has wrong address. I've sniffed traffic before WAN port of router and it seems to be correct. The other point is that I see those packets even when I don't put tcpdump to promiscuous mode so I think that packet are correct on medial layer but get somehow modified after (or during) receive.

Problems is greatly reduced when router is under load (in that case I see only something about 4% packet loss on same target).

Attachments (0)

Change History (25)

comment:1 Changed 7 years ago by jow

  • Owner changed from developers to juhosg
  • Status changed from new to assigned

comment:2 Changed 7 years ago by anonymous

Yes!I got the same problem. TP741. Wireless may be Ok. but when using lan with wire, somtimes I get wrong packets. at that time. the wired pc can ping any client linked to the router via wireless.

comment:3 Changed 7 years ago by p.titera@…

Just one more point. If it will help. I just testing it on internal bridge and it seems to be working correctly. That means I do not see any packets with incorrect destination mac addresses.

It seems to me that there can be some alignment problems with packet buffer and DMA controller.

comment:4 follow-up: Changed 7 years ago by nbd

please try applying http://nbd.name/ag7240-align.patch onto your kernel tree and check if it helps with this issue

comment:5 Changed 7 years ago by Petr Titera <p.titera@…>

Thanks for the tip. I will do this ASAP (in fact I have new image already build) but I have to keep router up for now. Will keep you informed.

comment:6 follow-up: Changed 7 years ago by Petr Titera <p.titera@…>

Finally get to test the patch from http://nbd.name/ag7240-align.patch. Its better and worse. Better, because this bug does not happen so often now, worse, because when it happens four bytes of MAC address are wrong:

13:23:51.919305 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 98: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 19150, seq 31, length 64
13:23:51.934581 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 98: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 19150, seq 31, length 64
13:23:52.920904 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 98: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 19150, seq 32, length 64
13:23:52.983756 00:01:5c:32:f3:c1 > ff:ff:ff:ff:13:20, ethertype IPv4 (0x0800), length 98: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 19150, seq 32, length 64
13:23:53.921181 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 98: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 19150, seq 33, length 64
13:23:53.934619 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 98: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 19150, seq 33, length 64
13:23:54.922865 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 98: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 19150, seq 34, length 64
13:23:54.939459 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 98: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 19150, seq 34, length 64
13:23:55.924697 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 98: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 19150, seq 35, length 64
13:23:55.937926 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 98: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 19150, seq 35, length 64

Packet loss is down to something like 10-15% from 30-40%.

comment:7 in reply to: ↑ 6 Changed 7 years ago by Petr Titera <p.titera@…>

OK scratch that part about this bug happening less. After some time it seems to me that its more or the less same as before. May be a little bit better, but definitely not solved.

comment:8 in reply to: ↑ 4 Changed 7 years ago by Petr Titera <p.titera@…>

Replying to nbd:

please try applying http://nbd.name/ag7240-align.patch onto your kernel tree and check if it helps with this issue

Another point. It apparently breaks DHCP, or may be broadcasts for wired clients. Wireless is OK. I've managed to take my entire home network from internet.

comment:9 Changed 7 years ago by Petr Titera <p.titera@…>

Happy me that I stored older image of firmware. I'm now on r25835 and everything seems to be working. So the bug was introduced in last 4 months.

comment:10 follow-up: Changed 7 years ago by nbd

broadcast issues fixed in r27705, r27706. other issues might be fixed by the commits before those.

please test latest latest trunk or backfire

comment:11 in reply to: ↑ 10 Changed 7 years ago by Petr Titera <p.titera@…>

Replying to nbd:

broadcast issues fixed in r27705, r27706. other issues might be fixed by the commits before those.

please test latest latest trunk or backfire

Unfortunatelly I must inform you that on version r27711 I still cannot get DHCP through wired connection (no problem with wireless) and I still see packet loss on simple ping (tcpdum still shows same behavior as reported). Is there anything I can do to help?

22:33:20.981087 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 1, seq 25356, length 40
22:33:20.990785 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 173: 89.187.236.173 > 81.200.57.13: ICMP 89.187.236.173 udp port 39857 unreachable, length 139
22:33:21.047234 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 1, seq 25356, length 40
22:33:21.054725 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 173: 91.122.174.24 > 81.200.57.13: ICMP 91.122.174.24 udp port 24909 unreachable, length 139
22:33:21.983174 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 1, seq 25357, length 40
22:33:22.006931 00:01:5c:32:f3:c1 > ff:ff:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 1, seq 25357, length 40
22:33:26.978446 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 74: 81.200.57.13 > 81.200.57.226: ICMP echo request, id 1, seq 25358, length 40
22:33:27.044801 00:01:5c:32:f3:c1 > 94:0c:6d:fc:13:20, ethertype IPv4 (0x0800), length 74: 81.200.57.226 > 81.200.57.13: ICMP echo reply, id 1, seq 25358, length 40
22:33:27.171647 94:0c:6d:fc:13:20 > 00:01:5c:32:f3:c1, ethertype IPv4 (0x0800), length 118: 81.200.57.13 > 195.212.29.180: ICMP 81.200.57.13 udp port 123 unreachable, length 84

Petr Titera

comment:12 follow-up: Changed 7 years ago by nbd

Other people have reported that my commits fix at least the broadcast issues. Did you clean your kernel tree after updating?

comment:13 in reply to: ↑ 12 ; follow-up: Changed 7 years ago by Petr Titera <p.titera@…>

Replying to nbd:

Other people have reported that my commits fix at least the broadcast issues. Did you clean your kernel tree after updating?

Yes, but I can try to build from scratch again. I've replaced this router with anothe so I can test without disrtupting my internet connection.

Petr

comment:14 in reply to: ↑ 13 Changed 7 years ago by Petr Titera <p.titera@…>

Replying to Petr Titera <p.titera@…>:

Replying to nbd:

Other people have reported that my commits fix at least the broadcast issues. Did you clean your kernel tree after updating?

Yes, but I can try to build from scratch again. I've replaced this router with anothe so I can test without disrtupting my internet connection.

Petr

So I've tested it once more. I've tested current backfire and current trunk and both seems to be working correctly. I don't know why it did not work first time. So I think you may consider this solved.

Petr

comment:15 Changed 7 years ago by nbd

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:16 Changed 7 years ago by Cheng Li <hanshuiys@…>

  • Resolution fixed deleted
  • Status changed from closed to reopened

I have a WR941NDv5 device, which seems to be equivalent to WR941NDv4. I use the WR941NDv4 profile and it works fine on r25690 trunk.

However when I upgraded to r27844 I got similar result. huge packet loss on wan interface. with tcpdump-mini i found that in some packets sending out to wan, the first 2 bytes of my mac address are filled with random stuff. here is some capture log on wan interface of the router, exported by tcpdump & wireshark:

 No.     Time        Source                Destination           Protocol Info                                                            Hw Source             Hw Dest
    826 10.340733   10.111.7.109          10.111.7.1            ICMP     Echo (ping) request  (id=0x0001, seq(be/le)=12149/29999, ttl=63) Tp-LinkT_4a:9f:1c     Cisco_cb:18:40
    827 10.341428   10.111.7.1            10.111.7.109          ICMP     Echo (ping) reply    (id=0x0001, seq(be/le)=12149/29999, ttl=255) Cisco_cb:18:40        Tp-LinkT_4a:9f:1c
    897 11.342784   10.111.7.109          10.111.7.1            ICMP     Echo (ping) request  (id=0x0001, seq(be/le)=12150/30255, ttl=63) Tp-LinkT_4a:9f:1c     Cisco_cb:18:40
    899 11.345987   10.111.7.1            10.111.7.109          ICMP     Echo (ping) reply    (id=0x0001, seq(be/le)=12150/30255, ttl=255) Cisco_cb:18:40        Meiko_4a:9f:1c
   1027 12.822859   10.111.7.109          10.111.7.1            ICMP     Echo (ping) request  (id=0x0001, seq(be/le)=12151/30511, ttl=63) Tp-LinkT_4a:9f:1c     Cisco_cb:18:40
   1029 12.844735   10.111.7.1            10.111.7.109          ICMP     Echo (ping) reply    (id=0x0001, seq(be/le)=12151/30511, ttl=255) Cisco_cb:18:40        AcsipTec_4a:9f:1c
   1151 14.323969   10.111.7.109          10.111.7.1            ICMP     Echo (ping) request  (id=0x0001, seq(be/le)=12152/30767, ttl=63) Tp-LinkT_4a:9f:1c     Cisco_cb:18:40
   1152 14.324287   10.111.7.1            10.111.7.109          ICMP     Echo (ping) reply    (id=0x0001, seq(be/le)=12152/30767, ttl=255) Cisco_cb:18:40        Tp-LinkT_4a:9f:1c
   1278 15.326055   10.111.7.109          10.111.7.1            ICMP     Echo (ping) request  (id=0x0001, seq(be/le)=12153/31023, ttl=63) Tp-LinkT_4a:9f:1c     Cisco_cb:18:40
   1279 15.326401   10.111.7.1            10.111.7.109          ICMP     Echo (ping) reply    (id=0x0001, seq(be/le)=12153/31023, ttl=255) Cisco_cb:18:40        Tp-LinkT_4a:9f:1c
   1481 16.328118   10.111.7.109          10.111.7.1            ICMP     Echo (ping) request  (id=0x0001, seq(be/le)=12154/31279, ttl=63) Tp-LinkT_4a:9f:1c     Cisco_cb:18:40
   1482 16.328586   10.111.7.1            10.111.7.109          ICMP     Echo (ping) reply    (id=0x0001, seq(be/le)=12154/31279, ttl=255) Cisco_cb:18:40        Tp-LinkT_4a:9f:1c

my real mac address is 54:e6:fc:4a:9f:1c, while in some packets 54:e6 is mis-replaced to some random bytes (Meiko_4a:9f:1c or AcsipTec_4a:9f:1c above).

if mac address in reply packet is incorrect, the ping program will not see the reply, even though tcpdump can see it. putting eth1 into promisc mode does not solve the problem.

another strange behavior is, even if eth1 is not in promisc mode, tcpdump can see all packets regardless of the destination mac address (i use tcpdump -p).

i have tried to use binary search to figure out which commit caused this problem, and found that in r26716 the probem had already existed.

comment:17 follow-up: Changed 7 years ago by nbd

how easy is it for you to reproduce this issue, how frequently does it happen?
i can't reproduce it on my own devices, so if i come up with any patches, can you test them for me?

comment:18 in reply to: ↑ 17 Changed 7 years ago by Cheng Li <hanshuiys@…>

Replying to nbd:

how easy is it for you to reproduce this issue, how frequently does it happen?
i can't reproduce it on my own devices, so if i come up with any patches, can you test them for me?

it always happens after several minutes after boot. if you have any patches, i would like to test them for you on my device.

ps. i am not sure if i need to try the above ag7240-align.patch. i can attach more information here for my device if you need.

comment:19 follow-ups: Changed 7 years ago by nbd

  • Resolution set to fixed
  • Status changed from reopened to closed

fixed in r27894-r27897

comment:20 in reply to: ↑ 19 Changed 6 years ago by Cheng Li <hanshuiys@…>

Replying to nbd:

fixed in r27894-r27897

it works on my device. thank you very much :)

comment:21 in reply to: ↑ 19 Changed 6 years ago by Cheng Li <hanshuiys@…>

  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to nbd:

fixed in r27894-r27897

maybe the bug is not completely fixed. the router behaves normally just after boot, but the bug still appears after several hours on my device.

i don't know what information i need to upload for you to debug. it looks like a mysterious bug, and it is a little harder to reproduce than before, because i need more time to wait for its appearing after boot.

comment:22 Changed 6 years ago by nbd

OK, looks like I need to write some code to detect the DMA glitches and do a quick reset. I'll let you know when I have a new patch for testing.

comment:23 follow-up: Changed 6 years ago by nbd

Please try current trunk (r27975), as usual: this requires cleaning the kernel tree after updating.

With these changes it should properly detect DMA issues and do a fast reset. When this happens, the link status will change to down and then right back up again, you can see this in your logs.
Please run this until this link state change happens (maybe a few times) and see if the latency is still normal afterwards.

comment:24 in reply to: ↑ 23 Changed 6 years ago by Cheng Li <hanshuiys@…>

Replying to nbd:

Please try current trunk (r27975), as usual: this requires cleaning the kernel tree after updating.

With these changes it should properly detect DMA issues and do a fast reset. When this happens, the link status will change to down and then right back up again, you can see this in your logs.
Please run this until this link state change happens (maybe a few times) and see if the latency is still normal afterwards.

on r27979 the problem still exists on my device. packet loss happened before i could see any link state changes in syslog (i expected messages such as "kernel: eth1: link up").

I can try to set up a proxy to the SSH server of my device for you, if it is helpful for debugging.

comment:25 Changed 4 years ago by jow

  • Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07

Milestone Attitude Adjustment 12.09 deleted

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.