Opened 5 years ago
Last modified 5 years ago
#13612 new defect
Client on wndr3700 suddenly stops receiving ARP replies from LAN side
Reported by: | syzop@… | Owned by: | developers |
---|---|---|---|
Priority: | normal | Milestone: | Chaos Calmer 15.05 |
Component: | packages | Version: | Trunk |
Keywords: | Cc: |
Description
==[ SUMMARY ]==
For some reason my WNDR3700 access points occasionally suddenly stop forwarding ARP replies from a server on the LAN side to a particular (wireless) client. Actually it's not just ARP, it's true for all ethernet traffic with a destination of that particular (wireless) client.
The affected client (laptop) is random. Other clients on the same AP are (almost?) always unaffected. Identical laptop hardware sitting next to the affected laptop works perfectly fine.
Bringing wifi down & up doesn't fix the issue.
See below for more information.
==[ NETWORK MAP ]==
Laptop )wifi)) AP <-wired-> SWITCH <-wired-> SERVER `---- SERVER2
==[ TYPE OF NETWORK ]==
I have 14 of these AP's. All kinds of laptops (different brands, etc) experience this issue, and it happens with all the AP's (all of which are WNDR3700).
Around 350 clients are associated at any random time to the 14 AP's (in total). There are many (really MANY) associate/disassocate events at certain times, as this is a high school and people move from one place to another multiple times a day. Just in case it matters...
==[ WHAT WORKS ]==
- Pinging from laptop to AP (and vice versa)
- Pinging from AP to SERVER & SERVER2
- Traffic from laptops on the same AP to SERVER & SERVER2
==[ WHAT DOESN'T WORK ]==
From affected client (laptop):
- Pinging SERVER (10.0.0.1)
- Pinging SERVER2
- Any traffic from SERVER to LAPTOP
==[ TCPDUMP @ ACCESS POINT ]==
11:31:52.506153 08:3e:8e:a2:f2:2d > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1, p 0, ethertype ARP, Request who-has 10.0.0.1 tell 10.0.4.6, length 28
11:31:53.506151 08:3e:8e:a2:f2:2d > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 1, p 0, ethertype ARP, Request who-has 10.0.0.1 tell 10.0.4.6, length 28
==[ TCPDUMP @ SERVER ]==
11:34:03.221254 08:3e:8e:a2:f2:2d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.1 tell 10.0.4.6, length 46
11:34:03.221265 00:01:03:c1:d4:30 > 08:3e:8e:a2:f2:2d, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 10.0.0.1 is-at 00:01:03:c1:d4:30, length 28
==[ JUST ARP? NO ]==
Actually it's not just ARP. When I manually add the mac address of SERVER on the wifi client you no longer see any ARP request/reply trouble, but then still traffic doesn't work.
Such as PING from LAPTOP to SERVER: you see only ping requests @ AP, and ping request + replies on SERVER, so very similar to the ARP story, but this proves that ARP itself is not the problem.
==[ WHAT I'VE TRIED ]==
What works:
- Rebooting the access point
- /etc/init.d/network restart
What does not work:
- kill hostapd and restart wifi (or through wifi down & wifi up)
- brctl to remove/add everything
- in addition to that, bring all interfaces down, and up, same as /etc/init.d/network reload (reload! not restart). didn't help either.
In short: it SEEMS related to a driver(?), since 'network restart' fixes the issue and 'network reload' doesn't'
==[ WHEN IT OCCURS / WHO IS AFFECTED ]==
When it occurs and who is affected is random. It happens both to existing users (who are already connected) and users that bring their laptop that very same day and never were able to get online.
Other laptops (wifi clients) on the access point are almost always unaffected.
==[ OTHER ]==
I don't have a reproducable test case, so I rely on staff to bring affected laptops. There are always a few people per day which experience this issue. I've already been hunting this down for hours, without much success.
Any help or suggestions would be greatly appreciated.
Attachments (2)
Change History (6)
comment:1 Changed 5 years ago by anonymous
comment:2 Changed 5 years ago by syzop@…
Today I changed the following on 21 AP's:
uci set network.@switch[0].enable_learning=0 uci delete network.@switch[0].enable_vlan uci commit
& restart
I hope this helps, as I suspect the internal switch to be the source of this issue.
comment:3 Changed 5 years ago by syzop@…
It seems we are no longer having any problems with disappearing traffic.
So:
uci set network.@switch[0].enable_learning=0 uci commit # and reboot..
(the vlan change I mentioned earlier is unlikely to be related, though I haven't double checked)
Does this indicate a hardware problem in the WNDR3700v2 internal switch? Or the driver?
Well anyway, I'm happy this workaround works for me :)
I've added a note about this in the WNDR3700 wiki so others won't have to waste tens of hours of time on this.
comment:4 Changed 5 years ago by syzop@…
By the way, I now see I didn't mention this. But when I change the mac address of the affected wifi client, then that client works perfectly again. That's (another reason) why I highly suspect the internal switch.
component should probably be kernel