Modify

Opened 7 years ago

Closed 7 years ago

#7988 closed defect (fixed)

RTL8366s / DIR-825 dropped packets, FCS errors

Reported by: yatakama Owned by: juhosg
Priority: high Milestone: Backfire 10.03.1
Component: kernel Version: Trunk
Keywords: rtl8366 packet loss drop fcs errors dir 825 Cc:

Description

There is an issue with the RTL8366s driver in Backfire trunk (last tested in r23115) which affects all traffic going through switch port 5, which is connected to the CPU. Basically this affects all traffic to/from the Realtek switch, i.e. switch <-> cpu, switch <-> wan, switch <-> wlan, while everything else works just fine. This causes transfers through port 5 to be very unstable and at extremely low speeds (at most 1 Mbit/s, tested with iperf and samba file transfers).

Errors are shown when running

swconfig dev rtl8366s port 5 show

the Dot3StatsFCSErrors and EtherStatsDropEvents counters steadily increase. There is another ticket open for this issue for the Kamikaze milestone (#7738) but it seems to have gone unnoticed.

I own a DIR-825 and I'm willing to help developers investigate the issue in every way that I can.

Attachments (0)

Change History (30)

comment:1 follow-up: Changed 7 years ago by anonymous

is QoS enabled?

comment:2 Changed 7 years ago by jow

  • Owner changed from developers to juhosg
  • Status changed from new to assigned

comment:3 in reply to: ↑ 1 Changed 7 years ago by yatakama

Replying to anonymous:

is QoS enabled?

QoS is not installed. It's a default config with ext3 and usb-storage modules added.

What I've noticed is that the fcs errors only affect traffic generated by the router CPU.

Running iperf on the router in client mode (client generates traffic) exhibits poor and unstable transfer rates. CPU usage in this case is < 2% so that's clearly not affecting performance.
Running iperf on the router in server mode (receives traffic generated by the pc) shows transfer rates of ~225 MBit/s, with iperf using 100% CPU, so the CPU is clearly the bottleneck in this case.

This is also consistent with the tests made using samba file transfers: consider client A connected to wlan and client B connected to the switch. Copying a file from client A to client B is affected by this issue; copying the other way around works fine, limited only by the wireless link speed.

Tests were made with two client pc's with Gbit Realtek cards, using various settings (link speed 100/1000, full/half duplex, checksum offloads on/off) and also an Asus WL-500gPv2 router and I observed the same results every time, so I believe the issue is independent from the client device used.

comment:4 Changed 7 years ago by jeromepoulin@…

I confirm the problem on 3 DIR-825. Even with every modules unloaded except switch, GPIO, iptables, iptables masquerade and unloadable modules usbcore and ipv6.

Data from router to PC is OK, data from Internet to router is OK, data routed from the Internet to the PC via router gets very very slow.

comment:5 Changed 7 years ago by Jérôme Poulin <jeromepoulin@…>

Just tried on a TP-Link WR1043ND with rtl8366rb switch on latest backfire and the problem isn't showing, so it is specific to rtl8366s.

comment:6 Changed 7 years ago by t.p.northover@…

Does anyone know where the problem is likely to be?

I'm not really a kernel developer, but I've looked at rtl8366s.c and it seems mostly concerned with reading and writing registers on the switch. I'd vaguely expect that the registers don't have a "generate frequent FCS errors" setting so the problem is elsewhere. Is that hopelessly naive of me?

If not, should I be trying to poke around the ethernet driver (AG71xx?)? Is there some way to enable debugging of the packets causing problems?

comment:7 Changed 7 years ago by Jérôme Poulin <jeromepoulin@…>

I was guessing that an initialization could be missing in the driver, maybe a register which handles the way the switch driver speaks to the CPU interface, it could be either that or the ag71xx driver, I'm really lost on this case. I searched all the drivers and register I could find on the net, tried some rtl8366s initialization registers from other drivers and it didn't help :(.

comment:8 Changed 7 years ago by yatakama

I also suspect this is most likely related to device-specific code; we have the WNDR3700 with rtl8366s and the TP-Link WR1043ND with rtl8366rb not exhibiting this behavior. These platforms share a whole lot of code, yet the D-Link is the only one affected by this issue.

comment:9 follow-up: Changed 7 years ago by osku@…

I run iperf tests with my DIR-825 (backfire 10.03-rc3). If iperf server was in router side throughput was 150Mbits and if server was PC side throughput was 250Mbits.

I have some FCS errors but clearly it is not affecting throughput. I'm not using default switch config so could it be difference between DIR-825 and other models (WNDR3700,WR1043ND)

My switch config

config switch
        option name     rtl8366s
        option reset    1
        option enable_vlan 1
        option enable_vlan4k 1

config switch_vlan
        option device   rtl8366s
        option vlan     1
        option pvid     1
        option ports    "1 2 3 5t"

config switch_vlan
        option device   rtl8366s
        option vlan     2
        option pvid     2
        option ports    "0 5t"

comment:10 in reply to: ↑ 9 Changed 7 years ago by Tim Northover

Replying to osku@…:

I run iperf tests with my DIR-825 (backfire 10.03-rc3). If iperf server was in router side
throughput was 150Mbits and if server was PC side throughput was 250Mbits.

I have some FCS errors but clearly it is not affecting throughput. I'm not using default
switch config so could it be difference between DIR-825 and other models (WNDR3700,WR1043ND)

I've just run a bunch of similar tests, and seen essentially the same thing when everything is over a LAN:

96Mbits desktop <-> laptop, even if traffic is going through port 5 (i.e. LAN <-> WAN ports, which I think is necessary for the FCS errors). Unfortunately my laptop doesn't have gigabit ethernet, so that's what I'd expect.

~230Mbits desktop <-> router. FCS errors are occurring, so I assume the suspected issue is being triggered, but I wouldn't swear to it.

However, as soon as I try to route traffic to the internet, speed drops catastrophically (normal 19Mbits to 3Mbits down) and the FCS errors seem to occur at a far larger rate. I assume they're not being picked up as quickly and cause more damage, though in this configuration my main router is still in the middle (should these be detected link-locally, or at the end-points?). Things get even worse if I configure the main router to simply forward packets blindly.

comment:11 follow-up: Changed 7 years ago by glenno@…

I am not seeing the above issues on my DIR-825 B1 running trunk r23581 on 2.6.32.24 kernel. I have not modified the switch config from stock standard.

IfInOctets : 3597967916 (that's 3.5GB I assume)
EtherStatsDropEvents : 4407
Dot3StatsFCSErrors : 4407

That appears to be a a low error rate on that amount of traffic.

On the 2.4Ghz wifi I can copy sustained 1.9MB /sec from a LAN connected PC (500MB test file using wget)

I will keep an eye on this when I move up to 2.6.36 and r23988
Hope that helps.

comment:12 in reply to: ↑ 11 Changed 7 years ago by anonymous

Replying to glenno@…:

I am not seeing the above issues on my DIR-825 B1 running trunk r23581 on 2.6.32.24 kernel. I have not modified the switch config from stock standard.

I think there's always been the suspicion that this was a B2 issue, though annoyingly it doesn't seem to have been mentioned here yet. Sorry for that.

IfInOctets : 3597967916 (that's 3.5GB I assume)
EtherStatsDropEvents : 4407
Dot3StatsFCSErrors : 4407

Interesting, that's roughly one every 800k. I seem to have one per 50k.

I've just upgraded to the 2.6.36 kernel on trunk and see no difference unfortunately.

comment:13 Changed 7 years ago by glenno@…

Well, ive just upgraded from r23581 to r24011 and upgraded the kernel from 2.6.32 to 2.6.36.

Previously I could max out my Internet at about 750KB/s. Now I cant get beyond 100KB/s. This is the case for Wireless (5Ghz and 2.4Ghz) or Gig Ethernet connected PC's.

Transfers from the Lan to Wireless clients max out at 2.6MB's which is per usual.

Config is straight out of the box, using pppoe to an ADSL router (ADSL1 8000/384k)

Thats about the limited of my fault finding ability (sorry).

comment:14 Changed 7 years ago by glenno@…

OK. Well my testing was far from clinical.....

I updated to r24037 on 2.6.37-rc2 and it was also stuck at 100KB/s max.

I disabled IPV6* on my PPPOE and now its back to running at full speed (750KB/s).

*my ISP supplies a dual stack IPV4/IPV6 service.

So my speeds may have just been an issue creeping in with the IPV6 code (which I must admit was flawless in Backfire 10.03).

comment:15 follow-up: Changed 7 years ago by jow

So you had IPv6 enabled on the ppp client? (option ipv6 1)

comment:16 in reply to: ↑ 15 Changed 7 years ago by Glenno <glenno@…>

Replying to jow:

So you had IPv6 enabled on the ppp client? (option ipv6 1)

Correct. under "config interface wan" on /etc/config/network

I took the line out, typed "wifi" and the speed instantly increased. If I get some time over the weekend I will try and be a bit more clinical with my testing, as I would like to try turning on/off IPV6 across a few different trunk builds and kernel builds to see if I can narrow down when it crept in.

comment:17 Changed 7 years ago by eric@…

Same problem observed with two DIR-825 B1 running 10.03 release.

comment:18 follow-up: Changed 7 years ago by eric@…

Following up to my last post, I'm not using PPPoE or a PPP client at all on either DIR-825. The WAN port on each is connected to a local network. Unless it is enabled by default, I don't think IPv6 is my problem.

I primarily notice the bandwidth and FCS problem with traffic between the LAN and WAN ports, but it seems to also happen between Wifi and WAN.

comment:19 in reply to: ↑ 18 Changed 7 years ago by glenno@…

Replying to eric@…:

Following up to my last post, I'm not using PPPoE or a PPP client at all on either DIR-825. The WAN port on each is connected to a local network. Unless it is enabled by default, I don't think IPv6 is my problem.

Check QOS.

I have discovered just now that the QOS is clamping the download at 1Mb download (based on the downstream setting in the default QOS config). This only happens on my router when I have "option IPV6 1" on my WAN port. Interesting how it clamps IPV4 downloads, when IPV6 is enabled. Puzzling.

Change QOS downstream to a higher figure or disable IPV6 and the downloads run at full speed again.

comment:20 Changed 7 years ago by eric@…

Is there any QoS in a default installation? I haven't done anything to install or enable QoS.

Are you saying that you get FCS errors reported on port 5 when QoS is enabled, but not when QoS is disabled? I can believe that misconfigured QoS would limit the data rate, but it's hard to believe that it would cause FCS errors. Maybe these are two different problems?

comment:21 Changed 7 years ago by yatakama

I think there may be two separate issues here indeed. The original problem I reported was only related to the four LAN ethernet ports. Any workstation connected to one of these ports is experiencing severe packet loss in all test scenarios: lan <-> wan, lan <-> wifi, lan <-> usb (all this traffic flows through the switch-cpu link - port 5). Wifi is not affected so there is no packet loss between wifi <-> wan and wifi <-> usb.

I'm using a rev B2, I can't verify whether B1 is also affected.

comment:22 Changed 7 years ago by anonymous

Since the FCS errors only happen on port 5, the CPU port, and reportedly didn't happen with older kernels (I haven't personally verified that), it occurs to me that the problem could potentially be with the Ethernet MAC configuration in the Atheros SoC, rather than with the Realtek switch configuration.

comment:23 Changed 7 years ago by Jérôme Poulin <jeromepoulin@…>

After testing, the old firmware (10.03) is slower than the new (10.03.1-rc4) but has less FCS errors, probably because it is slower anyway.

http://pastebin.ca/2016728

comment:24 Changed 7 years ago by anonymous

This is also happening to me using DIR-825 B1 with Backfire 10.03.1-RC4. I'm using the port 5 to trunk Vlan ID 600 to my IPTV set top box (Vlan ID 500 is for internet) and Dot3StatsFCSErrors and EtherStatsDropEvents keeps increasing every second. I have also noticed that IPTV keeps getting corrupted video stream.
My network config as follows:-

config interface loopback
        option ifname   lo
        option proto    static
        option ipaddr   127.0.0.1
        option netmask  255.0.0.0

config interface lan
        option ifname   eth0.1
        option type     bridge
        option proto    static
        option ipaddr   192.168.1.1
        option netmask  255.255.255.0

config interface wan
        option ifname   eth1.500
        option proto    pppoe

config interface iptv
        option ifname   eth1.600 eth0.600
        option type     bridge
	option proto    none

config switch rtl8366s
        option enable   1
        option reset    1
        option enable_vlan 1
	option enable_vlan4k 1

config switch_vlan
        option device   rtl8366s
        option vlan     1
        option ports    "1 2 3 5t"

config switch_vlan
        option device   rtl8366s
        option vlan     500
        option ports    "5t"

config switch_vlan
        option device   rtl8366s
        option vlan     600
        option ports    "0 5t"

comment:25 Changed 7 years ago by Dave Dubreuil <dave.dubreuil@…>

Has anybody tried locking the switch port and eth0 to 100/Full? Is it possible to do that? Could somebody point me to docs for that?

comment:26 Changed 7 years ago by fred.gillette@…

I am also seeing this with the trunk. Any news on when we might see a fix?

comment:27 Changed 7 years ago by yatakama

User masa posted a fix in the forums: https://forum.openwrt.org/viewtopic.php?pid=126410#p126410
It involves initializing the VCR1 register (0x0006) to 0x100, so as to leave vlan tags untouched for all ports and strip them for port 4 and also commenting the RTL8366S_VLAN_MEMBERINGRESS_REG initialization. It seems that this prevents the wan port from negotiating gigabit speeds, but I can't confirm that for the moment.

Maybe someone could refine this further and produce a patch.

comment:28 Changed 7 years ago by yatakama

Issue was fixed in r25121; ticket can be closed. Thanks to everyone involved.

comment:29 Changed 7 years ago by anonymous

Hope the fix can be backported to backfire.
Thanks.

comment:30 Changed 7 years ago by juhosg

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixed in r25121 (trunk) and r25257 (backfire).

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.