Modify

Opened 2 years ago

Last modified 23 months ago

#22024 new defect

RTL8366S switch hang

Reported by: bolvan Owned by: developers
Priority: normal Milestone:
Component: packages Version: Trunk
Keywords: Cc:

Description

I use WNDR3800's switch with VLANs.
After executing 'swconfig dev switch0 load network' switch hangs with 20-30% probability.
It looks like this :

root@wifi:~# swconfig dev switch0 load network_reset_switch
root@wifi:~# ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
64 bytes from 192.168.2.1: icmp_req=1 ttl=128 time=0.373 ms

root@wifi:~# swconfig dev switch0 load network_reset_switch
root@wifi:~# ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
64 bytes from 192.168.2.1: icmp_req=1 ttl=128 time=0.373 ms

root@wifi:~# swconfig dev switch0 load network_reset_switch
root@wifi:~# ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
64 bytes from 192.168.2.1: icmp_req=1 ttl=128 time=0.373 ms
From 192.168.2.2 icmp_seq=1 Destination Host Unreachable

root@wifi:~# swconfig dev switch0 load network_reset_switch
root@wifi:~# ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
64 bytes from 192.168.2.1: icmp_req=1 ttl=128 time=0.373 ms

/etc/config/network_reset_switch:

config switch
	option name 'switch0'
	option reset '1'
	option enable_vlan '1'
	option enable_vlan4k '1'
	option max_length '3'

config switch_vlan
	option device 'switch0'
	option vlan '1'
	option ports '0 1 2t 5'

config switch_vlan
	option device 'switch0'
	option vlan '2'
	option ports '2t 3'

Only reset operation causes problem. When reset is 0 switch does not hang.
When switch is hang 'swconfig show' outputs correct configuration. No switch specific dmesg can be seen.

Problem is very painful for me because reset is executed on every boot.
I had to write cure script that pings nearby hosts and reset switch until its working.

Attachments (4)

dump_good_1.txt (29.0 KB) - added by bolvan 23 months ago.
switch register dump in good state
dump_good_2.txt (29.0 KB) - added by bolvan 23 months ago.
switch register dump in good state (2)
dump_bad_1.txt (29.0 KB) - added by bolvan 23 months ago.
switch register dump in bad state
dump_bad_2.txt (29.0 KB) - added by bolvan 23 months ago.
switch register dump in bad state (2)

Download all attachments as: .zip

Change History (19)

comment:1 Changed 2 years ago by bolvan

Tested the same on D-link DIR-825. It has exactly same chipset.
Problem exists.
If only one VLAN is present then switch hangs with very low probability

comment:2 Changed 23 months ago by anonymous

Both problems exist with a clean config on recent trunk?

comment:3 Changed 23 months ago by anonymous

Maybe this is was the same reason:
/ticket/19630.html

comment:4 Changed 23 months ago by anonymous

Could you compile and install for testing version 12.09:
/ticket/18106.html

This bug seems to be often found but noone could explain it like you did:
https://dev.openwrt.org/search?q=RTL8366S

Thanks

comment:5 Changed 23 months ago by bolvan

I flashed 12.09 image from downloads.openwrt.org.
Problem exists.
The only difference I found is probability of hang.
In 12.09 its lower. ~10%.

I did not change anything from default install. Only copied my network,dhcp,firewall,wireless, network_reset_switch.

Also if "load" changed to manual steps "set reset", "vlan 1 set ports", "vlan 2 set ports" - nothing changes. No matter how much time pass between commands. After "bad" reset only another reset (or more than one) can unhang the switch.

Last edited 23 months ago by bolvan (previous) (diff)

comment:6 Changed 23 months ago by anonymous

The devs of the switch driver are listed here. Would be great if you could fix this problem with then together:
https://dev.openwrt.org/browser/trunk/target/linux/generic/files/drivers/net/phy/rtl8366s.c

comment:7 Changed 23 months ago by anonymous

I looked for you into the source and the changes.
Check this change here: https://dev.openwrt.org/changeset/32943/

Try to rise the msleep values to 100 or some value like that. Play bit with the code. Could be just some timing problem.

comment:8 Changed 23 months ago by anonymous

Changing :

//#define RTL8366_SMI_HW_STOP_DELAY		25	/* msecs */
//#define RTL8366_SMI_HW_START_DELAY		100	/* msecs */
#define RTL8366_SMI_HW_STOP_DELAY		100	/* msecs */
#define RTL8366_SMI_HW_START_DELAY		200	/* msecs */

did not help

comment:9 Changed 23 months ago by bolvan

I tracked flow of execution with some printk's. Found that hw_reset function pointer is never initialized and always NULL and code with mentioned delays never executed.
Added msleep(100) at the end of the rtl8366s_reset_chip().
Did not help.

comment:10 Changed 23 months ago by anonymous

The problem is probably somewhere in this place. Its already a great bug report with many ideas. Could you contact the two listed developer in the .c file? Should be easy for them to fix this.

comment:11 Changed 23 months ago by anonymous

I already sent them email. Hopefully they respond.
Meanwhile I wrote simple hardware reg dumper. It calls rtl8366_smi_read_reg() for regs from 0 to 0x11FF and does printk in hexdump format.
I didnt find any specific differences present in hang state and not present in normal state or vise versa. They seem almost identical.

Changed 23 months ago by bolvan

switch register dump in good state

Changed 23 months ago by bolvan

switch register dump in good state (2)

Changed 23 months ago by bolvan

switch register dump in bad state

Changed 23 months ago by bolvan

switch register dump in bad state (2)

comment:12 Changed 23 months ago by bolvan

Now I compiled it with debugfs support. Can read/write regs on the fly + my own full regdumper can dump on request. Since eth connectivity dies I use wifi for access

Last edited 23 months ago by bolvan (previous) (diff)

comment:13 Changed 23 months ago by bolvan

comment:14 Changed 23 months ago by bolvan

More observations.
1) Rarely in "bad" state switch pass thru pings of the size 78 but no more.

2) Writing 0x02 to register 0x100 causes soft reset. It wipes 4k vlan table (mode 3) but leaves mc configuration (16 vlans, mode 2). If vlan_4k is disabled then switch in most cases (but not always, hang is still possible) remains in working condition after soft reset. If vlan_4k is enabled then reload without reset is necessary. Soft reset also heals hang state.

comment:15 Changed 23 months ago by bolvan

I found reliable way of detecting switch hang without relying on presence of nearby hosts.
After reset MIB counters are zero.
Port 5 is bound to router's CPU.
Try sending something out. It can be broadcast ping.
Wait a few seconds.

swconfig dev switch0 port 5 get mib | grep -q '^IfOutOctets[[:space:]]*: 0$' || { SWITCH WORKS }

if IfOutOctets remains zero then switch is hang

Last edited 23 months ago by bolvan (previous) (diff)

Add Comment

Modify Ticket

Action
as new .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.