Modify

Opened 8 years ago

Closed 7 years ago

#7697 closed defect (duplicate)

WNDR3700 (ath9k) 2,4GHz N incompatibility with Intel AGN (4965)

Reported by: Alex Owned by: developers
Priority: normal Milestone:
Component: packages Version: Trunk
Keywords: Cc:

Description

The client which corrupts the connection is an Intel 4965AGN.

It is enough to disable n on this client. The connection then does not have the problem. (But other clients also don't have it on n.)

This client had it only after some time and so on on older builds.
But now i've found out that currently it can be replicated instantly when minstrel is enabled.

To replicate:

Use n (here channel 6) on (at least) the WNDR3700.

Connect with an Intel AGN (at least the 4965) to 2,4GHz n.

Then try to get a shared file from the wired side to the wireless client, and as soon as the transfer would start the connection gets slow and unresponsive as hell. (Could be easily checked i.e. with a continuous ping in the background.)

Extensive ath9k debug log captured while the failure happened can be found in Ticket #6667 (https://dev.openwrt.org/attachment/ticket/6667/ath9kdebug%3D0xffffffff_1.log and https://dev.openwrt.org/attachment/ticket/6667/ath9kdebug%3D0xffffffff_2.log).

Attachments (1)

log1.txt (15.7 KB) - added by Alex 7 years ago.
standard log of a crash

Download all attachments as: .zip

Change History (49)

comment:1 in reply to: ↑ description Changed 8 years ago by Alex

With the current branch there is even a 30% chance that this will reboot the router.
So what debug option should I use to log this usefully?!
Thank you..

Replying to Alex:

The client which corrupts the connection is an Intel 4965AGN.

It is enough to disable n on this client. The connection then does not have the problem. (But other clients also don't have it on n.)

This client had it only after some time and so on on older builds.
But now i've found out that currently it can be replicated instantly when minstrel is enabled.

To replicate:

Use n (here channel 6) on (at least) the WNDR3700.

Connect with an Intel AGN (at least the 4965) to 2,4GHz n.

Then try to get a shared file from the wired side to the wireless client, and as soon as the transfer would start the connection gets slow and unresponsive as hell. (Could be easily checked i.e. with a continuous ping in the background.)

Extensive ath9k debug log captured while the failure happened can be found in Ticket #6667 (https://dev.openwrt.org/attachment/ticket/6667/ath9kdebug%3D0xffffffff_1.log and https://dev.openwrt.org/attachment/ticket/6667/ath9kdebug%3D0xffffffff_2.log).

comment:2 follow-up: Changed 8 years ago by Jan

I can confirm this bug (specs: Lenovo T61 w/ 4965AGN and WNDR3700) and it happens on both 11n and 11g speeds (tested on 2,4GHz). Yesterday it seemed that the transfer lags and disconnects are much worse on Windows 7 (x64) than on Ubuntu 10.04 (x64). On Windows, wireless is practically useless for me at the moment. On Ubuntu, I was at least able to stream LastFM radio for 30 minutes. I'll investigate this during the weekend. @Alex: Which OS are you using?

comment:3 in reply to: ↑ 2 Changed 8 years ago by Alex

Hi Jan!

I'm very happy that you've confirmed this bug.

I've already heard that it's also happening on Linux, so I didn't care too much about the OS since then.

But I'm using Vista (x86). And yes, the connection is useless on that for me too. But we should be aware that it's the wireless connection (not the Vista file transfer method alone), because it works flawlessly with the same router when I connect an other wireless client (so to say a gateway) to the lan port of the same laptop (by the way - a Lenovo X300).

Replying to Jan:

I can confirm this bug (specs: Lenovo T61 w/ 4965AGN and WNDR3700) and it happens on both 11n and 11g speeds (tested on 2,4GHz). Yesterday it seemed that the transfer lags and disconnects are much worse on Windows 7 (x64) than on Ubuntu 10.04 (x64). On Windows, wireless is practically useless for me at the moment. On Ubuntu, I was at least able to stream LastFM radio for 30 minutes. I'll investigate this during the weekend. @Alex: Which OS are you using?

comment:4 Changed 7 years ago by Frenchie

I have exactly the same symptoms with the same card in a HP laptop. Two spontaneous reboots in the last two hours after the better part of two days of solid uptime while that particular laptop has been out of the house.

I've disabled 11n on that particular card, will report back if things improve. Unfortunately, I don't have atheros debugging enabled in my current build. I'm happy enough to turn it on and take some logs but it may take a few days until I'm in a position where I'm able to.

comment:5 follow-up: Changed 7 years ago by Jan

Just a short update: Disabling 11n in favor of 11g *and* a reboot of the router (instead of just "wifi" to reload the config) works for me. 11n is unusable on Windows and Linux however. Like Frenchie, my current build lacks debugging features.

comment:6 in reply to: ↑ 5 Changed 7 years ago by Frenchie

I've had further opportunity to borrow and test with a client with an Intel 5100 as well, exactly the same router crash until 11n is also disabled for that. In the absence of logs, it would seem to be something special that Intel's 11n cards are doing which I suspect is causing the ath9k driver to kernel panic.

comment:7 Changed 7 years ago by Alex

Tested it with a Thinkpad with Intel (Centrino) 1000 BGN, same driver Version, seems to work. (By the way, this one is able to do also 40MHz bandwith on 2,4GHz.)
So at least not the whole Intel line is affected.. .
Hopefully someone will pay attention to this soon, want to start debugging, but no assistance by now :-/.

comment:8 Changed 7 years ago by Andy Lutomirski <luto@…>

I may be affected as well. My client is an Intel 5350 running recent Linux drivers. This could also be related to #7790.

comment:9 Changed 7 years ago by dsvensson@…

Can confirm this problem on my Intel Corporation Ultimate N WiFi Link 5300.

comment:10 Changed 7 years ago by anonymous

I can confirm this with a Netgear WNDR3700 and the following wifi interfaces: a brand new N Intel "Pro Wireless 6200", an old (2006, draft-N) Atheros AR5008 and a very old (no N) Intel 3945ABG. The connections fail all the time and after a few failures the AP has to be rebooted. It's even worse with 10.03.1-rc1 and rc2 than with 10.03.

comment:11 follow-up: Changed 7 years ago by nbd

Please try current backfire SVN.

comment:12 in reply to: ↑ 11 ; follow-up: Changed 7 years ago by Alex

Sorry, still there.

So please, how should we debug it?

The last time I've killed the wifi-phy? interface(s), rmmod ath9k, and insmod ath9k debug=?, and then wifi.

If there is a more professional way or we need other output, please just comment.

And at least provide the debug value to set.

Thank you

Replying to nbd:

Please try current backfire SVN.

comment:13 in reply to: ↑ 12 Changed 7 years ago by anonymous

At least the router didn't reboot instantly now.
But the connection is not usable as soon as a transfer tries to start :-(.

Replying to Alex:

Sorry, still there.
So please, how should we debug it?
The last time I've killed the wifi-phy? interface(s), rmmod ath9k, and insmod ath9k debug=?, and then wifi.
If there is a more professional way or we need other output, please just comment.
And at least provide the debug value to set.
Thank you

Replying to nbd:

Please try current backfire SVN.

comment:14 follow-up: Changed 7 years ago by nbd

Maybe it's related to the STBC capability bug that I just committed a fix for (in hostapd). Please try a current version.
As for the debugging stuff: as long as I don't have an idea where the problem might be hiding, I can't give you a decent set of debug flags. The debug stuff is only useful if you know what you're looking for, especially on embedded hw, where it can easily have an influence on the test behaviour.

comment:15 in reply to: ↑ 14 Changed 7 years ago by anonymous

I'll try the current trunk. (At least wireless-testing got updated too.)
But hey, this bug makes backfire unusable for at least half a year now with very widespread chipsets on both sides.
So If we want that people stay with OpenWrt we should really give this high priority!

So we need to find out what we're looking for!
It's such a waste that we have so many people affected, and nobody tells them how they can help to solve it.
Even if it magically solves in the next time, I don't think that this procedure of just hoping can be recommended for the future.
Because you should be able to get what you need to solve it, and the users should be able to use it (and if not, be able to provide useful feedback).

Don't get me wrong, I appreciate your good work so much. But no usable WLAN for such a long time is one of the worst things I can figure for this project.

By the way we also have ticket: /ticket/7750.html and /ticket/7790.html .

Replying to nbd:

Maybe it's related to the STBC capability bug that I just committed a fix for (in hostapd). Please try a current version.
As for the debugging stuff: as long as I don't have an idea where the problem might be hiding, I can't give you a decent set of debug flags. The debug stuff is only useful if you know what you're looking for, especially on embedded hw, where it can easily have an influence on the test behaviour.

comment:16 follow-up: Changed 7 years ago by nbd

Whining about the seriousness of the problem does not help in any way. I've been working on stabilizing ath9k for a long time now, and I keep finding and fixing more bugs. These things take time, and the only truly helpful thing that you guys can do at the moment is to just keep testing my changes whenever I add them and report back.
The driver has made a lot of progress over time, and it will continue to do so.

comment:17 in reply to: ↑ 16 Changed 7 years ago by anonymous

I totally agree with you that the driver made a lot of progress!

But I believe its a little bit hard to change something, without knowing if there is any connection to the specific bug. (But for that, I'm impressed how much you've already improved.)
And i thought the people here are technically skilled enough to help finding the directions where the bug might come from. Without that, I just wonder how it should be localized and then fixed.
But okay, so I'm condemned to wait, and not able to help.. .

Replying to nbd:

Whining about the seriousness of the problem does not help in any way. I've been working on stabilizing ath9k for a long time now, and I keep finding and fixing more bugs. These things take time, and the only truly helpful thing that you guys can do at the moment is to just keep testing my changes whenever I add them and report back.
The driver has made a lot of progress over time, and it will continue to do so.

comment:18 follow-up: Changed 7 years ago by nbd

Don't worry, if I come up with any ideas on how this could be tracked down, I'll let you know. By the way, did you test whether the hostapd change that I checked in helped in your case as well? It did improve stability for other users (though I have no feedback on IWL4965 yet).

comment:19 in reply to: ↑ 18 ; follow-up: Changed 7 years ago by Alex

Hi!

For the first time I can do the transfer again! Thats such an inprovement since using OpenWrt on this Router.
It still got unstable some times, and reconnects were needed to "repair" the connection. But I think we are on the way again :-)!
I've used the current trunk, so I will compile the branch now to see if the new wireles-testing or the STBC fix resolved it.
Thank you very much in the meantime.

Replying to nbd:

Don't worry, if I come up with any ideas on how this could be tracked down, I'll let you know. By the way, did you test whether the hostapd change that I checked in helped in your case as well? It did improve stability for other users (though I have no feedback on IWL4965 yet).

comment:20 in reply to: ↑ 19 Changed 7 years ago by Alex

I could not make even one file copy before, but now there are only problems after several tries or some time - so just irregulary.. .
The connection "stutters" then, so timeouts appear too often. Sometimes until it is so bad that a reconnect is needed to work again.
Transfers are also working with the branch version, so it was indeed the STBC fix which made this possible!
But still got a unstable connection with the trunk version very fast (all this after one hour or so of testing). Can it be that some of the patches didn't made it into the new wireless-testing? Or do i just have to use the branch version for more than 20 minutes to get the same results with it;-).
Good work!

Replying to Alex:

Hi!

For the first time I can do the transfer again! Thats such an inprovement since using OpenWrt on this Router.
It still got unstable some times, and reconnects were needed to "repair" the connection. But I think we are on the way again :-)!
I've used the current trunk, so I will compile the branch now to see if the new wireles-testing or the STBC fix resolved it.
Thank you very much in the meantime.

comment:21 Changed 7 years ago by Alex

So it just works for some time.
Looks like there are several errors left :-(.
I still get a dead connection after a few tries.
Even a reconnect is not always enough.
So reboot/router access is still needed..

Changed 7 years ago by Alex

standard log of a crash

comment:22 follow-up: Changed 7 years ago by nbd

Please enable 'Compile the kernel with symbol table information' under 'Global build settings' and then make a new crash log for me.

comment:23 follow-up: Changed 7 years ago by Frenchie

Things are much more stable after the most recent set of changes (I'm now running backfire r22842). I wasn't able to make the router reboot in an hour of running transfers with the Intel client which previously would've beeen an unplesant test.

Still seeing (less frequent) dropouts but that may even be potentially down to range. In any case, things are getting there.

Thanks.

comment:24 in reply to: ↑ 23 Changed 7 years ago by anonymous

I also had to wait much longer "to get a reboot" (not that I really wanted that) as before. So looks like there are different reasons for that.. .
The dropouts are not because of the range in my environment. They look similar to before, because the reconnect helps. But they just don't appear every time.

Replying to Frenchie:

Things are much more stable after the most recent set of changes (I'm now running backfire r22842). I wasn't able to make the router reboot in an hour of running transfers with the Intel client which previously would've beeen an unplesant test.

Still seeing (less frequent) dropouts but that may even be potentially down to range. In any case, things are getting there.

Thanks.

comment:25 in reply to: ↑ 22 Changed 7 years ago by anonymous

Didn't get a reboot but at least something:

Jan  2 00:13:38 OpenWrt user.warn kernel: ------------[ cut here ]------------
Jan  2 00:13:38 OpenWrt user.warn kernel: WARNING: at /home/user/backfire/backfire/build_dir/linux-ar71xx/compat-wireless-2010-07-29/drivers/net/wireless/ath/ath9k/xmit.c:149 0x83168788()
Jan  2 00:13:38 OpenWrt user.warn kernel: Modules linked in: usb_storage leds_wndr3700_usb ohci_hcd nf_nat_tftp nf_conntrack_tftp nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat xt_NOTRACK iptable_raw xt_state nf_conntrack_ipv4
Jan  2 00:13:38 OpenWrt user.warn kernel: Call Trace:[<8007dc44>] 0x8007dc44
Jan  2 00:13:38 OpenWrt user.warn kernel: [<80068374>] 0x80068374
Jan  2 00:13:38 OpenWrt user.warn kernel: [<80068374>] 0x80068374
Jan  2 00:13:38 OpenWrt user.warn kernel: [<8007cb50>] 0x8007cb50
Jan  2 00:13:38 OpenWrt user.warn kernel: [<83168788>] 0x83168788
Jan  2 00:13:38 OpenWrt user.warn kernel: [<83168788>] 0x83168788
Jan  2 00:13:38 OpenWrt user.warn kernel: [<83168fa8>] 0x83168fa8
Jan  2 00:13:38 OpenWrt user.warn kernel: [<8316a4f0>] 0x8316a4f0

comment:26 Changed 7 years ago by Jan

I agree that things are more stable with r23000+ as I was able to use 11n for at least 30 minutes. Then the connection dropped dead again. I'll try to get a log.

comment:27 Changed 7 years ago by nbd

Please try r23097, as it contains some significant aggregation fixes.

comment:28 Changed 7 years ago by luto

I pounded on r23097 for half an hour or so (including a few netperf runs) from my Intel 5350 box, and it hasn't fallen over yet (although it did get a few of the Failed to stop TX DMA in 100 msec after killing last frame warnings). That's with and without STBC enabled in hostapd.conf.

I'll let you know if I have problems tomorrow.

comment:29 Changed 7 years ago by Jan

Running r23099, the situation has improved a lot for me :) I had only one disconnect on 11n/2.4GHz during the whole afternoon. Great work, nbd!

comment:30 follow-up: Changed 7 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

I guess this is fixed now :)

comment:31 in reply to: ↑ 30 Changed 7 years ago by Alex

  • Resolution fixed deleted
  • Status changed from closed to reopened

Sorry, but even before an hour of transfer the package loss appears again and doesn't stop until a reconnect. But in the beginning it looks very good!
So I don't think these few hours are enough to make a clear statement like fixed.

Replying to nbd:

I guess this is fixed now :)

comment:32 follow-up: Changed 7 years ago by nbd

What kind of packet loss? Complete loss or just many dropped packets that bring down throughput?

comment:33 follow-up: Changed 7 years ago by nbd

And what client driver are you using?

comment:34 follow-up: Changed 7 years ago by luto

FWIW, on earlier versions of openwrt, the AP dies permanently after awhile (days), but when actively using my Intel laptop, it happens in maybe half an hour instead of days.

Intel seems to have some nasty bugs on the client side, though, and my connection to just about any AP will die and need either a reconnect or a reload of the client driver after awhile. I don't think that's the AP's fault. It seems worse on Linux than Windows, but someone I work with has the same problem on a different Intel-based laptop running Windows. (This might be related to an acknowledged but as yet unfixed Intel firmware bug, at least on the Intel WiFi Link 5350.)

comment:35 in reply to: ↑ 33 Changed 7 years ago by Alex

I'm using the current Intel client driver in Vista. (I've tried several versions over time.)
It's more striking there than in XP by the side.
But I can't confirm the statement of luto that this problem exists with every AP. As it is not needed with the Netgear firmware on this AP and also at least not on some other APs.. .

To the packet loss:
At first, everything is fine with just one packet or so dropped from hundred.
But after a while when the packet loss starts, there are always some missing in a row. So for a time frame of 5 seconds i.e. every 5 packets are lost. After that 5 are comming through. Then 5 get lost. And so on.

So its not that just throughput is going down, but usability (i.e. ssh, rdp, ..) is lost.

Thanks

Alex

Replying to nbd:

And what client driver are you using?

comment:36 Changed 7 years ago by Alex

Okay, i got better results today (propably because of the the now real current intel driver 13.3.0.137 instead of 13.​2.​1.​5?! But that never helped before, so i'll continue the test..).

comment:37 in reply to: ↑ 32 Changed 7 years ago by anonymous

Got the dropout so far that nothing came through, even a reconnect was just the connection itself with authentication, but nothing else. Only WIFI solved the problem for ALL (not just the problematic intel device).

Replying to nbd:

What kind of packet loss? Complete loss or just many dropped packets that bring down throughput?

comment:38 in reply to: ↑ 34 Changed 7 years ago by glenno@…

Agree 100% luto. I dual boot Ubuntu / Vista, and when running Ubuntu I can get a solid hour of data transfer before it dies. On Vista it may only crash after a day or so. I have to execute "wifi" on the router (DIR-825) to restart the AP.

Running 10.03.1rc-3 on the DIR-825 and my Dell Laptop has Intel Corporation PRO/Wireless 4965 AG wifi card.

I suspect the Intel card is the issue as none of my other devices seem to cause it any grief (2 other laptops with non Intel Wifi card, several iphones and a wifi media player).

Replying to luto:

FWIW, on earlier versions of openwrt, the AP dies permanently after awhile (days), but when actively using my Intel laptop, it happens in maybe half an hour instead of days.


It seems worse on Linux than Windows,

comment:39 follow-up: Changed 7 years ago by dsvensson@…

The Intel cards do have an issue. I've been tracking this ticket for a while, but still waiting for a new firmware blob.

https://bugzilla.kernel.org/show_bug.cgi?id=16691

comment:40 Changed 7 years ago by Alex

I wonder why nobody mentioned the minstrel rate selection.
Did you have the problems as bad without it?
Or did you all had it enabled.

Because now nbd took minstrel out for ath9k.

So the problem does not occur instantly here (that was also the behaviour 2 month ago on ticket creation), but is it a full workaround-solution.

Or did you, nbd, found out that minstrel needs a correction.
By the way, I don't have better transfer rates with it at the current location.

So was the reason bug control or were there other reasons like that the normal rate selection got better. Because it could be disabled anyways (before).

comment:41 follow-up: Changed 7 years ago by nbd

I did not take minstrel out, I did quite the opposite. I forced it to be enabled unconditionally and removed the ath9k rate control.

comment:42 in reply to: ↑ 41 Changed 7 years ago by anonymous

Oh, i should have looked further.. . Didn't realize that it gets the choosen one this way.
I just wonder because I'm still stable.
Propably there is a way to see the current running type.

Replying to nbd:

I did not take minstrel out, I did quite the opposite. I forced it to be enabled unconditionally and removed the ath9k rate control.

comment:43 in reply to: ↑ 39 Changed 7 years ago by anonymous

Replying to dsvensson@…:

The Intel cards do have an issue. I've been tracking this ticket for a while, but still waiting for a new firmware blob.

https://bugzilla.kernel.org/show_bug.cgi?id=16691

Exeperimental uCode is available now for Intel cards, might be worth testing to make sure we're not debugging two different problems here:
http://marc.info/?l=linux-wireless&m=128658122408096&w=2

comment:44 follow-up: Changed 7 years ago by Treibholz

I experience the same behaviour (https://forum.openwrt.org/viewtopic.php?pid=125059).

What can I do to help solve this problem?

comment:45 in reply to: ↑ 44 ; follow-up: Changed 7 years ago by Alex

Nbd finally got an Intel card! So I am currently compiling r24962, because #8446 mentions the same deauthenticated due to local deauth request - authenticated - associated symptom. (Und sicherheitshalber klopf ich mal auf Holz ;-).)

(But as it normally does not recur instantly here I'm carefully with predictions for now..)

Replying to Treibholz:

I experience the same behaviour (https://forum.openwrt.org/viewtopic.php?pid=125059).

What can I do to help solve this problem?

comment:46 in reply to: ↑ 45 Changed 7 years ago by Alex

Not resolved for me :-(

Replying to Alex:

Nbd finally got an Intel card! So I am currently compiling r24962, because #8446 mentions the same deauthenticated due to local deauth request - authenticated - associated symptom. (Und sicherheitshalber klopf ich mal auf Holz ;-).)

(But as it normally does not recur instantly here I'm carefully with predictions for now..)

Replying to Treibholz:

I experience the same behaviour (https://forum.openwrt.org/viewtopic.php?pid=125059).

What can I do to help solve this problem?

comment:47 Changed 7 years ago by anonymous

For the record: worksforme.

I compiled "Backfire (10.03, r24928)" and I have already copied ~6GB with 3-7 MB/s over the wireless network from different positions in my appartment and the connection was stable.

Thanks a lot!

comment:48 Changed 7 years ago by nbd

  • Resolution set to duplicate
  • Status changed from reopened to closed

Stability issues are tracked in #8830

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.