Modify

Opened 3 years ago

Last modified 2 years ago

#18966 reopened defect

WPA-EAP TLS broken on Buffalo WZR-HP-AG300H/ath9k - workaround included

Reported by: alexander.wetzel@… Owned by: developers
Priority: normal Milestone:
Component: kernel Version: Barrier Breaker 14.07
Keywords: athk9 eap Cc:

Description

I'm using EAP-TLS on my Buffalo WZR-HP-AG300H with connections problems for at least one year and at least up to 14.07 (r42625). I've tracked the issue finally down and it looks now to be a problem with ath9k driver or firmware.

How it looks like from a user perspective:

With EAP-TLS, the initial connection works fine, Some time later (n*60min) the connection freezes while still claiming to be connected. But only if the connection is not idle at that moment. You have to transfer data above an as yet undetermined threshold to be hit by the bug or maybe some packets at a very bad moment.

A simple ping e.g. is not sufficient to trigger the issue, downloading something with around 2.5MiB/s on the other hand triggers it for sure.

There is also noting in the logs at a normal log level on either the router or the client and even with the highest debug settings it looks still fine.

Looking a bit closer:

When the bug hits, the client will be unable to reach any IP and after some minutes even the ARP entry for the wlan router expires. Tcpdump shows no data incoming on the client, you see only the outgoing packets.
Running tcpdump on the wlan router on the other hand will still show both, incoming and outgoing packets. Disconnecting and reconnecting to the wlan will fix the issue. (If you are really patient, waiting one hour will also fix it.)

Trying the same with WPA-PSK (on a separate SID on the same card) works perfectly, I can't reproduce the issue in this mode!

I did open an linux kernel bug for that, assuming it to be an issue with the iwlwifi driver of my client, see https://bugzilla.kernel.org/show_bug.cgi?id=92451
You find quite some more information of what I've tested there, including a wlan capture from a monitoring station and a better description of what I have done.

What's really going on:

With the feedback from the ticket that this is (probably) a security issue and the fact that another client using a different wlan card had the same issue it was getting obvious that this can't be an iwlwifi driver problem.
Also a closer look showed, that the connection was not failing around the rekey but exactly at the re-key, one hour after the initial connect.

So a re-key is somehow preventing the client to decrypt the packets from the router and the network connection freezes.

As confirmation it's possible to reproduce the issue much faster by changing the default re-key interval to e.g. 5min:

uci set wireless.@wifi-iface[0].eap_reauth_period=300" 
uci commit
reboot

With the shorter re-key it's much simpler to debug the problem. (I did verify that the pattern stays the same, only now with 5min intervals instead taking 1h with the default settings. And yes, you still must have a download running to trigger it during the re-key).

The workaround:

The real breakthrough was setting the "nohwcrypt=1" module parameter for ath9k.

/etc/modules.d/ath9k:

ath9k nohwcrypt=1

and reboot the router.

With this setting I'm now unable to reproduce the issue, strongly indicating that either the driver or the firmware for the wlan card is having an issue with EAP re-keys during load.

(Since the firmware seems to be "included" in the card I could find no way to try different firmware images for this card.)


Here some times for the attached logs, roughly one second exact and with a download running when possible with roughly 2.5MiB/s and the re-key interval set to 5min:

21:46:00 initial connect
21:51:02 control ping fails
21:56:02 ping resumes

Some router details:

All tests were done with 802.11n complete disabled on the router.

Here is the current config for wireless:

wireless.radio0=wifi-device
wireless.radio0.type=mac80211
wireless.radio0.macaddr=10:6f:3f:0e:33:3c
wireless.radio0.hwmode=11ng
wireless.radio0.ht_capab=SHORT-GI-40 TX-STBC RX-STBC1 DSSS_CCK-40
wireless.radio0.country=DE
wireless.radio0.channel=9
wireless.radio0.distance=10
wireless.radio0.txpower=20
wireless.radio0.log_level=0
wireless.@wifi-iface[0]=wifi-iface
wireless.@wifi-iface[0].device=radio0
wireless.@wifi-iface[0].mode=ap
wireless.@wifi-iface[0].network=WLAN
wireless.@wifi-iface[0].ssid=mordor
wireless.@wifi-iface[0].encryption=wpa2+ccmp
wireless.@wifi-iface[0].auth_server=127.0.0.1
wireless.@wifi-iface[0].auth_port=1812
wireless.@wifi-iface[0].auth_secret=<deleted>
wireless.@wifi-iface[0].acct_server=127.0.0.1
wireless.@wifi-iface[0].acct_port=1813
wireless.@wifi-iface[0].acct_secret=<deleted>
wireless.@wifi-iface[0].eap_reauth_period=300
wireless.@wifi-iface[2]=wifi-iface
wireless.@wifi-iface[2].device=radio0
wireless.@wifi-iface[2].mode=ap
wireless.@wifi-iface[2].ssid=mordor-g
wireless.@wifi-iface[2].encryption=psk2+ccmp
wireless.@wifi-iface[2].key=<deleted>
wireless.@wifi-iface[2].network=GWLAN

The second wlan card (5GHz)is disabled and unused.

lspci -v

00:11.0 Network controller: Qualcomm Atheros AR922X Wireless Network Adapter (rev 01)
        Subsystem: Qualcomm Atheros Device a097
        Flags: bus master, 66MHz, medium devsel, latency 168, IRQ 40
        Memory at 10000000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [44] Power Management version 2
        Kernel driver in use: ath9k

00:12.0 Network controller: Qualcomm Atheros AR922X Wireless Network Adapter (rev 01)
        Subsystem: Qualcomm Atheros Device a096
        Flags: bus master, 66MHz, medium devsel, latency 168, IRQ 41
        Memory at 10010000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [44] Power Management version 2
        Kernel driver in use: ath9k

Attachments (2)

mylog (337.8 KB) - added by alexander.wetzel@… 3 years ago.
Openwrt and Client debug logs
wireshark-eapdecode.patch (12.8 KB) - added by alexander.wetzel@… 3 years ago.
Add EAP Rekey support in wireshark and fix group rekey handling

Download all attachments as: .zip

Change History (10)

Changed 3 years ago by alexander.wetzel@…

Openwrt and Client debug logs

comment:1 follow-up: Changed 3 years ago by nbd

please test current trunk to check if the issue is still there.

comment:2 in reply to: ↑ 1 Changed 3 years ago by alexander.wetzel@…

First, setting nohwcrypt=1 has shown to be ineffective after all.
I kept the rekey timeout at 5min and had the same problem as without the kernel parameter during normal operation.

It looks like the "workaround" is at best reducing the chance to be hit by the problem and at worst is caused by some testing issue.

I've now updated to trunk (r44510), compiled from sources to an custom image and gave it a quick test today. Unfortunately this version shows the same problem as the stable version. Setting nohwcrypt=1 is also not helping.

I really would appreciate some pointers how to debug that further. Is there maybe a way I can dump the encryption keys on openwrt from wpad?
It looks like the wpa_supplicant on the client is able to output keys, so we could compare if they agree and this is really an issue of the wifi card or this is a wrong assumption...

comment:3 Changed 3 years ago by alexander.wetzel@…

Now this is getting stranger all the time:

To verify that the issue is the Openwrt router I connected another Win7 notebook with EAP-TLS to the network. (Again using the 5min rekey interval.)

And that one stays operational!

At the rekey time it only loses one or two pings. (I would have expected the rekey to be seamless, but then that could well be normal...)
So this seems to indicate that this may be not the fault of the router but the client.

That's hard to accept, since it would indicate that EAP-TLS rekey is broken in some generic way on the clients, affecting at least two different distributions and wlan drivers. (Maybe something in nl80211 or mac80211?)
But that would affect (nearly?) all wlan cards and at least many EAP-TLS users would have the same problem.

Guess I should try to find another EAP-TLS AP I can connect to which is not running openwrt to verify that.
That will be hard, though...

comment:4 follow-up: Changed 3 years ago by nbd

please try current trunk

comment:5 in reply to: ↑ 4 Changed 3 years ago by anonymous

Replying to nbd:

please try current trunk

The issue is still reproducible with r44747
In fact it seems to be worse, since I now get the freezes without downloading anything, a ping seems to be enough. (It still repairs itself on the next rekey.)

comment:6 Changed 3 years ago by alexander.wetzel@…

I'm still investigating this, but it looks like I found hard evidence that it's not a bug in Openwrt, so you may want to close the bug:

I was able to patch wireshark, so it follow the eap rekeys. When I now enter the PMKs from the radius server or wpa_supplicant debug log I see can see the cleartext of the encoded packets.
(I'll upload my current version of the patch here, so if someone is interested in it you can have a look. May take some time till I finalize that and try to get it included in wireshark.)

And the first result is, that all packets are decoded correctly in wireshark, both the Openwrt router and the Linux client seems generate valid packets. The packets are all there and according to the capture the communication should work.

So I have to conclude that the issue is indeed the linux client (as the test with win7 was already indicating).
Since wpa_supplicant is reporting the correct key (which is in wireshark able to decode the packets from the router) this seems to be a driver/kernel issue. I'll mess around a bit more and then address this to the correct audience (probably a linux kernel bug again.)

As unlikely as it seems, EAP-rekey under load seems to be broken, at least for the wlan drivers iwlwifi (with Centrino Ultimate-N 6300) and iwl3945 (PRO/Wireless 3945ABG) up to at least kernel 3.19.

Changed 3 years ago by alexander.wetzel@…

Add EAP Rekey support in wireshark and fix group rekey handling

comment:7 Changed 3 years ago by nbd

  • Resolution set to not_a_bug
  • Status changed from new to closed

Thanks for the info.

comment:8 Changed 2 years ago by alexander.wetzel@…

  • Resolution not_a_bug deleted
  • Status changed from closed to reopened

The root cause for the problem was found, and it's potentially affecting every linux device and basically all supported openwrt versions. It's a fundamental issue without a good fix, yet...
The original spec is broken for rekeys and need workarounds for in all current implementations.

This is just a heads up that rekeys with linux clients are currently flawed and can get very interesting errors, which are next to impossible to debug for even experienced users if you never heared about it. It's an unresolved bug upstream, caused by the broken IEE802.11 spec which happens to also affect openwrt. (IEEE Std 802.11-2012 has a solution but there seems to be no implementations for that. And for that to work both the AP and the STA must use the new method.)

I've decided to reopened the bug here to make you aware of the issue.
The correct Bug description would be something like: "Rekey with linux clients broken"

In one sentence:
Any (unicast) rekey under load (no matter if EAP or PSK is used) has a good chance of freezing the connection till the next rekey, as long as at least one end of the connection is using the mac80211 stack and hardware encryption.

The root of the evil in this case was openwrt router, sending out packets encrypted with the new key but still using a PN based on the old key. This is then triggering the reply attack protection on the client, which will then drop all frames till a rekey starts over the counters again. There is no log, debug or trace message for that... (Now the RX drop counter will be counting up, but I simply cannot find a way to simply read this value on a normal system...)

There may be differences depending on the driver in use (if the PN is added by the hardware instead of mac80211 we should be safe), but with ath9k as sender and all mac80211 drivers on the reciving end you will be able to reproduce this. (Other combinations will likely also be affected.)

This is caused by races between mac80211 kernel code and the hardware encryption on the wlan card, for both the sending and reciving site.

The only simple workaround is to either disable (unicast) rekeys for the WLAN or disable hardware encryption - falling back to software encryption - on the station AND the (linux) clients.
Windows is also affected by the problem, but has special code to recover from the situation. It only stalls for a roughly one second and then again accepts packets. (You will lose a ping during rekey with a Windows client)

There was a discussion at at the linux-wireless about that, but without a final solution:
http://www.spinics.net/lists/linux-wireless/msg136625.html

My really ugly and incomplte hack - works only for CCMP as it is - is also posted there. It's also not breaking the security properties as stated, that was a misinterpretation of the patch sorted out later.
A slighly updated version is working perfectly for me since months now.
This is nothing which should be added to the kernel or openwrt as default, but potential useful if you do not want to disable hardware acceleration and need a workaround now. (I've patched the Openwrt kernel and the normal linux clients, excluding only the Samsung SmartTV. The later still shows all the syntomps of the issue when streming video when I do not disable rekey.)

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.