#11894 closed defect (fixed)
Hostapd stop working - EAPOL-Key timeout
| Reported by: | raptor2101@… | Owned by: | nbd |
|---|---|---|---|
| Priority: | normal | Milestone: | Barrier Breaker 14.07 |
| Component: | packages | Version: | Trunk |
| Keywords: | hostapd | Cc: |
Description
I'm running a trunk version of OpenWRT. Since two months I'm suffering Wifi-Problems
OpenWRT Version: r32739 (installed 17.07.2012)
Router: Buffalo WZR-HP-AG300H
WLAN: 2.4 GHz WPA2 EAP-TLS via Freeradius running on Ubuntu 12.04
Problem: after starting hostapd all runs fine. After a time (24h till 3 days) suddenly clients aren't able to connect. Connected Clients still can use use the wifi-connect and stay connected till next rekeying. Rekeying also fails.
The problem can be solved by restarting hostapd (wifi down && wifi up). Clients can reconnect immediately.
Note: cause i write the logs into mysql-db the logfile is a csv-file.
/etc/config/wireless
config wifi-device radio0
option log_level 0
option type mac80211
option channel 10
option macaddr 4c:e6:76:c2:5a:e8
option hwmode 11ng
option htmode HT20
option beacon_int 150
list ht_capab SHORT-GI-40
list ht_capab TX-STBC
list ht_capab RX-STBC1
list ht_capab DSSS_CCK-40
config wifi-iface
option device radio0
option network trustedlan
option mode ap
option ssid atlantic.aqua
option encryption wpa2
option auth_server 192.168.99.2
option auth_secret #############
option nasid access-point
# option iapp_interface routerlan
if a client tries to connect wpa_supplicant logs
wlan0: direct probe to 4c:e6:76:c2:5a:e8 (try 1/3) wlan0: direct probe to 4c:e6:76:c2:5a:e8 (try 2/3) wlan0: direct probe to 4c:e6:76:c2:5a:e8 (try 3/3) wlan0: direct probe to 4c:e6:76:c2:5a:e8 timed out
last log message from hostapd
wlan0: STA 18:87:96:18:c8:4e WPA: sending 1/2 msg of Group Key Handshake wlan0: STA 18:87:96:18:c8:4e WPA: EAPOL-Key timeout wlan0: STA 1c:b0:94:1c:78:a5 WPA: sending 1/2 msg of Group Key Handshake wlan0: STA 1c:b0:94:1c:78:a5 WPA: EAPOL-Key timeout
Attachments (6)
Change History (78)
Changed 6 years ago by raptor2101@…
comment:1 Changed 6 years ago by anonymous
Problem still exist in r32878. Same log messages. THis could correlate with another error. If i try to setup a Kerberos-db on the router, the initilialisation stucks because /dev/random don't have enough entropy
comment:2 Changed 6 years ago by raptor2101@…
Ok done some further debugging. I'am running the hostapd with -dd flags and capturing managment-traffic with tcpdump.
Here are my resultes: It takes currently less then 24 hours till hostapd stop working. Logmessages don't show anything even with extended debugging. THis waths drives me crazy. The monitor-device captures incoming Probe-Requestes, Authentication-Requestes and (intern?) Acknownledge-Packages but nothing of this appears in the tracefile.
i attached some more files
comment:3 Changed 5 years ago by Del <delonly@…>
It seems we have the same problem with the same router. I will be on travel for a week, but should be able to contribute after that. Thanks for your efforts!
comment:4 Changed 5 years ago by Del <delonly@…>
This issue affects both WPA and WPA2 encryption. It only affects the 2.4GHz radio, the 5GHz radio works fine.
Seems like it is a bug related to the Atheros AR9223. The same chip is found in the Netgear routers WNDR3700 and WNDR3800 both supported by OpenWrt. This really puzzles me, why don't we see more reports?
Any hints on what to look for or test is appreciated. The fact that the problem occurs after a couple of days makes trial and error a time consuming task.
comment:5 Changed 5 years ago by raptor2101@…
You can shorten the time between two failures by increasing the "reconnect" counts and/or using Radius.
If i have "many" reconnects on my network, it need less than 24h to get a "stalled" hostapd.
Could that caused by the "broken" /dev/random? On the WZR-HP-AG300H there are no entropy supplied by this device ... (f.e. Kerberos isn't able to create a new database)
comment:6 Changed 5 years ago by jow
Could you try to hook onto hostapd with strace -p when it hangs?
Alternatively could you compile hostapd with debugging enabled, then run it in foreground in e.g. a screen session until it stalls?
comment:7 Changed 5 years ago by anonymous
at the next stall you will get a strace log...
comment:8 Changed 5 years ago by raptor2101@…
Regarding to the strace output, hostapd hangs on a "sending"-request. In my case this took over 8 hours without returning... is this kernel/driver oder hostapd related?
sendto(9, "\2\3\0\177\2\23\202\0\20\0\0\0\0\0\0\0O#\362R\206\344\301\217\242\26 \211L\27f\32"..., 131, 0, {sa_family=AF_PACKET, proto=0x888e, if9, pkttype=PACKET_HOST, addr(6)={0, 00042028ef0a}, 20
with "kill -INT" leads to a crash of hostapd, all other signals are ignored..
comment:9 Changed 5 years ago by nbd
please try the latest version of openwrt to see if it happens there as well.
comment:10 Changed 5 years ago by Del
Latest trunk is on. I opted for keeping settings when flashing, please inform me is I need to wipe the settings to get the updated hostapd config.
comment:11 Changed 5 years ago by nbd
You don't need to wipe your settings
comment:12 Changed 5 years ago by anonymous
Updated to latest version last night, waiting for next hang/crash...
comment:13 Changed 5 years ago by Del <delonly@…>
Running r33482 now since Sunday, today the wireless refuses connections on the 2.4GHz radio. The 5GHz radio still works, but that may be related to the fact that I only have one device connecting there, while a number of devices connect to the 2.4GHz radio.
comment:14 Changed 5 years ago by raptor2101@…
running r33482 it takes 24 till freeze
hostapd hangs on the following:
sendto(9, "\2\3\0\177\2\23\202\0\20\0\0\0\0\0\0\0\6\274\245\267e\305\247\356\365\362+\367\262dA+"..., 131, 0, {sa_family=AF_PACKET, proto=0x888e, if9, pkttype=PACKET_HOST, addr(6)={0, 1cb0941c78a5}, 20
comment:15 Changed 5 years ago by jow
raptor2101, can you paste a log with more context? (e.g. the last 100-200 lines leading to this specific hang). In particular it would be interesting to see what fd 9 refers to.
comment:16 Changed 5 years ago by nbd
next time it gets into that state, please paste the output of
cat /sys/kernel/debug/ieee80211/phy0/queues
comment:17 Changed 5 years ago by raptor2101@…
strace runs in background... next stale you will get both more detailed logs and the cat of queues...
comment:18 Changed 5 years ago by raptor2101@…
Ok hostapd hangs again - here is the requested output
cat /sys/kernel/debug/ieee80211/phy0/queues 00: 0x00000000/0 01: 0x00000000/0 02: 0x00000000/0 03: 0x00000000/0 cat /sys/kernel/debug/ieee80211/phy0/ath9k/regdump Segmentation fault
the newly attached file contains 500 rows of hostapd straces and logmessages, if you need more, I currently storring ~ 1GB of hostapd logtraces ;)
comment:19 Changed 5 years ago by nbd
please try the latest version, i fixed a bug that might be related to this issue.
comment:20 Changed 5 years ago by raptor2101@…
can you give me a source
http://downloads.openwrt.org/snapshots/ is empty
comment:21 Changed 5 years ago by nigelr
I'm still having this problem with r33624 on a TP-LINK TL-WDR4300, and it's only on the 2.4GHz interface.
comment:22 Changed 5 years ago by nbd
found more issues that might have been causing this issue, please try r33635 or later
comment:23 Changed 5 years ago by Del <delonly@…>
Thanks, compiling r33666 as we speak. Raptor, missing images is a good opportunity to test the excellent openwrt build system ;-) I always enjoy the make menuconfig step.
comment:24 Changed 5 years ago by Del <delonly@…>
Sorry, the problem is still there. Unable to connect this evening after a couple of days. Thanks for the efforts, just keep them coming :-) Let me know if there is anything else I can do.
comment:25 Changed 5 years ago by nbd
followup fix committed in r33695, should work now.
comment:26 Changed 5 years ago by Del
r33725 is installed, will report back in couple of days. Thanks!
comment:27 Changed 5 years ago by Del <delonly@…>
Sorry, the connection was not able to connect a couple of minutes ago. This time I actually needed to reboot the router to get everything up again, /etc/init.d/network restart did not work. This has not happened before.
comment:28 Changed 5 years ago by nigelr
Yep, I'm still having the same problem after a couple of days on r33780.
comment:29 Changed 5 years ago by Del <delonly@…>
I have no idea why my router needed a reboot a week ago. Right now I get this:
root@OpenWrt:~# uptime
15:04:32 up 7 days, 23:59, load average: 0.00, 0.01, 0.04
and wireless is working nicely without my intervention. Fantastic, it seems like this issue is finally resolved. You did it again nbd, huge thanks! Please close the report.
comment:30 Changed 5 years ago by nbd
- Resolution set to fixed
- Status changed from new to closed
thanks for testing.
comment:31 Changed 5 years ago by Del <delonly@…>
- Resolution fixed deleted
- Status changed from closed to reopened
Whatever you did (I haven't looked at the commits), it did have an effect. However, after ten days of flawless wireless operation the problem re-surfaced. I am unable to connect this evening. Immediately after running /etc/init.d/network restart every client can connect again. Still it is the 2.4GHz radio giving problems, while the 5GHz (which is far less used) still accepts connections.
comment:32 Changed 5 years ago by Del <delonly@…>
Had a look at the patch, it seems you changed dev_kfree_skb_irq(skb); in favour of ieee80211_free_txskb(&local->hw, skb);
Not sure what effect it has, but there is still two places left where dev_kfree_skb_irq(skb); is used. One in iface.c and one in status.c
Does it make sense to change those too?
comment:33 Changed 5 years ago by nbd
No, those should not be changed.
comment:34 Changed 5 years ago by matt@…
I'm still seeing this issue in the latest Attitude Adjustment. After about 24 hours wireless clients can no longer connect to the 2.4ghz network. As reported by others, the 5ghz network is fine.
root@brooklyn:~# cat /sys/kernel/debug/ieee80211/phy0/queues
00: 0x00000000/0
01: 0x00000000/0
02: 0x00000000/0
03: 0x00000000/0
comment:35 Changed 5 years ago by nbd
I committed a fix today, try updating to latest attitude_adjustment svn.
comment:36 Changed 5 years ago by matt@…
Compiling and installing now. I'll check back in a couple of days.
Thanks!
comment:37 Changed 5 years ago by nigelr
I'm still having the problem with r33939 on AA branch.
comment:38 Changed 5 years ago by nbd
please try latest trunk or AA, the fix in r33991/r33992 might be relevant
comment:39 Changed 5 years ago by nbd
- Owner changed from developers to nbd
- Status changed from reopened to accepted
comment:40 Changed 5 years ago by Del <delonly@…>
Just installed a freshly baked r33991 (seems that was the latest in trunk). Thanks!
comment:41 Changed 5 years ago by nigelr
I've been running AA r34018 for 3 days with no problems so far. Fingers crossed it stays that way.
comment:42 Changed 5 years ago by syzop
I was about to post a new bug report, but then noticed this one.
I've 15 WNDR3700 access points deployed at a high school and we're experiencing the same issue: clients can no longer connect to the access point after a while. Even with hourly wifi down/up cronjobs it's not working well enough, apparently sometimes they even go unreachable within the hour.
I should point out that, since it's a highschool, we have a lot of associate/disassociate events when students migrate from class to class. And even when they are all IN class we still have 1 login every few seconds (globally).
We're running r33288 at the moment.
I'll see if I can upgrade the access points next week(s) and let you know if latest trunk fixes things for us. If so, that would be great!
comment:43 Changed 5 years ago by Del <delonly@…>
Five days of stable wireless and counting.
comment:44 Changed 5 years ago by Del <delonly@…>
Sorry, had to restart wireless this evening. The bug is still present as of r33991.
comment:45 Changed 5 years ago by syzop
comment:46 Changed 5 years ago by matt@…
Just had to restart after 5 days uptime running r34056.
comment:47 Changed 5 years ago by nbd
found more issues of the same type. please try trunk or AA r34134, it should hopefully finally fix this.
comment:48 Changed 5 years ago by matt@…
Things seem to have gotten a bit worse for some 2.4ghz devices. Others seem to be working fine. I have an Android phone that can no longer ping the gateway after about 15 minutes of inactivity. I'm seeing a bunch of deauthenticated due to inactivity (timer DEAUTH/REMOVE) messages in the logs, which is something I haven't seen before.
comment:49 Changed 5 years ago by syzop
comment:50 Changed 5 years ago by matt@…
It got worse with r34134. Pretty sure nothing changed on the Android device, and I've tried rebooting it with no change. One weird thing I noticed is that another desktop was unable to detect the 2.4ghz network while the Android device was unable to connect. Restarting wireless on the Android device restored connectivity, but it went away a few minutes later. I've switched my laptop over to the 2.4ghz network to try and replicate, but my guess is the issue is related to idle clients, so I'll need to let my laptop sit for a bit.
comment:51 Changed 5 years ago by syzop
comment:52 Changed 5 years ago by syzop
No improvement here with r34122.
I removed my hourly wifi restart cron job, so I could see how long it would survive without restarting wireless.
Access point worked fine, but within minutes when the 800+ students arrived you couldn't connect anymore (still a 5/5 signal).
Instead of restarting wifi to 'fix' things, I tried to debug a little.
This is what I found:
- The access point still broadcasts it's SSID (both on 2.4 and 5GHz)
- Connecting with one laptop works
- Connecting with another laptop didn't work
- Then I noticed that the laptop where things did work was only associated on the 5GHz side. So I did a little test: when I disabled the 2.4GHz radio on the laptop, it would still connect. But when I enable the 2.4 and disable the 5GHz it would no longer connect. So clearly a 2.4GHz issue.
- I did not see any hostapd messages in syslog when I was unable to connect, nor any other syslog message.
So, and I think this is just confirming what everyone else already said, it seems hostapd on the 2.4GHz radio stops accepting new clients, it doesn't log anything.
Then when I killed (-15) hostapd of the 2.4GHz interface (wlan0) and started it in debug mode on the console, things worked again.
What can I do to help?
I wanted to strace hostapd but strace isn't installed, and there's no package by that name.
Off-topic: nothing new, but this dev site is often unreachable, which is quite annoying.
comment:53 Changed 5 years ago by syzop@…
Don't think it helps, but, file attached with the last hostapd messages on this particular access point.
After this nothing is logged anymore with regards to the wlan0* interface.
comment:54 Changed 5 years ago by syzop@…
On second thought, I've attached the full syslog of the access point since last night.
As said, the last message about wlan0 was on 10:11:06, then nothing anymore until I restart hostapd on 2.4GHz (wlan0) on 14:14:XX.
There are a number of ath: phy0: keycache entry 228 out of range messages. I don't know if they are related to this problem.
If you need anything else, let me know!
comment:55 Changed 5 years ago by Kufat
I've also noticed that this problem occurs on 2.4 but not 5. Of course, this might be an issue with some slightly misbehaving 2.4GHz client(s) common to both of our networks causing the state rather than being an issue with the 2.4 driver/firmware itself, but it could be worth investigating further.
If anybody needs any additional information gathered to help resolve this ticket, I'm almost always in #openwrt on freenode.
comment:56 Changed 5 years ago by anonymous
Clarification:
It is a bug in ath9k, but my point was that it might not be a bug specific to 2.4GHz code in ath9k. It could be an issue in code that's common to 2.4 and 5 that's only triggered in 2.4GHz due to client behavior.
comment:57 Changed 5 years ago by syzop@…
Could be that some specific 2.4GHz-client(s) trigger this.
Another possibility is that, because there are more clients using 2.4GHz (which don't have 5GHz capability), this larger number clients, and thus higher amount of associate/disassociate events, might cause the problem.
Just a few theories. Just to say that I concur that it doesn't have to be band related at all (code-wise / driver-wise), it just gets triggered on the 2.4GHz band all the time.
I'd also be happy to try any suggestions by a dev or anyone else. I'm really committed to solving this issue. Just tell me what you need / want me to do.
comment:58 Changed 5 years ago by Kufat
syzop, I should've mentioned that I usually only have one client on my 2.4GHz network (HTC Droid Incredible, an Android phone.)
comment:59 Changed 5 years ago by syzop@…
Oh really, that's interesting... maybe it isn't related at all to number of clients then. Or there are multiple issues.
I added my restart wifi cron job again (every 30 minutes) but now I got kernel dumps and wireless going down sometimes, I've opened a new bug #12480 for this since I don't think it's the same issue: it happens within seconds after wifi is restart.
comment:60 Changed 5 years ago by Aaaron16
I'm also experiencing this issue using ath9k_htc devices. In my application I have vehicle access points where most of the time no one is attempting to connect. Then after several days when someone tries to attach, it only works by restarting hostapd.
comment:61 Changed 5 years ago by syzop@…
Just adding some references to previous bugs which *seem* related:
#7363, #12144, #12159,
https://forum.openwrt.org/viewtopic.php?id=24350&p=3
And these may be unrelated, but are current ath9k issues on WNDR3700:
#11862 (Failed to stop TX DMA..) and #12480 (WARNING..at ath9k_beacon_tasklet)
Is it already clear if this is a driver issue or a hostapd issue?
To my knowledge it's still broadcasting it's SSID when this happens, does this mean anything? (eg: does this rule out a driver issue? or not?)
comment:62 Changed 5 years ago by Del <delonly@…>
I believe it is difficult to make categorical statements as to where the bug is until it is resolved. Again I have now had five days of stable wireless (and counting), this time with r34165. I do not believe I saw more the three days until nbd started working on it, so my best guess is to look at his patches to see where the problem is at. I just pointed another user to the relevant documentation here: https://dev.openwrt.org/ticket/12349#comment:8 Seems you are in a better position than the rest of us with respect to testing patches (trial and error is rather cumbersome with five days to wait), so feel free to dig in the code and test stuff out :)
comment:63 Changed 5 years ago by syzop@…
I can test latest trunk next week. I've succesfully generated 'my own image' using the openwrt build system earlier today.
My plan is to run hostapd in debugging mode and (s)trace it as was done earlier.
Not sure what else I can do at this point...
And yes, you're right, I can reproduce this issue within hours, or at least every working day for sure.
I didn't notice any improvement, though, between versions, unlike you (?).
comment:64 Changed 5 years ago by Del <delonly@…>
I am carefully optimistic, wireless still stable and:
root@OpenWrt:~# uptime
18:05:40 up 8 days, 19:31, load average: 0.00, 0.01, 0.04
Normal usage, five clients connecting multiple times each day on the 2.4GHz radio. Still running r34165.
comment:65 Changed 5 years ago by Del <delonly@…>
Fourteen days of stable wireless today. Hence I belive this bug is fixed, and the report can be closed. Syzop, it will be interesting to hear if this also solves your problems.
comment:66 Changed 5 years ago by syzop@…
I've upgraded two access points to r34200, they remained OK for 2 working days (now it's weekend).
I'm going to upgrade at least 6 more today. If by the end of the week they are still good, then I'm confident that this issue has been nailed (and then I'd be a very happy man ;p).
I'm just being cautiously optimistic, given that previously it also seemed resolved a number of times and then turned out not to be. This time it *looks* different, but 2 days is quite short to be sure, even for me ;)
I also wonder about Kufat and matt.
comment:67 Changed 5 years ago by Kufat
I've upgraded to r34336 and will let everyone know how it goes.
comment:68 Changed 5 years ago by matt@…
Things have stabilized here. I'm not sure what to say about the Android device as I've not upgraded since installing r34134.
comment:69 Changed 5 years ago by syzop@…
I have two users who can no longer connect after the firmware upgrade, but I'm not convinced yet if it's really related. Could be another problem, like maybe I made a config mistake when I did the upgrades.
Anyway: when I check out the logs of hostapd and look at the associate/disassociate events I can see the radio of all AP's are up throughout the day (both 2.4 and 5.X GHz). Previously hostapd would just stop logging any activity on some interfaces (usually in the morning already).
So, I'm pretty confident that the bug has been resolved. Thanks everyone! :)
comment:70 Changed 5 years ago by nbd
- Resolution set to fixed
- Status changed from accepted to closed
comment:71 Changed 5 years ago by Kufat
Well, it's post-resolution, but I wanted to say that the problem also seems to be resolved over here. Thanks for all the hard work, nbd (and all others who've helped!)
comment:72 Changed 4 years ago by jow
- Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07
Milestone Attitude Adjustment 12.09 deleted

CSV LogMessages from hostapd