Modify

Opened 22 months ago

Last modified 5 months ago

#22156 new defect

Race condition causes netifd to fail when bringing up network interface

Reported by: alan.christopher.jenkins@… Owned by: developers
Priority: normal Milestone:
Component: base system Version: Chaos Calmer 15.05
Keywords: Cc:

Description

I finally figured out what caused the 5Ghz network on my box to fail to start! I've shown to my satisfaction a race condition, which must be a bug in the base system. Next challenge: work out what race condition there is, that could cause the error. "Could not set interface wlan1-1 flags (UP): Device or resource busy".

Maybe netifd isn't serializing something it should be?

This is a Netgear WNDR3800. I have two 5Ghz networks (SSIDs) configured on the same radio.

It's happening on 15.05. I've now tried upgrading to 15.05.1 for the netifd update, but it didn't help.

I think I've included all the important details here. The configuration is described in more detail at http://superuser.com/questions/1060845/openwrt-on-wndr3800-5ghz-wifi-shows-as-disabled/1060846#1060846

Error log:

Sun Apr 3 15:02:19 2016 user.notice SQM: Starting simple.qos
Sun Apr 3 15:02:19 2016 user.notice SQM: ifb associated with interface pppoe-wan:
Sun Apr 3 15:02:19 2016 user.notice SQM: Currently no ifb is associated with pppoe-wan, this is normal during starting of the sqm system.
Sun Apr 3 15:02:19 2016 daemon.notice netifd: radio1 (9031): wlan1: ACS-COMPLETED freq=5320 channel=64
Sun Apr 3 15:02:19 2016 daemon.notice netifd: radio1 (9031): Using interface wlan1 with hwaddr 74:44:01:86:42:d6 and ssid "VOYAGER2091-90-jenkins"
Sun Apr 3 15:02:20 2016 user.notice SQM: Squashing differentiated services code points (DSCP) from ingress.
Sun Apr 3 15:02:21 2016 kern.info kernel: [ 199.510000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): Could not set interface wlan1-1 flags (UP): Device or resource busy
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): Failed to add BSS (BSSID=76:44:01:86:42:d6)
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): Interface initialization failed
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): wlan1: interface state ACS->DISABLED
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): wlan1: AP-DISABLED
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): ACS: Possibly channel configuration is invalid, please report this along with your config file.
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): ACS: Failed to start
Sun Apr 3 15:02:21 2016 daemon.notice netifd: radio1 (9031): wlan1: AP-DISABLED

Running /etc/init.d/network restart on the router generates the same error.

Running ifdown wifi_a_guest and then ifup wifi_a_guest seems to fix everything until the next reboot.

Disabling sqm-scripts (improved version of qos-scripts) resolved the 5Ghz problem permanently. Of course, I would like to find a way to have sqm working :). I know that sqm has a fairly slow script (it can take a number of seconds) that runs when its network interface is brought up.

Since sqm is not configured to touch the wireless interfaces, I was inclined to blame a race condition in OpenWRT's homegrown netifd. Indeed, I was able to reproduce the failure even after replacing the implementation of sqm with a busy loop of about 3 seconds. It's not just the delay: using sleep 3 didn't reproduce the failure.

It also turned out that network restart is slightly different from boot. It only seems to fail when I have both sqm _and_ miniupnpd enabled.

Attachments (2)

log.txt (18.7 KB) - added by alan.christopher.jenkins@… 22 months ago.
Full output from logread
log.2.txt (18.2 KB) - added by alan.christopher.jenkins@… 22 months ago.
logread, after disabling ntp to avoid jump in timestaps (only pointed out because pppd detected it, *grr*)

Download all attachments as: .zip

Change History (6)

Changed 22 months ago by alan.christopher.jenkins@…

Full output from logread

comment:1 Changed 22 months ago by alan.christopher.jenkins@…

Looking at the full system log, there may be a reason this happens with 5Ghz and not 2.4Ghz.

5Ghz warns "this may take a bit" [of time] for automatic channel selection (presumably radar detection). The timing is inherently different. It's the sort of thing that could affect a race condition.

I'm not saying it leads to a longer window of vulnerability to races. I can't see how that would be the case. But it means the 5Ghz radio comes up distinctly later than the 2.4Ghz one.

Mon Apr 4 08:38:14 2016 daemon.notice netifd: radio1 (1316): Configuration file: /var/run/hostapd-phy1.conf
Mon Apr 4 08:38:14 2016 kern.info kernel: [ 25.350000] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
Mon Apr 4 08:38:14 2016 daemon.notice netifd: radio1 (1316): wlan1: interface state UNINITIALIZED->COUNTRY_UPDATE
Mon Apr 4 08:38:14 2016 daemon.notice netifd: radio1 (1316): ACS: Automatic channel selection started, this may take a bit
Mon Apr 4 08:38:14 2016 daemon.notice netifd: radio1 (1316): wlan1: interface state COUNTRY_UPDATE->ACS
Mon Apr 4 08:38:14 2016 daemon.notice netifd: radio1 (1316): wlan1: ACS-STARTED

2.4Ghz:

Mon Apr 4 08:38:13 2016 daemon.notice netifd: radio0 (1315): Configuration file: /var/run/hostapd-phy0.conf
Mon Apr 4 08:38:13 2016 kern.info kernel: [ 24.780000] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Mon Apr 4 08:38:13 2016 daemon.notice netifd: radio0 (1315): wlan0: interface state UNINITIALIZED->COUNTRY_UPDATE
Mon Apr 4 08:38:13 2016 daemon.notice netifd: radio0 (1315): Using interface wlan0 with hwaddr 74:44:01:86:42:d4 and ssid "VOYAGER2091-90-jenkins"

Changed 22 months ago by alan.christopher.jenkins@…

logread, after disabling ntp to avoid jump in timestaps (only pointed out because pppd detected it, *grr*)

comment:2 Changed 22 months ago by sourcejedi

Arg. Now on 15.05.1, I can't get 5Ghz working anymore, either with "ifup wifi_a_guest", or by disabling SQM and miniupnpd. This means there's no evidence of a race-condition anymore. :(

I _can_ get 5Ghz working if I only enable one SSID. (Doesn't matter which one). That's still pretty freaky, so hopefully it's a good starting point.

I also learned that the error message comes from hostapd, and it's printed immediately after an ioctl() call which is supposed to bring the device up.

Apologies for any confusion. At this point I would close this bug and start again, but I can't. (I created it without logging in, so that might be why).

comment:3 Changed 22 months ago by sourcejedi

If I set a fixed channel e.g. 36, I can enable multiple SSIDs.

Summary so far:

On 15.05, creating a second, "guest" network a.k.a. VAP on 5Ghz breaks, if you've set up sqm-scripts (the improved version of qos-scripts). The conflict is not specific to sqm-scripts, because it can be reproduced by replacing them with a busy-loop. So it must be a race condition in the base system.

On 15.05.1, creating the guest network on 5Ghz breaks, unless you set a fixed channel. This is a regression. It seems an incredibly odd constraint; hopefully it's a bug that can be resolved.

comment:4 Changed 5 months ago by sourcejedi

I'm now able to use multiple 5Ghz SSIDs, on "auto" or fixed channels, after upgrading to LEDE 17.02.1.

Add Comment

Modify Ticket

Action
as new .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.