Modify

Opened 6 years ago

Closed 6 years ago

Last modified 4 years ago

#10630 closed defect (fixed)

Router crashes when trying to log in LuCI after idle time

Reported by: hnyman <hannu.nyman@…> Owned by: jow
Priority: response-needed Milestone: Barrier Breaker 14.07
Component: base system Version: Trunk
Keywords: login crash hang reboot Cc:

Description

My ar71xx/WNDR3700v2 router with recent trunk builds has crashed several times, when I have tried to access the router's LuCI GUI with browser. This has now happened with builds 29520, 29529 and 29557.

Build environment has been completely rebuilt and settings reseted, so it should not be due bad compile due to untidy build environment.

The scenario: Router has been running happily for several hours. The network might have had some traffic or been almost idle. When I then try to access LuCI GUI with browser, the router loads the black background with "LuCI - Lua configuration interface", but the main login screen never surfaces. Browser waits a few seconds, during which the router is non-accessible also with SSH. Then the router reboots.

As the router reboots, there are no useful logs.

Might be something with uhttpd? Or maybe related to the recently disappeared realtime graphs (http://luci.subsignal.org/trac/ticket/363)?

Attachments (2)

OpenwrtMemUsageCapture2.JPG (52.2 KB) - added by hnyman <hannu.nyman@…> 6 years ago.
Memory usage starts to grow after enabling iwinfo section in statistics
OpenWrtMemUsage.txt (8.1 KB) - added by hnyman <hannu.nyman@…> 6 years ago.
htop log showing memory usage growth

Download all attachments as: .zip

Change History (20)

comment:1 in reply to: ↑ description ; follow-up: Changed 6 years ago by Nilfred <nilfred@…>

Replying to hnyman:

happened with builds 29520, 29529 and 29557.

That's suspicious: r29558, r29560. Maybe already fixed as #10622

comment:2 in reply to: ↑ 1 Changed 6 years ago by hnyman <hannu.nyman@…>

Replying to Nilfred <nilfred@…>:

That's suspicious: r29558, r29560. Maybe already fixed as #10622

Apparently not the reason. Build 29560 seems to have crashed a few minutes ago by itself. (When I logged into the router, it said uptime to be 7 minutes and logs supported that.) No traffic, no load to the router. Might be that my initial thinking it to be directly related to LuCI login to be wrong.

(And as I am using wired connection to the router, the crash is most likely not due to the WLAN drivers.)

comment:3 follow-up: Changed 6 years ago by jow

I think you're simply out of memory or heavy disk io forces the router to reboot. In my case luci-statistics caused it, and no it cannot be "fixed" as its a simple resource constraint issue imo.

comment:4 in reply to: ↑ 3 Changed 6 years ago by hnyman <hannu.nyman@…>

Replying to jow:

I think you're simply out of memory or heavy disk io forces the router to reboot. In my case luci-statistics caused it, and no it cannot be "fixed" as its a simple resource constraint issue imo.

Out of memory or heavy disk IO sounds a bit strange, as the router had been sitting idle for some hours before the crash, with practically no traffic (no active LAN devices) and no additional disk activity, just the load caused by OpenWRT.

I do have the new iwinfo based luci-statistics, (and that is mayde the largest change compared to the December 7th version, which has been the last one not to crash.)

In normal conditions, the router says that there is 44% free memory and load is between 0.00 and 0.05. I have a rather minimal system, and it has been rock solid until last week.

comment:5 Changed 6 years ago by hnyman <hannu.nyman@…>

After disabling the iwinfo wireless tab in luci-statistics, the router was quite stable for several days with revision 29571.

Changed 6 years ago by hnyman <hannu.nyman@…>

Memory usage starts to grow after enabling iwinfo section in statistics

Changed 6 years ago by hnyman <hannu.nyman@…>

htop log showing memory usage growth

comment:6 Changed 6 years ago by hnyman <hannu.nyman@…>

@jow:
you might be right about a possible memory constraint.

But it seems to be related to having the iwinfo wireless stats section enabled in Luci statistics. Without that section, the memory usage was quite stable for two days.

However, after enabling iwinfo wireless stats, the amount of resident memory used by collectd processes (and also their virtual size) kept steadily creeping up. The following chart shows 24 hours, of which 2.5 last hours has iwinfo wireless stats enabled and constantly growing memory usage.

Memory usage starts to grow after enabling iwinfo section in statistics

Key output from htop:

root@OpenWrt:~# htop

  CPU[||||                         7.2%]     Tasks: 20, 4 thr; 2 running
  Mem[||||||||||||||||||||||    20/60MB]     Load average: 0.14 0.53 0.37
  Swp[                            0/0MB]     Uptime: 2 days, 03:00:55

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
    1 root      20   0  1488   424   352 S  0.0  0.7  0:00.79 init
 7650 root      20   0 15556  5288   688 S  0.0  8.5  0:05.46  `- /usr/sbin/collectd
 7654 root      20   0 15556  5288   688 S  0.0  8.5  0:02.39  |   `- /usr/sbin/collect
 7653 root      20   0 15556  5288   688 S  0.0  8.5  0:02.31  |   `- /usr/sbin/collect
 7652 root      20   0 15556  5288   688 S  0.0  8.5  0:00.66  |   `- /usr/sbin/collect
 7651 root      20   0 15556  5288   688 S  0.0  8.5  0:00.08  |   `- /usr/sbin/collect
 2594 root      20   0   792   248   192 S  0.0  0.4  0:00.40  `- /sbin/rngd -r /dev/ur
 2591 root      20   0  1484   304   228 S  0.0  0.5  0:00.60  `- /sbin/watchdog -t 5 /
 2557 nobody    20   0   948   468   364 S  0.0  0.8  0:23.96  `- /usr/sbin/dnsmasq -K
 2530 root      20   0  1000   216   152 S  0.0  0.3  0:00.00  `- /usr/sbin/vsftpd
 2526 root      20   0  1048   440   320 S  0.0  0.7  0:08.13  `- /usr/sbin/uhttpd -f -


  CPU[||||||||||||||||||          52.0%]     Tasks: 20, 4 thr; 1 running
  Mem[||||||||||||||||||||||    20/60MB]     Load average: 0.01 0.05 0.16
  Swp[                            0/0MB]     Uptime: 2 days, 03:15:52

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
    1 root      20   0  1488   424   352 S  0.0  0.7  0:00.79 init
 7650 root      20   0 16048  5780   688 S  0.0  9.3  0:06.04  `- /usr/sbin/collectd
 7654 root      20   0 16048  5780   688 S  0.0  9.3  0:02.69  |   `- /usr/sbin/collect
 7653 root      20   0 16048  5780   688 S  0.0  9.3  0:02.58  |   `- /usr/sbin/collect
 7652 root      20   0 16048  5780   688 S  0.0  9.3  0:00.68  |   `- /usr/sbin/collect
 7651 root      20   0 16048  5780   688 S  0.0  9.3  0:00.08  |   `- /usr/sbin/collect
 2594 root      20   0   792   248   192 S  0.0  0.4  0:00.46  `- /sbin/rngd -r /dev/ur
 2591 root      20   0  1484   304   228 S  0.0  0.5  0:00.60  `- /sbin/watchdog -t 5 /
 2557 nobody    20   0   948   468   364 S  0.0  0.8  0:24.44  `- /usr/sbin/dnsmasq -K
 2530 root      20   0  1000   216   152 S  0.0  0.3  0:00.00  `- /usr/sbin/vsftpd
 2526 root      20   0  1048   440   320 S  0.0  0.7  0:08.73  `- /usr/sbin/uhttpd -f -

###
### After turning iwinfo section off
###

root@OpenWrt:~# htop

  CPU[||                           2.6%]     Tasks: 20, 4 thr; 1 running
  Mem[|||||||||||||||||||       16/60MB]     Load average: 0.12 0.05 0.09
  Swp[                            0/0MB]     Uptime: 2 days, 03:33:21

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
    1 root      20   0  1488   424   352 S  0.0  0.7  0:00.79 init
26173 root      20   0 11008   908   600 S  0.0  1.5  0:00.05  `- /usr/sbin/collectd
26177 root      20   0 11008   908   600 S  0.0  1.5  0:00.00  |   `- /usr/sbin/collect
26176 root      20   0 11008   908   600 S  0.0  1.5  0:00.00  |   `- /usr/sbin/collect
26175 root      20   0 11008   908   600 S  0.0  1.5  0:00.02  |   `- /usr/sbin/collect
26174 root      20   0 11008   908   600 S  0.0  1.5  0:00.03  |   `- /usr/sbin/collect
 2594 root      20   0   792   248   192 S  0.0  0.4  0:00.56  `- /sbin/rngd -r /dev/ur
 2591 root      20   0  1484   304   228 S  0.0  0.5  0:00.60  `- /sbin/watchdog -t 5 /
 2557 nobody    20   0   948   468   364 S  0.0  0.8  0:24.48  `- /usr/sbin/dnsmasq -K
 2530 root      20   0  1000   216   152 S  0.0  0.3  0:00.00  `- /usr/sbin/vsftpd
 2526 root      20   0  1048   440   320 S  0.0  0.7  0:09.50  `- /usr/sbin/uhttpd -f -

root@OpenWrt:~# htop

  CPU[|||||||||||||               37.9%]     Tasks: 20, 4 thr; 1 running
  Mem[|||||||||||||||||||       16/60MB]     Load average: 0.04 0.04 0.05
  Swp[                            0/0MB]     Uptime: 2 days, 03:47:05

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
    1 root      20   0  1488   424   352 S  0.0  0.7  0:00.79 init
26173 root      20   0 11008   908   600 S  0.0  1.5  0:00.06  `- /usr/sbin/collectd
26177 root      20   0 11008   908   600 S  0.0  1.5  0:00.00  |   `- /usr/sbin/collect
26176 root      20   0 11008   908   600 S  0.0  1.5  0:00.00  |   `- /usr/sbin/collect
26175 root      20   0 11008   908   600 S  0.0  1.5  0:00.03  |   `- /usr/sbin/collect
26174 root      20   0 11008   908   600 S  0.0  1.5  0:00.03  |   `- /usr/sbin/collect
 2594 root      20   0   792   248   192 S  0.0  0.4  0:00.61  `- /sbin/rngd -r /dev/ur
 2591 root      20   0  1484   304   228 S  0.0  0.5  0:00.60  `- /sbin/watchdog -t 5 /
 2557 nobody    20   0   948   468   364 S  0.0  0.8  0:24.76  `- /usr/sbin/dnsmasq -K
 2530 root      20   0  1000   216   152 S  0.0  0.3  0:00.00  `- /usr/sbin/vsftpd
 2526 root      20   0  1048   440   320 S  0.0  0.7  0:10.01  `- /usr/sbin/uhttpd -f -

comment:7 Changed 6 years ago by jow

  • Owner changed from developers to jow
  • Status changed from new to accepted

it appears to be a leak in libiwinfo then, thanks for looking into it. I'll investigate this

comment:8 Changed 6 years ago by jow

Apprently libnl-tiny had a leak, I couldn't find any in libiwinfo.
Please test whether the same setup with libnl-tiny after r29720 still experiences the same memory increase.

comment:9 Changed 6 years ago by hnyman <hannu.nyman@…>

Sorry, it didn't help with build 29722. Memory usage by collectd process starts to creep up when the wireless plugin is turned on in Luci statistics. When it is turned off, the memory usage of collectd is stable.

(I didn't change anything in my config, and libnl-tiny seems to be included. I refreshed the source and then rebuilt everything after make dirclean.)

I have tried to find a leak in iwinfo sources, but no luck so far.

I thought that the next step might be to separate the wireless stats queries from each other, so that the possibly leaking query might be identified (power, stationcount, etc.). What would be the correct place to do that?

comment:10 follow-up: Changed 6 years ago by jow

  • Priority changed from normal to response-needed

Well there was a well hidden leak... please see if r29723 finally solves it.
At least in my local test vm the memory consumption remains stable while previously my testcase used to consume 4KB / interface / invocation.

comment:11 follow-up: Changed 6 years ago by Valent Turkovic <valent.turkovic@…>

Hi,
I have setup openwrt on tp-link TL-WR1043ND and I'm extremely satisfied. One thing I noticed is that it reboots every 11 hours. Luckily I have collected some info so it looks like a memory issue.

Take look at this graph:
http://dl.dropbox.com/u/184632/wr1043-free-memory.png

You can see how memory gets used up and when there is no more free memory it crashes and reboots, it needs exactly 11 hours for memory to run out.

Is anybody else seeing this issue?

I'm running luci, luci-statistics with collectd and rrrdtool, is some of them having memory issues and could be causing this?

Could stopping and starting collectd service every 6 hours fix this issue? I could experiment further and see if some specific module in collectd is causing the issue.

comment:12 in reply to: ↑ 11 Changed 6 years ago by Valent Turkovic <valent.turkovic@…>

Replying to Valent Turkovic <valent.turkovic@…>:

My issue was with one collectd plugin and it is fixed in svn now.

comment:13 in reply to: ↑ 10 Changed 6 years ago by hnyman <hannu.nyman@…>

Replying to jow:

Well there was a well hidden leak... please see if r29723 finally solves it.
At least in my local test vm the memory consumption remains stable while previously my testcase used to consume 4KB / interface / invocation.

I made a new build 29844 and the memory usage seems to be stable. So far just a quick test, but htop shows constant numbers. So, I guess that you got it fixed ;-)

comment:14 Changed 6 years ago by hnyman <hannu.nyman@…>

@jow:

is that memleak patch needed also for the Backfire version of iwinfo? I didn't check the logic any further, but the function nl80211_close looks pretty similar as in trunk, so the patch might be necessary.

(Backfire iwinfo has been synced with trunk only upto r29417, so the patch might be unnecessary, if it is only needed with stuff introduced after that.)

comment:15 Changed 6 years ago by hnyman

@jow:
Additional issue:

I have noticed twice or thrice while testing this, that for some reason either collectd or rrdtool stops saving data, so that the statistics charts start to show empty data from that point onward. Graph timeline goes normally.

That happened also yesterday when testing with the newest version. I started the router with iwinfo disabled in statistics options, then after a while I turned it on through LuCI options, Luci automatically restarted at least the collectd process and after working normally for a while, maybe after an hour later the statistics started to show empty.

Previously, during my vacation, the statistics had been nicely colledted during the week, so I find it hard to believe that there would be hidden instability. I am wondering if this could be related to the Luci statistics restart. Should also rrdtool be restarted after statstics options are changed? (Or should that restart happen automatically in any case?)

System log shows nothing.

My pure guess is that something happens to the data connection. Either collectd does not send data, or rrdtool does not save it.

Any debug advice or ideas?
Have you noticed similar to happen?

comment:16 Changed 6 years ago by jow

No idea on this, sorry. I suggest to simply not touch it if you want it reliable.

comment:17 Changed 6 years ago by jow

  • Resolution set to fixed
  • Status changed from accepted to closed

Closing this ticket as the original issue has been resolved.

comment:18 Changed 4 years ago by jow

  • Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07

Milestone Attitude Adjustment 12.09 deleted

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.