Modify

Opened 10 years ago

Closed 10 years ago

Last modified 9 years ago

#3019 closed defect (fixed)

AR7 keeps hanging after a while

Reported by: anonymous Owned by: developers
Priority: highest Milestone: Kamikaze 8.09 RC1
Component: kernel Version:
Keywords: Cc:

Description

I though this problem was something i did , but since few people over #openwrt are having the same problem , i decided to open a ticket.
I have a DG834v2 and it keeps hanging after a while [once/twice a day], i can never get one day in uptime, it always hangs. People with Dlink are also having this same problem. Anything wrong with ar7 port ? i am using r10082...

Attachments (2)

dlink504T-boot-r10444.txt (8.2 KB) - added by War3333 10 years ago.
Boot Log with 2.6.24.2
140-cpmac_fix.patch (6.9 KB) - added by matteo 10 years ago.

Download all attachments as: .zip

Change History (64)

comment:1 Changed 10 years ago by anonymous

Is there a way to see the output of the console ?
Maybe the kernel is getting in panic...

comment:2 Changed 10 years ago by nabcore

I'm running r10141 on a DG834G(v2), and not had any stability issues. This bug report has very little information in it to do anything useful with it, plus the user has not given his/her name, hence it's not possible to contact them on IRC. I think this should be closed for these reasons and re-opened if more concrete information comes to light.

comment:3 Changed 10 years ago by silenec

My dlink G664T hangs every 2-4 hours. I'm running r10135, here is the output of the serial console before it hangs: http://openwrt.pastebin.com/m5a4d2cf9

comment:4 Changed 10 years ago by nbd

Are you using any program that uses a lot of memory? The crash seems to happen because you run out of it.
Please leave your device running for a while and then run 'cat /proc/meminfo', so that we can find out whether the memory is leaked in kernel or in user space.

comment:5 Changed 10 years ago by silenec

I use x-wrt with qos-scripts and pppoe connection, nothing else.
cat /proc/meminfo after 2:00 hours of uptime: http://openwrt.pastebin.com/m72a6cad2

comment:6 Changed 10 years ago by silenec

after 2:15 of uptime http://openwrt.pastebin.com/m496a7973

  • it crashed again ~5 mins after that...

comment:7 Changed 10 years ago by colchaodemola <colchaodemola@…>

Hi , i also have a DG834v2 [not G] and i have given up on open-wrt because it was always hanging too, and since watchdog does not work on it i had to manually reboot what i can not do everytime.
I was using pppoe + qos-scripts..
If someone can lead me how to build a serial access i would try to make one and post here. [any information , full layout :) ]

Looking at silenec log the problem looks to be at:
<c01275cc>] tn7atm_allocate_rx_skb+0x40/0x954 [tiatm]
[<c011f1c4>] cpsarInitModule+0x514/0x5618 [tiatm]

I am really interested in having openwrt back to my dg834...

comment:8 Changed 10 years ago by matteo

Try defining TI_STATIC_ALLOCATIONS in the sangam Makefile

comment:9 Changed 10 years ago by colchaodemola <colchaodemola@…>

Well , i can not be sure if my problem is exactly the same , though i think it is since my box keeps hanging after a few hours. Well , i have defined TI_STATIC_ALLOCATIONS in sangam Makefile but the box still hangs :/
Any other ideas ?

comment:10 Changed 10 years ago by nabcore

Go back in time, trying out each revision, and workout which changeset "broke" this.

comment:11 follow-up: Changed 10 years ago by matteo

Please have a try with r10140

comment:12 in reply to: ↑ 11 ; follow-up: Changed 10 years ago by colchaodemola <colchaodemola@…>

Replying to matteo:

Please have a try with r10140

adding TI_STATIC_ALLOCATIONS to the sangam Makefile ?

comment:13 Changed 10 years ago by colchaodemola <colchaodemola@…>

Well , i just installed r10140 , lets see if it holds :)
Well , i found a few bugs in leds, for example:

power led is always off, after boot.
/etc/diag.sh pre_init funtion is never executed , i think leds dir is not being created before that being executed.

comment:14 in reply to: ↑ 12 Changed 10 years ago by anonymous

Replying to colchaodemola <colchaodemola@gmail.com>:

Replying to matteo:

Please have a try with r10140

adding TI_STATIC_ALLOCATIONS to the sangam Makefile ?

CFLAGS = ...... -DTI_STATIC_ALLOCATIONS

comment:15 Changed 10 years ago by War3333

I have the same problem on a D-link 504T... I'm trying to define -DTI_STATIC_ALLOCATIONS but where?

sangam makefile == package/ar7-atm/Makefile ?

In that file I don't see any CFLAGS...
I had even tried adding the definition at the "define Build/Compile" section but the only result is a compile error.

If I could add some info I can say that it last between few hours and a couple of days... more torrent I have in download (maybe other p2p too but i use only torrent) and more frequently it hangs.

I can't post console log because i don't have a serial port on the router... but if I can have it in other mode...

I was tinking a conntrack related issue but it seem to not be...

War3333

comment:16 Changed 10 years ago by anonymous

build_dir/linux-ar7/sangam_atm-D7.03.01.00/Makefile

maybe this?

comment:17 Changed 10 years ago by colchaodemola <colchaodemola@…>

probably :)

comment:18 Changed 10 years ago by colchaodemola <colchaodemola@…>

i tried here , but it won`t compile:

In file included from /media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:81:
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7api.h:176:8: warning: extra tokens at end of #endif directive
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c: In function 'tn7atm_irq_request':
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:651: warning: 'deprecated_irq_flag' is deprecated (declared at include/linux/interrupt.h:64)
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:669: warning: 'deprecated_irq_flag' is deprecated (declared at include/linux/interrupt.h:64)
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c: In function 'tn7atm_send_complete':
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:1305: warning: unused variable 'ledticks'
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c: In function 'tn7atm_allocate_rx_skb':
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:1374: error: implicit declaration of function 'ti_alloc_skb'
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:1374: warning: assignment makes pointer from integer without a cast
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c: In function 'tn7atm_receive':
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:1420: warning: unused variable 'ledticks'
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c: In function 'tn7atm_detect':
/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.c:1809: warning: unused variable 'residual'
make[5]: * /media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/tn7atm.o Error 1
make[4]:
* [_module_/media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00] Error 2
make[4]: Leaving directory `/media/win_d/kamikaze/build_dir/linux-ar7/linux-2.6.23.1'
make[3]: * /media/win_d/kamikaze/build_dir/linux-ar7/sangam_atm-D7.03.01.00/.built Error 2
make[3]: Leaving directory `/media/win_d/kamikaze/package/ar7-atm'
make[2]:
* [package/ar7-atm/compile] Error 2
make[2]: Leaving directory `/media/win_d/kamikaze'
make[1]: * /media/win_d/kamikaze/staging_dir/mipsel/stamp/.package_compile Error 2
make[1]: Leaving directory `/media/win_d/kamikaze'
make:
* [world] Error 2

comment:19 Changed 10 years ago by matteo

replace "ti_alloc_skb" with "kmalloc"

comment:20 Changed 10 years ago by colchaodemola <colchaodemola@…>

that make it compile. But it does not work.

dmesg shows:
registered device TI Avalanche SAR
Sangam detected
failed to setup channel =15.
requesting firmware image "ar0700xx.bin"
avsar firmware released
Creating new root folder avalanche in the proc for the driver stats
Texas Instruments ATM driver: version:[7.03.01.00]
failed to setup channel =0 with return code 2097447
failed to activate hw channel

comment:21 follow-up: Changed 10 years ago by colchaodemola <colchaodemola@…>

no more ideas ?

comment:22 Changed 10 years ago by War3333

I'am tring older revision as suggested...

Now I'am compiling r9000 if all goes well for 3 days i will try r9500... and so on until the problem reappears then I go back little by litte until is found the changeset that made the problem... this test need some weeks so don't expect from me any info soon ;)

comment:23 Changed 10 years ago by colchaodemola <colchaodemola@…>

well , i will do this too.
I am trying r7632 , `cause someone told this revision runs ok on dg834v2.
Well , it is up for 12 hours now. Though i dont like this revision very much cause QOS does not work.

comment:24 Changed 10 years ago by anonymous

did you ever try the latest version without qos?

comment:25 Changed 10 years ago by colchaodemola <colchaodemola@…>

what you mean by latest version without qos ?
If it is the trunk with qos disabled , yes i tried.
If it is the oldest revision where there is no qos available , no, i have no idea which revision this is.

comment:26 Changed 10 years ago by colchaodemola <colchaodemola@…>

Well just to keep you updated , r7632 works fine , no locks after 24h...
Then i updated to 9234 [last 2.6.22 revision] and it worked ok too.
Now i updated just the packages directory , and now i am running 2.6.22 with all packages updated ... [except busybox, i had to downgrade to 1.81 cause 1.82 would`n compile]
I will try to make trunk use 2.6.22 instead 2.6.23 and see if it works ...

comment:27 Changed 10 years ago by War3333

r9234 works fine for me too.
I can confirm that r9333 and 9665 with 2.6.23 have the problem.

I will wait for 2.6.24 to test :)

comment:28 in reply to: ↑ 21 Changed 10 years ago by colchaodemola <colchaodemola@…>

Replying to colchaodemola <colchaodemola@gmail.com>:

no more ideas ?

Any news why TI_STATIC_ALLOCATION does not work ?

comment:29 Changed 10 years ago by matteo

because it was not the right fix

comment:30 Changed 10 years ago by colchaodemola <colchaodemola@…>

i think you haven`t unnderstand my question , i mean why when TI_STATIC_ALLOCATION is enabled i can never connect.
failed to setup channel =0 with return code 2097447 failed to activate hw channel
[i was not talking about the hang]

comment:31 Changed 10 years ago by anonymous

Shame on me, don't use TI_STATIC_ALLOCATION pls

comment:32 Changed 10 years ago by colchaodemola <colchaodemola@…>

well , it does not work anyway :)
I tried the new 2.6.24 but my unit does not boot.

comment:33 Changed 10 years ago by War3333

Here too, 2.6.24 it seem to not boot at this time (adsl led blinking eternally)... I'am waiting for an rs232 -> ttl adapter for some console logging... it should be here tomorrow.

comment:34 Changed 10 years ago by colchaodemola <colchaodemola@…>

well , sometimes i get telnet access with 2.6.24 but even a ifconfig locks the terminal. [not the unit] and the done script is never executed cause the led is always blinking.

comment:35 Changed 10 years ago by War3333

Maybe this patch for .24 it's to be fixed as written?
http://ozlabs.org/pipermail/linuxppc-dev/2008-January/050330.htm

bah!... I will think of these problem when i can do some real analysis, without a console access it's impossible to even suppose something ;)

comment:36 Changed 10 years ago by colchaodemola <colchaodemola@…>

War3333, any news ?

comment:37 Changed 10 years ago by War3333

Sorry, I don't have enought time for testing...

For now I can say that iptables seems to not going with the new kernel (there is a stack trace during boot about i will post it), Avalanche show strange errors on startup, and ifconfig freeze the machine even on the console (freeze only don't crash so no stack trace for this :()

Changed 10 years ago by War3333

Boot Log with 2.6.24.2

comment:38 Changed 10 years ago by colchaodemola <colchaodemola@…>

news ?

comment:39 Changed 10 years ago by Stavros Korokithakis <stavros@…>

I can confirm this as well, my D-Link DSL-502T with rev 10117 crashes about once a day.

comment:40 Changed 10 years ago by colchaodemola <colchaodemola@…>

i kind found a workaround , by mistake :) , to kernel 2.6.23 ... I am up for 30h now with no locks.
I created a file at /tmp/swap , and used it as swap [i know tmp is tmpfs and uses ram, but as i said it was a mistake , i wanted use /swap :) ] , well , system is running fine , and started use swap after 9h uptime , where it would probably lock. Very strange , but it worked until now. I will post more info tomorrow , if it does not lock :)

comment:41 Changed 10 years ago by colchaodemola <colchaodemola@…>

it didn`t work. It locked after about 2 days. :/
Im trying now with kernel 2.6.23.17 insted of 2.6.23.1 , for now its holding, 28h uptime, no swap.
I need to get my serial connection working fast to see if all of these are the same error.

comment:42 Changed 10 years ago by matteo

  • Resolution set to fixed
  • Status changed from new to closed

Seems to be fixed in r10588, closing bug

comment:43 Changed 10 years ago by matteo

  • Resolution fixed deleted
  • Status changed from closed to reopened

I had this:

Kernel bug detected[#1]:
Cpu 0
$ 0   : 00000000 10008400 00000001 00000000
$ 4   : 00000002 94473bd8 00000202 00000001
$ 8   : 10008400 1000001f 94e0b368 00000000
$12   : 00000070 00000014 94c8c504 000001f4
$16   : 943df520 00000000 94473b80 00000000
$20   : 94473800 00010000 94915cd4 00000000
$24   : 00000000 9427edf0
$28   : 94d8a000 94d8bc10 00430918 94154f88
Hi    : 000002ee
Lo    : 001c0677
epc   : 94256e34 cpmac_irq+0x34c/0x658     Not tainted
ra    : 94154f88 handle_IRQ_event+0x74/0xe4
Status: 10008403    KERNEL EXL IE
Cause : 10800034
PrId  : 00018448 (MIPS 4KEc)
Modules linked in: tiatm nf_nat_tftp nf_conntrack_tftp ipt_NETMAP ipt_recent xt_limit xt_helper xt_connmark xt_connbytes ppp_async ch
Process nc (pid: 674, threadinfo=94d8a000, task=94900000)
Stack : 94473800 00000000 94e0b368 94e0b368 94473800 9427eed4 94d0f280 00000000
        00000000 0000001b 0000036d 00000000 94915cd4 00000000 00430918 94154f88
        00000000 00000000 94473800 942b9060 943b1048 0000001b 000001cc 94d8be50
        94157148 00000014 00000000 00000000 00000000 94d8bd7c 00000013 94e32de0
        94100be0 94d8bcb8 00430918 942d7e00 10008403 000002ee 00000000 94101c44
        ...
Call Trace:
[<94256e34>] cpmac_irq+0x34c/0x658
[<94154f88>] handle_IRQ_event+0x74/0xe4
[<94157148>] handle_level_irq+0x84/0x10c
[<94100be0>] plat_irq_dispatch+0x134/0x144
[<94101c44>] ret_from_irq+0x0/0x4
[<942c5068>] tcp_sendmsg+0x3f0/0xff0
[<94269520>] sock_aio_write+0xf0/0x118
[<94182820>] do_sync_write+0xd8/0x15c
[<9418396c>] sys_write+0x50/0xbc
[<9410aaf0>] stack_done+0x20/0x3c


Code: 26450058  38420001  30420001 <00028036> 8e020008  3c030001  00431024  10400046  00000000
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 3 seconds..

Maybe it's teh reboot issue that many users are reporting

comment:44 Changed 10 years ago by anonymous

reboot still much better than lock up :)

comment:45 Changed 10 years ago by matteo

I was saving some useful things to my pc:
nc 192.168.1.3 3000 < /dev/zero

comment:46 Changed 10 years ago by Robert.Siemer-openwrt.org@…

I can report the same problem with kamikaze r10408 (kernel 2.6.23.1) on the D-Link DSL-524T router. I thought it was overheating at first, but it hangs after a pretty constant uptime of about 20h.

comment:47 Changed 10 years ago by Robert.Siemer-openwrt.org@…

Update: time-to-crash is not constant (here neither). I was running "uptime" and "cat /proc/meminfo" every 10 seconds. The last one showed:

 09:57:06 up  9:57, load average: 0.00, 0.00, 0.00
MemTotal:        12716 kB
MemFree:          1408 kB
Buffers:          1096 kB
Cached:           3620 kB
SwapCached:          0 kB
Active:           3672 kB
Inactive:         2308 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:        1280 kB
Mapped:           1148 kB
Slab:             4172 kB
SReclaimable:      480 kB
SUnreclaim:       3692 kB
PageTables:        248 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:      6356 kB
Committed_AS:     4020 kB
VmallocTotal:  1048404 kB
VmallocUsed:       828 kB
VmallocChunk:  1047408 kB

The output one hour earlier was similar. So it does not look like a memory problem to me.

comment:48 Changed 10 years ago by matteo

Please report the kernel output with a serial, we need more info than "crash every 2-10 hours" to debug it"

comment:49 Changed 10 years ago by Robert <Robert.Siemer-openwrt.org@…>

Matteo, you don't have any problems with AR7 boards and kernel 2.6.23? Apart from that, I just got a Linksys WAG354G. I'll post the serial output if it crashes the next two days, but I'm afraid that it won't... It served also as a test for my serial cable as I have at the moment no clue where the serial port on my D-Link DSL-524T is...

comment:50 Changed 10 years ago by Robert <Robert.Siemer-openwrt.org@…>

Update: I found out how to connect the serial cable to the D-Link DSL-524T (see wiki page). Bad news: there was _no_ kernel output (or any other output) when it crashed after about 13 hours. -- On the other hand the Linksys WAG354G is still up and running (but I have nothing connected to it apart from the power supply.)

comment:51 Changed 10 years ago by anonymous

The problem, as far as i remember , appears to be on the sangam adsl module , so you have actually to connect WAG354G to see it locks.

comment:52 Changed 10 years ago by Robert <Robert.Siemer-openwrt.org@…>

Okay, I got the second crash and no output on the serial line (that link is okay otherwise (e.g. console login)). That was the D-Link DSL-524T. The "unconnected" Linksys WAG354G has still heartbeat.

I still want to have this issue resolved, but does anybody know a working ADSL2+ (Annex A) solution more or less open source based? I don't even know of sole ADSL2+ modems. Something like a USB modem with linux drivers connected to one of known OpenWRT routers...

Apart from that: any hints on how to proceed here?

comment:53 Changed 10 years ago by anonymous

You sure the serial is working ? can you see the boot process ?

comment:54 Changed 10 years ago by Robert <Robert.Siemer-openwrt.org@…>

Yes, I can see the boot process, influence it and later issue commands and see their output once OpenWRT is running. - "echo hi > /dev/console" prints 'hi'.

comment:55 Changed 10 years ago by matteo

BUG is napi_enable at linux-2.6.24/include/linux/netdevice.h:415

static inline void napi_enable(struct napi_struct *n)
{
	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
	smp_mb__before_clear_bit();
	clear_bit(NAPI_STATE_SCHED, &n->state);
}

comment:56 Changed 10 years ago by matteo

and it is called from cpmac_check_status which is added within 140-cpmac_fix.patch in r10424
Please retry without such patch

Changed 10 years ago by matteo

comment:57 Changed 10 years ago by matteo

  • Resolution set to fixed
  • Status changed from reopened to closed

It was a race condition, followed by an IRQ storm, fixed in r10747

comment:58 Changed 10 years ago by Robert <Robert.Siemer-openwrt.org@…>

Matteo, if the problem is napi_enable(), why is r10408 affected where cpmac.c never called napi_enable() at all (kernel 2.6.23.1), while with kernel 2.6.24 (with or without the patch) it is in cpmac_open() at least!?

Anonymous, as fas as I remember, my D-Link DSL-524T locked up without any ADSL line connected, too. When I started playing with it and OpenWRT I still used my old router for internet access, but it locked already up after some hours... But I had _ethernet_ connected (to talk to the device).

I will try out r10747 anyway... <-:

comment:59 Changed 10 years ago by anonymous

Is it normal to get Apr 6 19:45:20 OpenWrt user.warn kernel: eth0: rx dma ring overrun ?
i am running 10748.

comment:60 Changed 10 years ago by matteo

Yes, that's another (harmless) bug: #3047

comment:61 Changed 9 years ago by anonymous

Ok, I've got a DSL 524T,
Wiki page still says there are problems, here the bug appears to be fixed...
Does it still hang every few hours or is it ok now?
Can anyone confirm DSL 524T is stable?

comment:62 Changed 9 years ago by matteo

It seems to be fixed now

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.