Modify

Opened 3 years ago

Last modified 3 years ago

#18483 reopened defect

DIR-825 running CC r43485 - WIFI cannot handle IRQ 40

Reported by: papaj0e Owned by: developers
Priority: normal Milestone:
Component: packages Version: Trunk
Keywords: Cc:

Description

Hello,

After running CC r43485 for about two days, suddenly WIFI throughput drops and the following kernel error is logged:

[217492.920000] irq 40: nobody cared (try booting with the "irqpoll" option)
[217492.920000] CPU: 0 PID: 24331 Comm: uptime Not tainted 3.14.25 #4
[217492.920000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000035
[217492.920000] 83bf36d8 00000000 803018d8 803521ff 00005f0b 803b3b5c 83bf36d8 00000000
[217492.920000] 80340000 80305190 803051a4 802a2ca4 00000000 80206510 00000006 801a967c
[217492.920000] 80304fd8 825436dc 00000000 00000000 00000000 00000000 00000000 00000000
[217492.920000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[217492.920000] ...
[217492.920000] Call Trace:
[217492.920000] [<802488c8>] show_stack+0x48/0x70
[217492.920000] [<800a6ca0>] report_bad_irq.isra.7+0x44/0xf8
[217492.920000] [<801daa8c>] note_interrupt+0x224/0x2d8
[217492.920000] [<80160d2c>] handle_irq_event_percpu+0x1b8/0x1ec
[217492.920000] [<80160b50>] handle_irq_event+0x3c/0x60
[217492.920000] [<80160e40>] handle_level_irq+0xe0/0xf8
[217492.920000] [<80153d50>] generic_handle_irq+0x28/0x44
[217492.920000] [<80153d50>] generic_handle_irq+0x28/0x44
[217492.920000] [<8011c700>] do_IRQ+0x1c/0x2c
[217492.920000] [<80060830>] ret_from_irq+0x0/0x4
[217492.920000] [<83361274>] ath_start_rfkill_poll+0x10c/0x37c [ath9k]
[217492.920000] [<83363d68>] ath9k_tasklet+0x214/0x230 [ath9k]
[217492.920000] [<8026c670>] tasklet_action+0x84/0xcc
[217492.920000] [<8008f688>]
do_softirq+0xf8/0x228
[217492.920000] [<8018be14>] irq_exit+0x54/0x70
[217492.920000] [<80060830>] ret_from_irq+0x0/0x4
[217492.920000] [<801b4d58>] lzma_main+0x82c/0xb84
[217492.920000] [<802b71cc>] xz_dec_lzma2_run+0x578/0x834
[217492.920000] [<802b7808>] xz_dec_run+0x324/0x8fc
[217492.920000] [<80261d18>] squashfs_xz_uncompress+0x10c/0x1e8
[217492.920000] [<8025d5e0>] squashfs_decompress+0x50/0x88
[217492.920000] [<8025f2ac>] squashfs_read_data+0x5d0/0x73c
[217492.920000] [<8025ceac>] squashfs_cache_get+0x1a0/0x32c
[217492.920000] [<80261414>] squashfs_readpage_block+0x164/0x450
[217492.920000] [<8026113c>] squashfs_readpage+0x750/0x8c4
[217492.920000] [<8008eab4>] do_page_cache_readahead+0x1ec/0x24c
[217492.920000] [<801438b8>] filemap_fault+0x198/0x3e8
[217492.920000] [<8008e544>]
do_fault+0xd0/0x454
[217492.920000] [<80161048>] handle_mm_fault+0x1f0/0x754
[217492.920000] [<8008ec44>] do_page_fault+0x130/0x388
[217492.920000] [<80060820>] ret_from_exception+0x0/0x10
[217492.920000]
[217492.920000] handlers:
[217492.920000] [<83363864>] ath_isr [ath9k]
[217492.920000] Disabling IRQ #40

The only way I found to recover from this situation is to reload the ath9k module - simple wifi command doesn't solve it:

rmmod ath9k; modprobe ath9k; wifi

Last version that has been working correctly without this error was r43228 running the 3.10 kernel.

Could you please have a look at it?

Kind regards,
Peter

Attachments (0)

Change History (42)

comment:1 Changed 3 years ago by papaj0e

Forgot to mention that the router is DLINK DIR-825 rev.B using Atheros AR7161 SOC

Also about an hour after the recovery by reloading ath9k module similar error appeared:

[231686.010000] irq 40: nobody cared (try booting with the "irqpoll" option)
[231686.010000] CPU: 0 PID: 18901 Comm: iptables Not tainted 3.14.25 #4
[231686.010000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000037
[231686.010000] 	  838b5098 00000000 803018d8 803521ff 000049d5 803b3b5c 838b5098 00000000
[231686.010000] 	  80340000 80305190 803051a4 802a2ca4 00000000 80206510 00000006 801a967c
[231686.010000] 	  80304fd8 825fd774 00000000 00000000 00000000 00000000 00000000 00000000
[231686.010000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[231686.010000] 	  ...
[231686.010000] Call Trace:
[231686.010000] [<802488c8>] show_stack+0x48/0x70
[231686.010000] [<800a6ca0>] __report_bad_irq.isra.7+0x44/0xf8
[231686.010000] [<801daa8c>] note_interrupt+0x224/0x2d8
[231686.010000] [<80160d2c>] handle_irq_event_percpu+0x1b8/0x1ec
[231686.010000] [<80160b50>] handle_irq_event+0x3c/0x60
[231686.010000] [<80160e40>] handle_level_irq+0xe0/0xf8
[231686.010000] [<80153d50>] generic_handle_irq+0x28/0x44
[231686.010000] [<80153d50>] generic_handle_irq+0x28/0x44
[231686.010000] [<8011c700>] do_IRQ+0x1c/0x2c
[231686.010000] [<80060830>] ret_from_irq+0x0/0x4
[231686.010000] [<8008f61c>] __do_softirq+0x8c/0x228
[231686.010000] [<8018be14>] irq_exit+0x54/0x70
[231686.010000] [<80060830>] ret_from_irq+0x0/0x4
[231686.010000] [<801b4ea4>] lzma_main+0x978/0xb84
[231686.010000] [<802b71cc>] xz_dec_lzma2_run+0x578/0x834
[231686.010000] [<802b7808>] xz_dec_run+0x324/0x8fc
[231686.010000] [<80261d18>] squashfs_xz_uncompress+0x10c/0x1e8
[231686.010000] [<8025d5e0>] squashfs_decompress+0x50/0x88
[231686.010000] [<8025f2ac>] squashfs_read_data+0x5d0/0x73c
[231686.010000] [<8025ceac>] squashfs_cache_get+0x1a0/0x32c
[231686.010000] [<80261168>] squashfs_readpage+0x77c/0x8c4
[231686.010000] [<8008eab4>] __do_page_cache_readahead+0x1ec/0x24c
[231686.010000] [<801438b8>] filemap_fault+0x198/0x3e8
[231686.010000] [<8008e544>] __do_fault+0xd0/0x454
[231686.010000] [<80161048>] handle_mm_fault+0x1f0/0x754
[231686.010000] [<8008ec44>] __do_page_fault+0x130/0x388
[231686.010000] [<80060820>] ret_from_exception+0x0/0x10
[231686.010000] 
[231686.010000] handlers:
[231686.010000] [<82d03864>] ath_isr [ath9k]
[231686.010000] Disabling IRQ #40
Last edited 3 years ago by papaj0e (previous) (diff)

comment:2 Changed 3 years ago by hnyman

The error is apparently still there as of OpenWrt Chaos Calmer r43589.
Clean ar71xx build for WNDR3700v1.

[ 3499.330000] irq 40: nobody cared (try booting with the "irqpoll" option)
[ 3499.330000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.26 #1
[ 3499.330000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000032
[ 3499.330000] 	  803453b0 00000000 802fb434 803457d7 00000000 803b3b5c 803453b0 00000000
[ 3499.330000] 	  80340000 802fe7b0 802fe7c4 8029e040 00000000 801ffe30 00000006 801a4404
[ 3499.330000] 	  802fe5f8 80335b24 00000000 00000000 00000000 00000000 00000000 00000000
[ 3499.330000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 3499.330000] 	  ...
[ 3499.330000] Call Trace:
[ 3499.330000] [<80241ef8>] show_stack+0x48/0x70
[ 3499.330000] [<800a66b0>] __report_bad_irq.isra.7+0x44/0xf8
[ 3499.330000] [<801d4fcc>] note_interrupt+0x224/0x2d8
[ 3499.330000] [<8015c7d8>] handle_irq_event_percpu+0x1b8/0x1ec
[ 3499.330000] [<8015c5fc>] handle_irq_event+0x3c/0x60
[ 3499.330000] [<8015c8ec>] handle_level_irq+0xe0/0xf8
[ 3499.330000] [<8014ff5c>] generic_handle_irq+0x28/0x44
[ 3499.330000] [<8014ff5c>] generic_handle_irq+0x28/0x44
[ 3499.330000] [<8011902c>] do_IRQ+0x1c/0x2c
[ 3499.330000] [<80060830>] ret_from_irq+0x0/0x4
[ 3499.330000] [<8299bd3c>] ath9k_hw_enable_interrupts+0x140/0x194 [ath9k_hw]
[ 3499.330000] [<82883d68>] ath9k_tasklet+0x214/0x230 [ath9k]
[ 3499.330000] [<80264fdc>] tasklet_action+0x84/0xcc
[ 3499.330000] [<8008f5e0>] __do_softirq+0xf8/0x228
[ 3499.330000] [<801871c0>] irq_exit+0x54/0x70
[ 3499.330000] [<80060830>] ret_from_irq+0x0/0x4
[ 3499.330000] [<80060a80>] __r4k_wait+0x20/0x40
[ 3499.330000] [<80100920>] cpu_startup_entry+0xa4/0x104
[ 3499.330000] [<8035294c>] start_kernel+0x3c8/0x3e0
[ 3499.330000] 
[ 3499.330000] handlers:
[ 3499.330000] [<82883864>] ath_isr [ath9k]
[ 3499.330000] Disabling IRQ #40

comment:3 Changed 3 years ago by papaj0e

It looks like the same bug has been hit again after about day and a half uptime running trunk r43600:

[128559.690000] irq 40: nobody cared (try booting with the "irqpoll" option)
[128559.690000] CPU: 0 PID: 9052 Comm: hotplug-call Not tainted 3.14.26 #3
[128559.690000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 0000003a
[128559.690000] 	  81edc240 00000000 80301938 803521ff 0000235c 803b3b5c 81edc240 00000000
[128559.690000] 	  80340000 803051f0 80305204 802a2d5c 00000000 802065ec 00000006 801a96ec
[128559.690000] 	  80305038 81c5f75c 00000000 00000000 00000000 00000000 00000000 00000000
[128559.690000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[128559.690000] 	  ...
[128559.690000] Call Trace:
[128559.690000] [<80248980>] show_stack+0x48/0x70
[128559.690000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[128559.690000] [<801dab68>] note_interrupt+0x224/0x2d8
[128559.690000] [<80160d9c>] handle_irq_event_percpu+0x1b8/0x1ec
[128559.690000] [<80160bc0>] handle_irq_event+0x3c/0x60
[128559.690000] [<80160eb0>] handle_level_irq+0xe0/0xf8
[128559.690000] [<80153dc0>] generic_handle_irq+0x28/0x44
[128559.690000] [<80153dc0>] generic_handle_irq+0x28/0x44
[128559.690000] [<8011c770>] do_IRQ+0x1c/0x2c
[128559.690000] [<80060830>] ret_from_irq+0x0/0x4
[128559.690000] [<8008f624>] __do_softirq+0x8c/0x228
[128559.690000] [<8018be84>] irq_exit+0x54/0x70
[128559.690000] [<80060830>] ret_from_irq+0x0/0x4
[128559.690000] [<801b48fc>] lzma_main+0x2f4/0xb84
[128559.690000] [<802b7238>] xz_dec_lzma2_run+0x578/0x834
[128559.690000] [<802b7874>] xz_dec_run+0x324/0x8fc
[128559.690000] [<80261dd0>] squashfs_xz_uncompress+0x10c/0x1e8
[128559.690000] [<8025d698>] squashfs_decompress+0x50/0x88
[128559.690000] [<8025f364>] squashfs_read_data+0x5d0/0x73c
[128559.690000] [<8025cf64>] squashfs_cache_get+0x1a0/0x32c
[128559.690000] [<80261220>] squashfs_readpage+0x77c/0x8c4
[128559.690000] [<8008eabc>] __do_page_cache_readahead+0x1ec/0x24c
[128559.690000] [<801e6620>] page_cache_sync_readahead+0x54/0x60
[128559.690000] [<80152a40>] generic_file_aio_read+0x2c0/0x720
[128559.690000] [<8012a924>] do_sync_read+0x7c/0xb4
[128559.690000] [<802a95e4>] vfs_read+0xa4/0x168
[128559.690000] [<801a0ad8>] kernel_read+0x2c/0x40
[128559.690000] [<802055d8>] prepare_binprm+0x104/0x114
[128559.690000] [<8011f794>] do_execve+0x2b4/0x4c8
[128559.690000] [<8006283c>] handle_sys+0x11c/0x140
[128559.690000] 
[128559.690000] handlers:
[128559.690000] [<83243864>] ath_isr [ath9k]
[128559.690000] Disabling IRQ #40

comment:4 Changed 3 years ago by mroek

I can also confirm that this bug is still there. This is a very annoying and serious bug, and it really needs some serious attention. Is there anything we can do to help debugging it?

comment:5 follow-up: Changed 3 years ago by anonymous

Just noticed this on r43457 on Netgear WNDR3700v1:

[111947.690000] irq 40: nobody cared (try booting with the "irqpoll" option)
[111947.690000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.25 #1
[111947.690000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803ac8ce 00000032
[111947.690000] 	  80341398 00000000 802f7034 803417bf 00000000 803a3b5c 80341398 00000000
[111947.690000] 	  80330000 802fa3b0 802fa3c4 8029a0ac 00000000 801ff02c 00000006 801a3694
[111947.690000] 	  802fa1f8 80331b44 00000000 00000000 00000000 00000000 00000000 00000000
[111947.690000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[111947.690000] 	  ...
[111947.690000] Call Trace:
[111947.690000] [<80241018>] show_stack+0x48/0x70
[111947.690000] [<800a64d4>] __report_bad_irq.isra.7+0x44/0xf8
[111947.690000] [<801d41f0>] note_interrupt+0x224/0x2d8
[111947.690000] [<8015baf0>] handle_irq_event_percpu+0x1b8/0x1ec
[111947.690000] [<8015b914>] handle_irq_event+0x3c/0x60
[111947.690000] [<8015bc04>] handle_level_irq+0xe0/0xf8
[111947.690000] [<8014f274>] generic_handle_irq+0x28/0x44
[111947.690000] [<8014f274>] generic_handle_irq+0x28/0x44
[111947.690000] [<80118644>] do_IRQ+0x1c/0x2c
[111947.690000] [<80060830>] ret_from_irq+0x0/0x4
[111947.690000] [<82c61274>] ath_start_rfkill_poll+0x10c/0x37c [ath9k]
[111947.690000] [<82c63d68>] ath9k_tasklet+0x214/0x230 [ath9k]
[111947.690000] [<802640f4>] tasklet_action+0x84/0xcc
[111947.690000] [<8008f5c0>] __do_softirq+0xf8/0x228
[111947.690000] [<80186450>] irq_exit+0x54/0x70
[111947.690000] [<80060830>] ret_from_irq+0x0/0x4
[111947.690000] [<80060a80>] __r4k_wait+0x20/0x40
[111947.690000] [<801004e4>] cpu_startup_entry+0xa4/0x104
[111947.690000] [<8034e94c>] start_kernel+0x3c8/0x3e0
[111947.690000] 
[111947.690000] handlers:
[111947.690000] [<82c63864>] ath_isr [ath9k]
[111947.690000] Disabling IRQ #40

Interestingly, /proc/interrupts shows that IRQ #40 fired exactly 27,600,000 times:

           CPU0       
  3:          0      MIPS  ehci_hcd:usb1
  4:   20827731      MIPS  eth0
  5:   20548640      MIPS  eth1
  7:  104675986      MIPS  timer
 11:         40      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:   27600000  AR71XX PCI  ath9k
 41:  123064245  AR71XX PCI  ath9k
ERR:          0

comment:6 in reply to: ↑ 5 Changed 3 years ago by hnyman

Replying to anonymous:

Interestingly, /proc/interrupts shows that IRQ #40 fired exactly 27,600,000 times:

 40:   27600000  AR71XX PCI  ath9k
 41:  123064245  AR71XX PCI  ath9k

That might be just a coincidence, as it doesn't look like any "natural" limit in style of power of 2:

2^7 * 5^5 * 3 * 23 = 27600000

But I will check my router's counter when I run into this bug next time.

comment:7 Changed 3 years ago by mroek

This bug just happened again on my router also, and this is what I got in /proc/interrupts

root@WNDR3700:~# cat /proc/interrupts
           CPU0
  3:          0      MIPS  ehci_hcd:usb1
  4:   15480050      MIPS  eth0
  5:          0      MIPS  eth1
  7:   15791398      MIPS  timer
 11:         43      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:   32600000  AR71XX PCI  ath9k
 41:   14523689  AR71XX PCI  ath9k
ERR:          0

As you can see, I got a different number, but strangely enough it was also a rather round number, at least when viewed base 10 (like here).

comment:8 Changed 3 years ago by papaj0e

I am pretty sure it is not coincidence that IRQ 40 interrupt counter is always divisible by 100000 - once again my router hit the bug and see the statistics:

root@OpenWrt:~# cat /proc/interrupts 
           CPU0       
  3:   11259327      MIPS  ehci_hcd:usb1
  4:    8662071      MIPS  eth0
  5:  154312243      MIPS  eth1
  7:   37271673      MIPS  timer
 11:         54      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  115900000  AR71XX PCI  ath9k
 41:   21295825  AR71XX PCI  ath9k
ERR:          0

And the bug itself in the kernel log:

[251441.410000] irq 40: nobody cared (try booting with the "irqpoll" option)
[251441.410000] CPU: 0 PID: 27067 Comm: sh Not tainted 3.14.26 #3
[251441.410000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000031
[251441.410000] 	  824b3210 00000000 80301938 803521ff 000069bb 803b3b5c 824b3210 00000000
[251441.410000] 	  80340000 803051f0 80305204 802a2d5c 00000000 802065ec 00000006 801a96ec
[251441.410000] 	  80305038 81d135a4 00000000 00000000 00000000 00000000 00000000 00000000
[251441.410000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[251441.410000] 	  ...
[251441.410000] Call Trace:
[251441.410000] [<80248980>] show_stack+0x48/0x70
[251441.410000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[251441.410000] [<801dab68>] note_interrupt+0x224/0x2d8
[251441.410000] [<80160d9c>] handle_irq_event_percpu+0x1b8/0x1ec
[251441.410000] [<80160bc0>] handle_irq_event+0x3c/0x60
[251441.410000] [<80160eb0>] handle_level_irq+0xe0/0xf8
[251441.410000] [<80153dc0>] generic_handle_irq+0x28/0x44
[251441.410000] [<80153dc0>] generic_handle_irq+0x28/0x44
[251441.410000] [<8011c770>] do_IRQ+0x1c/0x2c
[251441.410000] [<80060830>] ret_from_irq+0x0/0x4
[251441.410000] [<8008f624>] __do_softirq+0x8c/0x228
[251441.410000] [<8018be84>] irq_exit+0x54/0x70
[251441.410000] [<80060830>] ret_from_irq+0x0/0x4
[251441.410000] [<801b49c0>] lzma_main+0x3b8/0xb84
[251441.410000] [<802b7238>] xz_dec_lzma2_run+0x578/0x834
[251441.410000] [<802b7874>] xz_dec_run+0x324/0x8fc
[251441.410000] [<80261dd0>] squashfs_xz_uncompress+0x10c/0x1e8
[251441.410000] [<8025d698>] squashfs_decompress+0x50/0x88
[251441.410000] [<8025f364>] squashfs_read_data+0x5d0/0x73c
[251441.410000] [<8025cf64>] squashfs_cache_get+0x1a0/0x32c
[251441.410000] [<80261220>] squashfs_readpage+0x77c/0x8c4
[251441.410000] [<8008eabc>] __do_page_cache_readahead+0x1ec/0x24c
[251441.410000] [<80143928>] filemap_fault+0x198/0x3e8
[251441.410000] [<8008e54c>] __do_fault+0xd0/0x454
[251441.410000] [<801610b8>] handle_mm_fault+0x1f0/0x754
[251441.410000] [<8008ec4c>] __do_page_fault+0x130/0x388
[251441.410000] [<80060820>] ret_from_exception+0x0/0x10
[251441.410000] [<80064b74>] __bzero+0x38/0x164
[251441.410000] [<801e5f8c>] padzero+0x4c/0x70
[251441.410000] [<801ae86c>] load_elf_binary+0xaec/0x124c
[251441.410000] [<80234ab8>] search_binary_handler+0x90/0x1d0
[251441.410000] [<8011f814>] do_execve+0x334/0x4c8
[251441.410000] [<8006283c>] handle_sys+0x11c/0x140
[251441.410000] 
[251441.410000] handlers:
[251441.410000] [<80403864>] ath_isr [ath9k]
[251441.410000] Disabling IRQ #40

I'm recovering from this error by full reload of the wireless network stack using this script:

#!/bin/sh
rmmod ath9k; rmmod ath9k_common; rmmod ath9k_hw; rmmod ath; rmmod mac80211; rmmod cfg80211; rmmod compat; modprobe ath9k; wifi

Hope this information would help resolving this bug.

comment:9 Changed 3 years ago by papaj0e

And once again my wireless gets broken at '00000 boundary:

root@OpenWrt:~# cat /proc/interrupts 
           CPU0       
  3:   13573663      MIPS  ehci_hcd:usb1
  4:   11627373      MIPS  eth0
  5:  173781352      MIPS  eth1
  7:   43564517      MIPS  timer
 11:         54      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  127300000  AR71XX PCI  ath9k
 41:   24681995  AR71XX PCI  ath9k
ERR:          0

Which makes exactly 11400000 interrupts (127300000-115900000) before the crash.
I guess my next crash is going to be at 138700000 interrupts (=127300000+11400000).

Here is the log:

[274529.840000] irq 40: nobody cared (try booting with the "irqpoll" option)
[274529.840000] CPU: 0 PID: 27597 Comm: hostapd Not tainted 3.14.26 #3
[274529.840000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000036
[274529.840000] 	  824b1ef0 00000000 80301938 803521ff 00006bcd 803b3b5c 824b1ef0 00000000
[274529.840000] 	  80340000 803051f0 80305204 802a2d5c 00000000 802065ec 00000006 801a96ec
[274529.840000] 	  80305038 8075943c 00000000 00000000 00000000 00000000 00000000 00000000
[274529.840000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[274529.840000] 	  ...
[274529.840000] Call Trace:
[274529.840000] [<80248980>] show_stack+0x48/0x70
[274529.840000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[274529.840000] [<801dab68>] note_interrupt+0x224/0x2d8
[274529.840000] [<80160d9c>] handle_irq_event_percpu+0x1b8/0x1ec
[274529.840000] [<80160bc0>] handle_irq_event+0x3c/0x60
[274529.840000] [<80160eb0>] handle_level_irq+0xe0/0xf8
[274529.840000] [<80153dc0>] generic_handle_irq+0x28/0x44
[274529.840000] [<80153dc0>] generic_handle_irq+0x28/0x44
[274529.840000] [<8011c770>] do_IRQ+0x1c/0x2c
[274529.840000] [<80060830>] ret_from_irq+0x0/0x4
[274529.840000] [<8027c3b8>] tcp_sacktag_write_queue+0x5d0/0x8a8
[274529.840000] [<8026ea20>] tcp_ack+0x2fc/0xd4c
[274529.840000] [<80278af0>] tcp_rcv_established+0x2d0/0x5fc
[274529.840000] [<802837e4>] tcp_v4_do_rcv+0x90/0x2a4
[274529.840000] [<8028471c>] tcp_v4_rcv+0x3a8/0x718
[274529.840000] [<8017f448>] ip_local_deliver_finish+0xd4/0x198
[274529.840000] [<800a0d8c>] __netif_receive_skb_core+0x474/0x530
[274529.840000] [<800bb5f4>] ag71xx_poll+0x434/0x608
[274529.840000] [<801d12d4>] net_rx_action+0x10c/0x1e0
[274529.840000] [<8008f690>] __do_softirq+0xf8/0x228
[274529.840000] [<8018be84>] irq_exit+0x54/0x70
[274529.840000] [<80060830>] ret_from_irq+0x0/0x4
[274529.840000] [<801b49b0>] lzma_main+0x3a8/0xb84
[274529.840000] [<802b7170>] xz_dec_lzma2_run+0x4b0/0x834
[274529.840000] [<802b7874>] xz_dec_run+0x324/0x8fc
[274529.840000] [<80261dd0>] squashfs_xz_uncompress+0x10c/0x1e8
[274529.840000] [<8025d698>] squashfs_decompress+0x50/0x88
[274529.840000] [<8025f364>] squashfs_read_data+0x5d0/0x73c
[274529.840000] [<8025cf64>] squashfs_cache_get+0x1a0/0x32c
[274529.840000] [<802614cc>] squashfs_readpage_block+0x164/0x450
[274529.840000] [<802611f4>] squashfs_readpage+0x750/0x8c4
[274529.840000] [<8008eabc>] __do_page_cache_readahead+0x1ec/0x24c
[274529.840000] [<80143928>] filemap_fault+0x198/0x3e8
[274529.840000] [<8008e54c>] __do_fault+0xd0/0x454
[274529.840000] [<801610b8>] handle_mm_fault+0x1f0/0x754
[274529.840000] [<8008ec4c>] __do_page_fault+0x130/0x388
[274529.840000] [<80060820>] ret_from_exception+0x0/0x10
[274529.840000] 
[274529.840000] handlers:
[274529.840000] [<804e3864>] ath_isr [ath9k]
[274529.840000] Disabling IRQ #40

comment:10 Changed 3 years ago by anonymous

I also just got this on my wndr3700 running r43602

[191111.700000] irq 40: nobody cared (try booting with the "irqpoll" option)
[191111.700000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.14.26 #1
[191111.700000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803dd83e 00000036
[191111.700000] 	  83828bd0 00000000 8031e834 80369f87 00000003 803d3b60 83828bd0 00000000
[191111.700000] 	  80360000 80321bc4 80321bd8 802b4e20 00000000 80214f10 00000006 801b81e0
[191111.700000] 	  80321a0c 83833bec 00000000 00000000 00000000 00000000 00000000 00000000
[191111.700000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[191111.700000] 	  ...
[191111.700000] Call Trace:
[191111.700000] [<80258830>] show_stack+0x48/0x70
[191111.700000] [<800ab914>] __report_bad_irq.isra.7+0x44/0xf8
[191111.700000] [<801e9714>] note_interrupt+0x224/0x2d8
[191111.700000] [<8016ada8>] handle_irq_event_percpu+0x1b8/0x1ec
[191111.700000] [<8016abcc>] handle_irq_event+0x3c/0x60
[191111.700000] [<8016aebc>] handle_level_irq+0xe0/0xf8
[191111.700000] [<8015e3d8>] generic_handle_irq+0x28/0x44
[191111.700000] [<8015e3d8>] generic_handle_irq+0x28/0x44
[191111.700000] [<80125ee8>] do_IRQ+0x1c/0x2c
[191111.700000] [<80060830>] ret_from_irq+0x0/0x4
[191111.700000] [<83361274>] ath_start_rfkill_poll+0x10c/0x37c [ath9k]
[191111.700000] [<83363d68>] ath9k_tasklet+0x214/0x230 [ath9k]
[191111.700000] [<8027ba6c>] tasklet_action+0x84/0xcc
[191111.700000] [<80093d9c>] __do_softirq+0xf8/0x228
[191111.700000] [<802409a0>] run_ksoftirqd+0x2c/0x58
[191111.700000] [<802662c0>] smpboot_thread_fn+0x134/0x164
[191111.700000] [<801b7910>] kthread+0xd8/0xe4
[191111.700000] [<80060878>] ret_from_kernel_thread+0x14/0x1c
[191111.700000] 
[191111.700000] handlers:
[191111.700000] [<83363864>] ath_isr [ath9k]
[191111.700000] Disabling IRQ #40

comment:11 Changed 3 years ago by papaj0e

After upgrade to trunk r43731 here we are again - the same bug appeared after less than a day uptime:

[98838.200000] irq 40: nobody cared (try booting with the "irqpoll" option)
[98838.200000] CPU: 0 PID: 1408 Comm: hostapd Not tainted 3.14.26 #2
[98838.200000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000035
[98838.200000] 	  82ec0240 00000000 80301938 803521ff 00000580 803b3b5c 82ec0240 00000000
[98838.200000] 	  80340000 803051f0 80305204 802a2d8c 00000000 8020661c 00000006 801a971c
[98838.200000] 	  80305038 824a371c 00000000 00000000 00000000 00000000 00000000 00000000
[98838.200000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[98838.200000] 	  ...
[98838.200000] Call Trace:
[98838.200000] [<802489b0>] show_stack+0x48/0x70
[98838.200000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[98838.200000] [<801dab98>] note_interrupt+0x224/0x2d8
[98838.200000] [<80160dcc>] handle_irq_event_percpu+0x1b8/0x1ec
[98838.200000] [<80160bf0>] handle_irq_event+0x3c/0x60
[98838.200000] [<80160ee0>] handle_level_irq+0xe0/0xf8
[98838.200000] [<80153df0>] generic_handle_irq+0x28/0x44
[98838.200000] [<80153df0>] generic_handle_irq+0x28/0x44
[98838.200000] [<8011c7a0>] do_IRQ+0x1c/0x2c
[98838.200000] [<80060830>] ret_from_irq+0x0/0x4
[98838.200000] [<8008f624>] __do_softirq+0x8c/0x228
[98838.200000] [<8018beb4>] irq_exit+0x54/0x70
[98838.200000] [<80060830>] ret_from_irq+0x0/0x4
[98838.200000] [<801b4884>] lzma_main+0x24c/0xb84
[98838.200000] [<802b7268>] xz_dec_lzma2_run+0x578/0x834
[98838.200000] [<802b78a4>] xz_dec_run+0x324/0x8fc
[98838.200000] [<80261e00>] squashfs_xz_uncompress+0x10c/0x1e8
[98838.200000] [<8025d6c8>] squashfs_decompress+0x50/0x88
[98838.200000] [<8025f394>] squashfs_read_data+0x5d0/0x73c
[98838.200000] [<8025cf94>] squashfs_cache_get+0x1a0/0x32c
[98838.200000] [<802614fc>] squashfs_readpage_block+0x164/0x450
[98838.200000] [<80261224>] squashfs_readpage+0x750/0x8c4
[98838.200000] [<8008eabc>] __do_page_cache_readahead+0x1ec/0x24c
[98838.200000] [<80143958>] filemap_fault+0x198/0x3e8
[98838.200000] [<8008e54c>] __do_fault+0xd0/0x454
[98838.200000] [<801610e8>] handle_mm_fault+0x1f0/0x754
[98838.200000] [<8008ec4c>] __do_page_fault+0x130/0x388
[98838.200000] [<80060820>] ret_from_exception+0x0/0x10
[98838.200000] 
[98838.200000] handlers:
[98838.200000] [<83363864>] ath_isr [ath9k]
[98838.200000] Disabling IRQ #40

And not surprisingly cat /proc/interrupts returns

           CPU0       
  3:    5091311      MIPS  ehci_hcd:usb1
  4:     875982      MIPS  eth0
  5:   64443502      MIPS  eth1
  7:   15338880      MIPS  timer
 11:         38      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:   49700000  AR71XX PCI  ath9k
 41:    9230325  AR71XX PCI  ath9k
ERR:          0

As you can see the count of interrupts for IRQ 40 again is divisible exactly by 100000

This pattern is clearly repeatable - it should help developers to find where this bug is hidden.

comment:12 Changed 3 years ago by papaj0e

After reload of the wireless modules here is the next failure log that occured about 27 hours later:

[196621.090000] irq 40: nobody cared (try booting with the "irqpoll" option)
[196621.090000] CPU: 0 PID: 29970 Comm: iptables Not tainted 3.14.26 #2
[196621.090000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000037
[196621.090000] 	  80c98bd0 00000000 80301938 803521ff 00007512 803b3b5c 80c98bd0 00000000
[196621.090000] 	  80340000 803051f0 80305204 802a2d8c 00000000 8020661c 00000006 801a971c
[196621.090000] 	  80305038 82f015a4 00000000 00000000 00000000 00000000 00000000 00000000
[196621.090000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[196621.090000] 	  ...
[196621.090000] Call Trace:
[196621.090000] [<802489b0>] show_stack+0x48/0x70
[196621.090000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[196621.090000] [<801dab98>] note_interrupt+0x224/0x2d8
[196621.090000] [<80160dcc>] handle_irq_event_percpu+0x1b8/0x1ec
[196621.090000] [<80160bf0>] handle_irq_event+0x3c/0x60
[196621.090000] [<80160ee0>] handle_level_irq+0xe0/0xf8
[196621.090000] [<80153df0>] generic_handle_irq+0x28/0x44
[196621.090000] [<80153df0>] generic_handle_irq+0x28/0x44
[196621.090000] [<8011c7a0>] do_IRQ+0x1c/0x2c
[196621.090000] [<80060830>] ret_from_irq+0x0/0x4
[196621.090000] [<8008f624>] __do_softirq+0x8c/0x228
[196621.090000] [<8018beb4>] irq_exit+0x54/0x70
[196621.090000] [<80060830>] ret_from_irq+0x0/0x4
[196621.090000] [<801b4804>] lzma_main+0x1cc/0xb84
[196621.090000] [<802b7268>] xz_dec_lzma2_run+0x578/0x834
[196621.090000] [<802b78a4>] xz_dec_run+0x324/0x8fc
[196621.090000] [<80261e00>] squashfs_xz_uncompress+0x10c/0x1e8
[196621.090000] [<8025d6c8>] squashfs_decompress+0x50/0x88
[196621.090000] [<8025f394>] squashfs_read_data+0x5d0/0x73c
[196621.090000] [<8025cf94>] squashfs_cache_get+0x1a0/0x32c
[196621.090000] [<80261250>] squashfs_readpage+0x77c/0x8c4
[196621.090000] [<8008eabc>] __do_page_cache_readahead+0x1ec/0x24c
[196621.090000] [<80143958>] filemap_fault+0x198/0x3e8
[196621.090000] [<8008e54c>] __do_fault+0xd0/0x454
[196621.090000] [<801610e8>] handle_mm_fault+0x1f0/0x754
[196621.090000] [<8008ec4c>] __do_page_fault+0x130/0x388
[196621.090000] [<80060820>] ret_from_exception+0x0/0x10
[196621.090000] [<80064b74>] __bzero+0x38/0x164
[196621.090000] [<801e5fbc>] padzero+0x4c/0x70
[196621.090000] [<801ae504>] load_elf_binary+0x754/0x124c
[196621.090000] [<80234ae8>] search_binary_handler+0x90/0x1d0
[196621.090000] [<8011f844>] do_execve+0x334/0x4c8
[196621.090000] [<8006283c>] handle_sys+0x11c/0x140
[196621.090000] 
[196621.090000] handlers:
[196621.090000] [<806e3864>] ath_isr [ath9k]
[196621.090000] Disabling IRQ #40
root@OpenWrt:~# cat /proc/interrupts 
           CPU0       
  3:   11649972      MIPS  ehci_hcd:usb1
  4:    2698590      MIPS  eth0
  5:  122543399      MIPS  eth1
  7:   33826486      MIPS  timer
 11:         38      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:   93500000  AR71XX PCI  ath9k
 41:   19834394  AR71XX PCI  ath9k
ERR:          0

comment:13 Changed 3 years ago by hnyman

The bug seems to be rather elusive. I have now tried to burden my wndr3700v1 rev43682 with wifi traffic, but after a week and 136 million interrupts, no bug...

When looking at the kernel logs above, one thing jumps out:
with papajoe the call stack trace seems always include filesystem operations (squashfs, lzma), while with me and anonymous the trace more directly includes ath9k.

@developers:
when that error surfaces, what would be the useful debug info in addition to the kernel logs above & /proc/interrupts stats?
Something else that would help in finding the possible bug?

comment:14 Changed 3 years ago by mroek

This is a rather elusive bug. My WNDR3700 has now been running for 8 days without being hit, so it clearly takes some special circumstances for it to appear (but it has occured on my router with the same version it is running now, just not for quite a few days).

How do we know if any dev is actually following this ticket?

comment:15 Changed 3 years ago by mroek

And after 12 days mine (running r43589) crashed again:

[1051020.490000] irq 40: nobody cared (try booting with the "irqpoll" option)
[1051020.490000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.26 #4
[1051020.490000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803bc8ce 00000032
[1051020.490000]          803453b0 00000000 802fb434 803457d7 00000000 803b3b5c 803453b0 00000000
[1051020.490000]          80340000 802fe7b0 802fe7c4 8029e014 00000000 801ffe04 00000006 801a43d8
[1051020.490000]          802fe5f8 80335b44 00000000 00000000 00000000 00000000 00000000 00000000
[1051020.490000]          00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[1051020.490000]          ...
[1051020.490000] Call Trace:
[1051020.490000] [<80241ecc>] show_stack+0x48/0x70
[1051020.490000] [<800a66b0>] __report_bad_irq.isra.7+0x44/0xf8
[1051020.490000] [<801d4fa0>] note_interrupt+0x224/0x2d8
[1051020.490000] [<8015c7ac>] handle_irq_event_percpu+0x1b8/0x1ec
[1051020.490000] [<8015c5d0>] handle_irq_event+0x3c/0x60
[1051020.490000] [<8015c8c0>] handle_level_irq+0xe0/0xf8
[1051020.490000] [<8014ff30>] generic_handle_irq+0x28/0x44
[1051020.490000] [<8014ff30>] generic_handle_irq+0x28/0x44
[1051020.490000] [<80119000>] do_IRQ+0x1c/0x2c
[1051020.490000] [<80060830>] ret_from_irq+0x0/0x4
[1051020.490000] [<83321274>] ath_start_rfkill_poll+0x10c/0x37c [ath9k]
[1051020.490000] [<83323d68>] ath9k_tasklet+0x214/0x230 [ath9k]
[1051020.490000] [<80264fb0>] tasklet_action+0x84/0xcc
[1051020.490000] [<8008f5e0>] __do_softirq+0xf8/0x228
[1051020.490000] [<80187194>] irq_exit+0x54/0x70
[1051020.490000] [<80060830>] ret_from_irq+0x0/0x4
[1051020.490000] [<80060a80>] __r4k_wait+0x20/0x40
[1051020.490000] [<801008f4>] cpu_startup_entry+0xa4/0x104
[1051020.490000] [<8035294c>] start_kernel+0x3c8/0x3e0
[1051020.490000]
[1051020.490000] handlers:
[1051020.490000] [<83323864>] ath_isr [ath9k]
[1051020.490000] Disabling IRQ #40

root@WNDR3700:~# cat /proc/interrupts
           CPU0
  3:          0      MIPS  ehci_hcd:usb1
  4:  155252913      MIPS  eth0
  5:          0      MIPS  eth1
  7:  107039095      MIPS  timer
 11:         43      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  241700000  AR71XX PCI  ath9k
 41:   87132137  AR71XX PCI  ath9k
ERR:          0

comment:16 Changed 3 years ago by anonymous

After running r43757 more than 4 days the bug appeared again:

[328914.630000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[328914.730000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[328914.840000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[328914.950000] ath: phy0: Failed to stop TX DMA, queues=0x102!
[344233.640000] irq 40: nobody cared (try booting with the "irqpoll" option)
[344233.640000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.26 #1
[344233.640000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803cc8ce 00000032
[344233.640000] 	  80353e48 00000000 80301ab8 8035426f 00000000 803c3b5c 80353e48 00000000
[344233.640000] 	  80340000 80305370 80305384 802a2d98 00000000 80206488 00000006 801a9464
[344233.640000] 	  803051b8 80341b44 00000000 00000000 00000000 00000000 00000000 00000000
[344233.640000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[344233.640000] 	  ...
[344233.640000] Call Trace:
[344233.640000] [<80248980>] show_stack+0x48/0x70
[344233.640000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[344233.640000] [<801daa04>] note_interrupt+0x224/0x2d8
[344233.640000] [<80160b14>] handle_irq_event_percpu+0x1b8/0x1ec
[344233.640000] [<80160938>] handle_irq_event+0x3c/0x60
[344233.640000] [<80160c28>] handle_level_irq+0xe0/0xf8
[344233.640000] [<80153b38>] generic_handle_irq+0x28/0x44
[344233.640000] [<80153b38>] generic_handle_irq+0x28/0x44
[344233.640000] [<8011c4f0>] do_IRQ+0x1c/0x2c
[344233.640000] [<80060830>] ret_from_irq+0x0/0x4
[344233.640000] [<82c61274>] ath_start_rfkill_poll+0x10c/0x37c [ath9k]
[344233.640000] [<82c63d68>] ath9k_tasklet+0x214/0x230 [ath9k]
[344233.640000] [<8026c728>] tasklet_action+0x84/0xcc
[344233.640000] [<8008f690>] __do_softirq+0xf8/0x228
[344233.640000] [<8018bbfc>] irq_exit+0x54/0x70
[344233.640000] [<80060830>] ret_from_irq+0x0/0x4
[344233.640000] [<80060a80>] __r4k_wait+0x20/0x40
[344233.640000] [<801027d0>] cpu_startup_entry+0xa4/0x104
[344233.640000] [<8036194c>] start_kernel+0x3c8/0x3e0
[344233.640000] 
[344233.640000] handlers:
[344233.640000] [<82c63864>] ath_isr [ath9k]
[344233.640000] Disabling IRQ #40
[366320.240000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366596.270000] ath: phy0: Failed to stop TX DMA, queues=0x00b!
[366597.540000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366597.730000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366597.940000] ath: phy0: Failed to stop TX DMA, queues=0x10a!
[366598.970000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366599.990000] ath: phy0: Failed to stop TX DMA, queues=0x00b!
[366601.020000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366602.040000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366602.240000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[366602.450000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366602.640000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[366603.660000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[366605.670000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[366606.450000] ath: phy0: Failed to stop TX DMA, queues=0x10a!
[366606.640000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367138.660000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367139.650000] ath: phy0: Failed to stop TX DMA, queues=0x00b!
[367139.840000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367140.040000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367141.060000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367143.080000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367143.840000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367144.040000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367144.250000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367146.270000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367148.150000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367148.350000] ath: phy0: Failed to stop TX DMA, queues=0x102!
[367148.540000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367150.560000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[367152.440000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367152.640000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367152.840000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367154.850000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367156.640000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367156.840000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367157.040000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[367159.050000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[369014.070000] ath: phy0: Failed to stop TX DMA, queues=0x00a!

cat /proc/interrupts

           CPU0       
  3:     691949      MIPS  ehci_hcd:usb1
  4:  148903151      MIPS  eth0
  5:  322457889      MIPS  eth1
  7:   41157503      MIPS  timer
 11:        104      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  222400000  AR71XX PCI  ath9k
 41:   38268933  AR71XX PCI  ath9k
ERR:          0

Again as you may see the number of interrupts is exactly divisible by 100000.

comment:17 Changed 3 years ago by papaj0e

And once again after two more days of uptime:

[521684.990000] ath: phy0: Failed to stop TX DMA, queues=0x00a!
[557188.050000] irq 40: nobody cared (try booting with the "irqpoll" option)
[557188.050000] CPU: 0 PID: 9832 Comm: kworker/u2:0 Not tainted 3.14.26 #1
[557188.050000] Workqueue: phy0 ath_reset_work [ath9k]
[557188.050000] Stack : 00000001 ffffffff 81d21e00 80301ab8 80356f68 00000028 00000000 00000000
[557188.050000] 	  80340000 80305370 80305384 802a2d98 00002668 80206bd0 82ecb210 00000000
[557188.050000] 	  80310fe8 82df5b84 82df5b84 802a2d98 00000000 80206698 00000006 00000000
[557188.050000] 	  00000005 80238e18 00000000 00000000 00000000 00000000 00000000 00000000
[557188.050000] 	  70687930 00000000 00000000 00000000 00000000 00000000 825c6a00 825c6900
[557188.050000] 	  ...
[557188.050000] Call Trace:
[557188.050000] [<80248980>] show_stack+0x48/0x70
[557188.050000] [<800a6cb8>] __report_bad_irq.isra.7+0x44/0xf8
[557188.050000] [<801daa04>] note_interrupt+0x224/0x2d8
[557188.050000] [<80160b14>] handle_irq_event_percpu+0x1b8/0x1ec
[557188.050000] [<80160938>] handle_irq_event+0x3c/0x60
[557188.050000] [<80160c28>] handle_level_irq+0xe0/0xf8
[557188.050000] [<80153b38>] generic_handle_irq+0x28/0x44
[557188.050000] [<80153b38>] generic_handle_irq+0x28/0x44
[557188.050000] [<8011c4f0>] do_IRQ+0x1c/0x2c
[557188.050000] [<80060830>] ret_from_irq+0x0/0x4
[557188.050000] [<804a85a0>] ieee80211_stop_queues_by_reason+0x8c/0xb0 [mac80211]
[557188.050000] [<832c2370>] ath9k_deinit_device+0x27c/0x88c [ath9k]
[557188.050000] [<832c4480>] ath9k_calculate_summary_state+0x550/0x648 [ath9k]
[557188.050000] [<8010dec0>] dequeue_task_fair+0x0/0x12c
[557188.050000] 
[557188.050000] handlers:
[557188.050000] [<832c3864>] ath_isr [ath9k]
[557188.050000] Disabling IRQ #40

cat /proc/interrupts

           CPU0       
  3:     691949      MIPS  ehci_hcd:usb1
  4:  232258906      MIPS  eth0
  5:  495000517      MIPS  eth1
  7:   57494081      MIPS  timer
 11:        125      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  329600000  AR71XX PCI  ath9k
 41:   53448736  AR71XX PCI  ath9k
ERR:          0

Last two times the router crashed it has been lightly loaded and have about half the RAM available:

root@OpenWrt:~# cat /proc/meminfo 
MemTotal:          61264 kB
MemFree:           22452 kB
MemAvailable:      37016 kB
Buffers:            2028 kB
Cached:            13208 kB
SwapCached:            0 kB
Active:            10160 kB
Inactive:           7568 kB
Active(anon):       2524 kB
Inactive(anon):     1032 kB
Active(file):       7636 kB
Inactive(file):     6536 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          2512 kB
Mapped:             1616 kB
Shmem:              1064 kB
Slab:              11176 kB
SReclaimable:       4088 kB
SUnreclaim:         7088 kB
KernelStack:         336 kB
PageTables:          288 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:       30632 kB
Committed_AS:       6728 kB
VmallocTotal:    1048372 kB
VmallocUsed:        1880 kB
VmallocChunk:    1023528 kB

comment:18 Changed 3 years ago by anonymous

It's not just IRQ 40. WNDR3700 running Chaos Calmer r43592 having the same problem for IRQ 41.

[595324.220000] irq 41: nobody cared (try booting with the "irqpoll" option)
[595324.220000] CPU: 0 PID: 4254 Comm: kworker/u2:0 Not tainted 3.14.26 #1
[595324.220000] Workqueue: phy1 ath_tx_complete_poll_work [ath9k]
[595324.220000] Stack : 00000001 ffffffff 8249b080 802f7094 80344538 00000029 00000000 00000000
[595324.220000] 	  80330000 802fa410 802fa424 8029a164 0000109e 801ff850 83b44240 00000000
[595324.220000] 	  80304f38 81d15844 81d15844 8029a164 00000000 801ff318 00000006 00000000
[595324.220000] 	  00000005 802315d4 00000000 00000000 00000000 00000000 00000000 00000000
[595324.220000] 	  70687931 00000000 00000000 00000000 00000000 00000000 8339a400 8339a500
[595324.220000] 	  ...
[595324.220000] Call Trace:
[595324.220000] [<802410d0>] show_stack+0x48/0x70
[595324.220000] [<800a64ec>] __report_bad_irq.isra.7+0x44/0xf8
[595324.220000] [<801d42cc>] note_interrupt+0x224/0x2d8
[595324.220000] [<8015bb60>] handle_irq_event_percpu+0x1b8/0x1ec
[595324.220000] [<8015b984>] handle_irq_event+0x3c/0x60
[595324.220000] [<8015bc74>] handle_level_irq+0xe0/0xf8
[595324.220000] [<8014f2e4>] generic_handle_irq+0x28/0x44
[595324.220000] [<8014f2e4>] generic_handle_irq+0x28/0x44
[595324.220000] [<801186b4>] do_IRQ+0x1c/0x2c
[595324.220000] [<80060830>] ret_from_irq+0x0/0x4
[595324.220000] [<800f7a0c>] codel_should_drop.isra.18+0x10/0x9c
[595324.220000] [<80146f88>] fq_codel_dequeue+0x11c/0x60c
[595324.220000] [<800a4a08>] __qdisc_run+0xb0/0x174
[595324.220000] [<8008cef0>] __dev_queue_xmit+0x30c/0x468
[595324.220000] [<802a28fc>] vlan_dev_hard_start_xmit+0x98/0x128
[595324.220000] [<8010e778>] dev_hard_start_xmit+0x338/0x490
[595324.220000] [<8008cf68>] __dev_queue_xmit+0x384/0x468
[595324.220000] [<800df154>] br_dev_queue_push_xmit+0x6c/0x80
[595324.220000] [<800df52c>] br_dev_xmit+0x1e0/0x218
[595324.220000] [<8010e778>] dev_hard_start_xmit+0x338/0x490
[595324.220000] [<8008cf68>] __dev_queue_xmit+0x384/0x468
[595324.220000] [<80178224>] ip_finish_output+0x404/0x484
[595324.220000] [<800a0614>] __netif_receive_skb_core+0x474/0x530
[595324.220000] [<800bac48>] ag71xx_poll+0x434/0x608
[595324.220000] [<801caa38>] net_rx_action+0x10c/0x1e0
[595324.220000] [<8008f5c8>] __do_softirq+0xf8/0x228
[595324.220000] [<801262dc>] do_softirq+0x48/0x68
[595324.220000] [<8009c6f4>] __local_bh_enable_ip+0x98/0xb8
[595324.220000] [<832ebaec>] ath_tx_complete_poll_work+0xf8/0x188 [ath9k]
[595324.220000] [<80207e20>] process_one_work+0x228/0x364
[595324.220000] [<802ab8e8>] worker_thread+0x234/0x388
[595324.220000] [<801a2e34>] kthread+0xd8/0xe4
[595324.220000] [<80060878>] ret_from_kernel_thread+0x14/0x1c
[595324.220000] 
[595324.220000] handlers:
[595324.220000] [<832e3f3c>] ath_isr [ath9k]
[595324.220000] Disabling IRQ #41

Happens normally after 6 days of uptime and thus assuming the below /proc/interrupts would be the same re IRQ 41.

           CPU0
  3:          0      MIPS  ehci_hcd:usb1
  4:   13854285      MIPS  eth0
  5:    9814582      MIPS  eth1
  7:   61342387      MIPS  timer
 11:         38      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:   53206499  AR71XX PCI  ath9k
 41:   47000000  AR71XX PCI  ath9k
ERR:          0

As this impacts only 5G I left it earlier unattended and eventually it kills the IRQ 40 as well.

comment:19 Changed 3 years ago by cmtsij

confirm that this bug is still there on r43793.
My device is XWR100.

[177314.420000] irq 40: nobody cared (try booting with the "irqpoll" option)
[177314.420000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.26 #1
[177314.420000] Stack : 00000006 ffffffff 00000000 00000000 00000000 00000000 803ac8ae 00000032
	  8033ee28 00000000 802f5144 8033f24f 00000000 803a3b5c 8033ee28 00000000
	  80330000 802f84c0 802f84d4 802976c4 00000000 801fcb10 00000006 801a1444
	  802f8308 8032fb24 00000000 00000000 00000000 00000000 00000000 00000000
	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	  ...
[177314.420000] Call Trace:
[177314.420000] [<8023e9e4>] show_stack+0x48/0x70
[177314.420000] [<800a5b08>] __report_bad_irq.isra.7+0x44/0xf8
[177314.420000] [<801d2130>] note_interrupt+0x224/0x2d8
[177314.420000] [<80159f20>] handle_irq_event_percpu+0x1b8/0x1ec
[177314.420000] [<80159d44>] handle_irq_event+0x3c/0x60
[177314.420000] [<8015a034>] handle_level_irq+0xe0/0xf8
[177314.420000] [<8014d6a4>] generic_handle_irq+0x28/0x44
[177314.420000] [<8014d6a4>] generic_handle_irq+0x28/0x44
[177314.420000] [<80116ab4>] do_IRQ+0x1c/0x2c
[177314.420000] [<80060830>] ret_from_irq+0x0/0x4
[177314.420000] [<80d9bd40>] ath9k_hw_enable_interrupts+0x134/0x194 [ath9k_hw]
[177314.420000] [<80de3d88>] ath9k_tasklet+0x214/0x230 [ath9k]
[177314.420000] [<80261ac0>] tasklet_action+0x84/0xcc
[177314.420000] [<8008f340>] __do_softirq+0xf8/0x228
[177314.420000] [<80183f94>] irq_exit+0x54/0x70
[177314.420000] [<80060830>] ret_from_irq+0x0/0x4
[177314.420000] [<80060a80>] __r4k_wait+0x20/0x40
[177314.420000] [<800fe954>] cpu_startup_entry+0xa4/0x104
[177314.420000] [<8034c94c>] start_kernel+0x3c8/0x3e0

comment:20 follow-up: Changed 3 years ago by anonymous

With all due respect, I think that the best course of action would be to systematically flash previous trunk builds in order to find out the exact build within which the bug was introduced.

comment:21 in reply to: ↑ 20 Changed 3 years ago by anonymous

anonymous here:

Replying to anonymous:

With all due respect, I think that the best course of action would be to systematically flash previous trunk builds in order to find out the exact build within which the bug was introduced.

That seems like a good idea, if anyone can repro frequently enough.

Also, it might help to log *why* ath9k didn't care about the interrupt, i.e. which condition in ath_isr() caused it to return IRQ_NONE. I might work up a patch for this in a few days.

The issue has not yet recurred on my system (WNDR3700v1 still running r43457).

comment:22 follow-up: Changed 3 years ago by anonymous

Just a hunch but is anyone able to test with 43342 reverted?

https://dev.openwrt.org/changeset/43342/

comment:23 in reply to: ↑ 22 Changed 3 years ago by bittorf@…

Replying to anonymous:

Just a hunch but is anyone able to test with 43342 reverted?

https://dev.openwrt.org/changeset/43342/

i can see this with r43238 too, so I guess not:
(Ubiquiti Bullet M)

root@box:~ cat /proc/interrupts 
           CPU0       
  4:     245701      MIPS  eth0
  7:    2539839      MIPS  timer
 11:       4718      MISC  serial
 40:     100000  AR724X PCI   ath9k
ERR:          0
[   27.090000] cfg80211: Calling CRDA to update world regulatory domain
[   27.090000] cfg80211: World regulatory domain updated:
[   27.100000] cfg80211:  DFS Master region: unset
[   27.100000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   27.110000] cfg80211:   (2400000 KHz - 2494000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   27.120000] cfg80211:   (4910000 KHz - 5895000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   27.240000] PPP generic driver version 2.4.2
[   27.260000] NET: Registered protocol family 24
[   27.360000] PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
[   27.820000] irq 40: nobody cared (try booting with the "irqpoll" option)
[   27.820000] CPU: 0 PID: 1524 Comm: kmodloader Not tainted 3.10.58 #20
[   27.820000] Stack : 00000000 00000000 00000000 00000000 8039bf1e 00000039 81ad9b00 00000000
[   27.820000] 	  802d7d14 8031f373 000005f4 803939dc 81ad9b00 00000000 80310000 802dbaa0
[   27.820000] 	  802dbab4 8027e3ec 00000000 801e3224 00000006 8018cff4 802d93a4 80c518a4
[   27.820000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   27.820000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 80c51830
[   27.820000] 	  ...
[   27.820000] Call Trace:
[   27.820000] [<80223f6c>] show_stack+0x48/0x70
[   27.820000] [<800a17e0>] __report_bad_irq.isra.6+0x44/0xf8
[   27.820000] [<801bb614>] note_interrupt+0x1d4/0x2cc
[   27.820000] [<8014a73c>] handle_irq_event_percpu+0x1b8/0x1ec
[   27.820000] [<8014a560>] handle_irq_event+0x3c/0x60
[   27.820000] [<8014a850>] handle_level_irq+0xe0/0xf8
[   27.820000] [<8013f1a0>] generic_handle_irq+0x28/0x44
[   27.820000] [<8013f1a0>] generic_handle_irq+0x28/0x44
[   27.820000] [<80107c50>] do_IRQ+0x1c/0x2c
[   27.820000] [<80060830>] ret_from_irq+0x0/0x4
[   27.820000] [<8008bbc4>] __do_softirq+0x84/0x1bc
[   27.820000] [<801158d8>] do_softirq+0x48/0x68
[   27.820000] [<801732d4>] irq_exit+0x54/0x70
[   27.820000] [<80060830>] ret_from_irq+0x0/0x4
[   27.820000] [<800a3c50>] __setup_irq+0x3d0/0x4e0
[   27.820000] [<801fed04>] request_threaded_irq+0xf4/0x154
[   27.820000] [<80e2d994>] ath_chanctx_set_channel+0x5f0/0x81c [ath9k]
[   27.820000] [<80241fd4>] sysfs_do_create_link_sd+0x150/0x1e4
[   27.820000] [<8011b8c8>] driver_sysfs_add+0x6c/0xa0
[   27.820000] [<8011b564>] driver_probe_device+0xc8/0x21c
[   27.820000] [<8008bf34>] __driver_attach+0x7c/0xb4
[   27.820000] [<800df488>] bus_for_each_dev+0x98/0xa8
[   27.820000] [<800df150>] bus_add_driver+0x104/0x248
[   27.820000] [<8011b78c>] driver_register+0xbc/0x16c
[   27.820000] [<80e78028>] init_module+0x28/0x58 [ath9k]
[   27.820000] [<80111a7c>] do_one_initcall+0xf0/0x198
[   27.820000] [<80193ac4>] load_module+0x156c/0x1b34
[   27.820000] [<800733b8>] SyS_init_module+0xbc/0xd4
[   27.820000] [<800625b0>] stack_done+0x20/0x44
[   27.820000]
[   27.820000] handlers:
[   27.820000] [<80e23ed4>] ath_isr [ath9k]
[   27.820000] Disabling IRQ #40
[   28.040000] ath: EEPROM regdomain: 0x0
[   28.040000] ath: EEPROM indicates default country code should be used
[   28.040000] ath: doing EEPROM country->regdmn map search
[   28.040000] ath: country maps to regdmn code: 0x3a
[   28.040000] ath: Country alpha2 being used: US
[   28.040000] ath: Regpair used: 0x3a
[   28.190000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   28.300000] cfg80211: Calling CRDA for country: US
[   28.310000] cfg80211: Regulatory domain changed to country: US
[   28.310000] cfg80211:  DFS Master region: unset
[   28.320000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   28.330000] cfg80211:   (2400000 KHz - 2494000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   28.330000] cfg80211:   (4910000 KHz - 5895000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   28.340000] ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xb0000000, irq=40
[   38.210000] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   38.340000] eth0: link up (100Mbps/Full duplex)
[   38.340000] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   40.160000] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   40.290000] wlan0: Created IBSS using preconfigured BSSID 02:ca:ff:ee:00:40
[   40.290000] wlan0: Creating new IBSS network, BSSID 02:ca:ff:ee:00:40
[   40.320000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready

comment:24 Changed 3 years ago by mroek

I built r43823 yesterday, and was hit after a couple of hours. Not surprising, since nothing specific has been done to fix this, but it was time to do a rebuild anyway.

comment:25 Changed 3 years ago by cmtsij

Please help to check compact driver version on the device, and report the issue is exist or not.
example:

[   12.020000] Loading modules backported from Linux version master-2014-10-08-0-g0275925
[   12.030000] Backport generated by backports.git backports-20140905-1-gde42785

comment:26 Changed 3 years ago by bittorf@…

can please somebody post which interface combinations are used? we can see it at least with adhoc and adhoc+ap - anybody can see it with ap-only?

comment:27 Changed 3 years ago by mroek

Mine is running AP only, serving two SSIDs on each band.

comment:28 Changed 3 years ago by bluse

Hi all,

From todays troubleshooting on several devices (UBNT bullet, Nanostation, TPLink 4300, 4900) and the help of nbd, I provide the following status update of this issue:

OBSERVATIONs with CHAOS CALMER (Bleeding Edge, r43853):

-the IRQ error can be triggered on all my platforms with different IRQ numbers (40,17,18) in the log
-a 'cat /proc/interrupts' show which IRQ ath9k uses
-the actual WiFi mode does not matter to trigger the problem (adhoc, ap, sta)
-after the error happens, latency increases up to 200-300ms and throughput decreases to ~250kByte/s
-it seems that while ath9k does a hw reset, the kernel handler ignors IRQs but the hw fires an IRQ
-this leads probably to a temporary IRQ storm and the kernel solves this by disabling this IRQ

STEPs to REPRODUCE the ERROR:
1) -generate high traffic load on your ath9k wireless interface (iperf, ect.)
2) -enforce several ath9k hw resets by issuing:

while :;do echo 4 >/sys/kernel/debug/ieee80211/phy0/ath9k/diag;I=100;while let I-=1; do :;done;done

3) -check "dmesg" for the error to appear

ATH9K SOURCE Code where the race condition is most likely to happen:
../compat-wireless../driver/net/wireless/ath/ath9k/main.c
../compat-wireless../driver/net/wireless/ath/ath9k/mac.c

PATCH HISTORY of this two files (with some IRQ fixes already in place)
http://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git/log/drivers/net/wireless/ath/ath9k/main.c
http://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git/log/drivers/net/wireless/ath/ath9k/mac.c

TEMPORARY WORKAROUND for productive routers:
-daily reboot at night via crond


Still no fix yet, but work in progress.

Greetings Thomas

Last edited 3 years ago by bluse (previous) (diff)

comment:29 Changed 3 years ago by bluse

UPDATE: potential WORKAROUND

By reverting the ath9k patch "ath9k: do not access hardware on IRQs during reset" I can not trigger the IRQ bug anymore on my plaforms while doing hw resets via the diag debugfs.

In order to validate if the mentioned patch is responsible for the IRQ error, you can help testing on your side by doing:

1) Download my patch to revert "ath9k: do not access hardware on IRQs during reset" into your Openwrt build env (../packages/kernel/mac80211/patches) from http://net.t-labs.tu-berlin.de/~thomas/999-revert_ath9k_do-not-access-hw-on-IRQ-reset.patch

2) rebuild (make clean && make) your OpenWrt trunk

3) flash you OpenWRT trunk image or update your ath9k driver by opkg

4) test if the STEPs to REPRODUCE the ERROR still trigger the IRQ issue.

If we get confidence that the patch in question is responsible for the IRQ issue, our troubleshooting is sufficient to point to the root cause and ath9k developers could rework/fix this patch.

Greetings Thomas

comment:30 Changed 3 years ago by papaj0e

I can confirm that after reverting the patch "ath9k: do not access hardware on IRQs during reset" and running trunk r43901 for almost two days I'm not able to trigger this bug anymore.

And this is after heavy wireless traffic load with almost 200 million interrupts fired:

root@OpenWrt:/tmp# cat /proc/interrupts 
           CPU0       
  3:          0      MIPS  ehci_hcd:usb1
  4:   34298818      MIPS  eth0
  5:  192964035      MIPS  eth1
  7:   16214486      MIPS  timer
 11:         48      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  192117084  AR71XX PCI  ath9k
 41:   12490162  AR71XX PCI  ath9k
ERR:          0

The only issue I have is the infamous "Failed to stop TX DMA" one:

[ 1831.230000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.170000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.280000] ath: phy0: Failed to stop TX DMA, queues=0x102!
[46139.380000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.480000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.580000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.680000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.790000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.890000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[46139.990000] ath: phy0: Failed to stop TX DMA, queues=0x002!
[144410.320000] ath: phy0: Failed to stop TX DMA, queues=0x002!

So it seems to me that this patch is the culprit and it should be revised by ath9k developers.

comment:31 follow-up: Changed 3 years ago by anonymous

Interesting. bittorf reported in comment:23 seeing this issue on r43238, the revision immediately prior to that patch being applied.

comment:32 in reply to: ↑ 31 ; follow-up: Changed 3 years ago by bluse

The provided error log from bittorf in comment:23 shows the IRQ issues with ath9k while his router is in its booting phase.
So rigth after the PCI system registers the ath9k card it shows:

...
[   27.360000] PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
[   27.820000] irq 40: nobody cared (try booting with the "irqpoll" option)
.
. <- IRQ Issue happens before WiFi interface is up
.
[   27.820000] Disabling IRQ #40
[   28.040000] ath: EEPROM regdomain: 0x0
[   28.040000] ath: EEPROM indicates default country code should be used
[   28.040000] ath: doing EEPROM country->regdmn map search
[   28.040000] ath: country maps to regdmn code: 0x3a
[   28.040000] ath: Country alpha2 being used: US
[   28.040000] ath: Regpair used: 0x3a
[   28.190000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   28.300000] cfg80211: Calling CRDA for country: US
[   28.310000] cfg80211: Regulatory domain changed to country: US
[   28.310000] cfg80211:  DFS Master region: unset
[   28.320000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   28.330000] cfg80211:   (2400000 KHz - 2494000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   28.330000] cfg80211:   (4910000 KHz - 5895000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   28.340000] ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xb0000000, irq=40
...

In contrast to this observation, I am refering to the IRQ issue that got triggert while normal transmit operations of the ath9k WiFi card, long after the router booted.

I have contacted the author Bittorf to recheck this on his side. As it is either a different root cause in r43238 with his UBNT Bullet M or our current troubleshooting with the ath9k patch in question is not sufficient.

Last edited 3 years ago by bluse (previous) (diff)

comment:33 follow-up: Changed 3 years ago by mroek

I haven't gotten around to test the provided patch yet, but it does sound promising.
However, what I don't really understand, is why ath9k would perform hw resets on a running system? Is that normal behaviour?

comment:34 in reply to: ↑ 33 Changed 3 years ago by bluse

However, what I don't really understand, is why ath9k would perform hw resets on a running system? Is that normal behaviour?

Short answer: yes
Many Wifi chips are reseted under certain circumstances by their driver core.. for instance if the chip hangs.. a watchdog routine will trigger a hw reset. Just look at the the relevant ath9k code in main.c and mac.c to understand which conditions trigger a hw reset.

I haven't gotten around to test the provided patch yet, but it does sound promising.

Please help with testing here and take the needed time, as this issues seems to affect many OpenWRT & ath9k users.

comment:35 Changed 3 years ago by mroek

I get it, but a hardware reset is nevertheless a sign that something happened which shouldn't have...

For some reason, I haven't seen this issue on my TP-Link WDR4300, only on my Netgear WNDR3700. I haven't tried provoking the error on the WDR4300 though, as that's my main router.

Anyway:
I have now recompiled r43823 with your patch applied, and I can confirm that I couldn't cause the issue to appear when performing the steps you outlined. I did get a bunch of the usual ath9k DMA-errors though, but it didn't crash. With the original r43823 the error was easily reproduced, so I think it is safe to say that the patch works as advertised.

comment:36 in reply to: ↑ 32 Changed 3 years ago by bittorf@…

I have contacted the author Bittorf to recheck this on his side. As it is either a different root cause in r43238 with his UBNT Bullet M or our current troubleshooting with the ath9k patch in question is not sufficient.

i think you are right, it seems to be another reason - but this device with this revision reliable has this irq-thingy just on bootup. i will flash 4 devices tonight with the patch of 'bluse' and report again...

comment:37 Changed 3 years ago by cmtsij

I remove "package/kernel/mac80211/patches/311-ath9k-do-not-access-hardware-on-IRQs-during-reset.patch" and rebuild based on r43942.
I can't this issue anymore.

 40:   18344814  AR71XX PCI  ath9k
 41:   11020420  AR71XX PCI  ath9k

comment:38 follow-up: Changed 3 years ago by anonymous

Bluse, are you able to push your patch to the devs?

comment:39 in reply to: ↑ 38 Changed 3 years ago by bluse

I chatted with nbd about our current findings/status this morning.
He is planning to develop a proper fix in the next days. I am not confident with changes to the irq top half handler routine.

ps: Sadly badly we can not trigger the irq error with the diag register with a single router on our desk. It just works reliably fast in my mesh network, where all routers have multiple neighbours and different service loads. Any other way to trigger the error with a single node on the desk ?

comment:40 Changed 3 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

fixed in r43971

comment:41 Changed 3 years ago by papaj0e

I have been running trunk r43972 for more than 3 days now and it is rock-solid under heavy wireless load - no more disabled IRQ 40 - thanks to @nbd for the fix!

root@OpenWrt:/tmp# uptime 
 23:35:15 up 3 days, 18 min,  load average: 0.09, 0.14, 0.35
root@OpenWrt:/tmp# cat /proc/interrupts 
           CPU0       
  3:          0      MIPS  ehci_hcd:usb1
  4:   65176087      MIPS  eth0
  5:  222361507      MIPS  eth1
  7:   28045126      MIPS  timer
 11:         38      MISC  serial
 14:          0      MISC  ohci_hcd:usb2
 40:  175248819  AR71XX PCI  ath9k
 41:  147324226  AR71XX PCI  ath9k
ERR:          0

comment:42 Changed 3 years ago by bittorf@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

still happening with r45686 on 'Ubiquiti Bullet M5'

[   28.530000] irq 40: nobody cared (try booting with the "irqpoll" option)
[   28.530000] CPU: 0 PID: 1865 Comm: kmodloader Not tainted 3.18.11 #1
[   28.530000] Stack : 00000000 00000000 00000000 00000000 803b4272 00000038 00000000 00000028
[   28.530000] 	  00000000 80365a84 803046f4 8035a9e3 00000749 803b349c 818fb2b8 80365a84
[   28.530000] 	  80360000 80307e90 80307ea4 800a3400 00000006 80080344 00000000 00000000
[   28.530000] 	  80307cec 81bab86c 00000000 00000000 00000000 00000000 00000000 00000000
[   28.530000] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   28.530000] 	  ...
[   28.530000] Call Trace:
[   28.530000] [<80070fd8>] show_stack+0x48/0x70
[   28.530000] [<800a6794>] __report_bad_irq.isra.7+0x44/0xf8
[   28.530000] [<800a6adc>] note_interrupt+0x224/0x2d8
[   28.530000] [<800a4e2c>] handle_irq_event_percpu+0x154/0x188
[   28.530000] [<800a4e9c>] handle_irq_event+0x3c/0x60
[   28.530000] [<800a75e0>] handle_level_irq+0xcc/0x110
[   28.530000] [<800a46b0>] generic_handle_irq+0x28/0x44
[   28.530000] [<800a46b0>] generic_handle_irq+0x28/0x44
[   28.530000] [<8006df54>] do_IRQ+0x1c/0x2c
[   28.530000] [<80060830>] ret_from_irq+0x0/0x4
[   28.530000] [<800828e0>] __do_softirq+0x8c/0x230
[   28.530000] [<80082cb0>] irq_exit+0x54/0x70
[   28.530000] [<80060830>] ret_from_irq+0x0/0x4
[   28.530000] [<800a60ec>] __setup_irq+0x420/0x530
[   28.530000] [<800a6394>] request_threaded_irq+0xf4/0x154
[   28.530000] [<8096ded8>] ath_chanctx_set_channel+0x5f4/0x820 [ath9k]
[   28.530000]
[   28.530000] handlers:
[   28.530000] [<80963fe0>] ath_isr [ath9k]
[   28.530000] Disabling IRQ #40
[   28.680000] ath: EEPROM regdomain: 0x0
[   28.680000] ath: EEPROM indicates default country code should be used
[   28.680000] ath: doing EEPROM country->regdmn map search
[   28.680000] ath: country maps to regdmn code: 0x3a
[   28.680000] ath: Country alpha2 being used: US
[   28.680000] ath: Regpair used: 0x3a
[   28.690000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   28.700000] ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xb0000000, irq=40
[   28.700000] cfg80211: Calling CRDA for country: US
[   28.720000] cfg80211: Regulatory domain changed to country: US
[   28.720000] cfg80211:  DFS Master region: unset
[   28.730000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   28.740000] cfg80211:   (2400000 KHz - 2494000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   28.740000] cfg80211:   (4910000 KHz - 5895000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   31.680000] zram0: detected capacity change from 0 to 14680064
[   31.700000] Adding 14332k swap on /dev/zram0.  Priority:0 extents:1 across:14332k SS
[   37.300000] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   37.700000] eth0: link up (100Mbps/Full duplex)
[   37.700000] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   39.700000] eth0: link down
[   40.700000] eth0: link up (100Mbps/Full duplex)

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.