Modify

Opened 4 years ago

Last modified 3 years ago

#14864 reopened defect

procd crash / r39307

Reported by: bittorf@… Owned by: developers
Priority: normal Milestone: Barrier Breaker 14.07
Component: packages Version: Trunk
Keywords: procd crash Cc:

Description

since a long time i can see from time to time procd-crashs.
we monitore why nodes are restarting and how often. if there
is nothing in the syslog/crashlog, we assume reason "unknown".
digging deeper into this the (custom) serial log shows:

up: 10699.03 load: 0.10 rest: 0.30 0.39 2/38 9958
EG-labor-AP
up: 10759.89 load: 0.04 rest: 0.25 0.36 2/38 10183
EG-labor-AP
up: 10819.32 load: 0.17 rest: 0.27 0.37 2/38 10413
EG-labor-AP
up: 10999.76 load: 0.57 rest: 0.58 0.49 2/38 11723
EG-labor-AP
up: 11059.76 load: 0.34 rest: 0.53 0.48 2/38 11952
EG-labor-AP
up: 11120.12 load: 0.33 rest: 0.52 0.48 2/38 12177
EG-labor-AP
up: 11180.29 load: 0.26 rest: 0.48 0.47 2/38 12404
EG-labor-AP
up: 11240.62 load: 0.23 rest: 0.45 0.46 2/38 12631
EG-labor-AP
up: 11299.66 load: 0.18 rest: 0.41 0.45 2/38 12859
EG-labor-AP
up: 11359.30 load: 0.24 rest: 0.41 0.45 1/39 13085
EG-labor-AP
up: 11419.30 load: 0.09 rest: 0.33 0.42 1/39 13310
EG-labor-AP
up: 11479.31 load: 0.03 rest: 0.27 0.39 2/39 13541
EG-labor-AP
up: 11539.31 load: 0.01 rest: 0.22 0.37 2/39 13766
EG-labor-AP
up: 11599.31 load: 0.00 rest: 0.18 0.35 2/38 13993
EG-labor-AP
up: 11659.31 load: 0.00 rest: 0.15 0.32 2/36 14220
EG-labor-AP
up: 11719.31 load: 0.00 rest: 0.12 0.30 1/36 14447
procd: Rebooting as procd has crashed
procd: reboot
[11733.320000] ath79_wdt: device closed unexpectedly, watchdog timer will not stop!
[11733.350000] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[11733.350000]
[11733.350000] Rebooting in 10 seconds..�

the output is:

echo $HOSTNAME
cat /proc/uptime /proc/loadavg

i will deliver more output with procd-debug '4'

Attachments (0)

Change History (9)

comment:1 Changed 4 years ago by bittorf@…

here again with r39404 and procd-debug '4': (the last lines only)
(the squashfs-error are IMHO because of missing or force unmount)

procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
up: 17144.50 load: 0.02 rest: 0.07 0.13 2/42 19600
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
EG-labor-AP
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
up: 17204.50 load: 0.01 rest: 0.06 0.13 2/42 19645
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: watchdog_timeout_cb(37): Ping
procd: Rebooting as procd has crashed
procd: reboot
[17267.120000] Removing MTD device #3 (rootfs_data) with use count 1
[17267.150000] end_request: I/O error, dev mtdblock2, sector 1872
[17267.150000] end_request: I/O error, dev mtdblock2, sector 1874
[17267.160000] end_request: I/O error, dev mtdblock2, sector 1876
[17267.160000] end_request: I/O error, dev mtdblock2, sector 1878
[17267.170000] end_request: I/O error, dev mtdblock2, sector 1880
[17267.180000] end_request: I/O error, dev mtdblock2, sector 1882
[17267.180000] end_request: I/O error, dev mtdblock2, sector 1884
[17267.190000] end_request: I/O error, dev mtdblock2, sector 1886
[17267.190000] end_request: I/O error, dev mtdblock2, sector 1888
[17267.200000] end_request: I/O error, dev mtdblock2, sector 1890
[17267.210000] SQUASHFS error: squashfs_read_data failed to read block 0xe80fe
[17267.220000] SQUASHFS error: Unable to read fragment cache entry [e80fe]
[17267.220000] SQUASHFS error: Unable to read page, block e80fe, size 111e4
[17267.230000] SQUASHFS error: Unable to read fragment cache entry [e80fe]
[17267.240000] SQUASHFS error: Unable to read page, block e80fe, size 111e4
[17267.240000] SQUASHFS error: Unable to read fragment cache entry [e80fe]
[17267.250000] SQUASHFS error: Unable to read page, block e80fe, size 111e4
[17267.260000] SQUASHFS error: Unable to read fragment cache entry [e80fe]
[17267.260000] SQUASHFS error: Unable to read page, block e80fe, size 111e4

comment:2 Changed 4 years ago by nbd

does this still happen with current trunk?

comment:3 Changed 4 years ago by nbd

  • Resolution set to no_response
  • Status changed from new to closed

comment:4 Changed 4 years ago by bittorf@…

i have seen no crashes anymore from procd in our testnet - thank you!

comment:5 Changed 4 years ago by jow

  • Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07

Milestone Attitude Adjustment 12.09 deleted

comment:6 Changed 3 years ago by bittorf@…

  • Resolution no_response deleted
  • Status changed from closed to reopened

this does still happens with r45621

how can this be better debugged further without serial console?

comment:7 Changed 3 years ago by nbd

you could disable the hardware watchdog and attach gdbserver to pid 1. then connect using remote-gdb.

comment:8 Changed 3 years ago by nbd

any news?

comment:9 Changed 3 years ago by bittorf@…

need more time for debugging

Add Comment

Modify Ticket

Action
as reopened .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.