Modify

Opened 3 years ago

Closed 3 years ago

#19564 closed defect (invalid)

after reboot - reset to defaults / firstboot by accident

Reported by: bittorf@… Owned by: developers
Priority: highest Milestone:
Component: packages Version: Trunk
Keywords: Cc:

Description

with r45568 and before (maybe introduced with r44942)
i can see (with ar71xx and mpc85xx), that sometimes
after a normal reboot the router is dumb and has all
values set to defaults - so it boots up with telnet and has no password / wifi is off.

i mark this as prio highest, because it's mission critical
in a mesh network and of course also not good from a security view 8-)

Attachments (0)

Change History (50)

comment:1 follow-up: Changed 3 years ago by guenti

on mpc85xx (WDR4900v1) (r45581) no such issue.
maybe you have not enough space available?

comment:2 in reply to: ↑ 1 Changed 3 years ago by bittorf@…

Replying to guenti:

on mpc85xx (WDR4900v1) (r45581) no such issue.
maybe you have not enough space available?

thanks for feedback:
there is enough space free (>3 megabyte) and it only happens seldom.
i'am working on a setup, which makes it reproduceable.

comment:3 Changed 3 years ago by ufo@…

we also saw that behaviour a few times within our mesh (cc-trunk with ar71)
our first idea was, that some process on the device fills the flash (maybe our asterisk)

and then the device may does a firstboot? but, that also would be new for me ( a feature?)

comment:4 Changed 3 years ago by anonymous

I have this problem too. It sometimes happened when I was testing asterisk before reboot.

comment:5 Changed 3 years ago by bittorf@…

ok, can reproduce this now with r45588

root@box:~ dd if=/dev/urandom of=/www/test.bin bs=16k count=99999
dd: writing '/www/test.bin': No space left on device
178+0 records in
176+1 records out

root@box:~ df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                    3.6M      3.6M     60.0K  98% /
/dev/root                 3.1M      3.1M         0 100% /rom
tmpfs                    61.6M      3.1M     58.5M   5% /tmp
/dev/mtdblock3            3.6M      3.6M     60.0K  98% /overlay
overlayfs:/overlay        3.6M      3.6M     60.0K  98% /
tmpfs                   512.0K         0    512.0K   0% /dev

root@box:~ reboot && exit

after reboot the box comes up "dumb" - all config is lost 8-(((

comment:6 Changed 3 years ago by guenti

this is a normal behaviour when you fill up the flash.

comment:7 Changed 3 years ago by bittorf@…

i have never seen this behavior before. even if this is 'normal', it should be fixed - this is nothing a user can be aware of. also i just tried it with r43239 and it does not happen there. everything is fine after a reboot with full flash.

comment:8 Changed 3 years ago by bittorf@…

when it happens, i can see this during bootup on serial console:

[    2.130000] Freeing unused kernel memory: 244K (80373000 - 803b0000)
[    2.850000] init: failed to symlink /tmp -> /var
[    2.860000] init: Console is alive
[    2.860000] init: - watchdog -
[    4.870000] init: - preinit -
[    5.210000] random: procd urandom read with 8 bits of entropy available
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
[    6.790000] eth0: link up (1000Mbps/Full duplex)
[    8.420000] jffs2: notice: (311) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (1 unchecked, 0 orphan) and 1 of xref (0 dead, 
0 orphan) found.
[    8.440000] mount_root: switching to jffs2 overlay
[    9.790000] overlayfs: failed to create directory /overlay/work/work
[    9.800000] mount_root: mount failed: No such device, options lowerdir=/,upperdir=/overlay/upper,workdir=/overlay/work
[    9.810000] mount_root: switching to jffs2 failed - fallback to ramoverlay
6+0 records in
6+0 records out
2104+0 records in
2104+0 records out
[    9.930000] eth0: link down
[    9.940000] procd: - early -
[    9.940000] procd: - watchdog -
[   10.570000] procd: - ubus -
[   11.580000] procd: - init -
Please press Enter to activate this console.
[   12.710000] NET: Registered protocol family 10

comment:9 Changed 3 years ago by bittorf@…

when this happens i can see the "real" filesystem under

/overlay/upper/www/test.bin

when i delete the file and bootup, everything is good again.

comment:10 follow-up: Changed 3 years ago by hnyman

It sounds like the problem comes from the rather new requirement in overlayfs v23 and above that there is "upperdir" and "workdir". They are created at mount. If flash is 100% full, that new directory can not be created and overlayfs fails to mount.

Change in fstools in October 2014:
http://nbd.name/gitweb.cgi?p=fstools.git;a=commitdiff;h=d71297353dc45eaf8f7c252246490746708530f9;hp=04f863c7d831a18d388db980cc85b2bac37b3c8d

(but that should have been in trunk fstools already since 42890, so it may be a combination of some additional changes)

Deleting the one leftover file from the previous session frees one block and at the next boot the directory can be created, and then the overlayfs filesystem mounts ok again.

For a fix, I ponder if 1-2 blocks should be reserved in jffs2 under normal usage for overlayfs workdirs, so that there would always be room for creating that workdir at boot.

comment:11 in reply to: ↑ 10 Changed 3 years ago by bittorf@…

Replying to hnyman:

It sounds like the problem comes from the rather new requirement in overlayfs v23 and above that there is "upperdir" and "workdir". They are created at mount. If flash is 100% full, that new directory can not be created and overlayfs fails to mount.

"Overlay FS v23 and later requires both a upper and
a work directory, both on the same filesystem, but
not part of the same subtree."

i wonder, if this can be within the squashfs? e.g:

/overlay/upper
/overlay/work

also:
if these dir's are created in the jffs2-space, why are they lost?
if they are not created in jffs2 (and are just part of the virtual fs),
why they need space?

comment:12 Changed 3 years ago by hnyman

"upper" is the overlaying r/w part, so I think that it needs to be in the writable jffs2 part. And "work" needs to be in the same filesystem.

Your two questions are nice ;-)

Some thoughts:

  • Log contains three error messages. Two are from Openwrt fstools, but the deepest one is from overlayfs itself. The error is not actually the missing /overlay/work, but that subdirectory /overlay/work/work can't be created.
  • The discussion at Linux file systems discussion group indicates that similar error may surface when the upper filesystem that should be r/w is actually r/o. http://thread.gmane.org/gmane.linux.file-systems/91881/focus=175
  • As far as I understand, overlayfs may return that error if the filesystem is read-only or creating a real directory inside the working dir fails.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/overlayfs/super.c#n925
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/overlayfs/super.c#n635
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/overlayfs/dir.c#n72

  • Is it possible that the filesystem is somehow recognised as r/o if there are no free blocks at boot? Or something similar?
  • Linux 3.18.x introduced multi-level lower layers in overlayfs, and that may be the change that has triggered this behaviour in Openwrt in the last few weeks.
  • The changes done between Linux 3.18.11 and .12 may clarify the situation, as the overlayfs error messages have been tweaked a bit in March. But .12 has not yet been imported into Openwrt, so we are still running .11 (if at 3.18 like ar71xx). Commits below:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/fs/overlayfs
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/fs/overlayfs?id=71cbad7e694ee81233b3be3a38b81c3d5872cc6f

In any case, this sounds like something coming from rather deep in Linux. It might not be directly related to Openwrt, but might be a "feature" in Linux overlayfs itself :-(

comment:13 follow-up: Changed 3 years ago by bittorf@…

i can see this behaviour also without full flash, this is the relevant klog
directly after a manual reboot 8-( (only telnet is running). the only thing which is maybe relevant: i replaced some small scripts (mv /tmp/myscript /www/myscript && reboot)

[    5.260000] ar71xx: pll_reg 0xb8050014: 0x1a000000
[    5.260000] eth0: link up (1000Mbps/Full duplex)
[    5.290000] random: procd urandom read with 8 bits of entropy available
[    8.480000] mount_root: loading kmods from internal overlay
[    8.780000] jffs2: notice: (346) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 1 orphan) and 1 
of xref (1 dead, 0 orphan) found.
[    8.800000] block: attempting to load /tmp/jffs_cfg/upper/etc/config/fstab
[    8.810000] block: extroot: not configured
[    8.870000] jffs2: notice: (342) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 1 orphan) and 1 
of xref (1 dead, 0 orphan) found.
[    8.890000] mount_root: loading kmods from internal overlay
[    9.070000] block: attempting to load /tmp/jffs_cfg/upper/etc/config/fstab
[    9.080000] block: extroot: not configured
[    9.080000] mount_root: overlay filesystem has not been fully initialized yet
[   10.680000] mount_root: switching to jffs2 overlay
[   10.980000] eth0: link down
[   10.990000] procd: - early -
[   10.990000] procd: - watchdog -
[   11.670000] procd: - ubus -
[   12.690000] procd: - init -
root@OpenWrt:/# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                    3.6M    300.0K      3.3M   8% /
/dev/root                 3.1M      3.1M         0 100% /rom
tmpfs                    14.0M    732.0K     13.3M   5% /tmp
/dev/mtdblock3            3.6M    300.0K      3.3M   8% /overlay
overlayfs:/overlay        3.6M    300.0K      3.3M   8% /

root@OpenWrt:/# ls -l /overlay/
drwxr-xr-x    4 root     root             0 Jan  1  1970 upper
drwxr-xr-x    3 root     root             0 Jan  1  1970 work

root@OpenWrt:/# ls -l /overlay/work/
d---------    2 root     root             0 Mar 23 06:20 work

root@OpenWrt:/# ls -l /overlay/work/work/
root@OpenWrt:/# 

comment:14 in reply to: ↑ 13 Changed 3 years ago by bmrmorph

Replying to bittorf@…:

i can see this behaviour also without full flash, this is the relevant klog
directly after a manual reboot 8-( (only telnet is running). the only thing which is maybe relevant: i replaced some small scripts (mv /tmp/myscript /www/myscript && reboot)

Same behavior here.
I've dit an "mtd -r erase rootfs_data" for an "fresh" jffs2.
After that i copy my config back (/etc/config), did an "sync" and did a cupple of reboots.
After some cycles i had an factory default openwrt.

The log show's nothing special, the only difference i detect is, tha the log line:

[    8.210000] jffs2: notice: (314) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 0 orphan) and 1 of xref (0 dead, 0 orphan) found.

changed to:

[    7.930000] jffs2: notice: (327) jffs2_build_xattr_subsystem: complete building xattr subsystem, 3 of xdatum (0 unchecked, 1 orphan) and 3 of xref (1 dead, 0 orphan) found.

So the orphan and dead counter rises.
I googled a little bit but can not find something to explain the couters.
Maybe the filesystem got corrupted?

comment:15 Changed 3 years ago by bittorf@…

here is a working patch for the 'full'-jffs2/overlay-not-working problem:
https://www.mail-archive.com/linux-unionfs@vger.kernel.org/msg00246.html

this still does not explain, why it can happen without a full flash.
this should be go to another ticket.

comment:16 Changed 3 years ago by bittorf@…

here another box, where flash does not seem full - but config could not be loaded (comes up with telnet and 192.168.1.1):

[    7.960000] eth0: link up (1000Mbps/Full duplex)
[    9.790000] jffs2: notice: (323) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 1 orphan) and 1 of xref (1 dead, 0 orphan) found.
[   10.110000] mount_root: overlay filesystem has not been fully initialized yet
[   10.320000] mount_root: switching to jffs2 overlay
root@OpenWrt:/# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                  576.0K    252.0K    324.0K  44% /
/dev/root                 2.3M      2.3M         0 100% /rom
tmpfs                    14.0M    764.0K     13.3M   5% /tmp
/dev/mtdblock3          576.0K    252.0K    324.0K  44% /overlay
overlayfs:/overlay      576.0K    252.0K    324.0K  44% /
tmpfs                   512.0K         0    512.0K   0% /dev

root@OpenWrt:/# mount
rootfs on / type rootfs (rw)
/dev/root on /rom type squashfs (ro,relatime)
proc on /proc type proc (rw,noatime)
sysfs on /sys type sysfs (rw,noatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noatime)
/dev/mtdblock3 on /overlay type jffs2 (rw,noatime)
overlayfs:/overlay on / type overlay (rw,noatime,lowerdir=/,upperdir=/overlay/upper,workdir=/overlay/work)
tmpfs on /dev type tmpfs (rw,relatime,size=512k,mode=755)
devpts on /dev/pts type devpts (rw,relatime,mode=600)
debugfs on /sys/kernel/debug type debugfs (rw,noatime)

comment:17 Changed 3 years ago by christian@…

I do have the same problem in CC. Unconditional loss of overlay. Can't reproduce with a script that reboots the device after 60 sec.

Seems that this occurs in ~2% of all reboots after heavy writing. In my setup the overlay is ~80% full but never 100%.

I have this behavior with kernel 3.14, too.

comment:18 Changed 3 years ago by bittorf@…

again with a "TP-LINK TL-WR1043ND" and r45455.

disc free was: 2588 Kilobytes and after a normal reboot everything is lost/fresh:
here dmesg and some more. I can not see anything obvious.

[    0.000000] Linux version 3.18.11 (automat@bittorf-wireless.de) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r45453) ) #2 Thu Apr 16 08:20:44 CEST 2015
[    0.000000] MyLoader: sysp=aaaaaaaa, boardp=abbaaa2a, parts=abaaaaa2
[    0.000000] bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 00019374 (MIPS 24Kc)
[    0.000000] SoC: Atheros AR9132 rev 2
[    0.000000] Determined physical RAM map:
[    0.000000]  memory: 02000000 @ 00000000 (usable)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x00000000-0x01ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00000000-0x01ffffff]
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x01ffffff]
[    0.000000] On node 0 totalpages: 8192
[    0.000000] free_area_init_node: node 0, pgdat 8035a0f0, node_mem_map 81000000
[    0.000000]   Normal zone: 64 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 8192 pages, LIFO batch:0
[    0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 8128
[    0.000000] Kernel command line:  board=TL-WR1043ND  oops=panic panic=10  console=ttyS0,115200 rootfstype=squashfs,jffs2 noinitrd
[    0.000000] PID hash table entries: 128 (order: -3, 512 bytes)
[    0.000000] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
[    0.000000] Writing ErrCtl register=00000000
[    0.000000] Readback ErrCtl register=00000000
[    0.000000] Memory: 28456K/32768K available (2485K kernel code, 125K rwdata, 528K rodata, 244K init, 188K bss, 4312K reserved)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS:51
[    0.000000] Clocks: CPU:400.000MHz, DDR:400.000MHz, AHB:200.000MHz, Ref:5.000MHz
[    0.000000] Calibrating delay loop... 265.42 BogoMIPS (lpj=1327104)
[    0.080000] pid_max: default: 32768 minimum: 301
[    0.080000] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.090000] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.100000] NET: Registered protocol family 16
[    0.100000] MIPS: machine is TP-LINK TL-WR1043ND
[    0.390000] Switched to clocksource MIPS
[    0.390000] NET: Registered protocol family 2
[    0.400000] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
[    0.400000] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[    0.410000] TCP: Hash tables configured (established 1024 bind 1024)
[    0.420000] TCP: reno registered
[    0.420000] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    0.420000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    0.430000] NET: Registered protocol family 1
[    0.440000] PCI: CLS 0 bytes, default 32
[    0.440000] futex hash table entries: 256 (order: -1, 3072 bytes)
[    0.450000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.460000] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.470000] msgmni has been set to 55
[    0.480000] io scheduler noop registered
[    0.490000] io scheduler deadline registered (default)
[    0.490000] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
[    0.500000] console [ttyS0] disabled
[    0.520000] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 12500000) is a 16550A
[    0.530000] console [ttyS0] enabled
[    0.540000] bootconsole [early0] disabled
[    0.550000] m25p80 spi0.0: found s25sl064p, expected m25p80
[    0.560000] m25p80 spi0.0: s25sl064p (8192 Kbytes)
[    0.560000] 5 tp-link partitions found on MTD device spi0.0
[    0.570000] Creating 5 MTD partitions on "spi0.0":
[    0.570000] 0x000000000000-0x000000020000 : "u-boot"
[    0.580000] 0x000000020000-0x000000137e00 : "kernel"
[    0.580000] mtd: partition "kernel" must either start or end on erase block boundary or be smaller than an erase block -- forcing read-only
[    0.600000] 0x000000137e00-0x0000007f0000 : "rootfs"
[    0.600000] mtd: partition "rootfs" must either start or end on erase block boundary or be smaller than an erase block -- forcing read-only
[    0.620000] mtd: device 2 (rootfs) set to be root filesystem
[    0.620000] 1 squashfs-split partitions found on MTD device rootfs
[    0.630000] 0x000000450000-0x0000007f0000 : "rootfs_data"
[    0.640000] 0x0000007f0000-0x000000800000 : "art"
[    0.640000] 0x000000020000-0x0000007f0000 : "firmware"
[    0.650000] Realtek RTL8366RB ethernet switch driver version 0.2.4
[    0.770000] rtl8366rb rtl8366rb: using GPIO pins 18 (SDA) and 19 (SCK)
[    0.770000] rtl8366rb rtl8366rb: RTL5937 ver. 3 chip found
[    0.960000] libphy: rtl8366rb: probed
[    1.280000] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode:RGMII
[    1.280000] TCP: cubic registered
[    1.290000] NET: Registered protocol family 17
[    1.290000] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
[    1.310000] 8021q: 802.1Q VLAN Support v1.8
[    1.320000] VFS: Mounted root (squashfs filesystem) readonly on device 31:2.
[    1.330000] Freeing unused kernel memory: 244K (80373000 - 803b0000)
[    2.250000] init: failed to symlink /tmp -> /var
[    2.250000] init: Console is alive
[    2.260000] init: - watchdog -
[    3.760000] usbcore: registered new interface driver usbfs
[    3.770000] usbcore: registered new interface driver hub
[    3.770000] usbcore: registered new device driver usb
[    3.780000] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    3.790000] ehci-platform: EHCI generic platform driver
[    3.800000] ehci-platform ehci-platform: EHCI Host Controller
[    3.800000] ehci-platform ehci-platform: new USB bus registered, assigned bus number 1
[    3.810000] ehci-platform ehci-platform: irq 3, io mem 0x1b000000
[    3.840000] ehci-platform ehci-platform: USB 2.0 started, EHCI 1.00
[    3.840000] hub 1-0:1.0: USB hub found
[    3.850000] hub 1-0:1.0: 1 port detected
[    4.170000] usb 1-1: new high-speed USB device number 2 using ehci-platform
[    4.270000] init: - preinit -
[    4.480000] random: procd urandom read with 7 bits of entropy available
[    4.800000] ar71xx: pll_reg 0xb8050014: 0x1a000000
[    4.800000] eth0: link up (1000Mbps/Full duplex)
[    8.080000] jffs2: notice: (336) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 1 orphan) and 1 of xref (1 dead, 0 orphan) found.
[    8.100000] mount_root: overlay filesystem has not been fully initialized yet
[    9.080000] mount_root: switching to jffs2 overlay
[    9.380000] eth0: link down
[    9.400000] procd: - early -
[    9.400000] procd: - watchdog -
[   10.060000] procd: - ubus -
[   11.080000] procd: - init -
[   13.220000] NET: Registered protocol family 10
[   13.250000] zram: Created 1 device(s) ...
[   13.260000] tun: Universal TUN/TAP device driver, 1.6
[   13.270000] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[   13.290000] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   13.340000] u32 classifier
[   13.340000]     input device check on
[   13.350000]     Actions configured
[   13.350000] Mirror/redirect action on
[   13.490000] Loading modules backported from Linux version master-2015-03-09-0-g141f155
[   13.490000] Backport generated by backports.git backports-20150129-0-gdd4a670
[   13.500000] ip_tables: (C) 2000-2006 Netfilter Core Team
[   13.540000] nf_conntrack version 0.5.0 (448 buckets, 1792 max)
[   13.680000] xt_time: kernel timezone is -0000
[   13.730000] cfg80211: Calling CRDA to update world regulatory domain
[   13.760000] cfg80211: World regulatory domain updated:
[   13.760000] cfg80211:  DFS Master region: unset
[   13.760000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   13.770000] cfg80211:   (2400000 KHz - 2494000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   13.780000] cfg80211:   (4910000 KHz - 5895000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   14.080000] PPP generic driver version 2.4.2
[   14.080000] NET: Registered protocol family 24
[   14.220000] ath: EEPROM regdomain: 0x0
[   14.220000] ath: EEPROM indicates default country code should be used
[   14.220000] ath: doing EEPROM country->regdmn map search
[   14.220000] ath: country maps to regdmn code: 0x3a
[   14.220000] ath: Country alpha2 being used: US
[   14.220000] ath: Regpair used: 0x3a
[   14.240000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   14.250000] ieee80211 phy0: Atheros AR9100 MAC/BB Rev:7 AR2133 RF Rev:a2 mem=0xb80c0000, irq=2
[   14.260000] cfg80211: Calling CRDA for country: US
[   14.260000] cfg80211: Regulatory domain changed to country: US
[   14.270000] cfg80211:  DFS Master region: unset
[   14.280000] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
[   14.280000] cfg80211:   (2400000 KHz - 2494000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   14.290000] cfg80211:   (4910000 KHz - 5895000 KHz @ 40000 KHz), (N/A, 3000 mBm), (N/A)
[   18.360000] zram0: detected capacity change from 0 to 14680064
[   18.380000] Adding 14332k swap on /dev/zram0.  Priority:0 extents:1 across:14332k SS
[   24.960000] ar71xx: pll_reg 0xb8050014: 0x1a000000
[   24.980000] eth0: link up (1000Mbps/Full duplex)
[   25.060000] br-lan: port 1(eth0.1) entered forwarding state
[   25.060000] br-lan: port 1(eth0.1) entered forwarding state
[   27.060000] br-lan: port 1(eth0.1) entered forwarding state
[   30.060000] random: nonblocking pool is initialized
root@OpenWrt:/# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                    3.6M    312.0K      3.3M   8% /
/dev/root                 3.1M      3.1M         0 100% /rom
tmpfs                    14.0M    724.0K     13.3M   5% /tmp
/dev/mtdblock3            3.6M    312.0K      3.3M   8% /overlay
overlayfs:/overlay        3.6M    312.0K      3.3M   8% /
tmpfs                   512.0K         0    512.0K   0% /dev

root@OpenWrt:/# cat /proc/mtd 
dev:    size   erasesize  name
mtd0: 00020000 00010000 "u-boot"
mtd1: 00117e00 00010000 "kernel"
mtd2: 006b8200 00010000 "rootfs"
mtd3: 003a0000 00010000 "rootfs_data"
mtd4: 00010000 00010000 "art"
mtd5: 007d0000 00010000 "firmware"

comment:19 Changed 3 years ago by bittorf@…

i just checked: the last change to JFFS2 was
upstream 164c24063a3eadee11b46575c5482b2f1417be49
"jffs2: fix handling of corrupted summary length"
and got applied to 3.10.71 and 3.18.9 (7th of march).

because we have reports, that loss of config has also been
seen with x86 i'am unsure if this is related...

comment:20 Changed 3 years ago by bittorf@…

i just checked: the last change to JFFS2 was
upstream 164c24063a3eadee11b46575c5482b2f1417be49
"jffs2: fix handling of corrupted summary length"
and got applied to 3.10.71 and 3.18.9 (7th of march).

because we have reports, that loss of config has also been
seen with x86 i'am unsure if this is related...

comment:21 Changed 3 years ago by bittorf@…

again on r45711 - dmesg directly after reboot on again "dumb" device:

[    3.810000] hub 1-0:1.0: USB hub found
[    3.820000] hub 1-0:1.0: 1 port detected
[    4.140000] usb 1-1: new high-speed USB device number 2 using ehci-platform
[    4.240000] init: - preinit -
[    4.450000] random: procd urandom read with 7 bits of entropy available
[    4.710000] ar71xx: pll_reg 0xb8050014: 0x1a000000
[    4.710000] eth0: link up (1000Mbps/Full duplex)
[    7.970000] jffs2: notice: (335) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unc
hecked, 1 orphan) and 1 of xref (1 dead, 0 orphan) found.
[    7.990000] mount_root: overlay filesystem has not been fully initialized yet
[    9.050000] mount_root: switching to jffs2 overlay
[    9.130000] eth0: link down
[    9.140000] procd: - early -

see: "overlay filesystem has not been fully initialized yet"
this message comes from "libfstools/overlay.c" and it
seems there is a wrong detection in line 334++

fs_state_get("/tmp/overlay") returns
"FS_STATE_UNKNOWN" and sets "FS_STATE_PENDING" so it fires 'overlay_delete'.

maybe JFFS2 needs some time and we query too early?

comment:22 Changed 3 years ago by christian@…

I can reproduce the "loss" behavior with

I=1000; while let I-=1; do echo "foo" > /www/file$I; done && rm -rf /www/file* && reboot

on my system with ar71xx.

Steps to reproduce:
1) install openwrt trunk
2) wait for early boot to finish
3) run the command
4) after reboot you are back in early boot (in case overlay is ok -> goto 3)

Those steps are reproducing the problem in 30-50% of the retries.

THIS IS A SERIOUS ISSUE IN Chaos Calmer.

I think one should search in fstools at commits newer

0b99adb02f2eb822fbfc4efcb8ebf5fecbd74974

2015-03-22 14:08 Felix Fietkau o overlay: create a common function for deleting all overlay data (optionally excluding sysupgrade.tgz

seems to be a candidate for such an error.

comment:23 Changed 3 years ago by jow

Can you please try the following patch?

  • libfstools/overlay.c

    diff --git a/libfstools/overlay.c b/libfstools/overlay.c
    index 0677c34..acac3db 100644
    a b jffs2_switch(struct volume *v) 
    255255       if (ret) 
    256256               return ret; 
    257257 
    258        sync(); 
    259258       fs_state_set("/overlay", FS_STATE_READY); 
     259       sync(); 
    260260       return 0; 
    261261} 
    262262 

comment:24 Changed 3 years ago by christian@…

@jow

not fixed with your patch, the 4th reboot gave me:

[    8.390000] jffs2: notice: (373) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 1 orphan) and 1 of xref (1 dead, 0 orphan) found.
[    8.400000] mount_root: overlay filesystem has not been fully initialized yet
[    8.560000] mount_root: switching to jffs2 overlay

I'm testing 45712 with

target/linux/ar71xx/patches-3.18/910-overlayfs-fallback-to-readonly-when-full.patch

and

the patch from @jow

comment:25 Changed 3 years ago by jow

When you issued the reboot, was /etc/init.d/done completed already?

comment:26 Changed 3 years ago by christian@…

@jow done script was completed for the test.

My code is in any case running after done has completed. Testcases and production logic.

I'm using a higher runlevel:

done uses 95
my custom stuff is using 96+.

comment:27 Changed 3 years ago by jow

Okay. So for now it looks like the xattr which tracks the completion state gets corrupted in jffs2's "/" leading the system to believe that the overlay needs to cleared and rebuilt again on next boot.

To gain additional clues it would make sense to add a little bit more tracking around the xattr handling, like that:

  • libfstools/overlay.c

    diff --git a/libfstools/overlay.c b/libfstools/overlay.c
    index 0677c34..b6c2bd2 100644
    a b static bool keep_sysupgrade; 
    3838static ssize_t 
    3939fs_getxattr(const char *path, const char *name, void *value, size_t size) 
    4040{ 
    41        return syscall(__NR_getxattr, path, name, value, size); 
     41       ssize_t rv = syscall(__NR_getxattr, path, name, value, size); 
     42       ULOG_INFO("fs_getxattr(%s) = %d (val=%u, errno=%d)\n", name, rv, *(uint32_t *)value, errno); 
     43       return rv; 
    4244} 
    4345 
    4446static ssize_t 
    4547fs_setxattr(const char *path, const char *name, const void *value, size_t size, int flags) 
    4648{ 
    47        return syscall(__NR_setxattr, path, name, value, size, flags); 
     49       ssize_t rv = syscall(__NR_setxattr, path, name, value, size, flags); 
     50       ULOG_INFO("fs_setxattr(%s, %u) = %d (errno=%d)\n", name, *(uint32_t *)value, rv, errno); 
     51       return rv; 
    4852} 
    4953 
    5054static int 

comment:28 Changed 3 years ago by christian@…

regular:

[    8.120000] jffs2: notice: (373) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (1 unchecked, 0 orphan) and 1 of xref (0 dead, 0 orphan) found.
[    8.170000] mount_root: fs_getxattr(user.fs_state) = 4 (val=2, errno=2)
[    8.170000] mount_root: switching to jffs2 overlay

on fail:

[    8.460000] jffs2: notice: (373) jffs2_build_xattr_subsystem: complete building xattr subsystem, 1 of xdatum (0 unchecked, 1 orphan) and 1 of xref (1 dead, 0 orphan) found.
[    8.480000] mount_root: fs_getxattr(user.fs_state) = -1 (val=0, errno=61)
[    8.490000] mount_root: fs_setxattr(user.fs_state, 1) = 0 (errno=61)
[    8.500000] mount_root: fs_getxattr(user.fs_state) = 4 (val=1, errno=61)
[    8.500000] mount_root: overlay filesystem has not been fully initialized yet

comment:29 Changed 3 years ago by jow

Thank you for testing, this leads us to https://dev.openwrt.org/ticket/19538#comment:4

"I am not sure if having xattr orphans is a long term problem for jffs2?"
I'd say yes :)

comment:30 Changed 3 years ago by jow

There probably needs to be a proper handling of xattrs in https://dev.openwrt.org/browser/trunk/target/linux/generic/patches-3.18/111-jffs2-add-RENAME_EXCHANGE-support.patch - but I'm not a VFS guru so no immediate suggestion. Will look further.

Last edited 3 years ago by jow (previous) (diff)

comment:31 Changed 3 years ago by christian@…

@bittorf and I are based on much older openwrt trunk revisions (r44974) and we do have this problem.

comment:32 Changed 3 years ago by jow

The next try would be adding something like usleep(250000); before the getxattr syscall, to rule out a race condition. Maybe we're indeed querying the xattr too early directly after mount.

comment:33 Changed 3 years ago by christian@…

will try on 27.5, sorry for the gap.

it's actually easy to reproduce with trunk and ar71xx, I used a tplink 4300.

comment:34 Changed 3 years ago by jow

Will see if I find time until then, I'm not near any hw atm.

comment:35 Changed 3 years ago by nbd

please test if r45751 fixes the problem

comment:36 Changed 3 years ago by bittorf@…

it does not. after successful flashing, running for ~15mins and executing

I=1000; while let I-=1; do echo "foo" > /www/file$I; done && rm -rf /www/file* && reboot

the box doesnt boot anymore. will solder a serial console and report...

comment:37 Changed 3 years ago by christian@…

Test setup: r45772 on ar71xx + https://patchwork.ozlabs.org/patch/477135/

I can't reproduce this issue anymore. I think @ndb fixed it in r45751.

I ran 10 reboots on ar71xx without loosing the overlay.

@bittorf probably had problems with /ticket/19742.html

comment:38 Changed 3 years ago by bittorf@…

please close - fixed with r45833 and r45848

(
my dead device was because of some unrelated jffs2-hickup:

[   83.120000] jffs2: iget() failed for ino #114
[   83.130000] jffs2: warning: (1416) jffs2_get_inode_nodes: Eep. No valid nodes for ino #114.
[   83.140000] jffs2: warning: (1416) jffs2_do_read_inode_internal: no data nodes found for ino #114

)

comment:39 Changed 3 years ago by florian

  • Resolution set to fixed
  • Status changed from new to closed

comment:40 follow-up: Changed 3 years ago by anonymous

What does that mean now, I'm on r45812, is it safe? Or do I need to reflash?

comment:41 in reply to: ↑ 40 Changed 3 years ago by bittorf@…

Replying to anonymous:

What does that mean now, I'm on r45812, is it safe? Or do I need to reflash?

this is a dumb question! see: "fixed with r45833 and r45848".
your revision is older, so you dont have the fix.

comment:42 Changed 3 years ago by ufo@…

ok, doesnt firstboot for me with trunk.new odd behaviour (when installing some stuff to fill up the flash.. then reboot..

[    8.850000] jffs2: notice: (404) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[    8.870000] mount_root: switching to jffs2 overlay
[    9.120000] overlayfs: failed to create work directory (EROFS), mounting read-only

..

root@234-48:~# rm /etc/config/alfred 
rm: remove '/etc/config/alfred'? y
rm: can't remove '/etc/config/alfred': Read-only file system

root@234-48:~# mount
rootfs on / type rootfs (rw)
/dev/root on /rom type squashfs (ro,relatime)
proc on /proc type proc (rw,noatime)
sysfs on /sys type sysfs (rw,noatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noatime)
/dev/mtdblock3 on /overlay type jffs2 (rw,noatime)
overlayfs:/overlay on / type overlay (ro,noatime,lowerdir=/,upperdir=/overlay/upper,workdir=/overlay/work)
tmpfs on /dev type tmpfs (rw,relatime,size=512k,mode=755)
devpts on /dev/pts type devpts (rw,relatime,mode=600)
debugfs on /sys/kernel/debug type debugfs (rw,noatime)

so i cant remove some file from flash anymore :-(

comment:43 Changed 3 years ago by christian@…

ok, doesnt firstboot for me with trunk.

I don't know what do you mean by that? can you elaborate?

so i cant remove some file from flash anymore :-(

that's correct, in case your flash is full, the fs is mounted read only and you need to reinstall
The real problem is filling up the flash that it's full...

Did you ever try to fill up your linux hdd? it's not booting anymore even in bigger server like systems.

How would you imagine the behavior in case the flash is full and the squashfs can't get mounted writeable?

comment:44 follow-ups: Changed 3 years ago by ufo@…

why the flash is mounted read-only? if its full, its should stay read/write, so anyone would be able to delete some stuff, to get some free space.

without reboot i can do that this way:

Signature check passed.
root@234-48:/# df
Filesystem           1K-blocks      Used Available Use% Mounted on
rootfs                     448       240       208  54% /
/dev/root                 2304      2304         0 100% /rom
tmpfs                    14260       712     13548   5% /tmp
/dev/mtdblock3             448       240       208  54% /overlay
overlayfs:/overlay         448       240       208  54% /
tmpfs                      512         0       512   0% /dev
root@234-48:/# opkg install alfred
Installing alfred (2015.0-0) to root...
Downloading http://firmware.leipzig.freifunk.net:8006/firmware/buildroots/ar71xx_generic_20150602/packages/routing/alfred_2015.0-0_ar71xx.ipk.
Installing librt (0.9.33.2-1) to root...
Downloading http://firmware.leipzig.freifunk.net:8006/firmware/buildroots/ar71xx_generic_20150602/packages/base/librt_0.9.33.2-1_ar71xx.ipk.
Configuring librt.
Configuring alfred.
root@234-48:/# df
Filesystem           1K-blocks      Used Available Use% Mounted on
rootfs                     448       272       176  61% /
/dev/root                 2304      2304         0 100% /rom
tmpfs                    14260       712     13548   5% /tmp
/dev/mtdblock3             448       272       176  61% /overlay
overlayfs:/overlay         448       272       176  61% /
tmpfs                      512         0       512   0% /dev
root@234-48:/# opkg install tcpdump-mini
Installing tcpdump-mini (4.5.1-4) to root...
Downloading http://firmware.leipzig.freifunk.net:8006/firmware/buildroots/ar71xx_generic_20150602/packages/base/tcpdump-mini_4.5.1-4_ar71xx.ipk.
Installing libpcap (1.5.3-1) to root...
Downloading http://firmware.leipzig.freifunk.net:8006/firmware/buildroots/ar71xx_generic_20150602/packages/base/libpcap_1.5.3-1_ar71xx.ipk.
Configuring libpcap.
Collected errors:
 * pkg_write_filelist: Failed to open //usr/lib/opkg/info/tcpdump-mini.list: No space left on device.
 * opkg_install_pkg: Failed to extract data files for tcpdump-mini. Package debris may remain!
 * opkg_install_cmd: Cannot install package tcpdump-mini.
 * opkg_conf_write_status_files: Can't open status file //usr/lib/opkg/status: No space left on device.
root@234-48:/# df
Filesystem           1K-blocks      Used Available Use% Mounted on
rootfs                     448       388        60  87% /
/dev/root                 2304      2304         0 100% /rom
tmpfs                    14260       712     13548   5% /tmp
/dev/mtdblock3             448       388        60  87% /overlay
overlayfs:/overlay         448       388        60  87% /
tmpfs                      512         0       512   0% /dev
root@234-48:/# opkg remove alfred
Removing package alfred from root...
sed: can't create temp file '/etc/crontabs/rootRrZbCH': No space left on device
Not deleting modified conffile /etc/config/alfred.
root@234-48:/# df
Filesystem           1K-blocks      Used Available Use% Mounted on
rootfs                     448       352        96  79% /
/dev/root                 2304      2304         0 100% /rom
tmpfs                    14260       712     13548   5% /tmp
/dev/mtdblock3             448       352        96  79% /overlay
overlayfs:/overlay         448       352        96  79% /
tmpfs                      512         0       512   0% /dev
root@234-48:/# rm /etc/config/alfred 
root@234-48:/# opkg remove librt
Refusing to remove essential package librt.
        Removing an essential package may lead to an unusable system, but if
        you enjoy that kind of pain, you can force opkg to proceed against
        its will with the option: --force-removal-of-essential-packages
No packages removed.
root@234-48:/# opkg remove librt --force-removal-of-essential-packages
Removing essential package librt under your coercion.
        If your system breaks, you get to keep both pieces
Removing package librt from root...
root@234-48:/# df
Filesystem           1K-blocks      Used Available Use% Mounted on
rootfs                     448       348       100  78% /
/dev/root                 2304      2304         0 100% /rom
tmpfs                    14260       712     13548   5% /tmp
/dev/mtdblock3             448       348       100  78% /overlay
overlayfs:/overlay         448       348       100  78% /
tmpfs                      512         0       512   0% /dev
root@234-48:/# 

comment:45 in reply to: ↑ 44 Changed 3 years ago by bittorf@…

why the flash is mounted read-only? if its full, its should stay read/write, so anyone would be able to delete some stuff, to get some free space.

we should really add a message/warning for that via /etc/profile

when the flash is full, overlay-fs is mounted readonly. you can still
remove stuff via JFFS2, e.g.: rm '/overlay/upper/myfile' - please try and report.

comment:46 in reply to: ↑ 44 Changed 3 years ago by hnyman

Replying to ufo@…:

why the flash is mounted read-only? if its full, its should stay read/write, so anyone would be able to delete some stuff, to get some free space.

Because the current version of the overlayfs requires a tiny amount of free space to mount the partition as r/w. Mounting it read-only is a fallback option created upstream when it became apparent that 100% full file system could not be mounted at all: https://www.mail-archive.com/linux-unionfs@vger.kernel.org/msg00246.html

comment:47 Changed 3 years ago by bittorf@…

i tried to enforce a readonly-mount with filling up JFFS2 and afterwards deleting something: this works:

root@box:~# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                  448.0K    260.0K    188.0K  58% /
/dev/root                 2.3M      2.3M         0 100% /rom
tmpfs                    13.9M    128.0K     13.8M   1% /tmp
/dev/mtdblock3          448.0K    260.0K    188.0K  58% /overlay
overlayfs:/overlay      448.0K    260.0K    188.0K  58% /
tmpfs                   512.0K         0    512.0K   0% /dev

root@box:~# dd if=/dev/urandom of=/www/test.bin bs=16k count=999
dd: writing '/www/test.bin': No space left on device
11+0 records in
9+1 records out

root@box:~# reboot && exit

# (after login again)

root@box:~# mount   # (see ro-mount of overlayfs)
rootfs on / type rootfs (rw)
/dev/root on /rom type squashfs (ro,relatime)
proc on /proc type proc (rw,noatime)
sysfs on /sys type sysfs (rw,noatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noatime)
/dev/mtdblock3 on /overlay type jffs2 (rw,noatime)
overlayfs:/overlay on / type overlay (ro,noatime,lowerdir=/,upperdir=/overlay/upper,workdir=/overlay/work)
tmpfs on /dev type tmpfs (rw,relatime,size=512k,mode=755)
devpts on /dev/pts type devpts (rw,relatime,mode=600)
debugfs on /sys/kernel/debug type debugfs (rw,noatime)

root@box:~# ls -l /overlay/upper/www/test.bin 
-rw-r--r--    1 root     root        159744 Jun  3 20:51 /overlay/upper/www/test.bin

root@box:~# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                  448.0K    408.0K     40.0K  91% /
/dev/root                 2.3M      2.3M         0 100% /rom
tmpfs                    13.9M    120.0K     13.8M   1% /tmp
/dev/mtdblock3          448.0K    408.0K     40.0K  91% /overlay
overlayfs:/overlay      448.0K    408.0K     40.0K  91% /
tmpfs                   512.0K         0    512.0K   0% /dev
root@box:~# rm /overlay/upper/www/test.bin

root@box:~# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                  448.0K    252.0K    196.0K  56% /
/dev/root                 2.3M      2.3M         0 100% /rom
tmpfs                    13.9M    120.0K     13.8M   1% /tmp
/dev/mtdblock3          448.0K    252.0K    196.0K  56% /overlay
overlayfs:/overlay      448.0K    252.0K    196.0K  56% /
tmpfs                   512.0K         0    512.0K   0% /dev

i will send a patch to the mailinglist with as a proposal...

comment:48 Changed 3 years ago by gnutella

Running today's snapshot.

Same problem here, with a TP-LINK 3600. After one day of running quite, the overlayFS filled 100% for no reason and now my system is readonly.

So I copied /overlay to a USB key (ext4 partition), now I am planning to remove something in /overlay/upper and reboot and migrate overlay to a USB key.

Will tell you if this works.

Last edited 3 years ago by gnutella (previous) (diff)

comment:49 Changed 3 years ago by gnutella

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:50 Changed 3 years ago by jow

  • Resolution set to invalid
  • Status changed from reopened to closed

"My system is readonly" is not the same as "system resets to defaults". The readonly mount is the expected behaviour.

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.