Modify

Opened 6 years ago

Last modified 22 months ago

#10225 accepted defect

Issue with UDP connections

Reported by: anonymous Owned by: jow
Priority: response-needed Milestone: Chaos Calmer 15.05
Component: packages Version: Backfire 10.03.1 RC5
Keywords: Cc:

Description

I upgraded from 10.03.1 RC4 to 10.03.1 RC5 and have the following problem. This problem was not present in RC4.

After the upgrade, my Asterisk server (Not in OpenWRT, but separate system) didn't work anymore out of the box, but with the installation of "iptables-mod-conntrack-extra" and "kmod-ipt-conntrack-extra" it could successfully register again with my IPSP.
This is UDP traffic on port 5060.

After about 4-5 days, Asterisk is unable to register to the ITSP.
I register with several ITSPs and the problem does not occur with all of them simultaniously, but in the timespan of several hours.
It seems something goes wrong with handling this UDP traffic.
I have a fixed external IP address.

It is hard to pinpoint the problem, as other connections (webbrowsing, etc...) continue to work.

Restarting the firewall service does not solve the problem, rebooting OpenWRT does solve it for a few days.

When the problem happens again, I will test if UDP DNS traffic still passes the firewall, as I don't know if this is a general issue with UDP or not.

Is there anything else I can do to troubleshoot this issue?

Attachments (0)

Change History (32)

comment:1 Changed 6 years ago by anonymous

It seems the issue does not affect new UDP connections, but connections that are open for a long time.
It looks like my connections die after more or less the same amount of packets have been sent:

IPV4 UDP <INTERNAL IP>:5060 <ITSP 1>:5060 20.65 MB (36776 Pkts.)
IPV4 UDP <INTERNAL IP>:5060 <ITSP 2>:6060 16.23 MB (36777 Pkts.)
IPV4 UDP <INTERNAL IP>:5060 <ITSP 3>:5060 20.36 MB (36778 Pkts.)
IPV4 UDP <INTERNAL IP>:5060 <ITSP 4>:5060 20.44 MB (36781 Pkts.)

I will comment again the next time to see if the amount of packets is the same.

comment:2 Changed 5 years ago by tom

Got same problem here with r33132 trunk+svn9008, at some point it cannot register to sip server anymore. Maybe its related to QoS ? I have one rule defined with conntrack/number of bytes. But after disabling QoS it still doesnt work. As soon i replace the router against non Openwrt, everthing works in Asterisk.

I can see with nmap, that i can reach UDP 5060 on the remote server. Maybe the UDP packets from the remote side dont go through openwrt?

I noticed i have also tos bits set in sip.conf. Sip packets are defined as cs3 and starting version 1.4.18 its possible to use TOS without root user.

maybe related:
/ticket/11929.html
/ticket/7410.html

comment:3 Changed 5 years ago by tom

Disabled all Tos settings, no change.
When the connection in asterisk disappears it says "Registration timed out" whereas in Openwrt i still can see the sip port as connected.
I tried ip address und register as in ticket 11929, but didnt work.

comment:4 Changed 5 years ago by tom

When the problem appears i can still see the udp connection as active under realtime connections in openwrt. The packets are still increasing. So there is some connection there, but with problems.
If i then stop asterisk and wait couple of seconds (1min or so) until the connection disappears in openwrt, start asterisk again and it works.
The problem happens when the provider is doing a pppoE reconnect every 24h.
If i do simple "ifup wan" on openwrt i cannot always reproduce the problem.
It may has to do when the external ip does *not* or *does* change after reconnect.

strace.log from asterisk:
1407 <... poll resumed> ) = 0 (Timeout)
1407 gettimeofday({1347196030, 765447}, NULL) = 0
1407 read(3, "H\324\372\240", 4) = 4
1407 read(3, "\342KiD", 4) = 4
1407 read(3, "sD\4\304", 4) = 4
1407 read(3, "\273}\21\220", 4) = 4
1407 read(3, "}_\277\7", 4) = 4
1407 read(3, "\215\t\246\274", 4) = 4
1407 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 24
1407 connect(24, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("88.79.152.249")}, 16) = 0
1407 getsockname(24, {sa_family=AF_INET, sin_port=htons(39770), sin_addr=inet_addr("192.168.2.2")}, [16]) = 0
1407 close(24) = 0
1407 read(3, "\347\330\226a", 4) = 4
1407 read(3, "D\250e\362", 4) = 4
1407 read(3, "\31\254\257\334", 4) = 4
1407 read(3, "W\205|\265", 4) = 4
1407 read(3, "\220\222c\252", 4) = 4
1407 time(NULL) = 1347196030
1407 gettimeofday({1347196030, 767518}, NULL) = 0
1407 sendto(20, "OPTIONS sip:08192.sip.arcor.de S"..., 528, 0, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("88.79.152.249")}, 16) = 528
1407 gettimeofday({1347196030, 767830}, NULL) = 0
1407 gettimeofday({1347196030, 767926}, NULL) = 0
1407 gettimeofday({1347196030, 768022}, NULL) = 0
1407 time(NULL) = 1347196030
1407 gettimeofday({1347196030, 768209}, NULL) = 0
1407 poll([{fd=20, events=POLLIN}], 1, 8) = 0 (Timeout)
1407 gettimeofday({1347196030, 776432}, NULL) = 0
1407 sendto(20, "REGISTER sip:08192.sip.arcor.de "..., 411, 0, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("88.79.152.249")}, 16) = 411
1407 gettimeofday({1347196030, 776690}, NULL) = 0
1407 gettimeofday({1347196030, 776786}, NULL) = 0
1407 time(NULL) = 1347196030
1407 gettimeofday({1347196030, 776984}, NULL) = 0
1407 poll([{fd=20, events=POLLIN}], 1, 43 <unfinished ...>
1396 <... poll resumed> ) = 1 ([{fd=14, revents=POLLIN}])
1396 gettimeofday({1347196030, 786805}, NULL) = 0
1396 read(14, "\1\0\0\0\0\0\0\0", 8) = 8
1396 poll([{fd=15, events=POLLIN}, {fd=14, events=POLLIN|POLLPRI}], 2, -1 <unfinished ...>
1407 <... poll resumed> ) = 0 (Timeout)
1407 gettimeofday({1347196030, 820219}, NULL) = 0
1407 sendto(20, "REGISTER sip:08192.sip.arcor.de "..., 417, 0, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("88.79.152.249")}, 16) = 417
1407 gettimeofday({1347196030, 820488}, NULL) = 0
1407 gettimeofday({1347196030, 820589}, NULL) = 0
1407 time(NULL) = 1347196030
1407 gettimeofday({1347196030, 820788}, NULL) = 0
1407 poll([{fd=20, events=POLLIN}], 1, 159 <unfinished ...>
1396 <... poll resumed> ) = 1 ([{fd=14, revents=POLLIN}])
1396 gettimeofday({1347196030, 836794}, NULL) = 0
1396 read(14, "\1\0\0\0\0\0\0\0", 8) = 8
1396 poll([{fd=15, events=POLLIN}, {fd=14, events=POLLIN|POLLPRI}], 2, -1) = 1 ([{fd=14, revents=POLLIN}])
1396 gettimeofday({1347196030, 886796}, NULL) = 0
1396 read(14, "\1\0\0\0\0\0\0\0", 8) = 8
1396 poll([{fd=15, events=POLLIN}, {fd=14, events=POLLIN|POLLPRI}], 2, -1) = 1 ([{fd=14, revents=POLLIN}])
1396 gettimeofday({1347196030, 936805}, NULL) = 0
1396 read(14, "\1\0\0\0\0\0\0\0", 8) = 8
1396 poll([{fd=15, events=POLLIN}, {fd=14, events=POLLIN|POLLPRI}], 2, -1 <unfinished ...>
1407 <... poll resumed> ) = 0 (Timeout)
1407 gettimeofday({1347196030, 980014}, NULL) = 0
1407 read(3, "W:\325\320", 4) = 4
1407 read(3, "V\323\264>", 4) = 4
1407 read(3, "\f\213\366\200", 4) = 4
1407 read(3, "\320!\371%", 4) = 4
1407 read(3, "=7\251g", 4) = 4
1407 read(3, "\277\244?\334", 4) = 4
1407 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 24
1407 connect(24, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("88.79.152.249")}, 16) = 0
1407 getsockname(24, {sa_family=AF_INET, sin_port=htons(53222), sin_addr=inet_addr("192.168.2.2")}, [16]) = 0
1407 close(24) = 0
1407 read(3, " \223\322\315", 4) = 4
1407 read(3, "\277\236\275\240", 4) = 4
1407 read(3, "/\2\225_", 4) = 4
1407 read(3, "\373Z\311F", 4) = 4
1407 read(3, "\333\352\342\276", 4) = 4
1407 time(NULL) = 1347196030
1407 gettimeofday({1347196030, 981830}, NULL) = 0
1407 sendto(20, "OPTIONS sip:08192.sip.arcor.de S"..., 528, 0, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("88.79.152.249")}, 16) = 528
1407 gettimeofday({1347196030, 982053}, NULL) = 0
1407 gettimeofday({1347196030, 982130}, NULL) = 0
1407 gettimeofday({1347196030, 982213}, NULL) = 0
1407 time(NULL) = 1347196030
1407 gettimeofday({1347196030, 982373}, NULL) = 0
1407 poll([{fd=20, events=POLLIN}], 1, 100 <unfinished ...>

comment:5 Changed 5 years ago by Damian Kaczkowski <damian.kaczkowski+openwrt@…>

Hi. I was experiencing the same issue (see #11259). This bug is caused by OpenWrt firewall implementation and iptables behavior. When your IP is changing or your WAN interface is down and sip re-register event occurs then iptables changes source IP of your NAT'ed sip connection to your LAN interface IP (effectively disabling NAT for this connection). Now when WAN interface goes up again iptables won't do the opposite, leaving non NAT'ed connection in conntrack table. Now the only way to restore your badly NAT'ed connection is to flush conntrack table or wait until conntrack udp timeout expires (e.g. shutdown Asterisk for above a minute).

I don't know how to properly fix this error, this is beyond my knowledge. However you can try this as a working workaround:

  1. Delete your current MASQUERADE rule and add the following rule:
    iptables -t nat -I POSTROUTING -o pppoe-wan -j MASQUERADE
    

iptables will now work as it should, rewriting your sip connection conntrack table entry properly when your WAN interface comes up again.

  1. Alternatively try to higher the re-register timer in case of register failure or timeout in your Asterisk to something above 60 seconds.
  1. Alternatively you can try to play with /proc/sys/net/netfilter/nf_conntrack_udp_timeout and lower the timeout value. But I don't recommend this solution.

Could any developer please look closer into this? Are there any chances for proper fix?

comment:6 Changed 5 years ago by tom

In the meantime i found multiple forum threads about this issue.

The command:
iptables -L -t nat | grep 5060
is showing correct values.

Whereas cat /proc/net/nf_conntrack | grep 5060 was showing my local ip as destination instead of the public ip.

nf_conntrack_udp_timeout is set to 60 seconds. Thats exactly the timespan which has to be wait after asterisk was closed, to be started again and it works after.

In asterisk there is one setting in sip.conf:
registertimeout = Number : Number of seconds to wait for a response from a SIP Registrar before classifying the SIP REGISTER has timed out. Default 20 seconds. (New in v1.2.x).

So as already above said, maybe it works to just increase the asterisk value to >60sec.
The other way round: to decrease nf_conntrack_udp_timeout needs too small values i think.

I did now an workaround by installing conntrack-tools.
In cron i have placed: ifup wan to control when the reconnection takes place. And some minutes later the command: conntrack -F is issued.

But it would be good to know why other routers dont have this problem.

comment:7 Changed 5 years ago by Damian Kaczkowski <damian.kaczkowski+openwrt@…>

Other routers don't have this problem cause they don't have such firewall package as OpenWrt do.

The problem is caused by:

iptables -t nat -A zone_wan_nat -j MASQUERADE

iptables got trouble with this for some reasons.

If you use:

iptables -t nat -I POSTROUTING -o <WAN_IFACE> -j MASQUERADE

instead, then iptables will update conntrack properly when your wan ip changes or your wan interface flaps.

comment:8 Changed 5 years ago by jow

Wrong conclusion. All rules jumping into zone_wan_nat have -o <WAN_IFACE>.

comment:9 Changed 5 years ago by Damian Kaczkowski <damian.kaczkowski+openwrt@…>

Yes they do have -o <WAN_IFACE>. But I don't know why iptables just don't like zone_wan_nat chain... If you masquerade directly in POSTROUTING then problem is gone. Tested and confirmed.

comment:10 Changed 5 years ago by tom

Instead of putting flush into cron its better to create a script. Because if the reconnect doesnt happen as planned, the flush will not be done at once.

Everytime a reconnect happens it will execute hotplug:
http://wiki.openwrt.org/doc/uci/firewall#hotplug.hooks.8.09.2

create:
/etc/hotplug.d/firewall/11-flush-conntrack

if [ "$ACTION" = "add" ] && [ "$INTERFACE" = "wan" ]; then
/usr/sbin/conntrack -F
fi

comment:11 Changed 5 years ago by Stefan.Nehlsen@…

Same problem here with outgoing udp-sessions from fastd. After reconnecting nat fails:

root@pluto:~# tcpdump -ni pppoe-wan port 10000
...............
22:53:31.027475 IP 86.103.161.250.57946 > 86.103.210.233.10000: UDP, length 116
22:53:31.027707 IP 86.103.161.250.57946 > 89.27.152.40.10000: UDP, length 116
22:53:31.081945 IP 89.27.152.40.10000 > 86.103.161.250.57946: UDP, length 64
22:53:31.106292 IP 86.103.172.156.10000 > 86.103.161.250.57946: UDP, length 64
tcpdump: pcap_loop: The interface went down
3650 packets captured
3798 packets received by filter
146 packets dropped by kernel
root@pluto:~# tcpdump -ni pppoe-wan port 10000
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on pppoe-wan, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
22:53:37.357189 IP 192.168.242.180.57946 > 86.103.172.156.10000: UDP, length 64
22:53:37.357323 IP 192.168.242.180.57946 > 86.103.210.233.10000: UDP, length 64
22:53:37.357507 IP 192.168.242.180.57946 > 89.27.152.40.10000: UDP, length 64
22:53:37.457318 IP 192.168.242.180.57946 > 86.103.172.156.10000: UDP, length 64
22:53:37.457477 IP 192.168.242.180.57946 > 86.103.210.233.10000: UDP, length 64

My (temporary) solution is to drop the firewall stuff and set up my own rules.

Is there any documentation about the organization of queues and rules of the OpenWRT-firewall?

comment:12 Changed 5 years ago by anonymous

I confirm this annoying situation, there has been any progress? upgrading to attitude adjustem does fix the problem?

comment:13 Changed 5 years ago by Stefan.Nehlsen@…

I don't know if it is fixed now, it wasn't in 12.09-beta (r33312)) that I am still using.

Ticket #10225 had a woraround:

create a file /etc/hotplug.d/firewall/11-flush-conntrack with the content:

if [ "$ACTION" = "add" ] && [ "$INTERFACE" = "wan" ]; then
  /usr/sbin/conntrack -F
fi

comment:14 Changed 5 years ago by jow

  • Priority changed from normal to response-needed

Should be fixed with r35330 and r35348 - please test.

comment:15 Changed 5 years ago by tom

i dont have anymore problems in r35412

comment:16 Changed 5 years ago by jogo

  • Resolution set to fixed
  • Status changed from new to closed

comment:17 follow-up: Changed 4 years ago by anonymous

  • Resolution fixed deleted
  • Status changed from closed to reopened

It seems this has crept up again somehow... I am on r39765 and need to flush the conntrack-tables manually with "conntrack -F" every few days for asterisk to register with my SIP provider.

comment:18 Changed 4 years ago by anonymous

If this bug still exists, it will affect all UDP flows (e.g. OpenVPN udp/1194, SIP udp/5060 etc), see ticket #15779

comment:19 in reply to: ↑ 17 ; follow-up: Changed 4 years ago by anonymous

Replying to anonymous:

It seems this has crept up again somehow... I am on r39765 and need to flush the conntrack-tables manually with "conntrack -F" every few days for asterisk to register with my SIP provider.

Do you have custom scripts or rules which might be interfering with the firewall operation ?

comment:20 in reply to: ↑ 19 Changed 4 years ago by anonymous

Replying to anonymous:

Replying to anonymous:

It seems this has crept up again somehow... I am on r39765 and need to flush the conntrack-tables manually with "conntrack -F" every few days for asterisk to register with my SIP provider.

Do you have custom scripts or rules which might be interfering with the firewall operation ?

No, none.

comment:21 Changed 4 years ago by anonymous

If this error re-occurs every few days in your setup, perhaps next time it happens you could check (e.g. using "conntrack -G conntrack -p udp -s your_asterisk_ip -d your_itsp_ip --sport=5060 --dport=5060") and provide more info about the flow.

Also check possibly related ticket #15779

comment:22 Changed 4 years ago by nbd

please test if this still happens in current versions. if so, please mention the exact revision you tested.

comment:23 Changed 4 years ago by anonymous

Apparently this is still an issue:

https://forum.openwrt.org/viewtopic.php?id=51871

comment:24 Changed 4 years ago by anonymous

One solution requires (recent) conntrack:

http://serverfault.com/questions/367626/delete-specific-conntrack-entries

conntrack -D --src-nat --reply-dst old.wan.ip.addr

comment:25 Changed 4 years ago by jow

  • Milestone changed from Backfire 10.03.2 to Chaos Calmer (trunk)

Milestone Backfire 10.03.2 deleted

comment:26 Changed 4 years ago by anonymous

So is this going to be fixed before the final release of OpenWRT BB 14.07 as suggested by jow in https://forum.openwrt.org/viewtopic.php?pid=242176#p242176 or is it postponed for CC ?

comment:27 Changed 3 years ago by anonymous

I have this problem too. We use a VDSL connection with static IP, everytime the connection forcefully drops by the provider our VoIP telephones lose the connection and cannot reconnect.

Version: TP-LINK TL-WDR3600/4300/4310 Firmware Version OpenWrt Attitude Adjustment 12.09 / LuCI 0.11.1 Release (0.11.1)

comment:28 Changed 3 years ago by anonymous

I just checked and the (old) conntrack-tools v1.0.0 included in BB 14.07 seems to support the

conntrack -D --src-nat --reply-dst old.wan.ip.addr

syntax:

...
conntrack v1.0.0 (conntrack-tools): 13 flow entries have been deleted.

comment:29 Changed 3 years ago by jow

  • Owner changed from developers to jow
  • Status changed from reopened to accepted

r42114 (http://nbd.name/gitweb.cgi?p=firewall3.git;a=commitdiff;h=2807cc26b8e46eef5f23c06534a853dd48183331;hp=91953d6a6e90df988f442f53097bd208784a295d) should resolve this issue by clearing out conntrack entries that refer to the old wan ip. I'll keep this ticket open until more people confirm that it is working properly, I cannot test it under real life conditions since my cable ISP always hands out the same address.

If the fix is confirmed working I'll backport all required parts to BB.

comment:30 Changed 3 years ago by anonymous

Works for me.

comment:31 Changed 2 years ago by fthiery

Using 15.05, also using voip phones going out. I checked and the phones have a keep-alive feature (20s), and the current default udp timeout is 60s in openwrt. Am i right assuming that the udp connexion will never expire as long as the 20s keep alive is running ?

Is there a way to whitelist an IP address so that the udp connections to/from this ip never expire ?

comment:32 Changed 22 months ago by anonymous

I have problem with corrupted UDP packets too:

This is what opposite side gets:

"15:07:06.465213 IP public-gprsxxxxxxxx.centertel.pl.24755 > alerts0.xxxxxxxx.com.4021: UDP, length 32

and here is the error log

Apr 11 15:07:06 alerts0 kernel: [1xxxxxxxx4.517485] UDP: bad checksum. From 37.xx.xxx.1xx:24755 to 1xx.xx.xxx.xx:4021 ulen 40"

Normally packets have about 500bytes.

After booting router everything run fine but only for few hours.

My WAN connection is 3G modem, and IP is changing quite often.

Add Comment

Modify Ticket

Action
as accepted .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.