[Solved] I can't getting firehol to work on latest lede trunk

dlakelan · December 5, 2017, 8:02pm

hmmm that error seems to mean that you already have it in your bridge. so that's fine.

I gotta say there's something confusing about this. I am so in the dark about what your network looks like that looking at the packet captures is mysterious. I see a ton of network traffic on UDP between 192.168.1.162 and 16.195.65.160 over your wlan1.

If you have "no internet" why is this traffic making it through? who is using that traffic? where does that traffic come from? perhaps the qos is working fine and choking off your laptop because some higher priority data stream is hogging the bandwidth

dlakelan · December 5, 2017, 8:16pm

our special policy routing table only affects packets not from 192.168.1.0/24 and going to 192.168.1.0/24 so any packets going from your LAN should go through the very same route that it would before we run the script.

Looking at your wlan1 packet capture carefully there is a LOT of UDP packets and many of them are tagged CS6 and AF41. They come and go from the internet without trouble. I think this means that you do have internet access, you just don't have bandwidth for the thing you want

it's actually the "outbound" packets from 192.168.1.162 that are marked CS6 I guess the inbound packets had their DSCP zeroed out by our nice rules. I don't know what all that UDP traffic is, but it certainly seems like it's taking up all the bandwidth, and the TCP connections you're trying to make just trickle through.

hisham2630 · December 5, 2017, 8:46pm

no one is using internet or streaming

dlakelan · December 5, 2017, 8:59pm

May I suggest you check to see if you are part of a botnet? what device on your LAN has the address 192.168.1.162? hate to say it, but you've got an awful lot of bandwidth going to / from 18.195.65.160 which is an amazonaws location, possibly a botnet control. Also outbound it's tagged CS6 which is high priority, so whatever is generating it is trying to squash other bandwidth.

Figure out what that device is and turn it off... then test

EDIT: suggested methodology

Reboot router from scratch, no QOS running... take a look at bandwidth charts in luci... what's using bandwidth? turn off power on devices in your network, things like phones, tablets, ip cameras, smart tvs, streaming media, iptv etc. Plug your laptop into the router with an ethernet cord if possible. Quit your browser and open only one or two tabs, at least one should be connected to LuCI showing you bandwidth graphs. See how much bandwidth you are using under these circumstances. Try to get it down to basically zero.

Then, start qos. You should stay connected to router so see how much bandwidth you are using under those new qos circumstances. Then check to see if you have connection. If you do, then it's because you had something using up all the bandwidth, and we can debug QoS settings, but if you have no connection at all, it's still possibly a routing/bridging/firewall problem. But, I definitely see a lot of UDP bandwidth hitting your wlan so I don't think firewall or routing is the problem.

One issue might be failure to deliver ACK packets back to servers due to upstream QOS, so they stop sending data... and you never get your web pages. Upstream bandwidth shaping can be important.

hisham2630 · December 5, 2017, 9:33pm

192.168.1.162 is an android phone called hotwav v9.
ok i will test without it.
ok, i will try and post back.

hisham2630 · December 5, 2017, 9:39pm

still no internet even with all other devices disconnected.
i see a little bandwidth something like 1 Mbit/s or less it's just a spike.
also i tried with disabling all fireqos qos rules still no internet, iam still not able to access my nano bridge on 20.20.10.10

dlakelan · December 5, 2017, 9:47pm

ok, you say that 1Mbps is the entirety of your download though, and QoS is limiting your download to that amount except for google video, so it seems like what's going on is that initially you get a spike of data, and then ACK messages don't make it back to source, which explains why I see a bunch of retransmits, so it slows down... eventually just stopping.

20.20.10.10 is public IP belonging to microsoft, are you sure this is the nano bridge ip? maybe 10.10.20.20 is what you were thinking? that is private ip.

As for why acks don't return... that is mysterious but it's a thing we can try to debug. A new capture with that phone off etc would be helpful. Please check your capture on pppoe-wan, it was still -i any in last tcpdump so I was getting packets from all over not just the pppoe-wan output.

dlakelan · December 5, 2017, 9:51pm

that also explains why UDP is working fine, as it just sends... doesn't wait for ACK packets.

Here's my suggestion. From clean router reboot, run the veth creating and route creation commands one at a time. after each command see if you can surf. What command makes it stop? My guess is that the ip rule command which initiates the new route table to be used will be the thing. If that isn't the thing. When?

dlakelan · December 5, 2017, 11:22pm

I had another idea. The way the firewall works is that it allows through existing connections via conntracking. The problem is, we tell the firewall to restart, and so your browser, which believes that it has a connection open already... is trying to send data over a connection which is no-longer conntracked and hence the firewall rejects it.

This is a definite possibility. Please remove the line from the script that restarts the firewall and see if that helps.

Here is screenshot of packets from one of the connections from your last packet dump:

Screenshot from 2017-12-05 15-56-59

on the left is your wan interface, on the right is your wlan1 interface.

As you can see, some packets from 52.0.252.94 do make it to your wlan1 destined for 192.168.1.234, but the connection doesn't progress. Your machine sends Client Hello, and the server responds with Server Hello, and some application data, but that doesn't get to wlan1. however acks do get to wlan1... eventually things time-out and reset.

When I look at veth0 the data from the wan connection does go across it. But evidently it doesn't wind up on the wlan. This makes me think something is weird about the lan bridge. Only some packets are being forwarded to wlan1 and not others. This isn't a routing issue because the destination MAC is the final device, the packet is sent down veth0 and received at veth1, the bridge should know where to send the packet on wlan1 ... yet the br-lan doesn't forward it to the destination on wlan1... but it does forward the acks, that seems weird. I wonder if stopping and restarting the br-lan would help:

ip link br-lan set down
ip link br-lan set up

dlakelan · December 6, 2017, 5:20am

Various web sources suggest bridge can take tens of seconds to start forwarding traffic reliably after adding new interface. Try stopping and restarting bridge and then waiting a full minute before trying to surf new sites. You can also use brctl showstp br-lan to see state of ports.

hisham2630 · December 6, 2017, 12:14pm

sorry,it's writing typo, i was meaning 10.10.20.20 i use the following commands to reach my nano in bridge mode :

iptables -t nat -I postrouting_rule -s 192.168.1.0/24 -d 10.10.20.20 -j SNAT --to 10.10.20.21
iptables -I zone_lan_forward -s 192.168.1.0/24 -d 10.10.20.20 -j ACCEPT

i will do another capture with that phone off.
i will run those veth commands stuff and i will see when i can't reach internet.
i think it's ip rule command cause that problems, i will try and see.
i will remove the firewall restart from script.
i will try to restart bridge.
ok, will post back with results.

hisham2630 · December 6, 2017, 12:39pm

iam still not able to surf internet and i can't access my nano on 10.10.20.20,even i can't access isp iptv page on 10.6.6.3, i removed the firewall restart command also i did bridge restart,bridge commands should be "ip link set down br-lan","ip link set up br-lan", you write "ip link br-lan set down","ip link br-lan set up".

the new captures is here : https://expirebox.com/download/7989f1f8addad5d5bd6e2cd317663f05.html

i wait more than 5 min so no luck.
this command isn't working so i use "brctl show" and get :

bridge name     bridge id               STP enabled     interfaces
br-lan          7fff.c04a00e72346       no              eth0.1
                                                        veth1
                                                        wlan0
                                                        wlan1

hisham2630 · December 6, 2017, 1:38pm

the problem is that command "ip rule add not from 192.168.1.0/24 to 192.168.1.0/24 table 100" when i run it, i can't access internet or the nano bridge.
but when i ping "10.10.20.20" from my laptop :

Pinging 10.10.20.20 with 32 bytes of data:
Reply from 10.10.20.20: bytes=32 time=2ms TTL=63
Request timed out.
Request timed out.
Reply from 10.10.20.20: bytes=32 time=1ms TTL=63

Ping statistics for 10.10.20.20:
    Packets: Sent = 4, Received = 2, Lost = 2 (50% loss),
Approximate round trip times in milli-seconds:
    Minimum = 1ms, Maximum = 2ms, Average = 1ms

when i use " ip rule del not from 192.168.1.0/24 to 192.168.1.0/24 table 100
" everything is back normally.

if compare between table 100 and table 254"default lede routing table " you can see:
"ip route show table 100":
192.168.1.0/24 dev veth0 scope link
192.168.1.1 dev br-lan scope link

"ip route show table 254":
default via 10.16.172.1 dev pppoe-wan proto static
10.10.20.0/24 dev eth0.4 proto kernel scope link src 10.10.20.21
10.16.172.1 dev pppoe-wan proto kernel scope link src 10.16.172.185
192.168.1.0/24 dev br-lan proto kernel scope link src 192.168.1.1

there's no routing to pppoe-wan in our table 100 so there's no routing to internet.
can you explain why we need veth0 and veth1 for qos.is it for upload shaping or download shaping.
also if it's for upload shaping or download ,can we just use some iptables commands to route from br-lan veth1 or something else.

dlakelan · December 6, 2017, 3:06pm

in usual inbound shaping, a packet is copied from the interface it arrives at to an IFB device and then goes through a queue there before being routed. However, it's copied before any iptables have seen the packet. Therefore it is impossible to match on marks or new DSCP tags... Instead with veth the packet goes through the original interface input queue, then goes through all the iptables where it can be marked, and then it needs to be routed to the veth where it can be shaped into different queues etc.

The fact that our routing rule breaks the connectivity is very useful. This rule is supposed to only affect packets coming from somewhere other than 192.168.1.0/24 and going to 192.168.1.0/24 so the table 254 should be used for all other traffic. Since traffic going to 10.10.20.21 doesn't match the rule, the regular table should be used!

I will try to figure it out and post in a half hour or so.

hisham2630 · December 6, 2017, 3:08pm

ok, thank you now is clear.

dlakelan · December 6, 2017, 3:30pm

If I understand correctly, while the routing rule was in place half of the ping packets worked, and half of them didn't? Is that right?

hisham2630 · December 6, 2017, 3:31pm

yup,right
and the command “ip rule add not from 192.168.1.0/24 to 192.168.1.0/24 table 100” cause no internet and no nano bridge access.

dlakelan · December 6, 2017, 3:33pm

was there any traffic shaping at all during that period when half the ping packets worked? Could a traffic shaping rule have been responsible for dropping packets, or is this a problem with wacky packet delivery?

hisham2630 · December 6, 2017, 3:34pm

i tried with/without traffic shaping they are same.

dlakelan · December 6, 2017, 4:14pm

I'm thinking we may not understand how the word "not" works in the routing rule. I'm thinking it's like

(not from) 192.168.1.0/24 AND to 192.168.1.0/24

but perhaps it's like

not (from 192.168.1.0/24 AND to 192.168.1.0/24)

let me read up on ip rule