ok,thank you.
That doesn't tell us why it is nondeterministic though, why 50% of the pings go and 50% don't. It's similar in your packet captures. it's not the case that you are locked out completely, some packets flow, but not enough to actually complete an HTTP request. The idea that acks are being lost is my favorite one, because it explains why TCP will shut down after an initial brief burst. It also explains why lots of UDP makes it through since it doesn't wait on ack.
I'm sure there is something small here, a single command or a small set of commands will get us what we want, we just need to figure out what that set is.
i think there's something missing in routing or we need to enable a forwarding between interfaces that we created in firewall.
let's assume we are misunderstanding "not" and try again. Here are all the rules/tables we will use
## force local traffic to go direct to bridge, table 99
ip route add 192.168.1.0/24 dev br-lan table 99
ip rule add from 192.168.1.0/24 to 192.168.1.0/24 table 99 priority 99
## nonlocal traffic to LAN goes via table 100, we already matched priority 99
## to local traffic
ip route add 192.168.1.0/24 dev veth0 table 100
ip rule add to 192.168.1.0/24 table 100 priority 100
I suggest to do this in temporary script so you can reboot router if you are locked out.
ok, i will try now and post results here.
still same nothing new.
Ok, so I think routing is correct. if it wasn't correct you wouldn't get any pings back right? Also the packet captures show packets flowing from internet to wlan, but mostly UDP. TCP tends to stop after ACKS get lost. Let's use this routing table in our script because it's very explicit about what it's doing.
What we have here is either some goofiness in the bridge, or some firewall problem preventing data from getting back to servers on the internet, they don't see ACKs.
Question for you right now, do you see in netdata a quiet network, or is it still something like continuous 1mbps like in graphs above on br-lan?
i see on veth0 sent is about 50 kbit and something near to zero for receive,veth1 is opposite.
how about on br-lan and pppoe-wan? Also do you have some default fireqos running? from before we started with veth stuff? please run
fireqos clear_all_qos
then post output of
ip rule show
ip route show
and then look at interface traffic in netdata, give me approximate numbers from br-lan, veth0, and pppoe-wan
EDIT: the point is that it's hard for me to know what state your router and network are in at any given time, I don't know when you rebooted, what scripts run on boot, etc so I want to understand the state.
The only thing I can think of that would give nondeterministic performance is that somewhere packets are being dropped because a queue is filling up. It's particularly the case if tons of UDP packets tagged DSCP CS6 are being streamed back and forth. Without any qos eventually the acks would get through. With the qos the acks would just sit there waiting for an opportunity and never get it.
we could change the interface definitions to classify everything in one default class for testing:
interface veth0 lanin output rate 15500kbit qdisc fq_codel overhead 28
class default
match all
interface pppoe-wan wanout output rate 5500kbit qdisc fq_codel overhead 28
class default
match all
then at least we'd have fq_codel working for us, and no data would have extra priority. Together with our veth routing this should eliminate the problem of possibly high priority traffic squelching your ability to browse.
i did a reboot, also i didn't use fireqos, so we can make sure that fireqos isn't make any problems.
for ppoe-wan there's 19 receive and 29 sent,br-lan -->99 receive and 129 sent.
there's no special scripts running at boot,only 4 commands running in custom firewall rules,2 for force clients to use lede dns and second to access my nano bridge in bridged mode.
output of "ip rule show":
0: from all lookup local
99: from 192.168.1.0/24 to 192.168.1.0/24 lookup 99
100: from all to 192.168.1.0/24 lookup 100
32766: from all lookup main
32767: from all lookup default
output of "ip route show":
default via 10.16.172.1 dev pppoe-wan proto static
10.10.20.0/24 dev eth0.4 proto kernel scope link src 10.10.20.21
10.16.172.1 dev pppoe-wan proto kernel scope link src 10.16.172.185
192.168.1.0/24 dev br-lan proto kernel scope link src 192.168.1.1
this is the script i use now:
ip link show | grep veth0 || ip link add type veth
ip link set veth0 up
ip link set veth1 up
ip link set veth1 master br-lan ## place the veth1 into br-lan bridge (We forgot this last time!)
## force local traffic to go direct to bridge, table 99
ip route add 192.168.1.0/24 dev br-lan table 99
ip rule add from 192.168.1.0/24 to 192.168.1.0/24 table 99 priority 99
## nonlocal traffic to LAN goes via table 100, we already matched priority 99
## to local traffic
ip route add 192.168.1.0/24 dev veth0 table 100
ip rule add to 192.168.1.0/24 table 100 priority 100
internet browser said err_timeout.
Great very helpful info. Since fireqos was not running, and you had connection before running script, and you had problems after running new routing commands, clearly we have routing, firewall, or bridge problem. Can you run the routing commands one at a time and tell me at what point does it disconnect? Is it ok to have rule:
ip rule add from 192.168.1.0/24 to 192.168.1.0/24 table 99 priority 99
but not rule
ip rule add to 192.168.1.0/24 table 100 priority 100
for example?
this command cause no internet "ip route add 192.168.1.0/24 dev veth0 table 100".
even before you enable the rule? or was the rule already in place?
before i enable it,also i tried when it's already in place.
So even if
ip rule show
does not show anything except default rules, and no rule for matching table 100
running
ip route add 192.168.1.0/24 dev veth0 table 100
causes your internet to hang?
EDIT: please check this, and confirm. It seems impossible. adding a route to a table without any rule saying to use that table should not affect anything
sorry it was "ip rule add to 192.168.1.0/24 table 100 priority 100 " cause no internet.
command "ip rule show":
0: from all lookup local
99: from 192.168.1.0/24 to 192.168.1.0/24 lookup 99
100: from all to 192.168.1.0/24 lookup 100
32766: from all lookup main
32767: from all lookup default
when i run "ip route add 192.168.1.0/24 dev veth0 table 100", my internet was disconnected from my isp.
I'm still confused. if you run these commands one by one I think it will shut off at location B, please try it:
ip rule del priority 99 ## to be sure
ip rule del priority 100
ip route add 192.168.1.0/24 dev br-lan table 99
ip rule add from 192.168.1.0/24 to 192.168.1.0/24 table 99 priority 99
## nonlocal traffic to LAN goes via table 100, we already matched priority 99
## to local traffic
ip route add 192.168.1.0/24 dev veth0 table 100
## LOCATION A
ip rule add to 192.168.1.0/24 table 100 priority 100
## LOCATION B
If it shuts off at location B this makes sense to me, we need to understand why... but if it shuts off at location A then I think there is something like a bug in the kernel, since there is no rule to make us use table 100 until location B
I'm going to assume I'm right above, please test. In the mean time I will think out loud for a bit.
Packet comes from internet through pppoe-wan and goes through input tables where it's marked and then NATted, it's from say 1.1.1.1 and it needs to go to 192.168.1.111 for example... so we lookup rule 99 and see that this only applies to packets from 192.168.1.0/24 and so we skip this rule. then we lookup rule 100 and see that it applies to all packets destined for 192.168.1.0/24 and so we look in our ARP cache and see which MAC we need to send our packet to, we then route it to the output queue of veth0 destined for the MAC of our laptop 192.168.1.111 for example.
it goes through this veth0 queue and arrives at veth1 which is part of a bridge. The bridge looks up the MAC and sees it's on wlan1 and sends the packet out... it arrives at our laptop...
Now, laptop needs to send packet acknowledging receipt. It knows to route to 192.168.1.1 and it has ARP cache, so it sends a packet destined for the MAC of the bridge br-lan with IP address 1.1.1.1 as destination. The bridge receives this packet on wlan1 and sends it through various iptables *** and decides it needs to route it. It looks at rule 99 and sees the traffic doesn't apply... it looks in rule 100 and sees the traffic doesn't apply, it looks in rule 32766 and sees the "main" table applies, and it routes this packet via default route to pppoe-wan
This is my model of how things work. Clearly, this is not what is going on, because if that were what was going on... we'd have connectivity.
If we're right about the routing table, and we think that the place things go wrong is the ACK then I see that this packet could get filtered via iptables at point labeled *** and this could be what's going wrong.
Now, why would you be able to send and receive 50% of pings if that iptables problem was the problem? It seems that iptables can't be the problem, or I misunderstand what is going on.
So, here's alternative problem... instead of the bridge receiving the packet and routing it... sometimes the bridge forwards the packet somewhere into a black hole. Where? why?
EDIT: possible solution... use ebtables to force the bridge to route the return traffic. I am not super familiar with ebtables, but we could try that.
EDIT AGAIN: Looking into your packet captures. I'm seeing that some packets that are very weird. will upload a picture.
it's shuts off at location B.
ebtables is installed now.
Hey I noticed in our script, we don't put a route rule to deliver to br-lan for 192.168.1.1 is that correct? perhaps this is problem.
ip rule del priority 99 ## to be sure
ip rule del priority 100
ip route add 192.168.1.0/24 dev br-lan table 99
ip rule add from 192.168.1.0/24 to 192.168.1.0/24 table 99 priority 99
## nonlocal traffic to LAN goes via table 100, we already matched priority 99
## to local traffic
## NOTE NEW COMMAND HERE
ip route add 192.168.1.1 dev br-lan table 100
ip route add 192.168.1.0/24 dev veth0 table 100
## LOCATION A
ip rule add to 192.168.1.0/24 table 100 priority 100
## LOCATION B
when you run this, and we have route to br-lan, does it still shut off?