Okay, so at a measured goodput of 62 Mbps you will need at least:
62 * 65/64 * (1526/(1500-8-20-20)) = 66.1779011708 Mbps Sync
your 67 Mbps is close, but the difference might iundicate that your ISP has some thraffic shaping at its BRAS/BNG level. Anyway, I would take this 66Mbps as 100% for the gross rate. For ingress shaping typically one needs a larger bandwidth sacrifice to keep control over the queueing than for egress (in the limit you can set the egress shaper-gross-rate to 100% of the true link gross-rate, as all you need to do is avoiding overfilling the modems buffers). 95% seems awfully tight for decent ingress/download shaping, we typically recommend to use 85-90% of the ingress gross rate. BUT this is a trade-off between latency-under-load and bandwidth utilization which all users/networks will need to figure out for themselves, or tp put it differently this is a policy question.
That said 100*63.65/66.18 = 96.18 % of the gross-link-rate seems tight enough to expect "back-spill" into the ISPs DSLAM-buffers when data is rushing in, and hence occasional ingress latency spikes. Mind you these spikes can also occur when the shaper is set to lower percentages of the true link speed, as they are depend on the difference in accumulated incoming rate to the DSLAM and the bandwidth from the DSLAM to the modem, so one can not fully rule these out (e.g. in a DOS situation our post-bootleneck-ingress-shaping basically is useless).
This most likely is the effect of all measurement flows sending like 10 packets towards the end-host all at once, and while these are queued at the dslam everything else experiences delays. Unfortunately that can not be fully avoided when instantiating a shaper downstream of the true bottle-neck.
This tends to be something else, potentially transient CPU overload on your router (caused by say, WIFI processing) which means that the shaper gets not-serviced for too long a period and that immediately causes delay as well.
Which just means that this shaper bandwidth is well matched with the load generated by the number of flows from the test. Cake actually offers the "ingress" keyword to better deal with shaping ingress links (it will for example not scale its packet dropping so the shaper's egress rate matches the set rate, but rather so that the ingress rate matches the set rate, which works better for post-bottle-neck shaping). IIRC, "ingress" also will better deal with the number of concurrent flows on the ingress link (more flows typically require a higher bandwidth sacrifice so that all flows get a sufficiently strong signal to slow down), and it dies this transiently depending on the number of active flows. In short, if you use the ingress keyword you might get away with a smaller permanent bandwidth sacrifice, but that still is policy and you need to balance this for your own network according to your preferences.
Sort of, my take on this is, that the pretty much synchronized nature of flows in a speedtest result in, say 16 flows dumping 10 Packets (assuming intial window 10) into the link simultaneously resulting in packets worth up to
(16 * 10 * 1526 * 8) / (63.65 * 1000 * 1000) * 1000 = 30.6878240377 ms
of transfer time on the bottleneck link piling up at the upstream end of said bottleneck link. These will be queued by the DSLAM and any other packets, like the packets used to probe the latency will be stuck behind these and experience an additional 30 ms of delay. Does this make sense to you?
You tell me This is a policy decision every network operator needs to make individually, there is not true answer to this. If you ask me, sure I would sacrifice this 4 Mbps happily, but I would also not be ashamed to completely disable my shaper if I needed maximum bandwidth (temporarily).