Scrape https://downloads.openwrt.org for profiles.json

Hi,

we need to scrape https://downloads.openwrt.org for all profiles.json files for the firmware-selector.
This is done using wget:

wget -c -r -P /tmp/ -A "profiles.json" --reject-regex "kmods|packages" --no-parent https://downloads.openwrt.org

This fetches all profiles.json for SNAPSHOT, 22.03.5 and 23.05.0-rc3. E.g:

/tmp/downloads.openwrt.org/releases/22.03.5/targets/lantiq/xway/profiles.json
/tmp/downloads.openwrt.org/releases/23.05.0-rc3/targets/apm821xx/sata/profiles.json
/tmp/downloads.openwrt.org/snapshots/targets/at91/sama7/profiles.json

But all other release folders only have empty directories: E.g.:

/tmp/downloads.openwrt.org/releases/19.07.10/targets/imx6/

But the profiles.json file exists here https://downloads.openwrt.org/releases/19.07.10/targets/apm821xx/nand/

Does someone has an idea what goes wrong here?

@aparcar fyi

Some of the profiles.json file simply don't exist. Maybe due to changes in the build system, or failed builds, or whatever. You need to check each one individually, like asu does: https://github.com/openwrt/asu/blob/main/asu/janitor.py

@efahl that looks tedious and expects the profiles.json to be in specific locations.

But to return to my original observation, wget also misses profiles.json that exist und are linked in the index.html. Event the directory is created, but empty.

This works for OpenWrt 19.07.10:

wget -c -r -P /tmp/ -A "profiles.json" --reject-regex "kmods|packages" --no-parent https://downloads.openwrt.org/releases/19.07.10/

The profiles.json are there, e.g.:
/tmp/downloads.openwrt.org/releases/19.07.10/targets/apm821xx/nand/profiles.json

But one folder up and it does not work anymore:

wget -c -r -P /tmp/ -A "profiles.json" --reject-regex "kmods|packages" --no-parent https://downloads.openwrt.org/releases/

rsync --bwlimit="8M" --del -m -r -t -v \
--include="*/" \
--include="profiles.json" \
--exclude="*" \
rsync://downloads.openwrt.org/downloads/ \
/tmp/openwrt/

https://openwrt.org/downloads#how_to_mirror

1 Like

@vgaetera hey, that works. Thank you.

But this only works for downloads.openwrt.org. But other communities that use the firmware selector do not have rsync support for their download servers. It would be nice to have it working with wget. :slight_smile:

Are you sure about this?
Wget seems terribly inefficient, being several dozen times slower.

How much load does Wget generate on the server?
How much will the load increase when scaling to multiple communities?

Consider including rsync to the list of dependencies.
Or run it somewhere for caching and then make the clients use it.

Also consider contacting the server admin to save effort and power.
Aggregating the files on the server on a schedule might be your best option.

1 Like

I do not know how other communities have their download server configured. But I doubt that they offer rsync. I like to keep the dependencies low. Performance and efficiency is not an issue, because this is only run once per release.

1 Like

So you're assuming it's a rate limit from the server side?

I'm wondering if you could make the script slightly "smarter" by only re-downloading snapshots which change. Like a 22.03.x release will never change again, if you download it once you never have to traverse that specific parse again.

@aparcar for wget this seems to be a ratelimit problem. --limit-rate=8M for wget solves the issue. :slight_smile:

Anyway, imho, to make the script smarter will add too much complexity for minor gain.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.