Get a list of file download links
Page in progress for June update
Table of Contents
Overview
This download method provides you with a text file containing URLs for every file in the collection you requested. After every URL is an “option” line that is compatible with the software aria2, which helps preserve the folder structure of the collection when you download it. With a few commands you can transform the file into a format that is compatible with curl if you prefer.
For many people, using S3 compatible software or Globus File Transfer would be easier options to use than this one. For older versions of some larger collections in the Data Access Portal, these options will not be available. When this is the case, the option to “Get a list of file download links” is the only practical way to download these larger collections.
Downloading files using this method requires the use of command line software. If you’re not familiar with using command line software, we have provided some simple commands you can run.
Installing Software
To download all the files using a single command, aria2 is available for Windows, macOS and Unix-like operating systems.
Alternatively you can use curl, ideally a version above 7.66.0 that introduced the --parallel
option. This is a good option when using macOS or a Unix-like system. MacOS should have curl installed by default. Unix-like operating systems vary in whether they have curl installed by default, but will almost certainly have a package that can be installed. Our guide to using curl will suit more advanced users, but includes instructions that anyone can copy and paste with minor changes.
It’s also possible to use curl with Windows, but due to the formatting of the text file containing the URLs the steps are different.
Aria2
Installing aria2 on Windows
Windows users can download aria2 from their website: https://aria2.github.io/. If you’re not able to install aria2, you can follow our guide for how to download the files using Powershell.
On the aria2 website, follow the link under the “Download” section and then scroll to the bottom of the page to the “Assets” list. In most instances you’ll want to download the file with “win-64bit” in the name.
Open the zip file you downloaded and drag the folder it contains to a place on your hard drive, e.g. “C:\apps”.
Select the address bar in File Explorer and copy the path to the folder, e.g. “C:\apps\aria2-1.37.0-win-64bit-build1”.
Open the Start menu of Windows and type the word “environment” to search for the option “Edit environment variables for your account”. Select this option.
In the Environment Variables window that appears, under the “User variables for {username}” section, find the “Path” option. Select this and click “Edit…”
Click “New” to add a new item to your path, then paste in the address you copied before. Then click “OK”
Click “OK” again on the “Environment Variables” window. You’re now ready to use aria2.
Installing aria2 on macOS
The easiest way to install aria2 on macOS would be to use a package manager like Homebrew or Mac Ports. If you’re not able to install this software on your system, you can follow our guide to using curl instead of aria2.
If you don’t have a package manager installed, first you should follow the installation instructions for one of the package managers using the links above. Once this is done you can use the below methods to install aria2.
Using Homebrew: https://formulae.brew.sh/formula/aria2
brew install aria2
Using Mac Ports: https://ports.macports.org/port/aria2/
sudo port install aria2
Installing aria2 on Unix-like Systems
Many distributions can install aria2 via their package manager, e.g. for Debian based systems:
sudo apt install aria2
See your operating system documentation for more information on how to install packages. Alternatively, you can download compiled binaries or source code from the aria2 web site.
If you don’t have permission to install software on your system you can follow the guide on using curl instead.
Downloading Files
From the “Files” tab of the collection, click the “Download” button. It is not necessary to select any files beforehand.
- Select the option “Get a list of file download links”. Note that some collections will not have all the options displayed in the following image.
- Click “Request files”
- If you agree to the licence conditions, click the checkbox and then “Accept”.
- Click the “Download file links” button.
- A text file will be downloaded using your web browser’s default settings, e.g. many browsers will automatically download the file to your local account’s “Downloads” folder.
- Open a file browser and locate the text file. If necessary, move the file to a folder that you want to download the files to. For this example on Windows we are using “C:\downloads”, but you should select a location you want to use.
- Now you need to open a command line or terminal.
Windows – Open a Command Line
With your file browser displaying the folder you have saved the text file to, click inside the address bar.
Type in “cmd” without quotes and press <ENTER>. A new command line window will open from the folder you were viewing, e.g. “C:\downloads”.
macOS – Open a Terminal
With the Finder window open at the location you have saved the text file, ensure that you have the “Path Bar” displayed. You can do this by clicking “View” and then looking for the option “Show Path Bar”. If you see the option, click it. If the option says “Hide Path Bar” then the Path bar is already being displayed and you can leave the “View” menu.
At the bottom of the Finder window, in the Path Bar, right click on the folder you have open and select “Open in Terminal”.
A new terminal window will open from the folder you selected, e.g. “/Users/Shared/downloads”
Unix-like OS – Open a Terminal
Since there are many different desktop environments, it’s challenging to cover this in detail. If you’re using a Linux distribution we hope you already know how to open a terminal and navigate to the folder you want. Some desktop environments give you a context menu option, e.g. in XFCE there is an “Open Terminal Here” option when you right-click:
A new terminal window will open from the folder you selected, e.g. “~/Downloads/DAP”
Download Using aria2
When you selected the download method and downloaded the text file, a sample command was provided, e.g.
aria2c -x 16 -j 16 -s 16 --input-file 9572v001.txt --continue
Note that while the software package is called “aria2”, the command line program is called “aria2c”.
You should be able to copy and paste this command from the DAP webpage into your terminal and press enter to download the files. If you’re new to command line software, often you can copy the command to your clipboard and then right-click inside your terminal window to paste the command.
If you need to type in the command manually, ensure that you use the correct file name.
When the files have downloaded, a new sub-folder will be inside the folder you ran the command from, e.g. on Windows this might be “C:\downloads\000009572v001”. The new folder will contain two sub-folders, “data” and “metadata”. The “data” folder will contain the files of the collection, and the “metadata” folder will contain some files that include the licence, citation details, and checksums for validating the download.
If the collection has a folder structure, this will be preserved within the “data” folder.
The sample command adds some options to the command that are not default settings for aria2. The provided options mean:
-x
is the number of connections per host.-x 16
means that if you’re downloading a single very large file you can have up to 16 concurrent connections to download different parts of the same file, which should improve speeds. Note that for a single very large file you should also use this with the-s
option, which would otherwise default to 5 and you would only get 5 connections.-j
is the number of files to download simultaneously. If you’re downloading a large number of files this should improve speeds, regardless of whether the files are small or large.-s
is the maximum number of file parts to “split” a single file download into. If you’re using 16 connections (-x 16
) and downloading 16 files (-j 16
) then you will only get one connection per file, but if you are downloading fewer than sixteen files then the number of concurrent parts per file will increase. This can increase speeds when you are downloading a single file or multiple large files.--continue
means that aria2 will continue downloading a partially downloaded file where the connection was previously interrupted. When you first run the command it makes no difference, but if your download is interrupted and fails, you should be able to run the exact same command with the continue and you won’t be restarting downloads that have partially completed.
Aria2 has many other options that you can use, although many of them are not relevant to downloading from the DAP. Advanced users can read about them on the aria2 documentation site.
The sample command should improve your download speeds over default settings if you have a fast network connection to CSIRO (e.g. you’re at an Australian academic institution using AARNET). If you are encountering problems, consider removing the following options from the command:
-x 16 -j 16 -s 16
If the download is using too much of your bandwidth, you can use the –-max-overall-download-limit=<SPEED> option, e.g. if you wanted to limit the download to 10 MiB/s then you could use:
--max-overall-download-limit=10M
curl
While the format of the file you download is not compatible with curl as a --config
file, it is possible to create a compatible file using grep and some shell commands. MacOS and users of Unix-like systems can run the same commands. Windows users will need to run some Powershell commands instead.
It’s best to use a version of curl after 7.66.0 which introduced the --parallel
argument. Since this was released in 2019 you probably don’t need to worry about this unless you’re using a very old system. You can run curl --version
to see what version you have.
curl – macOS and Unix-like Systems
To create a --config
file that is compatible with curl you need to do a few things:
- Get lines from the text file that start with “https” and have no leading whitespace.
- Derive the destination file path from the URL. By default curl will ignore folder structure and also include query parameters in output file names, so this makes dealing with the downloaded files much easier.
- URL decode any special characters in the derived destination path.
This can be done using grep and some shell commands. Open a terminal in the folder you downloaded the text file containing URLs, e.g. “~/downloads”.
You can copy and paste the below code sample into the terminal, but ensure that you change the name of the input text file on the first line, e.g. here it’s 9572v001.txt
, but the text file you download will have a different name. You can also change the name of the output file on the final line if you prefer.
# Ensure you change the name of the input text file
grep -o "^https.*" 9572v001.txt | while read line ; \
do \
filename=$(grep -Eo "[0-9]{9}v[0-9]{3}/(data|metadata)/[^\?]*" <<< $line); \
filepath="./${filename//%/\\x}"; \
echo -e "url = \"$line\""; \
echo -e "output = \"$filepath\"\n"; \
done > curl_config.txt
One way to do this is to copy the above code sample and paste it into a text editor first. In the text editor, modify the file name on the first line to match the text file you have downloaded. Now copy the modified code from the text editor and right-click in your terminal window to paste it, then press <ENTER>
to run the command. This will produce a text file called “curl_config.txt” in the same folder that you were working from.
The contents of “curl_config.txt” will look similar to the following, except in this example the URL query parameters have been removed for brevity:
url = "https://s3.data.csiro.au/dapprd/000009572v001/data/164.a2.391.tif?...
output = "./000009572v001/data/164.a2.391.tif"
url = "https://s3.data.csiro.au/dapprd/000009572v001/data/164.a2.393.tif?...
output = "./000009572v001/data/164.a2.393.tif"
# etc...
The use of a blank line between URLs with their options conforms to the examples provided in the curl documentation.
Now copy and paste the following command into the terminal window and press <ENTER>
to run it.
curl -C - --create-dirs --parallel --config curl_config.txt
When the files have downloaded, a new sub-folder will be inside the folder you ran the command from, e.g. “~/downloads/000009571v001”. The new folder will contain two sub-folders, “data” and “metadata”. The “data” folder will contain the files of the collection, and the “metadata” folder will contain some files that include the licence, citation details, and checksums for validating the download.
This example uses the following options:
- -C – means “continue-at”, with the dash at the end instructing curl to automatically detect where to continue downloading a file. This is useful if your download is interrupted. If your download encounters errors, you can run the same command again and the file downloads will continue from where they were instead of restarting.
- –create-dirs will create any missing folders in the destination path. Without this you will get many errors reported.
- –parallel instructs curl to download more than one file at a time. The default number of concurrent downloads is 50, which you can modify using the –parallel-max option. This helps improve download speeds when you are downloading multiple files, but does not appear to help when downloading a single very large file.
- –config lets you specify a list of options. In this case we are specifying the file created earlier with a list of URLs and their output destination.
curl – Windows
If you’re using Windows you have some options for using curl:
- Use the Microsoft-provided version of
curl.exe
to download the files. Note that if you want to run curl from a Powershell terminal you need to callcurl.exe
instead of justcurl
due to the Powershell alias forcurl
. You will need to run some Powershell commands to convert the text file you downloaded first. - Download and install the official curl binaries for Windows.
- Use Windows Subsystem for Linux and follow the guide for macOS and Unix-like systems. You should be able to run the same commands, although if the distribution you select does not have curl installed by default you may need to then install it, e.g. if you installed Ubuntu onto WSL you could run:
sudo apt install curl
This guide will focus on using the Microsoft-provided version of curl and Powershell.
To create a --config
file that is compatible with curl you need to do a few things:
- Get lines from the text file that start with “https” and have no leading whitespace.
- Derive the destination file path from the URL. By default curl will ignore folder structure and also include query parameters in output file names, so this makes dealing with the downloaded files much easier.
- URL decode any special characters in the derived destination path.
This can be done using Powershell commands. You can copy and paste the below code sample, but ensure that you change the name of the input text file on the first line, e.g. here it’s 9572v001.txt
, but the text file you download will have a different name. To do this you might want to copy and paste the code into a text editor first so you can edit the value and then copy/paste from your text editor to Powershell.
Open a Powershell window and navigate to the folder where you downloaded the text file with URLs. A simple way to do this is to:
- open the location in your File Explorer
- click in the address bar, type in “powershell” and then press enter
- A Powershell window should open at the correct location:
Now run the following commands after modifying the name of the input file:
$input_file = "9572v001.txt"
$output_file = "curl_config.txt"
$url_list = Select-String -Path .\$input_file -Pattern "^http.*"
Add-Type -AssemblyName System.Web
$output = ""
foreach($url in $url_list){
$url_string = $url.Line
$file_name_match = $url_string -match "[0-9]{9}v[0-9]{3}/(data|metadata)/[^\?]*"
$file_name_string = $matches[0]
$file_path = "./$file_name_string"
$decoded_file_path = [System.Web.HttpUtility]::UrlDecode($file_path)
$output += "url = `"$url_string`"`n"
$output += "output = `"$decoded_file_path`"`n`n"
}
New-Item -Path . -Name "$output_file" -Force
Add-Content -Path $output_file -Value $output
One way to do this is to copy the above code sample and paste it into a text editor first, e.g. Notepad. In the text editor, modify the file name on the first line to match the text file you have downloaded. Now copy the modified code from the text editor and right-click in the Powershell window to paste it.
When you paste in this code most of the commands will run automatically, but the final line may require you to press <ENTER>
in order for it to run. This will produce a text file called “curl_config.txt” in the same folder that you were working from:
The contents of “curl_config.txt” will look similar to the following, except in this example the URL query parameters have been removed for brevity:
url = "https://s3.data.csiro.au/dapprd/000009572v001/data/164.a2.391.tif?...
output = "./000009572v001/data/164.a2.391.tif"
url = "https://s3.data.csiro.au/dapprd/000009572v001/data/164.a2.393.tif?...
output = "./000009572v001/data/164.a2.393.tif"
# etc...
Now copy and paste the following command into the Powershell window and press <ENTER>
to run it.
curl.exe -C - --create-dirs --parallel --config curl_config.txt
Note that in Powershell on Windows it’s essential to run the command curl.exe
instead of just curl
, because these are technically two different commands. If you try to run just curl
you will see an error:
Invoke-WebRequest : Parameter cannot be processed because the parameter name ‘C’ is ambiguous.
When you run curl.exe you can verify that the download is working by seeing the progress that is displayed:
When the files have downloaded, a new sub-folder will be inside the folder you ran the command from, e.g. “E:\demo\000009572v001”. The new folder will contain two sub-folders, “data” and “metadata”. The “data” folder will contain the files of the collection, and the “metadata” folder will contain some files that include the licence, citation details, and checksums for validating the download.
The curl command example provided uses the following options:
- -C – means “continue-at”, with the dash at the end instructing curl to automatically detect where to continue downloading a file. This is useful if your download is interrupted. If your download encounters errors, you can run the same command again and the file downloads will continue from where they were instead of restarting.
- –create-dirs will create any missing folders in the destination path. Without this you will get many errors reported.
- –parallel instructs curl to download more than one file at a time. The default number of concurrent downloads is 50, which you can modify using the –parallel-max option. This helps improve download speeds when you are downloading multiple files, but does not appear to help when downloading a single very large file.
- –config lets you specify a list of options. In this case we are specifying the file created earlier with a list of URLs and their output destination.