From time to time I am asked to help some colleagues with something that should be very trivial. In this case it was to download genomes from the NCBI FTP site. Specifically the Bacteria genomes. The problem with NCBI is that they keep changing how they store the genomes on their servers. Some time ago, when I had previously done this task, all the genomes where in a single tar.gz file. It was a huge file, but it was one file. If you weren’t lucky, e.g. for the Fungi genomes, every single genome was put in a separate folder in an uncompressed GenBank file. This, per se, was not a problem. To download all of them you just needed the base folder and a simple wget command:
wget -r -A '*.gbk' --no-parent --no-verbose ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/
Sure, wget creates a folder structure that reflects the folder structure on the ftp server, including a base ftp.ncbi.nlm.nih.gov folder. But that is a minor annoyance because, after all, you have successfully downloaded the data.
And them someone in the NCBI was clever enough and decided they should have more info for each genome. Now for a single organism you have a folder structure that looks like this:
genomes
+---genbank
+---bacteria
+---organism_name
+---all_assembly_versions
+---latest_assembly_versions
+---representative
And all the folders in the last level contain folders named like [Assembly accession.version]_[assembly name], without the square brackets. And yes, because some of those folders should contain the same data, the use symlinks that actually point to a folder all located 4 levels down. That is actually the place the real genomes are stored. The problem is that wget doesn’t know how to deal with symlinked folders. With symlinked files, it downloads them without a fuss. But for folders, it doesn’t work. The docs clearly state: Currently, Wget does not traverse symbolic links to directories to download them recursively, though this feature may be added in the future. You can’t just download all the files from the all folder because it contains data of obsolete genomes that you don’t need, and most of all, all the genomes are mixed inside this folder (fungi, protozoa, bacteria, etc.).
After spending a couple of hours googling for something useful I found that wget can download the symlinks using the --retr-symlinks=no option. Using it causes wget to create the same symlink locally. Once the symlink is created you can use the readlink command to see where this link points to. Using this information and the base path of the symlinked folder you can obtain the folder that really contains this organism. Even better, as they use a naming convention, you can deduce the file name, skipping the retrieval of the entire folder.
The following code demonstrates the retrieval of the Bacteria genomes in the gbff format.
URL=ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/
DIRPATH=$(echo "$URL" | sed s-ftp://--)
wget -r --retr-symlinks=no -R '*.txt' --no-parent --no-verbose $URL
SYMLINKS=`find $DIRPATH -type l | grep latest_assembly_versions`
for SYMLINK in $SYMLINKS
do
LINK_FOLDER=$(dirname $SYMLINK)
FILE_NAME=$(basename $SYMLINK)
LINK_TARGET=`readlink $SYMLINK`
REAL_FILE_URL="ftp://${LINK_FOLDER}/${LINK_TARGET}/${FILE_NAME}_genomic.gbff.gz"
wget $REAL_FILE_URL
done
The solution is not perfect, but it works. The first wget command takes a really long time to execute because it visits every single folder. An there are more than 10 000 folders for the Bacteria genomes. The second wget is executed once per every symlinked folder that contains the latest assembly version. As there are quite a lot of folders, there is quite a lot wget commands issued. The second part can be batched creating a file that contains the file url-s that need to be downloaded. Using GNU Parallel on that file in combination with wget this download can be much faster. The only problem is that I don’t know how to speed up the first wget because it takes about one second per visited folder and this is too much.