The OpenIMAJ FlickrCrawler tools enable you to download large collections of images using the Flickr API for experimentation purposes. The FlickCrawler tools are implemented as simple Groovy scripts, and as such require that you have Groovy version 1.7 or later installed on your system.
The FlickrCrawler.groovy script is the main tool for downloading images using the flickr.photos.search API. It has a number of useful features:
The FlickrCrawler.groovy script is invoked from the command-line as follows:
groovy FlickrCrawler.groovy config_file
where config_file is the path to the configuration file that describes the parameters of your crawl, as described below.
The FlickrCrawler.groovy configuration file is a simple text file that contains the information the crawler needs to find the relevant images to download. A complete configuration file will look like the following:
crawler { apikey="ENTER_YOUR_FLICKR_API_KEY_HERE" //your flickr api key secret="ENTER_YOUR_FLICKR_API_SECRET_HERE" //your flickr api secret apihitfreq=1000 //number of milliseconds between api calls hitfreq=1000 //number of milliseconds between retries of failed downloads outputdir="crawl-data" //name of directory to save images and data to maximages=-1 //limit the number of images to be downloaded; -1 is unlimited maxRetries=3000 //maximum number of retries after failed api calls force=false //force re-download of duplicate images perpage=500 //number of results to request from the api per call queryparams { //the parameters describing the query } concurrentDownloads=16 //max number of concurrent image downloads pagingLimit=20 //max number of pages to look through maxretrytime=300000 //maximum amout of time between retries data { info=true //download all the information about each image exif=true //download all the exif information about each image } images { targetSize=["large","original"] //preferred image sizes in order smallSquare=false //should small square images be downloaded thumbnail=false //should thumbnail images be downloaded small=false //should small images be downloaded medium=false //should medium images be downloaded large=false //should large images be downloaded original=false //should original size images be downloaded } }
In practice however, the crawler has sensible defaults for most of the configuration and many of the options can be omitted. For most crawls, the important parts of the configuration are:
The following examples demonstrate practical usage of FlickrCrawler.groovy.
The following configuration can be used used to download all of the geo-tagged images from Southampton, UK that are licensed with the Creative Commons Attribution-NonCommercial License:
crawler { apikey="..." secret="..." outputdir="southampton-cc" queryparams { woeId="35356" //from flickr.places.find license="2" //from flickr.photos.licenses.getInfo } data { info=false exif=false } images { targetSize=["large", "original", "medium"] } }
The important parts of the configuration are crawler.queryparams.woeId which tells the crawler to find images with the specified flickr where-on-earth identifier, and the crawler.queryparams.license which specifies the license requirements for the downloaded images. Specific woeIds can be looked up using the flickr.places.find explorer page. The mapping between actual licenses and license identifiers can be found on the flickr.photos.licenses.getInfo explorer page.
The following configuration illustrates how the FlickrCrawler.groovy script can be made to download 100 images tagged with “city” but not “night”:
crawler { apikey="..." secret="..." outputdir="city-not-night" maximages=100 queryparams { tags=["city", "-night"] tagMode="bool" } data { info=false exif=false } images { targetSize=["large", "original", "medium"] } }
The crawler.queryparams part is self explanatory. It should be noted however, that the Flickr API will not allow you to search only with negative terms, so it isn’t possible to to search for just “not night”.
As the crawler runs it will download images to a directory structure inside the outputdir specified in the configuration. In addition to the images, the directory contains a number of other files which relate to the crawl:
Sometimes the FlickrCrawler will fail to download some images (for example, because of network issues). The DownloadMissingImages.groovy script will parse the images.csv file from a crawl and automatically attempt to download any missing images. Usage is simple; just run the script with the path to the crawl output directory (the outputdir specified in your original configuration):
groovy DownloadMissingImages.groovy crawldir