Files referenced in the article:
Store Data from App to CSV File
Download URLs from CSV File

Hi there! If you're one of many folks who have used our Custom Training or Visual Search tools, then you've uploaded at least a few images to our cloud to train on and search through at some point, respectively. These images, or inputs as we like to call them, are certainly safe and snug there, but what if you want to retrieve all of them and download them back to your local computer? Wouldn't that be nifty?

Often we respond to this question with our GET Inputs endpoint, which is great and all, but paginated with a 1,000 image per-page limit, so if you have 125k images that's kind of a pain. And it also comes with a lot of extra data in a JSON format that you may or may not need.

For the sake of this article, let's examine how we can get all of the image URLs and concepts/metadata into one singular CSV file, and then, if desired, actually download all of the images back to your local machine!


Step 1: Retrieve The Data

The first thing that we want to do here is pull all of our data from a specific Clarifai application, and we can do that with the following Python file (the first one linked above if you want to download it directly):

Requirements: argparse, json, pandas and tqdm modules

import argparse
import json
import pandas as pd

from clarifai.rest import ClarifaiApp
from tqdm import tqdm

def app2csv(api_key, output_name, url_only, desired_concepts):
  app = ClarifaiApp(api_key=api_key)

  status = app.inputs.check_status()
  total_count = status.errors + status.processed + status.to_process
  max_page = total_count / 1000 + 1

  parsed_inputs = []

  for n in tqdm(range(1, max_page + 1)):
    page_results = app.inputs.get_by_page(page=n, per_page=1000)

    for page_result in page_results:
      url = page_result.url

      concepts = page_result.concepts
      if concepts != None:
        concepts = ':'.join(concepts)

      not_concepts = page_result.not_concepts
      if not_concepts != None:
        not_concepts = ':'.join(not_concepts)

      metadata = page_result.metadata
      if metadata != None:
        metadata = json.dumps(metadata)

      temp_dict = {
          'url': url,
          'concepts': concepts,
          'not concepts': not_concepts,
          'metadata': metadata
      }
      parsed_inputs.append(temp_dict)

  df = pd.DataFrame(parsed_inputs)
  df = df[['url', 'concepts', 'not concepts', 'metadata']]

  if url_only is True:
    df = df[['url']]

  if desired_concepts != 'ALL':
    desired_list = desired_concepts.replace(',', '|')
    df = df[df['concepts'].str.contains(desired_list, na=False) | df['not concepts'].str.contains(
        desired_list, na=False)]

  df = df.dropna(axis=1, how='all')
  df.to_csv(output_name, index=False, encoding='utf8')


def main():
  parser = argparse.ArgumentParser(
      description="Read an app's inputs and spit out a dst upload ready csv file.")
  parser.add_argument('--api_key', required=True, help='the api key for the application', type=str)
  parser.add_argument('--output_name', required=True, help='name of the output csv', type=str)
  parser.add_argument('--url_only', action='store_true', help='call if you only want the urls')
  # note: below flag will grab inputs with the desired concept(s) + all other tags attached to that input
  parser.add_argument(
      '--desired_concepts',
      help='comma separated list of concepts to grab',
      type=str,
      default='ALL')

  args = parser.parse_args()

  app2csv(args.api_key, args.output_name, args.url_only, args.desired_concepts)


if __name__ == "__main__":
  main()


This code may look a little intimidating at first, but it's actually very easy to call!

python app_to_csv_upload_ready.py --api_key='YOUR_API_KEY' --output_name='OUTPUT_FILE.csv' --url_only=TRUE --desired_concepts='concept1,concept2,concept3'


The parameters here:

  • api_key is essentially the key that is tied your application
  • output_name is what you want the resulting CSV file to be called
  • url_only will determine whether you just receive the URLs of the images (TRUE), or if applicable, their concepts and metadata as well (FALSE)
  • desired_concepts will filter the specific concepts that are outputted if url_only is set to FALSE

And that's pretty much it! If you'd like to then download all of the images URLs, then read on to step 2.


Step 2: Download the Data!

Once we have all of our URLs in a CSV file, it's pretty easy to download them to your local machine. This next code snippet refers to the second link at the top.

Requirements: os, sys, urllib and csv modules

import os
import sys
import urllib
import csv

try:
filename = sys.argv[1]
url_name = sys.argv[2]
except:
print "\nERROR: Please specify filename and url column name to download\n"
print "Usage:"
print " $ picodash_export_url_download.py data.csv image_url\n"
print "- First param should be the csv file path"
print "- Second param should be the column name that has image urls to download\n"
sys.exit(0)

# open csv file to read
with open(filename, 'r') as csvfile:
csv_reader = csv.reader(csvfile)
# iterate on all rows in csv
for row_index,row in enumerate(csv_reader):
# find the url column name to download in first row
if row_index == 0:
IMAGE_URL_COL_NUM = None
for col_index,col in enumerate(row):
# find the index of column that has urls to download
if col == url_name:
IMAGE_URL_COL_NUM = col_index
if IMAGE_URL_COL_NUM is None:
print "\nERROR: url column name '"+url_name+"' not found, available options:"
for col_index,col in enumerate(row):
print " " + col
print "\nUsage:"
print " $ bulk_download_urls.py data.csv image_url\n"
sys.exit(0)
continue
# find image urls in rows 1+
image_url = row[IMAGE_URL_COL_NUM]
# check if we have an image URL and download
if image_url != '' and image_url != "\n":
image_filename = image_url.split('/')[-1].split('?')[0]
directory = filename.split('.csv')[0] + "-" + url_name
if not os.path.exists(directory):
os.makedirs(directory)
try:
urllib.urlretrieve(image_url, directory+'/'+image_filename)
print "["+str(row_index)+"] Image saved: " + image_filename
except:
# second attempt to download if failed
try:
urllib.urlretrieve(image_url, directory+'/'+image_filename)
print "["+str(row_index)+"] Image saved: " + image_filename
except:
print "["+str(row_index)+"] Could not download url: " + image_url
else:
print "["+str(row_index)+"] No " + url_name


And calling this is pretty easy as well:

python bulk_download_urls_from_csv.py data.csv url_column_name


The parameters here:

  • A filename that you want to download from (the one we created in step 1), and
  • The name of the column header that contains the URLs. Whether you chose to just include URLs or not in your file from step 1 won't matter, as you simply need to reference the URLs column.

And that's it! Once this second file is run, it'll create a folder in the same directory as this Python file and all of your images will be in there.

Give this a try and let us know what you think!

Did this answer your question?