Howdy there! You are probably tired of uploading your entire dataset to Clarifai one image or file at a time. I can see that being stressful and I am here to show you that there is an easier way of going about doing this! The programmatic way! This article will just be showing the code you need to do batching and assumes that you have done authorization and have an application ready!
A few notes about batching:
- The optimal batch size is 32
- Calls are asynchronous, meaning that they'll all get run at once.
- This is much faster and efficient than loading one file at a time. The Server Gods will be smiling down upon you for it
- If your files are fairly large (perhaps in the 1+ MB range) you may need to lower the batch size amount to avoid broken connection errors
In this portion, we'll show just how to do it with adding inputs and tags and/or custom metadata, or just for predictions if that's what floats your boat.
You can see in detail what clients are supported for our API here
Python GRPC client - Local files directly from a folder (on your computer)
Please see the documentation about our Python GRPC client and how to set it up here.
import csv
from google.protobuf.struct_pb2 import Struct
import time
from clarifai_grpc.grpc.api import service_pb2_grpc, service_pb2, resources_pb2
from clarifai_grpc.grpc.api.status import status_code_pb2
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
channel = ClarifaiChannel.get_insecure_grpc_channel()
stub = service_pb2_grpc.V2Stub(channel)
tic = time.time()
file_address = 'PATH_TO_CSV_FILE'
metadata = (('authorization', 'Key YOUR_API_KEY'),)
with open(file_address, mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
line_count = 0
inputs = []
count = 0
for row in csv_reader:
if line_count == 0:
line_count += 1
input_metadata = Struct()
## You can update as many metadatafields as you and name the work {{metadata_fieldx}} to whatever you need
input_metadata.update(
{"metadata_field1": row["field1"], "metadata_field2": row["field2"], "metadata_field3": row["field3"],
})
inputs.append(
resources_pb2.Input(
### You can replace id below with one of the values from your csv file or you can delete this
###option
id={input_id},
data=resources_pb2.Data(
image=resources_pb2.Image(
url=row["url_column"] ### Change the url_column to the name of the column that has
# URL in your csv file
),
metadata=input_metadata
)
)
)
if len(inputs) == 32:
post_input_results_response = stub.PostInputs(
service_pb2.PostInputsRequest(inputs=inputs),
metadata=metadata
)
inputs = []
count += 1
if count % 10000 == 0:
time.sleep(1)
print(count, ' batches of 32 processed')
# print(32 * count, 'images processed\n\n')
toc = time.time()
hours, rem = divmod(toc - tic, 3600)
minutes, seconds = divmod(rem, 60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), seconds))