datasets
#
When you work on a particular ML problem you tend to use multiple datasets – a train and a test one as a bare minimum. That’s why one can define multiple datasets per Efemarai project. Each dataset can then be used for baseline evaluation or stress testing.
Example
Here is a quick example defining 2 datasets - one remote and one to be uploaded to Efemarai.
datasets:
- name: GTSDB Test
format: COCO
stage: test
data_url: "s3://GTSDB/images"
annotations_url: "s3://GTSDB/instances.json"
credentials: "${AWS_KEY}:${AWS_SECRET}" # environment variables
num_datapoints: 1000
- name: Road Signs Dataset
format: COCO
stage: train
data_url: "road-signs-dataset/images"
annotations_url: "road-signs-dataset/images/instances.json"
upload: True
Properties
datasets
contains an array where each element defines a dataset and has the
following properties:
name
: unique name of the datasetformat
: format of the dataset. Supported formats:COCO
- standard COCO formatImageNet
- ImageNet formattfrecord
- a subset of TFRecords format
stage
: specifies the stage this dataset is used for. Supported stages:train
validation
test
data_url
: URL where the image files can be found. Depending on the URL scheme andcredentials
(see bellow) the data files can be fetched from various locations including:s3://bucket/key
- AWS bucketgs://bucket/key
- GCP bucketazure://bucket/key
- Azure buckethdfs://path/file
- Hadoop file systemssh://path/file
- via SSHscp://path/file
- via SSHsftp://path/file
- via SSHlocal/path/file
- requiresupload=True
(see bellow)
annotations_url
: URL where the annotations JSON file is. Can be fetched from various locations similar todata_url
(see above).credentials
(optional): contains a string of the formusername:password
oraccess_token
that’s used for authentication when accessing the datasets remotely (e.g. from an S3 bucket). In order to avoid keeping the credentials in plain text one can provide them as environment variables as shown in the example above.upload
(optional): upload the dataset to Efemarai - typically set toTrue
when eitherdata_url
orannotations_url
points to a local file on the user machine.num_datapoints
(optional): truncates the dataset to the firstnum_datapoints
items. This is useful if you work with large datasets, but want to run quick tests.