datasets#

When you work on a particular ML problem you tend to use multiple datasets – a train and a test one as a bare minimum. That’s why one can define multiple datasets per Efemarai project. Each dataset can then be used for baseline evaluation or stress testing.

Example

Here is a quick example defining 2 datasets - one remote and one to be uploaded to Efemarai.

datasets:
  - name: GTSDB Test
    format: COCO
    stage: test
    data_url: "s3://GTSDB/images"
    annotations_url: "s3://GTSDB/instances.json"
    credentials: "${AWS_KEY}:${AWS_SECRET}"  # environment variables
    num_datapoints: 1000

  - name: Road Signs Dataset
    format: COCO
    stage: train
    data_url: "road-signs-dataset/images"
    annotations_url: "road-signs-dataset/images/instances.json"
    upload: True

Properties

datasets contains an array where each element defines a dataset and has the following properties:

  • name: unique name of the dataset

  • format: format of the dataset. Supported formats:

  • stage: specifies the stage this dataset is used for. Supported stages:

    • train

    • validation

    • test

  • data_url: URL where the image files can be found. Depending on the URL scheme and credentials (see bellow) the data files can be fetched from various locations including:

    • s3://bucket/key - AWS bucket

    • gs://bucket/key - GCP bucket

    • azure://bucket/key - Azure bucket

    • hdfs://path/file - Hadoop file system

    • ssh://path/file - via SSH

    • scp://path/file - via SSH

    • sftp://path/file - via SSH

    • local/path/file - requires upload=True (see bellow)

  • annotations_url: URL where the annotations JSON file is. Can be fetched from various locations similar to data_url (see above).

  • credentials (optional): contains a string of the form username:password or access_token that’s used for authentication when accessing the datasets remotely (e.g. from an S3 bucket). In order to avoid keeping the credentials in plain text one can provide them as environment variables as shown in the example above.

  • upload (optional): upload the dataset to Efemarai - typically set to True when either data_url or annotations_url points to a local file on the user machine.

  • num_datapoints (optional): truncates the dataset to the first num_datapoints items. This is useful if you work with large datasets, but want to run quick tests.