Skip to content

S3 Storage

Load data directly from Amazon S3 and compatible object stores.

Basic Usage

Text Only
data:
  source = "s3://my-bucket/data/prices.parquet"
  format = parquet

Authentication

Bash
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

AWS Profile

Bash
export AWS_PROFILE=my-profile

IAM Role (EC2/ECS)

When running on AWS infrastructure, IAM roles are automatically used.

Explicit Credentials

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    aws_access_key_id = "${AWS_ACCESS_KEY_ID}"
    aws_secret_access_key = "${AWS_SECRET_ACCESS_KEY}"
    region = "us-east-1"

S3 Options

Region

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    region = "us-west-2"

Endpoint (S3-Compatible)

For MinIO, DigitalOcean Spaces, etc.:

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    endpoint = "https://nyc3.digitaloceanspaces.com"
    region = "nyc3"

Path Style

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    path_style = true  # Use path-style URLs (bucket in path)

Loading Multiple Files

Glob Pattern

Text Only
data:
  source = "s3://bucket/data/*.parquet"
  format = parquet

Prefix

Text Only
data:
  source = "s3://bucket/data/prices/"
  format = parquet
  options:
    recursive = true  # Include subdirectories

Partitioned Data

Text Only
data:
  source = "s3://bucket/data/prices/"
  format = parquet
  options:
    partitioned = true
    start_date = "2023-01-01"

Caching

sigc caches S3 data locally for faster subsequent loads:

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    cache = true                    # Enable caching (default)
    cache_dir = "/tmp/sigc_cache"   # Cache location
    cache_ttl = 3600                # TTL in seconds (1 hour)

Disable Caching

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    cache = false  # Always fetch from S3

Performance

Parallel Downloads

Text Only
data:
  source = "s3://bucket/data/*.parquet"
  format = parquet
  options:
    max_connections = 8  # Parallel downloads

Range Requests

Parquet files support reading only needed columns via range requests:

Text Only
data:
  source = "s3://bucket/large_file.parquet"
  format = parquet
  columns:
    date: Date
    ticker: Symbol
    close: Numeric as prices
  # Only downloads the 'date', 'ticker', and 'close' columns

Streaming

For very large files:

Text Only
data:
  source = "s3://bucket/huge_file.parquet"
  format = parquet
  options:
    streaming = true  # Stream instead of download

S3-Compatible Services

MinIO

Text Only
data:
  source = "s3://my-bucket/data.parquet"
  format = parquet
  options:
    endpoint = "http://localhost:9000"
    path_style = true

DigitalOcean Spaces

Text Only
data:
  source = "s3://my-space/data.parquet"
  format = parquet
  options:
    endpoint = "https://nyc3.digitaloceanspaces.com"
    region = "nyc3"

Backblaze B2

Text Only
data:
  source = "s3://my-bucket/data.parquet"
  format = parquet
  options:
    endpoint = "https://s3.us-west-002.backblazeb2.com"
    region = "us-west-002"

Cloudflare R2

Text Only
data:
  source = "s3://my-bucket/data.parquet"
  format = parquet
  options:
    endpoint = "https://account-id.r2.cloudflarestorage.com"

Error Handling

Access Denied

Text Only
Error: Access Denied to s3://bucket/file.parquet

Check:

  1. Credentials are set correctly
  2. Bucket policy allows access
  3. IAM permissions include s3:GetObject

Bucket Not Found

Text Only
Error: Bucket 'bucket-name' does not exist

Check:

  1. Bucket name is correct
  2. Region matches bucket region

Timeout

Text Only
Error: Connection timeout

Solutions:

  1. Check network connectivity
  2. Increase timeout: options: timeout = 60
  3. Use VPC endpoint for AWS

Best Practices

1. Use Environment Variables

Bash
# .env or shell profile
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=xxx
export AWS_DEFAULT_REGION=us-east-1

2. Enable Caching for Development

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    cache = true
    cache_ttl = 86400  # 24 hours

3. Use Parquet Format

Parquet's columnar format enables efficient partial reads from S3.

4. Partition Large Datasets

Text Only
s3://bucket/prices/
├── year=2023/
│   ├── month=01/
│   └── month=02/
└── year=2024/

5. Set Appropriate Timeouts

Text Only
data:
  source = "s3://bucket/data.parquet"
  format = parquet
  options:
    timeout = 120      # 2 minutes
    max_retries = 3

Example: Production Setup

Text Only
data:
  source = "s3://prod-data/market/prices/"
  format = parquet
  options:
    partitioned = true
    region = "us-east-1"
    cache = true
    cache_dir = "/var/cache/sigc"
    cache_ttl = 3600
    max_connections = 4
  columns:
    date: Date
    ticker: Symbol
    close: Numeric as prices
    volume: Numeric
    high: Numeric
    low: Numeric

signal momentum:
  emit zscore(ret(prices, 60))

portfolio main:
  weights = rank(momentum).long_short(top=0.2, bottom=0.2)
  backtest from 2020-01-01 to 2024-12-31

Security Considerations

  1. Never hardcode credentials - Use environment variables or IAM roles
  2. Use least privilege - Grant only s3:GetObject on specific buckets
  3. Enable encryption - Use server-side encryption (SSE-S3 or SSE-KMS)
  4. Use VPC endpoints - Keep traffic within AWS network
  5. Enable access logging - Audit who accesses your data

Next Steps