S3 Storage¶
Load data directly from Amazon S3 and compatible object stores.
Basic Usage¶
Authentication¶
Environment Variables (Recommended)¶
Bash
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
AWS Profile¶
IAM Role (EC2/ECS)¶
When running on AWS infrastructure, IAM roles are automatically used.
Explicit Credentials¶
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
aws_access_key_id = "${AWS_ACCESS_KEY_ID}"
aws_secret_access_key = "${AWS_SECRET_ACCESS_KEY}"
region = "us-east-1"
S3 Options¶
Region¶
Endpoint (S3-Compatible)¶
For MinIO, DigitalOcean Spaces, etc.:
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
endpoint = "https://nyc3.digitaloceanspaces.com"
region = "nyc3"
Path Style¶
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
path_style = true # Use path-style URLs (bucket in path)
Loading Multiple Files¶
Glob Pattern¶
Prefix¶
Text Only
data:
source = "s3://bucket/data/prices/"
format = parquet
options:
recursive = true # Include subdirectories
Partitioned Data¶
Text Only
data:
source = "s3://bucket/data/prices/"
format = parquet
options:
partitioned = true
start_date = "2023-01-01"
Caching¶
sigc caches S3 data locally for faster subsequent loads:
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
cache = true # Enable caching (default)
cache_dir = "/tmp/sigc_cache" # Cache location
cache_ttl = 3600 # TTL in seconds (1 hour)
Disable Caching¶
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
cache = false # Always fetch from S3
Performance¶
Parallel Downloads¶
Text Only
data:
source = "s3://bucket/data/*.parquet"
format = parquet
options:
max_connections = 8 # Parallel downloads
Range Requests¶
Parquet files support reading only needed columns via range requests:
Text Only
data:
source = "s3://bucket/large_file.parquet"
format = parquet
columns:
date: Date
ticker: Symbol
close: Numeric as prices
# Only downloads the 'date', 'ticker', and 'close' columns
Streaming¶
For very large files:
Text Only
data:
source = "s3://bucket/huge_file.parquet"
format = parquet
options:
streaming = true # Stream instead of download
S3-Compatible Services¶
MinIO¶
Text Only
data:
source = "s3://my-bucket/data.parquet"
format = parquet
options:
endpoint = "http://localhost:9000"
path_style = true
DigitalOcean Spaces¶
Text Only
data:
source = "s3://my-space/data.parquet"
format = parquet
options:
endpoint = "https://nyc3.digitaloceanspaces.com"
region = "nyc3"
Backblaze B2¶
Text Only
data:
source = "s3://my-bucket/data.parquet"
format = parquet
options:
endpoint = "https://s3.us-west-002.backblazeb2.com"
region = "us-west-002"
Cloudflare R2¶
Text Only
data:
source = "s3://my-bucket/data.parquet"
format = parquet
options:
endpoint = "https://account-id.r2.cloudflarestorage.com"
Error Handling¶
Access Denied¶
Check:
- Credentials are set correctly
- Bucket policy allows access
- IAM permissions include
s3:GetObject
Bucket Not Found¶
Check:
- Bucket name is correct
- Region matches bucket region
Timeout¶
Solutions:
- Check network connectivity
- Increase timeout:
options: timeout = 60 - Use VPC endpoint for AWS
Best Practices¶
1. Use Environment Variables¶
Bash
# .env or shell profile
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=xxx
export AWS_DEFAULT_REGION=us-east-1
2. Enable Caching for Development¶
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
cache = true
cache_ttl = 86400 # 24 hours
3. Use Parquet Format¶
Parquet's columnar format enables efficient partial reads from S3.
4. Partition Large Datasets¶
5. Set Appropriate Timeouts¶
Text Only
data:
source = "s3://bucket/data.parquet"
format = parquet
options:
timeout = 120 # 2 minutes
max_retries = 3
Example: Production Setup¶
Text Only
data:
source = "s3://prod-data/market/prices/"
format = parquet
options:
partitioned = true
region = "us-east-1"
cache = true
cache_dir = "/var/cache/sigc"
cache_ttl = 3600
max_connections = 4
columns:
date: Date
ticker: Symbol
close: Numeric as prices
volume: Numeric
high: Numeric
low: Numeric
signal momentum:
emit zscore(ret(prices, 60))
portfolio main:
weights = rank(momentum).long_short(top=0.2, bottom=0.2)
backtest from 2020-01-01 to 2024-12-31
Security Considerations¶
- Never hardcode credentials - Use environment variables or IAM roles
- Use least privilege - Grant only
s3:GetObjecton specific buckets - Enable encryption - Use server-side encryption (SSE-S3 or SSE-KMS)
- Use VPC endpoints - Keep traffic within AWS network
- Enable access logging - Audit who accesses your data
Next Steps¶
- PostgreSQL - Database integration
- Parquet Format - Parquet file details
- Data Quality - Validating your data