Garage S3 Storage
Configuration and usage of high-performance Garage S3 storage.
S3 Garage is a self-hosted implementation of the Amazon S3 protocol (Glossary). This guide covers configuration, performance optimization and troubleshooting.
To get started quickly, see Quick Start Guide.
Overview
Available endpoints:
HTTP: http://s3fast.lafrance.io (port 3910) - Preferred in code
HTTPS: https://s3fast.lafrance.io (port 443, via reverse proxy)
Bucket: mangetamain
Performance: 500-917 MB/s (DNAT bypass)
Region: garage-fast
HTTP vs HTTPS choice:
The code uses the HTTP endpoint for performance reasons:
Speed gain: No TLS/SSL overhead during data transfers
Secured DMZ network: Communication remains within the isolated local network (192.168.80.0/24)
HTTPS available: Accessible via reverse proxy for external access if needed
Installation
Local DNS Configuration
echo "192.168.80.202 s3fast.lafrance.io" | sudo tee -a /etc/hosts
iptables-persistent Installation
sudo apt update
sudo apt install iptables-persistent -y
iptables DNAT Rule
Bypass reverse proxy for maximum performance:
sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 -j DNAT --to-destination 192.168.80.202:3910
Permanent Save
sudo netfilter-persistent save
Installation Verification
# DNS
getent hosts s3fast.lafrance.io
# Should display: 192.168.80.202 s3fast.lafrance.io
# iptables
sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# Should display the DNAT rule
Credentials Configuration
96_keys/ Structure
96_keys/
├── credentials # s3fast profile
├── aws_config # AWS CLI config
└── garage_s3.duckdb # DuckDB database with S3 secret
credentials File
ConfigParser format:
[s3fast]
aws_access_key_id = GK4feb...
aws_secret_access_key = 50e63b...
endpoint_url = http://s3fast.lafrance.io
region = garage-fast
bucket = mangetamain
aws_config File
AWS CLI format:
[profile s3fast]
region = garage-fast
s3 =
endpoint_url = http://s3fast.lafrance.io
DuckDB Database with Secret
Create once:
cd ~/mangetamain/96_keys
duckdb garage_s3.duckdb
In DuckDB:
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET s3fast (
TYPE s3,
KEY_ID 'your_access_key_id',
SECRET 'your_secret_access_key',
ENDPOINT 's3fast.lafrance.io',
REGION 'garage-fast',
URL_STYLE 'path',
USE_SSL false
);
AWS CLI Usage
List Files
aws s3 ls s3://mangetamain/ \
--endpoint-url http://s3fast.lafrance.io \
--region garage-fast
Download
aws s3 cp s3://mangetamain/PP_recipes.csv /tmp/recipes.csv \
--endpoint-url http://s3fast.lafrance.io \
--region garage-fast
Upload
aws s3 cp /tmp/results.csv s3://mangetamain/results/ \
--endpoint-url http://s3fast.lafrance.io \
--region garage-fast
Python boto3 Usage
Loading Credentials
import boto3
from configparser import ConfigParser
# Load credentials from 96_keys/
config = ConfigParser()
config.read('../96_keys/credentials')
s3 = boto3.client(
's3',
endpoint_url=config['s3fast']['endpoint_url'],
aws_access_key_id=config['s3fast']['aws_access_key_id'],
aws_secret_access_key=config['s3fast']['aws_secret_access_key'],
region_name=config['s3fast']['region']
)
List Objects
# List files with sizes
response = s3.list_objects_v2(Bucket='mangetamain')
for obj in response.get('Contents', []):
print(f"{obj['Key']} - {obj['Size']/1e6:.1f} MB")
Download File
s3.download_file('mangetamain', 'PP_recipes.csv', '/tmp/recipes.csv')
Upload File
s3.upload_file('/tmp/results.csv', 'mangetamain', 'results/analysis.csv')
DuckDB Usage
SQL Queries on S3
In CLI:
# Simple query
duckdb ~/mangetamain/96_keys/garage_s3.duckdb \
-c "SELECT COUNT(*) FROM 's3://mangetamain/PP_recipes.csv'"
# Analysis with GROUP BY
duckdb ~/mangetamain/96_keys/garage_s3.duckdb -c "
SELECT calorie_level, COUNT(*) as total
FROM 's3://mangetamain/PP_recipes.csv'
GROUP BY calorie_level
ORDER BY total DESC"
In Python:
import duckdb
# Connect to database with secret
conn = duckdb.connect('~/mangetamain/96_keys/garage_s3.duckdb')
# Direct SQL query on S3
df = conn.execute("""
SELECT *
FROM 's3://mangetamain/PP_recipes.csv'
LIMIT 1000
""").fetchdf()
Parquet on S3
DuckDB optimized for Parquet:
# Read Parquet from S3 (zero-copy)
conn.execute("""
SELECT AVG(calories) as mean_calories
FROM 's3://mangetamain/RAW_recipes_clean.parquet'
WHERE year >= 2010
""")
Polars Usage
Direct S3 Reading
import polars as pl
from configparser import ConfigParser
# Load credentials
config = ConfigParser()
config.read('../96_keys/credentials')
# Configure storage options
storage_options = {
'aws_endpoint_url': config['s3fast']['endpoint_url'],
'aws_access_key_id': config['s3fast']['aws_access_key_id'],
'aws_secret_access_key': config['s3fast']['aws_secret_access_key'],
'aws_region': config['s3fast']['region']
}
# Read CSV from S3
df = pl.read_csv(
's3://mangetamain/PP_recipes.csv',
storage_options=storage_options
)
# Read Parquet from S3
df = pl.read_parquet(
's3://mangetamain/RAW_recipes_clean.parquet',
storage_options=storage_options
)
Performance Tests
Download Benchmark
# Test with large file
time aws s3 cp s3://mangetamain/large_file.parquet /tmp/ \
--endpoint-url http://s3fast.lafrance.io \
--region garage-fast
Expected results:
With DNAT bypass: 500-917 MB/s
Without bypass (reverse proxy): 50-100 MB/s
Gain: 5-10x faster
DNAT Active Verification
# Check iptables rule
sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# Test direct connection port 3910
curl -I http://192.168.80.202:3910/mangetamain/
# Should return HTTP 200 or S3 XML error
Bucket Structure
File Organization
s3://mangetamain/
├── RAW_recipes.csv
├── RAW_recipes_clean.parquet
├── RAW_interactions.csv
├── RAW_interactions_clean.parquet
├── PP_recipes.csv
├── PP_users.csv
├── PP_ratings.parquet
├── interactions_train.csv
├── interactions_test.csv
└── interactions_validation.csv
File Sizes
File |
Size |
|---|---|
RAW_recipes.csv |
~50 MB |
RAW_recipes_clean.parquet |
~25 MB |
RAW_interactions.csv |
~200 MB |
RAW_interactions_clean.parquet |
~80 MB |
PP_recipes.csv |
~30 MB |
PP_ratings.parquet |
~60 MB |
Infrastructure Tests
Automated Tests (50_test/)
S3_duckdb_test.py (14 tests):
System environment (AWS CLI, credentials)
S3 connection with boto3
Download performance (>5 MB/s)
DuckDB + S3 integration
Docker tests (optional)
test_s3_parquet_files.py (5 tests):
Automatically scans code
Finds parquet file references
Tests S3 accessibility
Run S3 Tests
cd ~/mangetamain/50_test
pytest S3_duckdb_test.py -v
Troubleshooting
Error: Cannot connect to S3
Possible causes:
DNS not configured
Missing iptables rule
Invalid credentials
Solution:
# Check DNS
getent hosts s3fast.lafrance.io
# Check iptables
sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# Test credentials
aws s3 ls s3://mangetamain/ \
--endpoint-url http://s3fast.lafrance.io \
--region garage-fast
Error: Slow Download Speed
Cause: DNAT bypass not active, traffic goes through reverse proxy
Solution: Check iptables rule
sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# If absent, recreate rule
sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 -j DNAT --to-destination 192.168.80.202:3910
sudo netfilter-persistent save
Error: DuckDB Secret Not Found
Cause: S3 secret not created in DuckDB database
Solution: Recreate the secret
duckdb ~/mangetamain/96_keys/garage_s3.duckdb
DROP SECRET IF EXISTS s3fast;
CREATE SECRET s3fast (
TYPE s3,
KEY_ID 'your_key_id',
SECRET 'your_secret',
ENDPOINT 's3fast.lafrance.io',
REGION 'garage-fast',
URL_STYLE 'path',
USE_SSL false
);
Best Practices
Credentials Security
NEVER commit 96_keys/ (in .gitignore)
Share credentials via secure channel only
Regular key rotation
Performance
Prefer Parquet over CSV (2-3x faster)
Use DuckDB for SQL queries (zero-copy)
Enable DNAT bypass (10x faster)
Local cache for frequently accessed files
Streamlit Cache
import streamlit as st
@st.cache_data(ttl=3600) # 1h cache
def load_data_from_s3():
"""Load S3 data with cache."""
# Expensive S3 read only once
return df
Performance Benchmarks
Configuration Comparison
Tests performed with recipes_clean.parquet (250 MB):
Configuration |
Speed |
Time (250 MB) |
Gain |
|---|---|---|---|
Without DNAT (via reverse proxy) |
50-100 MB/s |
2.5-5 seconds |
Baseline |
DNAT bypass (direct port 3910) |
500-917 MB/s |
0.27-0.5 sec |
10x |
DNAT + local SSD read |
2-3 GB/s |
0.08-0.12 sec |
40x |
Recommendation: DNAT bypass mandatory for acceptable performance.
Performance Test
Benchmark script:
#!/bin/bash
# test_s3_speed.sh
echo "=== Test without DNAT ==="
# Temporarily disable DNAT
sudo iptables -t nat -D OUTPUT -p tcp -d 192.168.80.202 --dport 80 \
-j DNAT --to-destination 192.168.80.202:3910 2>/dev/null
time aws s3 cp s3://mangetamain/recipes_clean.parquet /tmp/test1.parquet --profile s3fast
rm /tmp/test1.parquet
echo "=== Test with DNAT ==="
# Re-enable DNAT
sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 \
-j DNAT --to-destination 192.168.80.202:3910
time aws s3 cp s3://mangetamain/recipes_clean.parquet /tmp/test2.parquet --profile s3fast
rm /tmp/test2.parquet
Expected results:
Without DNAT: real 0m4.520s (55 MB/s)
With DNAT: real 0m0.380s (658 MB/s)
Gain: 11.9x faster
Parquet Reading Optimization
Format comparison:
Format |
Size |
Read time |
Speed |
|---|---|---|---|
CSV (uncompressed) |
1.2 GB |
12-15 seconds |
80-100 MB/s |
CSV (gzip) |
320 MB |
8-10 seconds |
32-40 MB/s |
Parquet (Snappy) |
250 MB |
0.3-0.5 sec |
500-833 MB/s |
Why Parquet is optimal:
Integrated Snappy compression (ratio ~5:1)
Columnar format (selective reading)
Integrated metadata (no parsing needed)
Zero-copy with DuckDB/Polars
Optimal reading with Polars:
import polars as pl
# Optimized Parquet reading
df = pl.read_parquet(
"s3://mangetamain/recipes_clean.parquet",
use_pyarrow=True, # Arrow engine (faster)
columns=['id', 'name'], # Selective reading (columnar)
n_rows=1000 # Limit for preview
)
Performance Monitoring
Measure loading time:
import time
from loguru import logger
@st.cache_data(ttl=3600)
def load_with_timing():
start = time.time()
df = pl.read_parquet("s3://mangetamain/recipes_clean.parquet")
elapsed = time.time() - start
logger.info(f"S3 load: {len(df)} rows in {elapsed:.2f}s ({len(df)/elapsed:.0f} rows/s)")
return df
Expected logs:
2025-10-27 15:23:45 | INFO | S3 load: 178265 rows in 0.42s (424441 rows/s)
Performance Troubleshooting
Speed < 100 MB/s:
Check DNAT active:
sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# Should display DNAT rule
Test direct connection:
curl -o /dev/null http://192.168.80.202:3910/mangetamain/recipes_clean.parquet
Check network latency:
ping -c 10 192.168.80.202
# RTT should be < 1ms (local network)
Fluctuating speed:
Cause: Garage server load
Solution: Repeat measurements over 5-10 attempts
Normal variance: ±20%
First slow load:
Cause: Garage cold start (server cache)
Normal: 2-3x slower than subsequent loads
Solution: Pre-warm with
aws s3 ls
Limits and Quotas
Garage S3 (current installation):
Bandwidth: ~1 Gbps (125 MB/s theoretical)
IOPS: Unlimited (server SSD)
Simultaneous connections: 100+ (sufficient)
Bucket size: ~5 GB used / 1 TB available
No AWS quotas: Self-hosted installation, no AWS limits.
See Also
Installation - Complete project installation
Tests and Coverage - Infrastructure S3 tests (50_test/)
api/data - data.cached_loaders module with schemas
api/infrastructure - Automated S3 tests
Quick Start Guide - Essential S3 commands
S3_INSTALL.md (root) - Detailed installation documentation
S3_USAGE.md (root) - Complete usage guide