Garage S3 Storage

Configuration and usage of high-performance Garage S3 storage.

S3 Garage is a self-hosted implementation of the Amazon S3 protocol (Glossary). This guide covers configuration, performance optimization and troubleshooting.

To get started quickly, see Quick Start Guide.

Overview

Available endpoints:

Bucket: mangetamain

Performance: 500-917 MB/s (DNAT bypass)

Region: garage-fast

HTTP vs HTTPS choice:

The code uses the HTTP endpoint for performance reasons:

  • Speed gain: No TLS/SSL overhead during data transfers

  • Secured DMZ network: Communication remains within the isolated local network (192.168.80.0/24)

  • HTTPS available: Accessible via reverse proxy for external access if needed

Installation

Local DNS Configuration

echo "192.168.80.202  s3fast.lafrance.io" | sudo tee -a /etc/hosts

iptables-persistent Installation

sudo apt update
sudo apt install iptables-persistent -y

iptables DNAT Rule

Bypass reverse proxy for maximum performance:

sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 -j DNAT --to-destination 192.168.80.202:3910

Permanent Save

sudo netfilter-persistent save

Installation Verification

# DNS
getent hosts s3fast.lafrance.io
# Should display: 192.168.80.202  s3fast.lafrance.io

# iptables
sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# Should display the DNAT rule

Credentials Configuration

96_keys/ Structure

96_keys/
├── credentials          # s3fast profile
├── aws_config           # AWS CLI config
└── garage_s3.duckdb     # DuckDB database with S3 secret

credentials File

ConfigParser format:

[s3fast]
aws_access_key_id = GK4feb...
aws_secret_access_key = 50e63b...
endpoint_url = http://s3fast.lafrance.io
region = garage-fast
bucket = mangetamain

aws_config File

AWS CLI format:

[profile s3fast]
region = garage-fast
s3 =
    endpoint_url = http://s3fast.lafrance.io

DuckDB Database with Secret

Create once:

cd ~/mangetamain/96_keys
duckdb garage_s3.duckdb

In DuckDB:

INSTALL httpfs;
LOAD httpfs;

CREATE SECRET s3fast (
    TYPE s3,
    KEY_ID 'your_access_key_id',
    SECRET 'your_secret_access_key',
    ENDPOINT 's3fast.lafrance.io',
    REGION 'garage-fast',
    URL_STYLE 'path',
    USE_SSL false
);

AWS CLI Usage

List Files

aws s3 ls s3://mangetamain/ \
  --endpoint-url http://s3fast.lafrance.io \
  --region garage-fast

Download

aws s3 cp s3://mangetamain/PP_recipes.csv /tmp/recipes.csv \
  --endpoint-url http://s3fast.lafrance.io \
  --region garage-fast

Upload

aws s3 cp /tmp/results.csv s3://mangetamain/results/ \
  --endpoint-url http://s3fast.lafrance.io \
  --region garage-fast

Python boto3 Usage

Loading Credentials

import boto3
from configparser import ConfigParser

# Load credentials from 96_keys/
config = ConfigParser()
config.read('../96_keys/credentials')

s3 = boto3.client(
    's3',
    endpoint_url=config['s3fast']['endpoint_url'],
    aws_access_key_id=config['s3fast']['aws_access_key_id'],
    aws_secret_access_key=config['s3fast']['aws_secret_access_key'],
    region_name=config['s3fast']['region']
)

List Objects

# List files with sizes
response = s3.list_objects_v2(Bucket='mangetamain')
for obj in response.get('Contents', []):
    print(f"{obj['Key']} - {obj['Size']/1e6:.1f} MB")

Download File

s3.download_file('mangetamain', 'PP_recipes.csv', '/tmp/recipes.csv')

Upload File

s3.upload_file('/tmp/results.csv', 'mangetamain', 'results/analysis.csv')

DuckDB Usage

SQL Queries on S3

In CLI:

# Simple query
duckdb ~/mangetamain/96_keys/garage_s3.duckdb \
  -c "SELECT COUNT(*) FROM 's3://mangetamain/PP_recipes.csv'"

# Analysis with GROUP BY
duckdb ~/mangetamain/96_keys/garage_s3.duckdb -c "
SELECT calorie_level, COUNT(*) as total
FROM 's3://mangetamain/PP_recipes.csv'
GROUP BY calorie_level
ORDER BY total DESC"

In Python:

import duckdb

# Connect to database with secret
conn = duckdb.connect('~/mangetamain/96_keys/garage_s3.duckdb')

# Direct SQL query on S3
df = conn.execute("""
    SELECT *
    FROM 's3://mangetamain/PP_recipes.csv'
    LIMIT 1000
""").fetchdf()

Parquet on S3

DuckDB optimized for Parquet:

# Read Parquet from S3 (zero-copy)
conn.execute("""
    SELECT AVG(calories) as mean_calories
    FROM 's3://mangetamain/RAW_recipes_clean.parquet'
    WHERE year >= 2010
""")

Polars Usage

Direct S3 Reading

import polars as pl
from configparser import ConfigParser

# Load credentials
config = ConfigParser()
config.read('../96_keys/credentials')

# Configure storage options
storage_options = {
    'aws_endpoint_url': config['s3fast']['endpoint_url'],
    'aws_access_key_id': config['s3fast']['aws_access_key_id'],
    'aws_secret_access_key': config['s3fast']['aws_secret_access_key'],
    'aws_region': config['s3fast']['region']
}

# Read CSV from S3
df = pl.read_csv(
    's3://mangetamain/PP_recipes.csv',
    storage_options=storage_options
)

# Read Parquet from S3
df = pl.read_parquet(
    's3://mangetamain/RAW_recipes_clean.parquet',
    storage_options=storage_options
)

Performance Tests

Download Benchmark

# Test with large file
time aws s3 cp s3://mangetamain/large_file.parquet /tmp/ \
  --endpoint-url http://s3fast.lafrance.io \
  --region garage-fast

Expected results:

  • With DNAT bypass: 500-917 MB/s

  • Without bypass (reverse proxy): 50-100 MB/s

  • Gain: 5-10x faster

DNAT Active Verification

# Check iptables rule
sudo iptables -t nat -L OUTPUT -n -v | grep 3910

# Test direct connection port 3910
curl -I http://192.168.80.202:3910/mangetamain/

# Should return HTTP 200 or S3 XML error

Bucket Structure

File Organization

s3://mangetamain/
├── RAW_recipes.csv
├── RAW_recipes_clean.parquet
├── RAW_interactions.csv
├── RAW_interactions_clean.parquet
├── PP_recipes.csv
├── PP_users.csv
├── PP_ratings.parquet
├── interactions_train.csv
├── interactions_test.csv
└── interactions_validation.csv

File Sizes

File

Size

RAW_recipes.csv

~50 MB

RAW_recipes_clean.parquet

~25 MB

RAW_interactions.csv

~200 MB

RAW_interactions_clean.parquet

~80 MB

PP_recipes.csv

~30 MB

PP_ratings.parquet

~60 MB

Infrastructure Tests

Automated Tests (50_test/)

S3_duckdb_test.py (14 tests):

  • System environment (AWS CLI, credentials)

  • S3 connection with boto3

  • Download performance (>5 MB/s)

  • DuckDB + S3 integration

  • Docker tests (optional)

test_s3_parquet_files.py (5 tests):

  • Automatically scans code

  • Finds parquet file references

  • Tests S3 accessibility

Run S3 Tests

cd ~/mangetamain/50_test
pytest S3_duckdb_test.py -v

Troubleshooting

Error: Cannot connect to S3

Possible causes:

  1. DNS not configured

  2. Missing iptables rule

  3. Invalid credentials

Solution:

# Check DNS
getent hosts s3fast.lafrance.io

# Check iptables
sudo iptables -t nat -L OUTPUT -n -v | grep 3910

# Test credentials
aws s3 ls s3://mangetamain/ \
  --endpoint-url http://s3fast.lafrance.io \
  --region garage-fast

Error: Slow Download Speed

Cause: DNAT bypass not active, traffic goes through reverse proxy

Solution: Check iptables rule

sudo iptables -t nat -L OUTPUT -n -v | grep 3910

# If absent, recreate rule
sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 -j DNAT --to-destination 192.168.80.202:3910
sudo netfilter-persistent save

Error: DuckDB Secret Not Found

Cause: S3 secret not created in DuckDB database

Solution: Recreate the secret

duckdb ~/mangetamain/96_keys/garage_s3.duckdb
DROP SECRET IF EXISTS s3fast;

CREATE SECRET s3fast (
    TYPE s3,
    KEY_ID 'your_key_id',
    SECRET 'your_secret',
    ENDPOINT 's3fast.lafrance.io',
    REGION 'garage-fast',
    URL_STYLE 'path',
    USE_SSL false
);

Best Practices

Credentials Security

  • NEVER commit 96_keys/ (in .gitignore)

  • Share credentials via secure channel only

  • Regular key rotation

Performance

  • Prefer Parquet over CSV (2-3x faster)

  • Use DuckDB for SQL queries (zero-copy)

  • Enable DNAT bypass (10x faster)

  • Local cache for frequently accessed files

Streamlit Cache

import streamlit as st

@st.cache_data(ttl=3600)  # 1h cache
def load_data_from_s3():
    """Load S3 data with cache."""
    # Expensive S3 read only once
    return df

Performance Benchmarks

Configuration Comparison

Tests performed with recipes_clean.parquet (250 MB):

Configuration

Speed

Time (250 MB)

Gain

Without DNAT (via reverse proxy)

50-100 MB/s

2.5-5 seconds

Baseline

DNAT bypass (direct port 3910)

500-917 MB/s

0.27-0.5 sec

10x

DNAT + local SSD read

2-3 GB/s

0.08-0.12 sec

40x

Recommendation: DNAT bypass mandatory for acceptable performance.

Performance Test

Benchmark script:

#!/bin/bash
# test_s3_speed.sh

echo "=== Test without DNAT ==="
# Temporarily disable DNAT
sudo iptables -t nat -D OUTPUT -p tcp -d 192.168.80.202 --dport 80 \
     -j DNAT --to-destination 192.168.80.202:3910 2>/dev/null

time aws s3 cp s3://mangetamain/recipes_clean.parquet /tmp/test1.parquet --profile s3fast
rm /tmp/test1.parquet

echo "=== Test with DNAT ==="
# Re-enable DNAT
sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 \
     -j DNAT --to-destination 192.168.80.202:3910

time aws s3 cp s3://mangetamain/recipes_clean.parquet /tmp/test2.parquet --profile s3fast
rm /tmp/test2.parquet

Expected results:

Without DNAT: real 0m4.520s (55 MB/s)
With DNAT: real 0m0.380s (658 MB/s)

Gain: 11.9x faster

Parquet Reading Optimization

Format comparison:

Format

Size

Read time

Speed

CSV (uncompressed)

1.2 GB

12-15 seconds

80-100 MB/s

CSV (gzip)

320 MB

8-10 seconds

32-40 MB/s

Parquet (Snappy)

250 MB

0.3-0.5 sec

500-833 MB/s

Why Parquet is optimal:

  • Integrated Snappy compression (ratio ~5:1)

  • Columnar format (selective reading)

  • Integrated metadata (no parsing needed)

  • Zero-copy with DuckDB/Polars

Optimal reading with Polars:

import polars as pl

# Optimized Parquet reading
df = pl.read_parquet(
    "s3://mangetamain/recipes_clean.parquet",
    use_pyarrow=True,        # Arrow engine (faster)
    columns=['id', 'name'],   # Selective reading (columnar)
    n_rows=1000              # Limit for preview
)

Performance Monitoring

Measure loading time:

import time
from loguru import logger

@st.cache_data(ttl=3600)
def load_with_timing():
    start = time.time()

    df = pl.read_parquet("s3://mangetamain/recipes_clean.parquet")

    elapsed = time.time() - start
    logger.info(f"S3 load: {len(df)} rows in {elapsed:.2f}s ({len(df)/elapsed:.0f} rows/s)")

    return df

Expected logs:

2025-10-27 15:23:45 | INFO | S3 load: 178265 rows in 0.42s (424441 rows/s)

Performance Troubleshooting

Speed < 100 MB/s:

  1. Check DNAT active:

sudo iptables -t nat -L OUTPUT -n -v | grep 3910
# Should display DNAT rule
  1. Test direct connection:

curl -o /dev/null http://192.168.80.202:3910/mangetamain/recipes_clean.parquet
  1. Check network latency:

ping -c 10 192.168.80.202
# RTT should be < 1ms (local network)

Fluctuating speed:

  • Cause: Garage server load

  • Solution: Repeat measurements over 5-10 attempts

  • Normal variance: ±20%

First slow load:

  • Cause: Garage cold start (server cache)

  • Normal: 2-3x slower than subsequent loads

  • Solution: Pre-warm with aws s3 ls

Limits and Quotas

Garage S3 (current installation):

  • Bandwidth: ~1 Gbps (125 MB/s theoretical)

  • IOPS: Unlimited (server SSD)

  • Simultaneous connections: 100+ (sufficient)

  • Bucket size: ~5 GB used / 1 TB available

No AWS quotas: Self-hosted installation, no AWS limits.

See Also

  • Installation - Complete project installation

  • Tests and Coverage - Infrastructure S3 tests (50_test/)

  • api/data - data.cached_loaders module with schemas

  • api/infrastructure - Automated S3 tests

  • Quick Start Guide - Essential S3 commands

  • S3_INSTALL.md (root) - Detailed installation documentation

  • S3_USAGE.md (root) - Complete usage guide