Garage S3 Storage ================= Configuration and usage of high-performance Garage S3 storage. **S3 Garage** is a self-hosted implementation of the Amazon S3 protocol (:doc:`glossaire`). This guide covers configuration, performance optimization and troubleshooting. **To get started quickly**, see :doc:`quickstart`. Overview -------- **Available endpoints**: * **HTTP**: http://s3fast.lafrance.io (port 3910) - **Preferred in code** * **HTTPS**: https://s3fast.lafrance.io (port 443, via reverse proxy) **Bucket**: mangetamain **Performance**: 500-917 MB/s (DNAT bypass) **Region**: garage-fast **HTTP vs HTTPS choice**: The code uses the HTTP endpoint for performance reasons: * **Speed gain**: No TLS/SSL overhead during data transfers * **Secured DMZ network**: Communication remains within the isolated local network (192.168.80.0/24) * **HTTPS available**: Accessible via reverse proxy for external access if needed Installation ------------ Local DNS Configuration ^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash echo "192.168.80.202 s3fast.lafrance.io" | sudo tee -a /etc/hosts iptables-persistent Installation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash sudo apt update sudo apt install iptables-persistent -y iptables DNAT Rule ^^^^^^^^^^^^^^^^^^ Bypass reverse proxy for maximum performance: .. code-block:: bash sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 -j DNAT --to-destination 192.168.80.202:3910 Permanent Save ^^^^^^^^^^^^^^ .. code-block:: bash sudo netfilter-persistent save Installation Verification ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # DNS getent hosts s3fast.lafrance.io # Should display: 192.168.80.202 s3fast.lafrance.io # iptables sudo iptables -t nat -L OUTPUT -n -v | grep 3910 # Should display the DNAT rule Credentials Configuration ------------------------- 96_keys/ Structure ^^^^^^^^^^^^^^^^^^ .. code-block:: text 96_keys/ ├── credentials # s3fast profile ├── aws_config # AWS CLI config └── garage_s3.duckdb # DuckDB database with S3 secret credentials File ^^^^^^^^^^^^^^^^ ConfigParser format: .. code-block:: ini [s3fast] aws_access_key_id = GK4feb... aws_secret_access_key = 50e63b... endpoint_url = http://s3fast.lafrance.io region = garage-fast bucket = mangetamain aws_config File ^^^^^^^^^^^^^^^ AWS CLI format: .. code-block:: ini [profile s3fast] region = garage-fast s3 = endpoint_url = http://s3fast.lafrance.io DuckDB Database with Secret ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create once: .. code-block:: bash cd ~/mangetamain/96_keys duckdb garage_s3.duckdb In DuckDB: .. code-block:: sql INSTALL httpfs; LOAD httpfs; CREATE SECRET s3fast ( TYPE s3, KEY_ID 'your_access_key_id', SECRET 'your_secret_access_key', ENDPOINT 's3fast.lafrance.io', REGION 'garage-fast', URL_STYLE 'path', USE_SSL false ); AWS CLI Usage ------------- List Files ^^^^^^^^^^ .. code-block:: bash aws s3 ls s3://mangetamain/ \ --endpoint-url http://s3fast.lafrance.io \ --region garage-fast Download ^^^^^^^^ .. code-block:: bash aws s3 cp s3://mangetamain/PP_recipes.csv /tmp/recipes.csv \ --endpoint-url http://s3fast.lafrance.io \ --region garage-fast Upload ^^^^^^ .. code-block:: bash aws s3 cp /tmp/results.csv s3://mangetamain/results/ \ --endpoint-url http://s3fast.lafrance.io \ --region garage-fast Python boto3 Usage ------------------ Loading Credentials ^^^^^^^^^^^^^^^^^^^ .. code-block:: python import boto3 from configparser import ConfigParser # Load credentials from 96_keys/ config = ConfigParser() config.read('../96_keys/credentials') s3 = boto3.client( 's3', endpoint_url=config['s3fast']['endpoint_url'], aws_access_key_id=config['s3fast']['aws_access_key_id'], aws_secret_access_key=config['s3fast']['aws_secret_access_key'], region_name=config['s3fast']['region'] ) List Objects ^^^^^^^^^^^^ .. code-block:: python # List files with sizes response = s3.list_objects_v2(Bucket='mangetamain') for obj in response.get('Contents', []): print(f"{obj['Key']} - {obj['Size']/1e6:.1f} MB") Download File ^^^^^^^^^^^^^ .. code-block:: python s3.download_file('mangetamain', 'PP_recipes.csv', '/tmp/recipes.csv') Upload File ^^^^^^^^^^^ .. code-block:: python s3.upload_file('/tmp/results.csv', 'mangetamain', 'results/analysis.csv') DuckDB Usage ------------ SQL Queries on S3 ^^^^^^^^^^^^^^^^^ In CLI: .. code-block:: bash # Simple query duckdb ~/mangetamain/96_keys/garage_s3.duckdb \ -c "SELECT COUNT(*) FROM 's3://mangetamain/PP_recipes.csv'" # Analysis with GROUP BY duckdb ~/mangetamain/96_keys/garage_s3.duckdb -c " SELECT calorie_level, COUNT(*) as total FROM 's3://mangetamain/PP_recipes.csv' GROUP BY calorie_level ORDER BY total DESC" In Python: .. code-block:: python import duckdb # Connect to database with secret conn = duckdb.connect('~/mangetamain/96_keys/garage_s3.duckdb') # Direct SQL query on S3 df = conn.execute(""" SELECT * FROM 's3://mangetamain/PP_recipes.csv' LIMIT 1000 """).fetchdf() Parquet on S3 ^^^^^^^^^^^^^ DuckDB optimized for Parquet: .. code-block:: python # Read Parquet from S3 (zero-copy) conn.execute(""" SELECT AVG(calories) as mean_calories FROM 's3://mangetamain/RAW_recipes_clean.parquet' WHERE year >= 2010 """) Polars Usage ------------ Direct S3 Reading ^^^^^^^^^^^^^^^^^ .. code-block:: python import polars as pl from configparser import ConfigParser # Load credentials config = ConfigParser() config.read('../96_keys/credentials') # Configure storage options storage_options = { 'aws_endpoint_url': config['s3fast']['endpoint_url'], 'aws_access_key_id': config['s3fast']['aws_access_key_id'], 'aws_secret_access_key': config['s3fast']['aws_secret_access_key'], 'aws_region': config['s3fast']['region'] } # Read CSV from S3 df = pl.read_csv( 's3://mangetamain/PP_recipes.csv', storage_options=storage_options ) # Read Parquet from S3 df = pl.read_parquet( 's3://mangetamain/RAW_recipes_clean.parquet', storage_options=storage_options ) Performance Tests ----------------- Download Benchmark ^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Test with large file time aws s3 cp s3://mangetamain/large_file.parquet /tmp/ \ --endpoint-url http://s3fast.lafrance.io \ --region garage-fast **Expected results**: * **With DNAT bypass**: 500-917 MB/s * **Without bypass** (reverse proxy): 50-100 MB/s * **Gain**: 5-10x faster DNAT Active Verification ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Check iptables rule sudo iptables -t nat -L OUTPUT -n -v | grep 3910 # Test direct connection port 3910 curl -I http://192.168.80.202:3910/mangetamain/ # Should return HTTP 200 or S3 XML error Bucket Structure ---------------- File Organization ^^^^^^^^^^^^^^^^^ .. code-block:: text s3://mangetamain/ ├── RAW_recipes.csv ├── RAW_recipes_clean.parquet ├── RAW_interactions.csv ├── RAW_interactions_clean.parquet ├── PP_recipes.csv ├── PP_users.csv ├── PP_ratings.parquet ├── interactions_train.csv ├── interactions_test.csv └── interactions_validation.csv File Sizes ^^^^^^^^^^ =========================================== ============ File Size =========================================== ============ RAW_recipes.csv ~50 MB RAW_recipes_clean.parquet ~25 MB RAW_interactions.csv ~200 MB RAW_interactions_clean.parquet ~80 MB PP_recipes.csv ~30 MB PP_ratings.parquet ~60 MB =========================================== ============ Infrastructure Tests -------------------- Automated Tests (50_test/) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **S3_duckdb_test.py** (14 tests): * System environment (AWS CLI, credentials) * S3 connection with boto3 * Download performance (>5 MB/s) * DuckDB + S3 integration * Docker tests (optional) **test_s3_parquet_files.py** (5 tests): * Automatically scans code * Finds parquet file references * Tests S3 accessibility Run S3 Tests ^^^^^^^^^^^^ .. code-block:: bash cd ~/mangetamain/50_test pytest S3_duckdb_test.py -v Troubleshooting --------------- Error: Cannot connect to S3 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Possible causes**: 1. DNS not configured 2. Missing iptables rule 3. Invalid credentials **Solution**: .. code-block:: bash # Check DNS getent hosts s3fast.lafrance.io # Check iptables sudo iptables -t nat -L OUTPUT -n -v | grep 3910 # Test credentials aws s3 ls s3://mangetamain/ \ --endpoint-url http://s3fast.lafrance.io \ --region garage-fast Error: Slow Download Speed ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Cause**: DNAT bypass not active, traffic goes through reverse proxy **Solution**: Check iptables rule .. code-block:: bash sudo iptables -t nat -L OUTPUT -n -v | grep 3910 # If absent, recreate rule sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 -j DNAT --to-destination 192.168.80.202:3910 sudo netfilter-persistent save Error: DuckDB Secret Not Found ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Cause**: S3 secret not created in DuckDB database **Solution**: Recreate the secret .. code-block:: bash duckdb ~/mangetamain/96_keys/garage_s3.duckdb .. code-block:: sql DROP SECRET IF EXISTS s3fast; CREATE SECRET s3fast ( TYPE s3, KEY_ID 'your_key_id', SECRET 'your_secret', ENDPOINT 's3fast.lafrance.io', REGION 'garage-fast', URL_STYLE 'path', USE_SSL false ); Best Practices -------------- Credentials Security ^^^^^^^^^^^^^^^^^^^^ * **NEVER** commit 96_keys/ (in .gitignore) * Share credentials via secure channel only * Regular key rotation Performance ^^^^^^^^^^^ * Prefer Parquet over CSV (2-3x faster) * Use DuckDB for SQL queries (zero-copy) * Enable DNAT bypass (10x faster) * Local cache for frequently accessed files Streamlit Cache ^^^^^^^^^^^^^^^ .. code-block:: python import streamlit as st @st.cache_data(ttl=3600) # 1h cache def load_data_from_s3(): """Load S3 data with cache.""" # Expensive S3 read only once return df Performance Benchmarks ---------------------- Configuration Comparison ^^^^^^^^^^^^^^^^^^^^^^^^ Tests performed with ``recipes_clean.parquet`` (250 MB): ================================ ============== =============== ========== Configuration Speed Time (250 MB) Gain ================================ ============== =============== ========== Without DNAT (via reverse proxy) 50-100 MB/s 2.5-5 seconds Baseline DNAT bypass (direct port 3910) 500-917 MB/s 0.27-0.5 sec **10x** DNAT + local SSD read 2-3 GB/s 0.08-0.12 sec 40x ================================ ============== =============== ========== **Recommendation**: DNAT bypass mandatory for acceptable performance. Performance Test ^^^^^^^^^^^^^^^^ **Benchmark script**: .. code-block:: bash #!/bin/bash # test_s3_speed.sh echo "=== Test without DNAT ===" # Temporarily disable DNAT sudo iptables -t nat -D OUTPUT -p tcp -d 192.168.80.202 --dport 80 \ -j DNAT --to-destination 192.168.80.202:3910 2>/dev/null time aws s3 cp s3://mangetamain/recipes_clean.parquet /tmp/test1.parquet --profile s3fast rm /tmp/test1.parquet echo "=== Test with DNAT ===" # Re-enable DNAT sudo iptables -t nat -A OUTPUT -p tcp -d 192.168.80.202 --dport 80 \ -j DNAT --to-destination 192.168.80.202:3910 time aws s3 cp s3://mangetamain/recipes_clean.parquet /tmp/test2.parquet --profile s3fast rm /tmp/test2.parquet **Expected results**: .. code-block:: text Without DNAT: real 0m4.520s (55 MB/s) With DNAT: real 0m0.380s (658 MB/s) Gain: 11.9x faster Parquet Reading Optimization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Format comparison**: =================== ============ =============== ==================== Format Size Read time Speed =================== ============ =============== ==================== CSV (uncompressed) 1.2 GB 12-15 seconds 80-100 MB/s CSV (gzip) 320 MB 8-10 seconds 32-40 MB/s Parquet (Snappy) 250 MB 0.3-0.5 sec **500-833 MB/s** =================== ============ =============== ==================== **Why Parquet is optimal**: * Integrated Snappy compression (ratio ~5:1) * Columnar format (selective reading) * Integrated metadata (no parsing needed) * Zero-copy with DuckDB/Polars **Optimal reading with Polars**: .. code-block:: python import polars as pl # Optimized Parquet reading df = pl.read_parquet( "s3://mangetamain/recipes_clean.parquet", use_pyarrow=True, # Arrow engine (faster) columns=['id', 'name'], # Selective reading (columnar) n_rows=1000 # Limit for preview ) Performance Monitoring ^^^^^^^^^^^^^^^^^^^^^^ **Measure loading time**: .. code-block:: python import time from loguru import logger @st.cache_data(ttl=3600) def load_with_timing(): start = time.time() df = pl.read_parquet("s3://mangetamain/recipes_clean.parquet") elapsed = time.time() - start logger.info(f"S3 load: {len(df)} rows in {elapsed:.2f}s ({len(df)/elapsed:.0f} rows/s)") return df **Expected logs**: .. code-block:: text 2025-10-27 15:23:45 | INFO | S3 load: 178265 rows in 0.42s (424441 rows/s) Performance Troubleshooting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Speed < 100 MB/s**: 1. **Check DNAT active**: .. code-block:: bash sudo iptables -t nat -L OUTPUT -n -v | grep 3910 # Should display DNAT rule 2. **Test direct connection**: .. code-block:: bash curl -o /dev/null http://192.168.80.202:3910/mangetamain/recipes_clean.parquet 3. **Check network latency**: .. code-block:: bash ping -c 10 192.168.80.202 # RTT should be < 1ms (local network) **Fluctuating speed**: * **Cause**: Garage server load * **Solution**: Repeat measurements over 5-10 attempts * **Normal variance**: ±20% **First slow load**: * **Cause**: Garage cold start (server cache) * **Normal**: 2-3x slower than subsequent loads * **Solution**: Pre-warm with ``aws s3 ls`` Limits and Quotas ^^^^^^^^^^^^^^^^^ **Garage S3 (current installation)**: * **Bandwidth**: ~1 Gbps (125 MB/s theoretical) * **IOPS**: Unlimited (server SSD) * **Simultaneous connections**: 100+ (sufficient) * **Bucket size**: ~5 GB used / 1 TB available **No AWS quotas**: Self-hosted installation, no AWS limits. See Also -------- * :doc:`installation` - Complete project installation * :doc:`tests` - Infrastructure S3 tests (50_test/) * :doc:`api/data` - data.cached_loaders module with schemas * :doc:`api/infrastructure` - Automated S3 tests * :doc:`quickstart` - Essential S3 commands * S3_INSTALL.md (root) - Detailed installation documentation * S3_USAGE.md (root) - Complete usage guide