Module data =========== Data loading and caching from S3 Parquet with Streamlit @st.cache_data (TTL 3600s). **Data**: 178K recipes + 1.1M ratings (~450 MB Parquet) → see :doc:`../glossaire` for details. data.loaders ------------ DataLoader class for data loading with robust error handling. .. automodule:: mangetamain_analytics.data.loaders :members: :undoc-members: :show-inheritance: DataLoader Class ^^^^^^^^^^^^^^^^^ Encapsulates loading logic from mangetamain_data_utils with custom exceptions. **Methods**: * ``load_recipes()``: Load recipes from S3 Parquet * ``load_ratings(min_interactions, return_metadata, verbose)``: Load ratings for long-term analysis **Raised Exceptions**: * ``DataLoadError``: If module not found or S3 loading fails **Example**: .. code-block:: python from mangetamain_analytics.data.loaders import DataLoader from mangetamain_analytics.exceptions import DataLoadError loader = DataLoader() try: recipes = loader.load_recipes() print(f"Loaded {len(recipes)} recipes") except DataLoadError as e: print(f"Error: {e.source} - {e.detail}") Relationship with cached_loaders ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``cached_loaders`` uses DataLoader internally: 1. ``DataLoader``: Business logic + error handling (testable) 2. ``cached_loaders``: Wrapping with ``@st.cache_data`` (Streamlit cache) This separation allows testing DataLoader without mocking Streamlit. data.cached_loaders ------------------- Data loading functions with Streamlit cache. .. automodule:: mangetamain_analytics.data.cached_loaders :members: :undoc-members: :show-inheritance: Main Functions ^^^^^^^^^^^^^^^^^^^^^ * ``get_recipes_clean()``: Load recipes from S3 Parquet * ``get_ratings_longterm()``: Load ratings for long-term analysis Data Schema ^^^^^^^^^^^^^^^^^^ **get_recipes_clean() returns**: * ``id``: Unique recipe identifier (int) * ``name``: Recipe name (str) * ``minutes``: Preparation time in minutes (int) * ``submitted``: Submission date (date) * ``year``: Submission year (int) * ``n_ingredients``: Number of ingredients (int) * ``complexity_score``: Complexity score 0-10 (float) * ``calories``, ``protein``, ``fat``, ``sodium``: Nutritional information (float) * ``tags``: Recipe tags list (list[str]) * ``day_of_week``: Day of week (0=Monday, 6=Sunday) * ``season``: Season (Autumn, Winter, Spring, Summer) **Size**: 178,265 recipes, ~250 MB compressed Parquet **get_ratings_longterm() returns**: * ``user_id``: User identifier (int) * ``recipe_id``: Recipe identifier (int) * ``date``: Rating date (date) * ``rating``: 0-5 star rating (int) * ``review``: Optional comment text (str) **Size**: 1.1M+ ratings, ~180 MB compressed Parquet Advanced Options ^^^^^^^^^^^^^^^^ **get_ratings_longterm() accepts**: * ``min_interactions`` (int, default 100): Filter recipes with minimum interactions * ``return_metadata`` (bool, default False): Return tuple (data, metadata) * ``verbose`` (bool, default False): Display detailed loading logs Metadata contains: * ``total_ratings``: Total number of ratings * ``total_users``: Number of unique users * ``total_recipes``: Number of rated recipes * ``date_range``: Temporal range (min, max) * ``load_time_ms``: Loading time in milliseconds Cache Mechanism ^^^^^^^^^^^^^^^^^^ Functions use the ``@st.cache_data`` decorator with: * **TTL**: 3600 seconds (1 hour) * **Spinner**: Visible loading message * **Lazy imports**: Local test compatibility Usage Examples ^^^^^^^^^^^^^^^ **Basic loading:** .. code-block:: python from data.cached_loaders import get_recipes_clean, get_ratings_longterm # Loaded once per hour from S3 recipes = get_recipes_clean() # DataFrame 178K recipes ratings = get_ratings_longterm() # DataFrame 1.1M+ ratings print(f"Loaded {len(recipes)} recipes, {len(ratings)} ratings") **With advanced options:** .. code-block:: python # Filter popular recipes + metadata ratings, metadata = get_ratings_longterm( min_interactions=100, # Minimum 100 ratings return_metadata=True, verbose=True ) print(f"Total users: {metadata['total_users']:,}") print(f"Date range: {metadata['date_range']}") print(f"Load time: {metadata['load_time_ms']} ms") **Analyzing data:** .. code-block:: python import polars as pl # Filter recipes by year recipes_2018 = recipes.filter(pl.col('year') == 2018) # Quick recipes (< 30 min) quick_recipes = recipes.filter(pl.col('minutes') < 30) # Recipes by season winter_recipes = recipes.filter(pl.col('season') == 'Winter') # Top rated recipes top_ratings = ratings.filter(pl.col('rating') == 5) **Joining recipes and ratings:** .. code-block:: python # Join for combined analysis recipes_with_ratings = recipes.join( ratings, left_on='id', right_on='recipe_id', how='inner' ) # Calculate average rating per recipe avg_ratings = recipes_with_ratings.group_by('id').agg([ pl.col('rating').mean().alias('avg_rating'), pl.col('rating').count().alias('num_ratings') ]) **Cache management:** .. code-block:: python import streamlit as st # Force programmatic reload st.cache_data.clear() # Reload fresh data recipes = get_recipes_clean() # Display cache info st.info(f"Cache TTL: 1 hour. Last update: {datetime.now()}") Performance ^^^^^^^^^^^ * First load: 5-10 seconds (from S3 Parquet) * Subsequent loads: <0.1 second (Streamlit memory cache) * Gain: 50-100x on repeated navigations To force reload: 1. Streamlit menu → "Clear cache" 2. Reload page Memory Optimization ^^^^^^^^^^^^^^^^^^^^ Data is loaded in **Polars** (columnar format) for: * Reduced memory footprint vs Pandas * 5-10x faster filter/aggregation performance * Lazy evaluation for complex transformations Pandas conversion if needed: .. code-block:: python recipes_pd = recipes.to_pandas() # Polars → Pandas Troubleshooting ^^^^^^^^^^^^^^^ **Error: "No S3 credentials"** Solution: Verify ``96_keys/credentials`` file exists with valid INI format. **See**: :doc:`/s3` for complete S3 configuration. **Error: "Cache data too large"** If the app consumes too much memory: 1. Reduce cache TTL in code: ``@st.cache_data(ttl=1800)`` (30 min) 2. Filter data before caching 3. Increase server RAM (current: 32 GB dataia) **Slow loading (> 30 seconds)** Possible causes: 1. Slow S3 connection → Check DNAT bypass (:doc:`/s3`) 2. First load normal (cache creation) 3. Expired cache → Reloaded every hour **Missing columns in DataFrame** Some columns are calculated: * ``season``: Derived from ``submitted`` (month → season) * ``day_of_week``: Derived from ``submitted`` (0-6) * ``complexity_score``: Calculated from ``n_steps``, ``n_ingredients`` If missing: verify S3 Parquet version is up to date. Data Source ^^^^^^^^^^^^^^^^^^ * **Original dataset**: Food.com (Kaggle) * **Period**: 1999-2018 (20 years) * **Preprocessing**: Cleaning, enrichment, feature engineering * **Format**: Snappy-compressed Parquet * **Storage**: S3 Garage (s3fast.lafrance.io) * **Total**: ~450 MB compressed, ~2.5 GB uncompressed **See**: EDA project documentation (``00_eda/``) for preprocessing details.