🌍 Simple Disaster COG Processing

This simplified notebook converts disaster satellite imagery to Cloud Optimized GeoTIFFs (COGs) with just a few cells.

✨ Features

See files first - List S3 files before configuring
Smart configuration - Define filename functions after seeing actual files
Auto-discovery - Automatically categorizes your files
Simple processing - Just run the cells in order

🚀 Launch in Disasters-Hub JupyterHub (requires access)

To obtain credentials to VEDA Hub, follow this link for more information.

Disclaimer: it is highly recommended to run a tutorial within NASA VEDA JupyterHub, which already includes functions for processing and visualizing data specific to VEDA stories. Running the tutorial outside of the VEDA JupyterHub may lead to errors, specifically related to EarthData authentication. Additionally, it is recommended to use the Pangeo workspace within the VEDA JupyterHub, since certain packages relevant to this tutorial are already installed.

If you do not have a VEDA Jupyterhub Account you can launch this notebook on your local environment using MyBinder by clicking the icon below.

🔧 Step 0: Setup - Clone Required Repository

Run this cell first! This notebook requires the disasters-aws-conversion repository for processing functions.

import os
import subprocess

# Check if disasters-aws-conversion exists, if not clone it
repo_name = "disasters-aws-conversion"
repo_url = "https://github.com/Disasters-Learning-Portal/disasters-aws-conversion.git"

if not os.path.exists(repo_name):
    print(f"📥 Cloning {repo_name} repository...")
    try:
        result = subprocess.run(
            ["git", "clone", repo_url, f"{repo_name}"],
            capture_output=True,
            text=True,
            check=True
        )
        print(f"✅ Successfully cloned {repo_name}")
    except subprocess.CalledProcessError as e:
        print(f"❌ Error cloning repository: {e.stderr}")
else:
    print(f"✅ {repo_name} repository already exists")

📋 Step 1: Basic Configuration

Set your event details and S3 paths:

# ========================================
# INPUTS
# ========================================

# S3 Paths
BUCKET = 'nasa-disasters'    # S3 bucket (DO NOT CHANGE)
DESTINATION_BASE = 'drcs_activations_new'  # Where to save COGs in S3 bucket (DO NOT CHANGE)
GEOTIFF_DIR = 'drcs_activations' # This is where all non-converted files should be placed


# Event Details
EVENT_NAME = '202510_Flood_AK'  # Your sensor or product name (e.g, Sentinel-1, Planet, Landsat)
SUB_PRODUCT_NAME = 'aria'         # Sub-directories within PRODUCT_NAME (RGB, trueColor, SWIR, etc.). Can leave blank and it will read from PRODUCT_NAME.
SOURCE_PATH = f'{GEOTIFF_DIR}/{EVENT_NAME}/{SUB_PRODUCT_NAME}'      # Where your files are


# Processing Options
OVERWRITE = False      # Set to True to replace existing files
VERIFY = True          # Set to True to verify results after processing
SAVE_RESULTS = True    # Set to False to skip saving results CSV to /output directory

print(f"Event: {GEOTIFF_DIR}")
print(f"Source: s3://{BUCKET}/{SOURCE_PATH}")
print(f"Destination: s3://{BUCKET}/{DESTINATION_BASE}/")

Event: drcs_activations
Source: s3://nasa-disasters/drcs_activations/202510_Flood_AK/aria
Destination: s3://nasa-disasters/drcs_activations_new/

🔍 Step 2: Connect to S3 and List Files

Let’s see what files are available before configuring filename transformations:

# Import necessary modules
import sys
import os
from pathlib import Path

# Add parent directory to path for importing functions
sys.path.insert(0, str(Path('..').resolve()))

# Import S3 operations
from core.s3_operations import (
    initialize_s3_client,
    list_s3_files,
    get_file_size_from_s3
)

# Initialize S3 client
print("🌐 Connecting to S3...")
s3_client, _ = initialize_s3_client(bucket_name=BUCKET, verbose=False)

if s3_client:
    print("✅ Connected to S3\n")
    
    # List all TIF files
    print(f"📂 Files in s3://{BUCKET}/{SOURCE_PATH}:")
    print("="*60)
    
    files = list_s3_files(s3_client, BUCKET, SOURCE_PATH, suffix='.tif')
    
    if files:
        print(f"Found {len(files)} .tif files:\n")
        for i, file_path in enumerate(files[:10], 1):  # Show first 10
            filename = os.path.basename(file_path)
            try:
                size_gb = get_file_size_from_s3(s3_client, BUCKET, file_path)
                print(f"{i:2}. {filename:<60} ({size_gb:.2f} GB)")
            except:
                print(f"{i:2}. {filename}")
        
        if len(files) > 10:
            print(f"\n... and {len(files) - 10} more files")
        
        print("\n" + "="*60)
        print("\n💡 Use this information to create filename functions in Step 3")
    else:
        print("⚠️ No .tif files found in the specified path.")
        print("   Check your SOURCE_PATH configuration.")
else:
    print("❌ Could not connect to S3. Check your AWS credentials.")
    files = []

🌐 Connecting to S3...
✅ Connected to S3

📂 Files in s3://nasa-disasters/drcs_activations/202510_Flood_AK/aria:
============================================================
Found 6 .tif files:

 1. OPERA_L3_DSWX-S1_V1_WTR_2025-10-08_mosaic.tif                (0.01 GB)
 2. OPERA_L3_DSWX-S1_V1_WTR_2025-10-10_mosaic.tif                (0.03 GB)
 3. OPERA_L3_DSWX-S1_V1_WTR_2025-10-12_mosaic.tif                (0.03 GB)
 4. OPERA_L3_DSWX-S1_V1_WTR_2025-10-15_mosaic.tif                (0.02 GB)
 5. OPERA_L3_DSWX-HLS_V1_WTR_2025-10-08_mosaic.tif               (0.01 GB)
 6. OPERA_L3_DSWX-HLS_V1_WTR_2025-10-13_mosaic.tif               (0.01 GB)

============================================================

💡 Use this information to create filename functions in Step 3

🏷️ Step 3a: Define Categorization and Filename Transformations

Based on the files you see above, configure: 1. Categorization patterns - Regex patterns to identify file types 2. Filename functions - How to transform filenames 3. Output directories - Where each category should be saved

# ========================================
# CATEGORIZATION AND OUTPUT CONFIGURATION
# ========================================

import re

# STEP 1: Define how to extract dates from filenames
def extract_date_from_filename(filename):
    """Extract date from filename in YYYYMMDD format."""
    dates = re.findall(r'\d{8}', filename)
    if dates:
        date_str = dates[0]
        return f"{date_str[0:4]}-{date_str[4:6]}-{date_str[6:8]}"
    return None

# STEP 2: Define filename transformation functions for each category
def create_truecolor_filename(original_path, event_name):
    """Create filename for trueColor products."""
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    date = extract_date_from_filename(stem)
    
    if date:
        stem_clean = re.sub(r'_\d{8}', '', stem)
        return f"{event_name}_{stem_clean}_{date}_day.tif"
    return f"{event_name}_{stem}_day.tif"

def create_colorinfrared_filename(original_path, event_name):
    """Create filename for colorInfrared products."""
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    date = extract_date_from_filename(stem)
    
    if date:
        stem_clean = re.sub(r'_\d{8}', '', stem)
        return f"{event_name}_{stem_clean}_{date}_day.tif"
    return f"{event_name}_{stem}_day.tif"

def create_naturalcolor_filename(original_path, event_name):
    """Create filename for naturalColor products."""
    filename = os.path.basename(original_path)
    stem = os.path.splitext(filename)[0]
    date = extract_date_from_filename(stem)
    
    if date:
        stem_clean = re.sub(r'_\d{8}', '', stem)
        return f"{event_name}_{stem_clean}_{date}_day.tif"
    return f"{event_name}_{stem}_day.tif"

# STEP 3: Configure categorization patterns (REQUIRED)
# These regex patterns determine which files belong to which category
CATEGORIZATION_PATTERNS = {
    'trueColor': r'trueColor|truecolor|true_color',
    'colorInfrared': r'colorInfrared|colorIR|color_infrared',
    'naturalColor': r'naturalColor|natural_color',
    # Add patterns for ALL file types you want to process
    # Files not matching any pattern will be skipped with a warning
}

# STEP 4: Map categories to filename transformation functions
FILENAME_CREATORS = {
    'trueColor': create_truecolor_filename,
    'colorInfrared': create_colorinfrared_filename,
    'naturalColor': create_naturalcolor_filename,
    # Must have an entry for each category in CATEGORIZATION_PATTERNS
}

# STEP 5: Specify output directories for each category
OUTPUT_DIRS = {
    'trueColor': 'Landsat/trueColor',
    'colorInfrared': 'Landsat/colorIR',
    'naturalColor': 'Landsat/naturalColor',
    # Must have an entry for each category in CATEGORIZATION_PATTERNS
}

# OPTIONAL: Specify no-data values (None = auto-detect)
NODATA_VALUES = {
    'trueColor': 0,
    'colorInfrared': 0,
    'naturalColor': 0
    # Leave empty or set to None for auto-detection
}

🏷️ Step 3b: Test the new functions to verify what the inputs and outputs will be.

print("✅ Configuration defined")
print(f"\n📂 Categories and output paths:")
for category, path in OUTPUT_DIRS.items():
    pattern = CATEGORIZATION_PATTERNS.get(category, 'No pattern defined')
    print(f"   • {category}:")
    print(f"     Pattern: {pattern}")
    print(f"     Output:  {DESTINATION_BASE}/{path}")

# Test with sample filename if files exist
if files:
    sample_file = files[0]
    sample_name = os.path.basename(sample_file)
    
    # Check which category it would match
    matched_category = None
    for cat, pattern in CATEGORIZATION_PATTERNS.items():
        if re.search(pattern, sample_name, re.IGNORECASE):
            matched_category = cat
            break
    
    if matched_category:
        new_name = FILENAME_CREATORS[matched_category](sample_file, EVENT_NAME)
        print(f"\n📝 Example transformation:")
        print(f"   Original: {sample_name}")
        print(f"   Category: {matched_category}")
        print(f"   → New:    {new_name}")
        print(f"   → Output: {DESTINATION_BASE}/{OUTPUT_DIRS[matched_category]}/{new_name}")
    else:
        print(f"\n⚠️ Warning: Sample file doesn't match any category pattern:")
        print(f"   File: {sample_name}")
        print(f"   Add a pattern to CATEGORIZATION_PATTERNS to process this file type")

✅ Configuration defined

📂 Categories and output paths:
   • trueColor:
     Pattern: trueColor|truecolor|true_color
     Output:  drcs_activations_new/Landsat/trueColor
   • colorInfrared:
     Pattern: colorInfrared|colorIR|color_infrared
     Output:  drcs_activations_new/Landsat/colorIR
   • naturalColor:
     Pattern: naturalColor|natural_color
     Output:  drcs_activations_new/Landsat/naturalColor

🚀 Step 4: Initialize Processor and Preview

Now let’s set up the processor and preview all transformations:

# Import our simplified helper
from notebooks.notebook_helpers import SimpleProcessor

# Create full configuration with categorization patterns
config = {
    'event_name': EVENT_NAME,
    'bucket': BUCKET,
    'source_path': SOURCE_PATH,
    'destination_base': DESTINATION_BASE,
    'overwrite': OVERWRITE,
    'verify': VERIFY,
    'save_results': SAVE_RESULTS,  # Add save results flag
    'categorization_patterns': CATEGORIZATION_PATTERNS,  # IMPORTANT: Include patterns
    'filename_creators': FILENAME_CREATORS,
    'output_dirs': OUTPUT_DIRS,
    'nodata_values': NODATA_VALUES
}

# Initialize processor
processor = SimpleProcessor(config)

# Connect to S3 (already connected, but needed for processor)
if processor.connect_to_s3():
    print("✅ Processor ready\n")
    
    # Discover and categorize files
    num_files = processor.discover_files()
    
    if num_files > 0:
        # Show preview of transformations
        processor.preview_processing()
        
        print("\n📌 Review the transformations above.")
        print("   • Files will be saved to the directories specified in OUTPUT_DIRS")
        print("   • If files appear as 'uncategorized', add patterns to CATEGORIZATION_PATTERNS")
        print("   • When ready, proceed to Step 5 to process the files.")
    else:
        print("⚠️ No files found to process.")
else:
    print("❌ Could not initialize processor.")

✅ All modules loaded successfully

🌐 Connecting to S3...
✅ Connected to S3 successfully
✅ Processor ready


🔍 Searching for files in: geotiffs_to_convert/202408_TropicalStorm_Debby/landsat8
⚠️ No .tif files found
⚠️ No files found to process.

⚙️ Step 5: Process Files

Run this cell to start processing all files:

# Process all files
if 'num_files' in locals() and num_files > 0:
    print("🚀 Starting processing...")
    print("This may take several minutes depending on file sizes.\n")
    
    # Process everything
    results = processor.process_all()
    
    # Display results
    if not results.empty:
        print("\n📊 Processing Complete!")
        display(results) if 'display' in dir() else print(results)
else:
    print("⚠️ No files to process. Complete Steps 1-4 first.")

🚀 Starting processing...
This may take several minutes depending on file sizes.


🚀 Starting processing...

📦 Processing colorInfrared (3 files)
  ⏭️ Skipped: LC08_colorInfrared_20240715_155319_016036.tif (exists)
  ⏭️ Skipped: LC08_colorInfrared_20240715_155343_016037.tif (exists)
  ⏭️ Skipped: LC08_colorInfrared_20240715_15547_016038.tif (exists)

📦 Processing naturalColor (3 files)
  ⏭️ Skipped: LC08_naturalColor_20240715_155319_016036.tif (exists)
  ⏭️ Skipped: LC08_naturalColor_20240715_155343_016037.tif (exists)
  ⏭️ Skipped: LC08_naturalColor_20240715_15547_016038.tif (exists)

📦 Processing trueColor (3 files)
  ⏭️ Skipped: LC08_trueColor_20240715_155319_016036.tif (exists)
  ⏭️ Skipped: LC08_trueColor_20240715_155343_016037.tif (exists)
  ⏭️ Skipped: LC08_trueColor_20240715_15547_016038.tif (exists)

============================================================
✅ PROCESSING COMPLETE
============================================================

Results:
  ⏭️ Skipped: 9

Processing time: 0.0 minutes

📁 Results saved to: output/202408_TropicalStorm_Debby/results_20250929_191143.csv
============================================================

📊 Processing Complete!
                                     source_file       category   status  \
0  LC08_colorInfrared_20240715_155319_016036.tif  colorInfrared  skipped   
1  LC08_colorInfrared_20240715_155343_016037.tif  colorInfrared  skipped   
2   LC08_colorInfrared_20240715_15547_016038.tif  colorInfrared  skipped   
3   LC08_naturalColor_20240715_155319_016036.tif   naturalColor  skipped   
4   LC08_naturalColor_20240715_155343_016037.tif   naturalColor  skipped   
5    LC08_naturalColor_20240715_15547_016038.tif   naturalColor  skipped   
6      LC08_trueColor_20240715_155319_016036.tif      trueColor  skipped   
7      LC08_trueColor_20240715_155343_016037.tif      trueColor  skipped   
8       LC08_trueColor_20240715_15547_016038.tif      trueColor  skipped   

           reason                                        output_path  \
0  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
1  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
2  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
3  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
4  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
5  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
6  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
7  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   
8  already exists  s3://nasa-disasters/drcs_activations_new/Lands...   

   time_seconds  
0             0  
1             0  
2             0  
3             0  
4             0  
5             0  
6             0  
7             0  
8             0

# Analyze results
if 'results' in locals() and not results.empty:
    print("📊 PROCESSING STATISTICS")
    print("="*40)
    
    # Success rate
    total = len(results)
    success = len(results[results['status'] == 'success'])
    failed = len(results[results['status'] == 'failed'])
    skipped = len(results[results['status'] == 'skipped'])
    
    print(f"Total files: {total}")
    print(f"✅ Success: {success}")
    print(f"❌ Failed: {failed}")
    print(f"⏭️ Skipped: {skipped}")
    print(f"\nSuccess rate: {(success/total*100):.1f}%")
    
    # Failed files
    if failed > 0:
        print("\n❌ Failed files:")
        failed_df = results[results['status'] == 'failed']
        for idx, row in failed_df.iterrows():
            print(f"  - {row['source_file']}: {row.get('error', 'Unknown error')}")
    
    # Processing times
    if 'time_seconds' in results.columns:
        success_df = results[results['status'] == 'success']
        if not success_df.empty:
            avg_time = success_df['time_seconds'].mean()
            max_time = success_df['time_seconds'].max()
            print(f"\n⏱️ Timing:")
            print(f"Average: {avg_time:.1f} seconds per file")
            print(f"Slowest: {max_time:.1f} seconds")
else:
    print("No results to analyze. Run Step 5 first.")

💡 Tips & Troubleshooting

Workflow Summary:

Setup - Clone disasters-aws-conversion repository (Step 0)
Configure basic settings (Step 1)
List files from S3 to see naming patterns (Step 2)
Define functions to transform filenames (Step 3)
Preview transformations (Step 4)
Process all files (Step 5)
Review results (Step 6)

Common Issues:

“ModuleNotFoundError: No module named ‘core’” or import errors
- Run Step 0 first to clone the disasters-aws-conversion repository
- Restart kernel and run all cells from the beginning
“No files found”
- Check SOURCE_PATH in Step 1
- Verify bucket permissions
- Ensure files have .tif extension
Wrong filename transformations
- Review actual filenames in Step 2
- Adjust functions in Step 3
- Re-run Step 4 to preview
Files being skipped
- Files already exist in destination
- Set OVERWRITE = True in Step 1
Processing errors
- Check AWS credentials
- Verify S3 write permissions
- Check available disk space for temp files

Need More Control?

Use the full template at disaster_processing_template.ipynb for: - Manual chunk configuration - Custom compression settings - Detailed memory management - Advanced processing options