Dependency Management
Building complex computational graphs with bead's input system
Understanding Dependencies
In bead, dependencies are explicit connections between computational units. When bead B depends on bead A, it means B uses A’s outputs as inputs.
The Input System
Basic Dependency Addition
# Add a dependency
$ bead input add processed-data
# What happens:
# 1. Searches all bead boxes for 'processed-data'
# 2. Finds latest version
# 3. Extracts outputs to input/processed-data/
# 4. Records dependency in .bead-meta/
Input Directory Structure
my-analysis/
├── input/
│ ├── processed-data/ # From one bead
│ │ ├── clean.csv
│ │ └── README.md
│ └── model-parameters/ # From another bead
│ └── config.json
├── output/
├── src/
└── temp/
Complete Input Commands
Adding Dependencies
# Basic add (finds latest version)
bead input add survey-responses
# Add with custom name
bead input add responses survey-2024
# Add specific version
bead input add survey-responses --time 20250730T120000+0200
# Add from specific file
bead input add model-output /path/to/model_20250730.zip
Loading and Unloading
Save disk space by loading/unloading large inputs:
# Unload to free space (keeps dependency definition)
$ bead input unload large-dataset
$ ls input/
# large-dataset folder gone
# Load when needed again
$ bead input load large-dataset
# Data restored from bead box
Updating Dependencies
Keep inputs current as upstream beads evolve:
# Update single input to latest
$ bead input update processed-data
# Update all inputs
$ bead input update
# See what would update without changing
$ bead input update --dry-run
Advanced Operations
# Remap to different source
$ bead input map old-data new-cleaned-data
# Delete dependency entirely
$ bead input delete test-data
# List all current inputs
$ bead input list
Version Management
Understanding Versions
Each bead save creates a new version with timestamp:
survey-data_20250729T100000+0200.zip # Version 1
survey-data_20250730T100000+0200.zip # Version 2
survey-data_20250730T150000+0200.zip # Version 3
Pinning Versions
# Always use latest (default)
$ bead input add survey-data
# Pin to specific time
$ bead input add survey-data --time 20250730T100000+0200
# Update to previous version
$ bead input update --prev survey-data
Version Conflicts
When your input is outdated:
$ bead input update
Updating processed-data:
Current: processed-data_20250729T100000+0200.zip
Latest: processed-data_20250730T150000+0200.zip
Update? [y/N]: y
Complex Dependency Patterns
Multiple Dependencies
$ bead new multi-source-analysis
$ cd multi-source-analysis
# Add multiple data sources
$ bead input add customer-data
$ bead input add transaction-logs
$ bead input add product-catalog
Use in your analysis script (analyze.py
):
import pandas as pd
customers = pd.read_csv('input/customer-data/customers.csv')
transactions = pd.read_csv('input/transaction-logs/logs.csv')
products = pd.read_csv('input/product-catalog/products.csv')
# Merge and analyze...
Dependency Chains
Build pipelines where each step depends on the previous:
# Step 1: Raw data
bead new raw-sensor-readings
# ... download data ...
bead save data-lake
# Step 2: Cleaning
bead new clean-sensor-data
bead input add raw-sensor-readings
# ... clean data ...
bead save processed
# Step 3: Analysis
bead new sensor-analysis
bead input add clean-sensor-data
# ... analyze ...
bead save results
# Step 4: Visualization
bead new dashboard
bead input add sensor-analysis
# ... create plots ...
Branching Dependencies
One bead can be input to many:
┌→ regional-analysis
│
base-data──→ temporal-analysis
│
└→ cohort-analysis
Implementation:
# Each analysis starts with same base
$ bead new regional-analysis
$ bead input add base-data
$ bead new temporal-analysis
$ bead input add base-data
$ bead new cohort-analysis
$ bead input add base-data
Managing Large Dependencies
Selective Loading with -x Flag
For beads with large outputs:
# Development: don't load outputs
$ bead develop large-model-results
# When you need to inspect outputs
$ bead develop -x large-model-results
Partial Dependencies
When you only need some files:
# In your code, check what's available
import os
if os.path.exists('input/large-data/subset.csv'):
# Use subset for development
data = pd.read_csv('input/large-data/subset.csv')
else:
# Full data in production
data = pd.read_csv('input/large-data/full.csv')
Troubleshooting Dependencies
Missing Dependencies
$ python analyze.py
FileNotFoundError: input/model-output/predictions.csv
# Solution 1: Load the input
$ bead input load model-output
# Solution 2: Check if input is defined
$ cat .bead-meta/bead | grep model-output
Wrong Version Loaded
# Check current version
$ ls -la input/processed-data/
# Check timestamp in filename
# Update to latest
$ bead input update processed-data
# Or pin to specific version
$ bead input map processed-data processed-data_20250730T120000+0200.zip
Circular Dependencies
bead prevents circular dependencies:
$ bead input add analysis-b
ERROR: Circular dependency detected:
analysis-a → analysis-b → analysis-a
Solution: Refactor into proper DAG:
# Extract common elements
$ bead new shared-preprocessing
# Both analyses can depend on this
Best Practices
1. Descriptive Input Names
# Match input name to source bead
$ bead input add customer-demographics
# Clear names in code
demographics = pd.read_csv('input/customer-demographics/data.csv')
2. Document Dependencies
In your README:
## Dependencies
This bead requires:
- `survey-responses`: Raw survey data (v2024-07-30 or later)
- `census-data`: Population statistics for weighting
3. Test with Different Versions
# Test with latest
$ bead input update
$ make test
# Test with production version
$ bead input map survey-data survey-data_20250701T000000+0200.zip
$ make test
4. Handle Missing Inputs Gracefully
import os
import sys
required_inputs = ['survey-data', 'model-parameters']
for inp in required_inputs:
if not os.path.exists(f'input/{inp}'):
print(f"ERROR: Required input '{inp}' not loaded")
print(f"Run: bead input load {inp}")
sys.exit(1)
Advanced Patterns
Conditional Dependencies
# config.json specifies which inputs to use
config = json.load(open('config.json'))
if config['use_external_data']:
if not os.path.exists('input/external-source'):
raise ValueError("External data not loaded")
external = pd.read_csv('input/external-source/data.csv')
Dependency Versioning in Analysis
Track which versions produced results:
# Record input versions in output
import os
import json
versions = {}
for input_dir in os.listdir('input'):
# Get version from symlink or metadata
versions[input_dir] = get_input_version(input_dir)
with open('output/input-versions.json', 'w') as f:
json.dump(versions, f, indent=2)
Multi-Stage Processing
# Stage 1: Quick prototype with subset
$ bead input add data-subset
$ python prototype.py
$ bead save prototype
# Stage 2: Full analysis with complete data
$ bead input delete data-subset
$ bead input add data-complete
$ python full_analysis.py
$ bead save final
Ready to collaborate? Continue to Team Collaboration to learn how teams work together with bead.