Core Concepts

Understanding bead's fundamental principles and design

The bead Philosophy

Research workflows are complex and hard to reproduce. Data files get moved, code gets updated, team members leave, and suddenly results can’t be recreated. bead solves this with a simple approach: package everything needed to recreate a result into one self-contained unit.

This creates reproducible computational workflows without forcing you to change how you work.

The Fundamental Pattern

Output is created by applying code to inputs:

output = code(*inputs)

Every bead follows this pattern:

Inputs: The data you need (bead tracks exactly which version)
Code: Your scripts, notebooks, whatever (bead saves all of it)
Output: The results you created (bead packages it up nicely)

Key Concepts

1. Immutable Computational Snapshots

When you save a bead, it becomes an immutable archive containing:

All code needed to run the computation
References to exact input data versions
Generated output files
Metadata about creation time and dependencies

Think of it as a computational snapshot. You can always return to recreate the exact same results.

2. Workspace vs Archive

Workspace (Active Development)

Directory where you actively work on analysis
You can modify files, run code, test ideas
Temporary state during development

Archive (Saved bead)

Immutable .zip file stored in a bead box
Timestamped and content-verified
The permanent, shareable record of your computation

# Workspace: active development
$ cd my-analysis/
$ python analyze.py  # Modify and test

# Archive: frozen snapshot  
$ bead save results
# Creates: my-analysis_20250730T120000+0200.zip

3. Content-Based Verification

Every file in a bead has a cryptographic hash. When you load dependencies, bead verifies you have the exact same files that produced the original results.

$ bead input add processed-data
# bead verifies:
# - Correct version exists
# - Content matches hash
# - No corruption occurred

4. Directed Acyclic Graphs (DAGs)

beads form dependency graphs:

raw-data
    ↓
cleaning-step
    ↓
analysis → paper
    ↓
figures

Rules (not enforced by bead, but good practice):

Dependencies flow in one direction
No circular dependencies allowed
Each node is independently reproducible

Design Principles

1. Immutability

Once saved, beads never change. New work creates new versions:

$ bead save results  # Creates v1: analysis_20250730T120000.zip
# Make changes...
$ bead save results  # Creates v2: analysis_20250730T130000.zip
# Both versions preserved forever

2. Explicit Dependencies

No hidden data sources or magic file paths:

# ❌ Bad: Hidden dependency
data = pd.read_csv("/shared/data/important.csv")

# ✅ Good: Explicit bead input
data = pd.read_csv("input/validated-data/important.csv")

3. Tool Agnosticism

bead doesn’t care about your tools:

Use any programming language
Use any data format
Use any execution method

bead only manages files and dependencies.

4. Human-Readable Archives

Even without bead installed, archives are usable:

$ unzip analysis_20250730T120000.zip
$ ls
code/       # Your source code files
data/       # Output data from your analysis
meta/       # bead metadata (bead, manifest)

The bead Lifecycle

1. Creation

$ bead new my-analysis
# Empty workspace with standard structure

2. Development

$ cd my-analysis
# Add code, load inputs, generate outputs
$ bead input add upstream-data
$ python process.py

3. Preservation

$ bead save results
# Immutable snapshot created

# Copy .zip file to collaborator
# They can reproduce exactly
$ bead edit my-analysis_20250730T120000.zip

5. Building Upon

$ bead new follow-up
$ bead input add my-analysis
# Previous outputs become new inputs

What bead Is NOT

Understanding what bead doesn’t do is as important as what it does:

Not a Version Control System

bead tracks computational snapshots, not code evolution
Feel free to use Git for code versioning within beads
bead complements, doesn’t replace, traditional VCS

Not a Workflow Engine

bead doesn’t execute your code
No job scheduling or parallelization
You control execution, bead manages artifacts

Not a Data Store

bead manages references, not data hosting
No cloud storage or synchronization
You manage where bead boxes live

Not a Package Manager

bead doesn’t install software dependencies
Use conda, pip, or system packages
Document environment in your bead

Common Patterns

Source beads

No inputs, only outputs, because you have to start somewhere:

$ bead new survey-data
$ curl -o output/responses.csv https://survey.com/data
$ bead save my-beads
Successfully stored bead at /Users/you/bead-storage/survey-data_20250909T142438714494+0100.zip.

Processing beads

Transform inputs to outputs:

$ bead new clean-survey
$ bead input add survey-data
$ python clean.py input/survey-data/responses.csv output/clean.csv
$ bead save my-beads
Successfully stored bead at /Users/you/bead-storage/clean-survey_20250909T143500123456+0100.zip.

Analysis beads

Final computations, often no downstream dependencies:

$ bead new paper-figures
$ bead input add clean-survey
$ R --file=analyze.R
# May never 'bead save' - just share outputs directly

Best Practices

1. One Concept, One bead

Don’t pack unrelated computations together
Split complex pipelines into logical steps
Each bead should have a clear purpose

2. Document Everything

README in every output folder
Explain what the bead does
List any manual steps required

3. Save Frequently

After completing meaningful work
Before making major changes
When sharing with others

4. Use Descriptive Names

# ❌ Bad
bead new analysis
bead new data

# ✅ Good  
bead new customer-churn-model
bead new survey-2024-responses

Ready to dive deeper? Continue to Dependency Management to learn about building complex computational graphs.