Organizing scientific repositories for reproducibility and rigor

I’ve been writing recently about best practices — or at least, the practices that I keep — for reproducible, rigorous scientific software development during research. In this post, I’m going to share how I organize my scientific repositories.

See also:

You know what it’s like to inherit a science repository from someone else… Notebooks called analysis.ipynb with a thousand cells run out of order; scripts called actual_FINAL_analysis_FINAL_DO_NOT_USE.py with no documentation and a mysterious import that cannot be found; a data folder with a lone README that says “data goes here”.

This is a nightmare for reproducibility and rigor.

If you’re just here for a Skill.md for your favorite LLM, you can just click through to my GitHub.


In the past I’ve been a strong advocate for notebook-based science using well-formed Jupyter notebooks to narrate the story of research. Increasingly I do not believe this is the right way to share science (though I still think it’s a fine way to do exploratory analyses, don’t yell at me).

Instead, here’s what I recommend.

Hyperspeed Overview

  • Use a Directed Acyclic Graph (DAG) of small, single-purpose scripts to organize your analysis.
    • Use a workflow engine like Snakemake to orchestrate the execution of these scripts and to track dependencies between them.
    • (Snakemake says, “to make figure 1, you need data.csv, and to make data.csv, you need raw-data.txt” and handles re-computing things when you change code or update data.)
  • Organize everything else!
    • Keep scripts (you can mix languages if necessary!) in analysis/ and reusable code in src/.
    • Organize your files into data/ and results/ so you never mix inputs with results.
  • Assume everything in results/ will be seen by reviewers and readers, assume everything besides analysis/ and data/provided/ can be deleted at any time and must be regenerable from the code and data in the repo.

Getting Started: Tutorial Zone

A complete repository for this example is available here.

We’re going to work out a simple scientific repository together with the Gedanken-goal of finding the curvature of the Earth using Eratosthenes’ method. This is a classic experiment that involves measuring the angle of the sun’s rays at two different locations with similar longitude but different latitude, and using the different angles to the sun to calculate the Earth’s curvature.

Note: This demo gets in the technical weeds! If you want a lighter intro, you can skip ahead to the overview of the layout and conventions in the next section.

1. Create the repo

uv init earth-curvature
cd earth-curvature

2. Create the layout

uv add snakemake
mkdir -p data/raw
mkdir -p analysis/measure-curvature

3. Download data

First we will need to get some data. For this experiment, we need the angle of the sun’s rays at two different locations. We can use the NOAA Solar Calculator to get this information.

First let’s add httpx to our dependencies:

uv add httpx pandas lxml

Now let’s create a script to download the data and save it in our data/raw folder.

# analysis/download-data/run.py
import httpx
import pandas as pd

def download_solar_data(location: str) -> pd.DataFrame:
    if location not in ["salt-lake-city", "phoenix"]:
        raise ValueError("Location must be either 'salt-lake-city' or 'phoenix'")
    if location == "salt-lake-city":
        lat, lon = 40.77, -111.89
    else:
        lat, lon = 33.45, -112.07
    # https://gml.noaa.gov/grad/solcalc/table.php?lat=40.77&lon=-111.89&year=2026
    url = f"https://gml.noaa.gov/grad/solcalc/table.php?lat={lat}&lon={lon}&year=2026"
    response = httpx.get(url)
    response.raise_for_status()
    # parse the response and extract the relevant data
    df = pd.read_html(response.text)[2] # solar noon
    return df

if __name__ == "__main__":
    for location in ["salt-lake-city", "phoenix"]:
        df = download_solar_data(location)
        df.to_csv(f"data/raw/{location}_solar_data.csv", index=False)

Where do we keep track of the fact that we’re creating data/raw/salt-lake-city_solar_data.csv and data/raw/phoenix_solar_data.csv? In the Snakefile!

# Snakefile
rule download_data:
    output:
        "data/raw/salt-lake-city_solar_data.csv",
        "data/raw/phoenix_solar_data.csv"
    shell:
        "python analysis/download-data/run.py"

Now we can run this with:

uv run snakemake --cores 1

You should see output like this:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job              count
-------------  -------
download_data        1
total                1

Select jobs to execute...
Execute 1 jobs...

[Wed Mar 11 13:57:37 2026]
localrule download_data:
    output: data/raw/salt-lake-city_solar_data.csv, data/raw/phoenix_solar_data.csv
    jobid: 0
    reason: Missing output files: data/raw/phoenix_solar_data.csv, data/raw/salt-lake-city_solar_data.csv
    resources: tmpdir=/var/folders/3s/z2md30v922b129bpyjmrt0xw0000gn/T
[Wed Mar 11 13:57:42 2026]
Finished jobid: 0 (Rule: download_data)
1 of 1 steps (100%) done
Complete log(s): ./.snakemake/log/2026-03-11T135737.313363.snakemake.log

Here’s the important stuff:

reason: Missing output files: data/raw/phoenix_solar_data.csv, data/raw/salt-lake-city_solar_data.csv

That means that Snakemake is tracking the fact that these files are generated by this rule, and if we delete them, Snakemake will know to regenerate them.

You should now have two CSV files in your data/raw folder with the solar data for Salt Lake City and Phoenix!

It’ll look like this:

Day Jan Feb Mar Apr May
1 12:31:46 12:41:50 12:40:38 12:32:10 12:25:24
2 12:32:14 12:41:58 12:40:26 12:31:52 12:25:17
3 12:32:41 12:42:05 12:40:13 12:31:35 12:25:11

3 - 0.5: Oh crap we only have solar noon data

Oh no! We only have the time of solar noon, but we need the angle of the sun’s rays at solar noon to calculate the curvature of the Earth. This is a problem. Alright, let’s change tack.

We know the planet rotates 360° in 24 hours, so if we can pick two cities with similar latitudes but different longitudes, we can use the difference in solar noon times to calculate the radius of the Earth…

Let’s just change Salt Lake City to Dallas:

...
    if location == "dallas":
        lat, lon = 32.77, -96.79
...

Then we update our Snakefile:

rule download_data:
    output:
        "data/raw/dallas_solar_data.csv",
        "data/raw/phoenix_solar_data.csv"
    shell:
        "python analysis/download-data/run.py"

And because Snakemake is tracking our outputs, it will know to regenerate the Dallas data but not the Phoenix data when we run:

uv run snakemake --cores 1

But you don’t have to run it now… Keep reading :)

4. Estimate curvature

Now that we have our data, we can create a new script to estimate the radius of the earth.

# analysis/get-radius/run.py
import pandas as pd
import numpy as np

def calculate_radius(solar_noon_1, solar_noon_2, distance: float, timezone_difference: int = 0) -> float:
    # Strings like "12:31:46"
    time_1 = pd.to_datetime(solar_noon_1, format="%H:%M:%S")
    time_2 = pd.to_datetime(solar_noon_2, format="%H:%M:%S")
    time_diff = abs((time_1 - time_2).total_seconds())
    # Adjust for timezone difference
    time_diff += timezone_difference * 3600
    # The Earth rotates 360° in 24 hours, so we can calculate the angle
    angle = (time_diff / (24 * 3600)) * 360
    # The distance between the two cities is the arc length, so we can calculate the radius
    radius = distance / np.radians(angle)
    return radius

if __name__ == "__main__":
    dallas_data = pd.read_csv("data/raw/dallas_solar_data.csv")
    phoenix_data = pd.read_csv("data/raw/phoenix_solar_data.csv")
    # Get the solar noon time for the 1st of January
    dallas_solar_noon = dallas_data.loc[dallas_data["Day"] == 1, "Jan"].values[0]
    phoenix_solar_noon = phoenix_data.loc[phoenix_data["Day"] == 1, "Jan"].values[0]
    # The distance between Dallas and Phoenix is approximately 1400 km
    distance = 1400
    # And Phoenix is 1 hour behind Dallas on Jan 1
    timezone_difference = 1
    radius = calculate_radius(dallas_solar_noon, phoenix_solar_noon, distance, timezone_difference)
    # Put it into a radius-{city1}-{city2}.txt file
    with open(f"results/radius-dallas-phoenix.txt", "w") as f:
        f.write(f"{radius:.2f}\n")

And we add this to our Snakefile:

rule get_radius:
    input:
        "data/raw/dallas_solar_data.csv",
        "data/raw/phoenix_solar_data.csv"
    output:
        "results/radius-dallas-phoenix.txt"
    shell:
        "python analysis/get-radius/run.py"

rule all:
    input:
        "results/radius-dallas-phoenix.txt"

Remember when I said “you don’t have to run this now”? This is why. Because Snakemake is tracking our inputs and outputs, it will know that if we want to run get_radius but we don’t have the raw data, it will know to run download_data first to get the raw data (our specified inputs) before running get_radius.

uv run snakemake --cores 1 all

5. A little cleverness to make it more reusable

Now let’s say we want to calculate the radius of the Earth using different pairs of cities. We can make our get_radius rule more reusable by parameterizing the city names and distances.

# analysis/get-radius/run.py
...
if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--city1", type=str, required=True)
    parser.add_argument("--city2", type=str, required=True)
    parser.add_argument("--distance", type=float, required=True)
    parser.add_argument("--timezone-difference", type=int, default=0)
    args = parser.parse_args()
    city1_data = pd.read_csv(f"data/raw/{args.city1}_solar_data.csv")
    city2_data = pd.read_csv(f"data/raw/{args.city2}_solar_data.csv")
    city1_solar_noon = city1_data.loc[city1_data["Day"] == 1, "Jan"].values[0]
    city2_solar_noon = city2_data.loc[city2_data["Day"] == 1, "Jan"].values[0]
    radius = calculate_radius(city1_solar_noon, city2_solar_noon, args.distance, args.timezone_difference)
    with open(f"results/radius-{args.city1}-{args.city2}.txt", "w") as f:
        f.write(f"{radius:.2f}\n")

And we can update our Snakefile to use this new parameterization:

rule get_radius:
    input:
        "data/raw/dallas_solar_data.csv",
        "data/raw/phoenix_solar_data.csv"
    output:
        "results/radius-dallas-phoenix.txt"
    params:
        city1="dallas",
        city2="phoenix",
        distance=1400,
        timezone_difference=1
    shell:
        "python analysis/get-radius/run.py --city1 {params.city1} --city2 {params.city2} --distance {params.distance} --timezone-difference {params.timezone_difference}"

How did we do?

My result says 5247.04. That’s far from the actual radius of the Earth (about 6371 km), but it’s in the right ballpark! The error could be due to a number of factors, including inaccuracies in the solar noon times, the distance between the cities (which I guessed, oops!), and the fact that the Earth is not a perfect sphere.

I say we can declare success…after ONE MORE THING

6. Let’s make a figure

Now let’s say we want to make a figure that shows our estimated radius of the Earth and the points we used to calculate it. We can create a new script to do this:

uv add numpy
# analysis/plot-radius/run.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

locations = {
    "dallas": (32.77, -96.79),
    "phoenix": (33.45, -112.07)
}

if __name__ == "__main__":
    radius = float(open("results/radius-dallas-phoenix.txt").read().strip())
    city1, city2 = "dallas", "phoenix"
    lat1, lon1 = locations[city1]
    lat2, lon2 = locations[city2]
    plt.figure(figsize=(6, 6))
    # A circle with city1 at the top and city2 the approximately correct angle
    circle = plt.Circle((0, 0), radius, fill=False)
    plt.gca().add_artist(circle)
    plt.plot(0, radius, "ro", label=city1)
    angle = (abs(lon1 - lon2) / 360) * 2 * 3.14159
    plt.plot(radius * np.sin(angle), radius * np.cos(angle), "bo", label=city2)
    plt.text(0, radius + 100, city1, ha="center")
    plt.text(radius * np.sin(angle), radius * np.cos(angle) + 100, city2, ha="center")
    plt.xlim(-radius - 200, radius + 200)
    plt.ylim(-radius - 200, radius + 200)
    plt.gca().set_aspect("equal")
    plt.title(f"Estimated Radius of the Earth: {radius:.2f} km")
    plt.legend()
    plt.savefig("results/figures/earth-radius.png")

And we add this to our Snakefile:

rule plot_radius:
    input:
        "results/radius-dallas-phoenix.txt"
    output:
        "results/figures/earth-radius-dallas-phoenix.png"
    shell:
        "python analysis/plot-radius/run.py"

Of course, we now duplicate the coordinates of the cities in both the download-data and plot-radius scripts, which is not ideal. We could hoist this out into a shared src/ file to avoid this duplication:

# src/earth_curvature/locations.py
locations = {
    "dallas": (32.77, -96.79),
    "phoenix": (33.45, -112.07)
}

And then we can import this in both scripts:

# analysis/download-data/run.py
from earth_curvature.locations import locations
...
    if location not in locations:
        raise ValueError(f"Location must be one of {list(locations.keys())}")
    lat, lon = locations[location]
...
# analysis/plot-radius/run.py
from earth_curvature.locations import locations
...

(Note you may need to add an __init__.py file in src/ to make it a package and run a quick uv pip install -e . to make it available for import.)

Radius of the earth figure


A complete repository for this example is available here.

Manifesto Zone

A scientific repository should make three things obvious:

  1. where the inputs came from;
  2. what code generated a given output; and
  3. whether a file is an intermediate artifact or something you’d actually pull into a paper.

I organize analyses into a dependency graph of small, single-purpose scripts. Each script should have a clear purpose and a clear set of inputs and outputs. The workflow should be orchestrated by a workflow engine like Snakemake, which makes it easy to see the dependencies between steps and to reproduce the entire analysis from raw data to final results.

My layout looks like this:

.
├── data/
│   ├── raw/
│   ├── provided/
│   └── generated/
├── analysis/
│   ├── some_step/
│   │   └── run.py
├── src/
├── results/
│   ├── data/
│   └── figures/
└── Snakefile

Data Files

One of my biggest pain points when inheriting a codebase — especially one that deals with lots of data with lots of preprocessing steps — is provenance.

This is not a perfect solution, but it’s like… mostly right?

  • data/raw/: original external inputs; do not hand-edit these
  • data/provided/: stable project-supplied inputs like metadata tables or curated annotations. check these into version control, or have early stages of your pipeline generate or download them.
  • data/generated/: machine-generated intermediate artifacts. these should be deletable at any time without breaking your repo (since they should always be regenerable from the raw/provided data and code).

Analysis Code

The analysis/ folder contains the code that transforms data into results. I like to organize this into subfolders for each step of the analysis, with a single run.py script that serves as the entry point for that step. (In snakemake, these are called rules — so each rule gets its own folder.)

  • analysis/<rule_name>/run.py: one entrypoint per workflow step
  • src/: shared reusable Python code; hoist stuff out of scripts and into here to avoid duplication

I like each workflow step to have a small entrypoint, usually something like:

analysis/<rule_name>/run.py

That script should parse arguments, read explicit inputs, call shared code from src/, and write explicit outputs. Reusable logic belongs in src/, not copied across scripts.

Results

  • results/data/: final tables
  • results/figures/: final figures
  • Snakefile: the canonical execution path

To regenerate results, use Snakemake:

uv run snakemake --cores 1

(You can still run individual scripts while debugging, but I usually DELETE EVERYTHING in results and rerun from scratch before rendering out a manuscript to make sure things are fresh.)

If an output is expensive to regenerate, protected() is useful:

output:
    protected("data/generated/expensive_result.parquet")

For figures, generate them programmatically and size them for their final use. Do not rely on manual cleanup downstream.

That’s the short version. The full conventions, examples, and LLM-friendly repo instructions are here:

github.com/j6k4m8/science-repos

The point is not folder aesthetics. The point is that six months later, you should still be able to answer: where did this science come from?

Written on March 10, 2026
Comments? Let's chat on bsky or mastodon!