Sam GCP infra (Terraform)

On this page

Mental model
What this provisions
Persistence model
CI security checks
First-time bootstrap
Changing config later
GitHub webhooks (SAM-5)
State notes

One-time GCP setup for Sam’s Cloud Run CI/CD. Single config file (config.yaml) drives both Terraform and the CI workflow.

Mental model

Everything Sam needs lives in 3 places:

Where	What	Edited by
`infra/config.yaml`	All non-secret deployment config (project, regions, Slack IDs, runtime knobs, secret name map)	You, by hand
`infra/config.generated.yaml`	TF-derived values (WIF provider path, SA emails, AR repo URL)	Terraform writes it; you commit it once
GCP Secret Manager	The application secrets (Slack tokens, GitHub PAT, Linear API, Exa key, GitHub webhook HMAC)	`bash infra/scripts/upload-secrets.sh`

There are no GitHub Actions secrets or variables to set. The workflow reads both YAMLs directly.

What this provisions

APIs enabled — Vertex AI, Cloud Run, Artifact Registry, Secret Manager, IAM, IAM Credentials, STS, Cloud Resource Manager, Cloud Functions, Cloud Build, Eventarc (the last three for the edge function)
Edge function github-webhook-proxy (gen2, EU) — the public door for GitHub webhooks; forwards to Sam’s private /github/webhook with an IAM token. See “GitHub webhooks” below. Adding more edge functions is a functions/<name>/ folder + one entry in functions.tf’s local.functions.
Artifact Registry repo sam (Docker, regional)
Secret Manager secrets (resources only — values populated separately, EU-only replication pinned to europe-west1 + europe-west4)
Cloud Storage bucket <project_id>-sam-data — EU multi-region, versioned, public-access prevention enforced. Mounted at /data in Cloud Run via gcsfuse so Sam’s journal survives restarts.
Two service accounts:
- sam-deploy@… — assumed by GitHub Actions via WIF
- sam-runtime@… — what Sam runs as inside Cloud Run
Workload Identity Federation — GitHub OIDC → GCP token, scoped to this repo only
IAM bindings — least-privilege per SA

What this does not provision (kept manual on purpose):

The Cloud Run service itself — first deploy creates it; the workflow keeps it updated
Secret values — populated via gcloud secrets versions add so secret material never enters Terraform state

Persistence model

Sam writes to /data/journal/*.md and /data/sam.lock. Cloud Run is stateless by default — /data would be wiped on every restart. To keep journal state across restarts/redeploys, the workflow mounts the GCS bucket as a Cloud Run volume:

--add-volume=name=sam-data,type=cloud-storage,bucket=<bucket>
--add-volume-mount=volume=sam-data,mount-path=/data
--execution-environment=gen2

Notes:

gen2 is required for cloud-storage volumes (gcsfuse).
The lock file at /data/sam.lock is PID-based, not fcntl-based, so gcsfuse’s weak POSIX semantics don’t break it. Sam’s stale-lock cleanup handles deploy overlaps gracefully (new container’s os.kill(old_pid, 0) fails → lock cleared).
Object versioning is enabled on the bucket, so accidentally rm-ing a journal file from inside the container is recoverable for 30 days.

CI security checks

The ci-checks job runs on PRs targeting main, not on pushes to main. The assumption: every commit reaching main got there via a PR that already passed checks. This halves CI minutes and avoids re-running expensive scans (trivy, gitleaks history) on the same code twice.

Required: branch protection on main. Without it, someone could push directly to main and skip all checks. Set this once in GitHub repo settings:

Settings → Branches → Add rule → main ☑ Require a pull request before merging ☑ Require status checks to pass before merging → select ci-checks ☑ Do not allow bypassing the above settings

What runs:

Check	Stack	What it catches
`ruff check src/`	Python	Code quality
`ruff check src/ --select=S`	Python	Security antipatterns (bandit subset — subprocess shell=True, eval, pickle, weak crypto, etc.)
`pip-audit -r src/runtime/requirements.txt`	Python	Known CVEs in pinned deps
`docker build`	Docker	Dockerfile + deps resolve cleanly
`trivy-action@0.28.0` (HIGH/CRITICAL, ignore-unfixed)	Docker	OS/package CVEs in the built container image

Secret scanning is delegated to GitHub. Since the repo is public, GitHub’s native secret scanning runs on every push automatically, surfaces findings in the Security tab, and burns zero CI minutes. We removed the in-CI gitleaks step in favor of it.

To suppress a specific finding:

ruff S rule: add to pyproject.toml [tool.ruff.lint] ignore = ["Sxxx"]
pip-audit CVE: add --ignore-vuln GHSA-xxxx-xxxx-xxxx to the step
trivy CVE: add the CVE ID to .trivyignore at repo root

First-time bootstrap

# Auth — uses your gcloud ADC
gcloud auth application-default login

# Edit config.yaml first if you need to change defaults (project, region, etc.)
$EDITOR infra/config.yaml

# Apply
cd infra/
terraform init
terraform plan
terraform apply

# Commit the generated config (workflow needs it)
git add infra/config.generated.yaml
git commit -m "infra: capture WIF provider + SA emails from terraform apply"
git push

# Upload your local .env secrets into GCP Secret Manager (one time)
bash scripts/upload-secrets.sh

After that, any push to main triggers a deploy.

Changing config later

Change a runtime knob (memory, channel, project, etc.) → edit config.yaml, commit, push. The workflow picks it up on next deploy.
Rotate a secret value → re-run bash scripts/upload-secrets.sh (reads your current .env).
Add a new secret → add it to config.yaml > secrets, terraform apply to create the resource, then re-run upload-secrets.

GitHub webhooks (SAM-5)

Sam’s Cloud Run service stays --no-allow-unauthenticated — it never accepts public traffic. GitHub can’t present a GCP IAM token, so it can’t call Sam directly. The public edge function github-webhook-proxy is the only door: GitHub → proxy (public, HMAC-signed body) → forwards with an IAM token → Sam’s private /github/webhook → Sam validates the HMAC and acts.

One org-level webhook covers every repo in the org — current and future — so there’s no per-repo setup, ever. It pairs with Sam’s contributor filter (the daemon ignores events on PRs the bot didn’t author), so the org firehose only wakes Sam for repos it actually works in.

Setup, once:

# 1. Provision the proxy function, its IAM, AND the HMAC secret. The secret is
#    auto-generated by Terraform (random_password → Secret Manager) — no human
#    picks or types it, and a version exists before Sam's deploy mounts it.
terraform apply

# 2. Deploy Sam so it picks up the secret: push to main, or
#    `gh workflow run ci-deploy.yml` (SAM-19).

# 3. Register the ONE org webhook. Needs YOUR org-admin gh creds — the bot has
#    only write, so it can't self-register. Idempotent. Reads the
#    Terraform-generated secret from Secret Manager and sets it on the hook.
bash scripts/register-webhooks.sh            # org from config.yaml (Dembrane)
# or: bash scripts/register-webhooks.sh SomeOtherOrg

The proxy is a thin forwarder — it does not hold the secret or validate the signature. Sam is the single HMAC validator. Junk traffic is forwarded once and HMAC-rejected by Sam (fast 401, no session). If the secret is unset, Sam’s daemon doesn’t expose /github/webhook at all and the loop is simply off.

The secret is the only human-free part now: Terraform generates it, Sam reads it to validate, the script reads it to register. The one irreducible manual step is the org-admin-gated registration in step 3 — because the bot can’t have admin.

State notes

State is local (terraform.tfstate in this directory, gitignored). For a single-operator setup this is fine; migrate to a GCS backend if more than one person needs to apply.

destroy will delete SAs and AR repo. Secret Manager has 30-day soft-delete by default; APIs stay enabled (deliberate — turning them off project-wide breaks anything else using them).

Sam GCP infra (Terraform)

Mental model#

What this provisions#

Persistence model#

CI security checks#

First-time bootstrap#

Changing config later#

GitHub webhooks (SAM-5)#

State notes#