Recruitment Pipeline

GitHub-sourced inbound recruitment. Six orthogonal discovery channels feed a single LangGraph (backend/leadgen_agent/gh_patterns_graph.py) that hydrates candidates via the GitHub GraphQL API, extracts features from the contribution calendar, embeds bios with a local Candle BERT 384-dim pass (Rust HTTP server, Metal/CPU — no cloud egress), and persists the result into the samecontacts table the outbound B2B pipeline uses. Every candidate carries an auditable tag array: the recruiter sees exactly which channels surfaced them and why. This page is one of three sibling pipelines on the same schema — see the handoff stage below.

6 discovery channels50+ fields / candidate9-component score0 cloud LLM calls
Pipeline Metrics
6

Discovery channels — bio search, stargazers, contributors, org members, follower graph, seed domains

backend/leadgen_agent/gh_patterns_graph.py

15

Bio-search passes targeting distinct AI/ML angles (RAG, DSPy, MLOps, vector DBs, agentic…)

search_queries() in gh_patterns_graph.py

50+

Fields hydrated per candidate via a single GraphQL batch query

USER_GQL_FIELDS in backend/leadgen_agent/_gh_graphql.py

9

Components in the rising-score composite (density, novelty, breadth, activity, skill, engagement, obscurity, recency, quality)

compute_rising_score() in gh_patterns_graph.py

0

Cloud LLM calls per candidate embedding — Candle BERT runs locally on Metal/CPU

Candle BERT 384-dim local pass

InteractiveDrag to rearrange.
agent
store
gate
1

discovery

Six-channel fan-in

Six orthogonal channels fan in to a single dedup gate keyed on numeric user.id. Each channel has a different false-positive profile — the intersection is higher-confidence than any one source.

candidate_loginshydration
2

hydration

Batch GraphQL hydration

A single GraphQL query hydrates 50+ fields per candidate — bio, location, contribution calendar, pinned repos, contributed repos, org memberships — collapsing what would be 10+ REST round-trips into one batch call.

hydrated_profilessignal_extraction
3

signal_extraction

Deterministic feature extraction

Four deterministic extractors turn the hydrated profile into scoring features: a rolling activity profile, 16-term AI topic detection across pinned + top + contributed repos, multi-source skill tagging, and a keyword-based seniority classifier.

feature_vectorsscoring
4

scoring

Two-prior composite scoring

Two independent composite scores with opposite priors — rising_score favors undiscovered talent via an obscurity component; strength_score favors established track records. Opportunity skill match runs in parallel. A disjunctive gate (rising OR strength ≥ threshold) decides persistence.

scored_candidatespersistence
5

persistence

Dual-store with auditable tags

LanceDB stores vector embeddings for semantic similarity search; Neon stores the canonical contact row with profile jsonb. Every candidate carries a prefix-tagged provenance array (github:* · skill:* · seniority:* · opp:* · src:*) — recruiters see exactly why each candidate surfaced.

contact_rowshandoff
6

handoff

Shared-schema handoff

Three sibling pipelines converge on the same contacts schema: (1) this page — inbound recruitment of GitHub engineers; (2) /how-it-works — outbound B2B lead-gen, where we discover companies and email decision-makers; (3) inbound auto-apply — HN Algolia → Sonnet 4.6 batch_fit_scorer → Playwright submission (see Deep Dive). Email discovery, deliverability verification, AI composition, and reply-aware follow-up all apply uniformly across all three.

Deep Dive

1Why local Candle BERT for embeddings?code

The discovery loop runs on the developer's laptop or a small worker pod, not in a managed ML environment. Embedding is the one ML hop in the pipeline, and it stays on-device: a Rust-built Candle BERT server (loopback, port 7799) serves 384-dim vectors over HTTP to the Python graph. Candle gives us BERT inference on Metal (Apple silicon) or CPU with no Python runtime in the hot path, no CUDA setup, no cloud egress. The practical consequence: a full candidate scan costs roughly $0 in inference and never leaks the sourcing target to a third party. Scoring itself is pure-Python arithmetic over the hydrated GraphQL fields — fast enough at 50k candidates that it doesn't justify a non-Python implementation.

# backend/leadgen_agent/embeddings.py — Candle BERT 384-dim local pass
async def embed_texts(texts: list[str]) -> list[list[float]]:
    # Local Rust/Candle HTTP server on 127.0.0.1:7799 — Metal on Apple silicon,
    # CPU elsewhere. No external network, no per-call cost.
    async with httpx.AsyncClient() as c:
        r = await c.post("http://127.0.0.1:7799/embed", json={"texts": texts})
    return r.json()["embeddings"]   # 384-dim vectors

# backend/leadgen_agent/gh_patterns_graph.py — rising_score composite
def compute_rising_score(c: Candidate) -> float:
    density    = contribs_per_follower(c)    # undiscovered talent
    novelty    = account_age_score(c)        # newer accounts
    breadth    = distinct_ai_repo_score(c)   # # of AI repos
    activity   = real_activity_score(c)      # commits/PRs/reviews
    relevance  = ai_skill_score(c)           # skill tag count
    engagement = profile_completeness(c)     # email/hireable/etc
    obscurity  = 1.0 / (1.0 + c.followers / 500.0)
    recency    = last_90d_signal(c)
    quality    = external_star_quality(c)

    composite = clamp(linear_combo(...), 0.0, 1.0)
    return composite * hireable_multiplier(c) * recency_multiplier(c)
2Why six channels?code

Every discovery channel has a characteristic false-positive mode. Bio search surfaces people who describe themselves as AI engineers but don't ship AI code. Stargazer mining surfaces hobbyists who starred a framework once. Contributor mining surfaces real capability but misses anyone who hasn't contributed to the specific 12 repos we track. Org-member mining is biased toward visible AI labs and misses senior engineers at quiet companies. Follower-graph expansion surfaces community-embedded engineers but has a 1-hop limit before noise dominates. Seed-domain extraction only fires for companies already in our enriched set. Running all six and intersecting the results — each candidate carries tags for every channel that surfaced them — produces a candidate pool where every shortlisted profile has multiple independent pieces of evidence.

# backend/leadgen_agent/gh_patterns_graph.py — 6-channel fan-out
# Channel 1: Bio/keyword search (passes A–O)
for label, query in search_queries():
    await gh.search_users(query, sort="followers", order="desc", per_page=100)
    # tag: src:bio/{label}

# Channel 2: Stargazer mining (13 AI repos)
for repo in stargazer_repos():
    await gh.list_stargazers(repo, per_page=100)
    # tag: src:star/{repo}

# Channel 3: Contributor mining (12 core libs)
for repo in contributor_repos():
    await gh.list_contributors(repo, per_page=100)
    # tag: src:contrib/{repo}

# Channel 4: AI org public members
for org in ai_orgs():
    await gh.list_org_members(org)
    # tag: src:org/{org}

# Channel 5: Follower expansion from top-k seeds
# Channel 6: Seed domains (company team pages)
3Auditable provenance at the contact levelcode

Every signal that surfaced a candidate is preserved as a tag on their contact row — no black-box scoring. A recruiter looking at a shortlisted engineer sees github:rising-star, github:score:A, github:trend:rising, skill:pytorch, skill:rag, seniority:senior, opp:opp_20260415_principal_ai_eng_ob, opp:skill-match:85pct, src:bio/C, src:contrib/langchain-ai/langgraph, src:org/huggingface — which answers 'why this person?' without requiring access to scoring internals. Tags are stored as a JSONB array with a GIN index, so containment queries ('show me everyone with src:contrib/langchain AND seniority:senior') are O(1).

// src/db/schema.ts — contacts table (GitHub-relevant columns)
export const contacts = pgTable("contacts", {
  id: serial("id").primaryKey(),
  github_handle: text("github_handle"),              // unique partial index
  profile: jsonb("profile"),                   // ContactProfile
  authority_score: real("authority_score"),          // ← strength_score
  tags: jsonb("tags").default([]),                   // GIN-indexed
  // ...
});

// Example tag set on a single contact row
[
  "github:rising-star",
  "github:score:A",
  "github:trend:rising",
  "github:active-this-week",
  "skill:pytorch", "skill:langchain", "skill:rag",
  "seniority:senior",
  "opp:opp_20260415_principal_ai_eng_ob",
  "opp:skill-match:85pct",
  "src:bio/C", "src:contrib/langchain-ai/langgraph",
  "src:org/huggingface"
]
4Sibling pipeline: inbound auto-apply (HN Algolia → Sonnet 4.6 → Playwright)code

The third sibling alongside outbound B2B (/how-it-works) and inbound recruitment (this page) is auto-apply — a self-directed pipeline that ingests live job postings and submits applications. Source: the Hacker News 'Who is Hiring?' threads via the Algolia API at https://hn.algolia.com/api/v1/search (paginated by date, AI/remote-filtered). Triage: a batch_fit_scorer node that chunks 8 jobs per Claude Sonnet 4.6 (claude-sonnet-4-6) call, with cache_control=ephemeral on the SYSTEM_PROMPT + resume prefix — every chunk after the first pays only the cache-read rate (10% of input tokens), collapsing what was N sequential per-job triage calls into one prefix-cached batch. Cover-letter generation reuses the same cached prefix on the same model. Submission is a Playwright browser session with cookie-banner dismissal via page.evaluate(); end-to-end submission success rate sits at ~77%. Entry point: pnpm jobs:apply → backend/scripts/auto_apply_cli.py → backend/leadgen_agent/auto_apply_graph.py.

# backend/leadgen_agent/auto_apply/batch_fit_scorer.py
PROMPT_VERSION = "batch-fit-v1-2026-05"
DEFAULT_BATCH_SIZE = 8

# One Sonnet 4.6 call per chunk — system prompt + resume cached
# (cache_control=ephemeral). Chunk #2+ hits cache_read_input_tokens.
async def _score_one_chunk(jobs):
    payload = [{"idx": i, "role": j["role"], "company": j["company"],
                "jd_slice": j["jd"][:2000]} for i, j in enumerate(jobs)]
    client = make_sonnet(temperature=0.0)
    result, tel = await ainvoke_anthropic_json_with_telemetry(
        client,
        system_blocks=_system_blocks(),   # ← cached prefix
        user_content=BATCH_USER_TEMPLATE.format(jobs_json=json.dumps(payload)),
        max_tokens=200 + 80 * len(jobs),
    )
    return result["scores"], tel  # [{"idx": i, "score": 0..100, "reasoning": "..."}]

# Pipeline shape:
# discover (HN Algolia) → prioritize → batch_score (Sonnet, chunked) →
# pop_next → fetch_jd → cover_letter (Sonnet, same cache prefix) →
# submit (Playwright) → persist
5Trend over fame — the obscurity componentcode

The rising_score deliberately inverts follower count through an obscurity component: 1 / (1 + followers / 500). An engineer with 10,000 followers gets an obscurity multiplier of ~0.05; an engineer with 50 followers gets ~0.91. The intent is explicit: the product is a discovery tool, not a re-ranking of already-famous accounts. A 20k-follower AI influencer is not the target hire — a 200-follower engineer who ships three high-star repos is. This is paired with a recency multiplier that boosts candidates active in the last 7/30/90 days, so obscurity alone doesn't surface dormant accounts.

# backend/leadgen_agent/gh_patterns_graph.py
def obscurity(c: Candidate) -> float:
    return 1.0 / (1.0 + c.followers / 500.0)

def recency_multiplier(c: Candidate) -> float:
    days = c.days_since_last_active
    if days <= 7:   return 1.15   # active this week
    if days <= 30:  return 1.10   # active this month
    if days <= 90:  return 1.05   # active this quarter
    return 1.0

def hireable_multiplier(c: Candidate) -> float:
    return 1.15 if c.is_hireable else 1.0

Technical Details

Six discovery channels

Each channel has a different false-positive profile — intersecting evidence across channels produces higher confidence than any single source.

bio/keyword search

GitHub REST /search/users with 15 targeted passes (RAG, LLM, DSPy, MLOps, vector DBs, principal/staff titles).

self-declaredlow-medium
stargazer mining

Users who starred 13 high-signal AI framework repos (langgraph, crewAI, transformers, dspy, instructor…).

interestlow
contributor mining

Actual code contributors to 12 foundational AI/ML libraries (langchain, llama_index, autogen, vllm, litellm…).

capabilityhighest
org public members

Public members of 8 AI labs (deepmind, huggingface, cohere-ai, stability-ai, faculty-ai…).

affiliationhigh
follower graph

1-hop follower expansion from top-scoring seeds — exploits homophily in the AI-engineer community graph.

homophilymedium
seed domains

GitHub handles extracted from the team pages of already-enriched AI-native companies.

pre-qualifiedhigh

Signals extracted per candidate

Every field is recomputed on each pipeline run from live GraphQL data — no stale caches.

FieldDescriptionType
account_age_daysDays since createdAt; novelty component of rising_score.
integer
contributions_30/90/365dRolling windows derived from the contribution calendar.
integer
trendrising | stable | declining | dormant | new — computed from 90d-over-90d delta.
enum
current_streak_daysConsecutive days with at least one contribution.
integer
pinned_repos_jsonTop 6 repos the user pinned on their profile.
jsonb
contributed_repos_jsonTop 10 external repos by stars the user committed to.
jsonb
organizations_jsonUp to 5 org memberships from GraphQL organizations.
jsonb
ai_topic_matchesCount of 16-term AI taxonomy hits across pinned + top + contributed repos.
integer
skillsskill:* tags from bio + languages + topics.
tag array
seniorityjunior | mid | senior | staff-plus — deterministic keyword classifier.
enum
is_hireableGitHub profile hireable flag; 1.15× multiplier in rising_score.
boolean
rising_score9-component composite in [0, 1]; emerging-talent prior.
float
strength_scoreExperience-weighted alternative; senior-hire prior.
float
src:* provenanceEvery discovery channel that surfaced this candidate.
tag array

Batch hydration: one GraphQL query per candidate

What would be 10+ REST round-trips collapses into a single batched GraphQL query. Abbreviated below — actual query is in backend/leadgen_agent/_gh_graphql.py.

query UserBatch($logins: [String!]!) {
  users: nodes(ids: $logins) {
    ... on User {
      login bio company location email isHireable
      createdAt updatedAt
      followers { totalCount }
      contributionsCollection {
        totalCommitContributions
        totalPullRequestContributions
        contributionCalendar {
          weeks { contributionDays { contributionCount date } }
        }
      }
      pinnedItems(first: 6) { nodes { ... on Repository { name stargazerCount } } }
      repositoriesContributedTo(first: 10, orderBy: { field: STARGAZERS, direction: DESC }) {
        nodes { nameWithOwner stargazerCount repositoryTopics(first: 5) { nodes { topic { name } } } }
      }
      organizations(first: 5) { nodes { login name } }
    }
  }
}

Technical Foundations

Python2026

gh_patterns_graph (LangGraph)

This repo · backend/leadgen_agent

3,000+ line LangGraph that supersedes the prior Rust crates/gh — same six-channel discovery, GraphQL hydration, and 9-component composite scoring, now executed under the same langgraph dev runtime as every other graph in this app.

Single source of truth for GitHub discovery, enrichment, and scoring; sibling graphs (gh_ai_repos_graph, gh_quick_brief_graph, score_recruiter_fit_graph, classify_recruitment_bulk_graph) compose against the same hydrated user shape.

API2024

GitHub REST API v3

GitHub

Endpoints for /search/users, /repos/{}/stargazers, /repos/{}/contributors, /orgs/{}/members, /users/{}/followers

Used for all six discovery channels — fan-out across keyword search, stargazers, contributors, org members, and follower graphs.

API2024

GitHub GraphQL API v4

GitHub

Batch hydration of 50+ fields per user: contribution calendar, pinned repos, org memberships, top repos, contributed repos

Used after REST discovery to enrich candidates with signals that REST would require N separate calls to fetch.

Python2024

httpx + asyncio

encode

Async HTTP client on the asyncio runtime — concurrent GitHub API calls with backoff on 403/429

All GitHub traffic from gh_patterns_graph flows through a single shared httpx.AsyncClient. Rate-limit headers are parsed on every response.

AI/ML2024

Candle (HuggingFace)

HuggingFace

Minimalist ML framework for Rust with Metal/CUDA/CPU backends — local BERT inference, no cloud calls

Embeds contributor bios + repo descriptions into a vector space for semantic candidate search. Zero egress per candidate.

Database2024

LanceDB

LanceDB

Embedded columnar vector database — ANN search over millions of vectors without a server

Stores contributor vectors alongside raw fields. Powers offline similarity search across the whole candidate corpus.

Database2024

Neon PostgreSQL

Neon

Serverless PostgreSQL — the canonical contacts store shared with the company pipeline

Scored candidates land in contacts with github_handle unique key, profile jsonb, authority_score, and a searchable tag array.

Python2024

asyncpg

MagicStack

High-performance async PostgreSQL driver for Python — connection-pooled writes from gh_patterns_graph nodes

Used by gh_patterns_graph to upsert candidates and org-pattern rows into Neon. Upserts keyed on contacts.github_handle.

Database2024

Drizzle ORM

Drizzle Team

TypeScript ORM with typed queries — the contacts schema definition shared by Rust writes and TS reads

contacts.github_handle (unique), contacts.profile (jsonb ContactProfile), contacts.authority_score — all surfaced through Drizzle types.

Frontend2024

React Flow

xyflow

Node-and-edge visualization library with custom nodes, dark-mode theming, and smooth-step routing

Renders the interactive diagram on this page — same custom node components as /how-it-works.