Recruitment Pipeline
GitHub-sourced inbound recruitment. Six orthogonal discovery channels feed a single LangGraph (backend/leadgen_agent/gh_patterns_graph.py) that hydrates candidates via the GitHub GraphQL API, extracts features from the contribution calendar, embeds bios with a local Candle BERT 384-dim pass (Rust HTTP server, Metal/CPU — no cloud egress), and persists the result into the samecontacts table the outbound B2B pipeline uses. Every candidate carries an auditable tag array: the recruiter sees exactly which channels surfaced them and why. This page is one of three sibling pipelines on the same schema — see the handoff stage below.
Discovery channels — bio search, stargazers, contributors, org members, follower graph, seed domains
backend/leadgen_agent/gh_patterns_graph.py
Bio-search passes targeting distinct AI/ML angles (RAG, DSPy, MLOps, vector DBs, agentic…)
search_queries() in gh_patterns_graph.py
Fields hydrated per candidate via a single GraphQL batch query
USER_GQL_FIELDS in backend/leadgen_agent/_gh_graphql.py
Components in the rising-score composite (density, novelty, breadth, activity, skill, engagement, obscurity, recency, quality)
compute_rising_score() in gh_patterns_graph.py
Cloud LLM calls per candidate embedding — Candle BERT runs locally on Metal/CPU
Candle BERT 384-dim local pass
discovery
Six-channel fan-inSix orthogonal channels fan in to a single dedup gate keyed on numeric user.id. Each channel has a different false-positive profile — the intersection is higher-confidence than any one source.
candidate_logins→hydrationhydration
Batch GraphQL hydrationA single GraphQL query hydrates 50+ fields per candidate — bio, location, contribution calendar, pinned repos, contributed repos, org memberships — collapsing what would be 10+ REST round-trips into one batch call.
hydrated_profiles→signal_extractionsignal_extraction
Deterministic feature extractionFour deterministic extractors turn the hydrated profile into scoring features: a rolling activity profile, 16-term AI topic detection across pinned + top + contributed repos, multi-source skill tagging, and a keyword-based seniority classifier.
feature_vectors→scoringscoring
Two-prior composite scoringTwo independent composite scores with opposite priors — rising_score favors undiscovered talent via an obscurity component; strength_score favors established track records. Opportunity skill match runs in parallel. A disjunctive gate (rising OR strength ≥ threshold) decides persistence.
scored_candidates→persistencepersistence
Dual-store with auditable tagsLanceDB stores vector embeddings for semantic similarity search; Neon stores the canonical contact row with profile jsonb. Every candidate carries a prefix-tagged provenance array (github:* · skill:* · seniority:* · opp:* · src:*) — recruiters see exactly why each candidate surfaced.
contact_rows→handoffhandoff
Shared-schema handoffThree sibling pipelines converge on the same contacts schema: (1) this page — inbound recruitment of GitHub engineers; (2) /how-it-works — outbound B2B lead-gen, where we discover companies and email decision-makers; (3) inbound auto-apply — HN Algolia → Sonnet 4.6 batch_fit_scorer → Playwright submission (see Deep Dive). Email discovery, deliverability verification, AI composition, and reply-aware follow-up all apply uniformly across all three.
Deep Dive
Technical Details
Six discovery channels
Each channel has a different false-positive profile — intersecting evidence across channels produces higher confidence than any single source.
GitHub REST /search/users with 15 targeted passes (RAG, LLM, DSPy, MLOps, vector DBs, principal/staff titles).
self-declaredlow-mediumUsers who starred 13 high-signal AI framework repos (langgraph, crewAI, transformers, dspy, instructor…).
interestlowActual code contributors to 12 foundational AI/ML libraries (langchain, llama_index, autogen, vllm, litellm…).
capabilityhighestPublic members of 8 AI labs (deepmind, huggingface, cohere-ai, stability-ai, faculty-ai…).
affiliationhigh1-hop follower expansion from top-scoring seeds — exploits homophily in the AI-engineer community graph.
homophilymediumGitHub handles extracted from the team pages of already-enriched AI-native companies.
pre-qualifiedhighSignals extracted per candidate
Every field is recomputed on each pipeline run from live GraphQL data — no stale caches.
| Field | Description | Type |
|---|---|---|
| account_age_days | Days since createdAt; novelty component of rising_score. | integer |
| contributions_30/90/365d | Rolling windows derived from the contribution calendar. | integer |
| trend | rising | stable | declining | dormant | new — computed from 90d-over-90d delta. | enum |
| current_streak_days | Consecutive days with at least one contribution. | integer |
| pinned_repos_json | Top 6 repos the user pinned on their profile. | jsonb |
| contributed_repos_json | Top 10 external repos by stars the user committed to. | jsonb |
| organizations_json | Up to 5 org memberships from GraphQL organizations. | jsonb |
| ai_topic_matches | Count of 16-term AI taxonomy hits across pinned + top + contributed repos. | integer |
| skills | skill:* tags from bio + languages + topics. | tag array |
| seniority | junior | mid | senior | staff-plus — deterministic keyword classifier. | enum |
| is_hireable | GitHub profile hireable flag; 1.15× multiplier in rising_score. | boolean |
| rising_score | 9-component composite in [0, 1]; emerging-talent prior. | float |
| strength_score | Experience-weighted alternative; senior-hire prior. | float |
| src:* provenance | Every discovery channel that surfaced this candidate. | tag array |
Batch hydration: one GraphQL query per candidate
What would be 10+ REST round-trips collapses into a single batched GraphQL query. Abbreviated below — actual query is in backend/leadgen_agent/_gh_graphql.py.
query UserBatch($logins: [String!]!) {
users: nodes(ids: $logins) {
... on User {
login bio company location email isHireable
createdAt updatedAt
followers { totalCount }
contributionsCollection {
totalCommitContributions
totalPullRequestContributions
contributionCalendar {
weeks { contributionDays { contributionCount date } }
}
}
pinnedItems(first: 6) { nodes { ... on Repository { name stargazerCount } } }
repositoriesContributedTo(first: 10, orderBy: { field: STARGAZERS, direction: DESC }) {
nodes { nameWithOwner stargazerCount repositoryTopics(first: 5) { nodes { topic { name } } } }
}
organizations(first: 5) { nodes { login name } }
}
}
}Technical Foundations
gh_patterns_graph (LangGraph)
This repo · backend/leadgen_agent3,000+ line LangGraph that supersedes the prior Rust crates/gh — same six-channel discovery, GraphQL hydration, and 9-component composite scoring, now executed under the same langgraph dev runtime as every other graph in this app.
Single source of truth for GitHub discovery, enrichment, and scoring; sibling graphs (gh_ai_repos_graph, gh_quick_brief_graph, score_recruiter_fit_graph, classify_recruitment_bulk_graph) compose against the same hydrated user shape.
GitHub REST API v3
GitHubEndpoints for /search/users, /repos/{}/stargazers, /repos/{}/contributors, /orgs/{}/members, /users/{}/followers
Used for all six discovery channels — fan-out across keyword search, stargazers, contributors, org members, and follower graphs.
GitHub GraphQL API v4
GitHubBatch hydration of 50+ fields per user: contribution calendar, pinned repos, org memberships, top repos, contributed repos
Used after REST discovery to enrich candidates with signals that REST would require N separate calls to fetch.
httpx + asyncio
encodeAsync HTTP client on the asyncio runtime — concurrent GitHub API calls with backoff on 403/429
All GitHub traffic from gh_patterns_graph flows through a single shared httpx.AsyncClient. Rate-limit headers are parsed on every response.
Candle (HuggingFace)
HuggingFaceMinimalist ML framework for Rust with Metal/CUDA/CPU backends — local BERT inference, no cloud calls
Embeds contributor bios + repo descriptions into a vector space for semantic candidate search. Zero egress per candidate.
LanceDB
LanceDBEmbedded columnar vector database — ANN search over millions of vectors without a server
Stores contributor vectors alongside raw fields. Powers offline similarity search across the whole candidate corpus.
Neon PostgreSQL
NeonServerless PostgreSQL — the canonical contacts store shared with the company pipeline
Scored candidates land in contacts with github_handle unique key, profile jsonb, authority_score, and a searchable tag array.
asyncpg
MagicStackHigh-performance async PostgreSQL driver for Python — connection-pooled writes from gh_patterns_graph nodes
Used by gh_patterns_graph to upsert candidates and org-pattern rows into Neon. Upserts keyed on contacts.github_handle.
Drizzle ORM
Drizzle TeamTypeScript ORM with typed queries — the contacts schema definition shared by Rust writes and TS reads
contacts.github_handle (unique), contacts.profile (jsonb ContactProfile), contacts.authority_score — all surfaced through Drizzle types.
React Flow
xyflowNode-and-edge visualization library with custom nodes, dark-mode theming, and smooth-step routing
Renders the interactive diagram on this page — same custom node components as /how-it-works.