Open to Opportunities

Alex Knill

Cloud Operations & Support Engineering

Software engineer with 15+ years across startups and hyperscalers. Built incident management from scratch at CoreWeave. Specializes in support engineering, data operations, and being the person who makes things work when they break. Fixer by nature, firefighter by necessity, translator by instinct.

CoreWeaveUrbintComplionMovableFindaway World
Years in Tech
15+
Companies
5
From Startup to Hyperscaler
CoreWeave, Complion, Urbint
Community
Co-Chair, PyOhio

Experience

CoreWeave

Cloud Operations Engineer

2024 - 2025

Remote

Founding member of the Cloud Operations team at a GPU-focused hyperscaler during explosive growth (employee #609, company grew from ~600 to 1,300+ during tenure). Built incident management processes from scratch. Served as Incident Manager on Call (IMOC), coordinating cross-team responses across a 24/7 operation supporting AI/ML infrastructure at massive scale.

  • Built the Cloud Ops team's entire incident management process from scratch — severity levels, triage workflows, reporting, and organizational accountability
  • Served as IMOC during production incidents, coordinating war rooms across engineering teams via Slack and Zoom while maintaining team morale during extended outages
  • Led development of a Grafana dashboard migration script (Python) during observability platform changes, managing scope and teaching the team software development workflows
  • Investigated and triaged alerts across Kubernetes clusters, Slurm HPC nodes, and core networking infrastructure
  • Contributed features to a Go-based Slack bot for incident automation (Jira integration, channel creation, on-call paging)
  • Acted as an organizational shield — triaging low-level alerts with runbooks to avoid waking engineers unnecessarily on off-hours
  • Documented recurring Slurm node failures into runbooks, turning unknown problems into known procedures with quick fixes
KubernetesGrafanaPrometheusLokiPagerDutySlackJiraConfluenceGoPythonLinuxSlurmBackstage
Urbint

Incident Management Engineer

2022 - 2024

Remote

Support engineering and incident management for a utility-sector SaaS platform handling mission-critical 811 dig ticket data. Managed complex data pipelines (Airflow, Kafka, custom streaming) across legacy and modern systems. Built monitoring improvements and automation tools that eliminated hours of manual work.

  • Delivered an 'impossible' SQL report in first weeks — a complex Postgres query using JSON extraction, string manipulation, and pivots that had been declared unfeasible by the previous team
  • Replaced threshold-based Datadog alerts with anomaly detection, eliminating false positives from natural business-hour data patterns
  • Built a Python/Selenium automation tool (on personal time) that reduced a 5-30 minute manual process to ~30 seconds
  • Traced and documented legacy Airflow pipeline data flows line-by-line through three generations of pipeline architecture
  • Built a Make.com pipeline to buffer oversized data files, preventing single-customer batch uploads from bringing down the entire system
  • Managed on-call rotation for data pipeline alerts across Airflow, Kafka, and a custom streaming platform
PostgreSQLAirflowKafkaDatadogPythonSeleniumApache SupersetMake.comElasticsearchDjangoGCP
Complion

Customer Support Associate → Support Engineer → Software Engineer

2016 - 2022

Cleveland, OH

Employee #9 at a clinical trials SaaS startup. Grew from sole support person to support engineer to full-stack software engineer over 6 years. Built the support function from zero, created internal tools that automated himself out of support work, served as product owner for the legacy product, and did everything from Salesforce administration to furniture assembly.

  • Built a Flask app that automated Tier 1 database operations and monthly client reports (cron + Azure Blob Storage + SendGrid), freeing himself from the support queue entirely
  • Managed production Postgres database for 6 years — built a library of SQL queries used across support, sales, and engineering teams
  • Served as product owner for the legacy product, implementing sprint workflows with a single engineer to prioritize customer impact
  • Set up Zendesk-to-Salesforce integration, self-taught Salesforce admin via Trailhead, built custom workflows for sales and marketing
  • Navigated 21 CFR Part 11 compliance requirements — produced regulatory training videos, managed change processes for clinical trials software
  • Performed full-stack engineering work: React frontend, Node.js backend, Python API, plus Selenium test automation
  • Mentored team members transitioning into technical roles, using the same Flask app that enabled his own growth
PostgreSQLPythonFlaskReactNode.jsJavaScriptSeleniumZendeskSalesforceAzureRackspaceMongoDBGit
Movable

Technical Support

2014 - 2016

Cleveland, OH

Technical support for a wearable fitness tracker startup (~25 employees). Supported the MoveBand device sold to schools and corporations for health initiatives. High-volume phone and email support.

  • Handled high-volume phone and email support for a consumer hardware product
  • Developed customer communication skills for guiding non-technical users through hardware troubleshooting
  • Worked with Zendesk platform (training carried directly to next role at Complion)
ZendeskUSB hardwareWindowsSQL
Findaway World

Sales Administrator

2009 - 2014

Cleveland, OH

First tech job, held while completing college full-time with evening classes. Supported PlayAway pre-loaded audiobook devices sold to libraries. Got the job by entering a mock Shark Tank at business school, taking runner-up, and networking at a business conference.

  • Worked full-time while completing a 5-year evening degree program
  • Entered a company of ~100 employees through entrepreneurial initiative (mock Shark Tank → conference networking → direct hire)
  • Gained foundational support skills in a hardware product environment
Hardware supportLibrary systems

Skills

Proficient
  • SQL / PostgreSQL
  • Incident Management
  • Zendesk
Comfortable
  • Kubernetes
  • Python
  • Grafana / Prometheus
  • Datadog
  • Linux Administration
  • Docker
  • Airflow
  • Git
  • JavaScript / React / Node.js
  • Terraform / IaC
  • SAML / SSO
  • AI Tools & Agents
  • Bash / Shell Scripting
  • REST APIs
  • Selenium
  • Flask
  • Salesforce
  • MongoDB
  • ArgoCD
Foundational
  • Go
  • GCP
  • CI/CD Pipelines
  • Networking
  • Kafka

Education & Community

Education

DeVry University

Computer Information SystemsSystem Security

Completed while working full-time at Findaway World with evening classes over 5 years.

Community

PyOhioCo-Chair

Part of the core organizer group for ~3 years.

Homelab

Personal Infrastructure

Runs a homelab with 3 Proxmox nodes, 3 Kubernetes (K3S) Intel NUCs, and a Raspberry Pi. Services include Immich (photos), Navidrome (music), Joplin (notes), PiHole (DNS), Grafana/Prometheus (monitoring), and OuraRing health data collection. Uses Tailscale for remote access and Cloudflare tunnels for selective public exposure. The homelab directly extends professional skills in Linux, Docker, Kubernetes, networking, and observability.

Ask the AI
Ask anything about Alex's career. Honest answers — including when something isn't his strength.
Start a conversation
Evaluate Job Fit
Paste a job description and get a frank assessment — strong fit, moderate fit, or honest about the gaps.
Try it out