Open to Opportunities

Alex Knill

Cloud Operations & Support Engineering

Software engineer with 15+ years across startups and hyperscalers. Built incident management from scratch at CoreWeave. Specializes in support engineering, data operations, and being the person who makes things work when they break. Fixer by nature, firefighter by necessity, translator by instinct.

CoreWeaveUrbintComplionMovableFindaway World

Ask the AI Evaluate Job Fit

Years in Tech

15+

Companies

5

From Startup to Hyperscaler

CoreWeave, Complion, Urbint

Community

Co-Chair, PyOhio

Experience

CoreWeave

Cloud Operations Engineer

2024 - 2025

Remote

Founding member of the Cloud Operations team at a GPU-focused hyperscaler during explosive growth (employee #609, company grew from ~600 to 1,300+ during tenure). Built incident management processes from scratch. Served as Incident Manager on Call (IMOC), coordinating cross-team responses across a 24/7 operation supporting AI/ML infrastructure at massive scale.

Built the Cloud Ops team's entire incident management process from scratch — severity levels, triage workflows, reporting, and organizational accountability
Served as IMOC during production incidents, coordinating war rooms across engineering teams via Slack and Zoom while maintaining team morale during extended outages
Led development of a Grafana dashboard migration script (Python) during observability platform changes, managing scope and teaching the team software development workflows
Investigated and triaged alerts across Kubernetes clusters, Slurm HPC nodes, and core networking infrastructure
Contributed features to a Go-based Slack bot for incident automation (Jira integration, channel creation, on-call paging)
Acted as an organizational shield — triaging low-level alerts with runbooks to avoid waking engineers unnecessarily on off-hours
Documented recurring Slurm node failures into runbooks, turning unknown problems into known procedures with quick fixes

KubernetesGrafanaPrometheusLokiPagerDutySlackJiraConfluenceGoPythonLinuxSlurmBackstage

Urbint

Incident Management Engineer

2022 - 2024

Remote

Support engineering and incident management for a utility-sector SaaS platform handling mission-critical 811 dig ticket data. Managed complex data pipelines (Airflow, Kafka, custom streaming) across legacy and modern systems. Built monitoring improvements and automation tools that eliminated hours of manual work.

Delivered an 'impossible' SQL report in first weeks — a complex Postgres query using JSON extraction, string manipulation, and pivots that had been declared unfeasible by the previous team
Replaced threshold-based Datadog alerts with anomaly detection, eliminating false positives from natural business-hour data patterns
Built a Python/Selenium automation tool (on personal time) that reduced a 5-30 minute manual process to ~30 seconds
Traced and documented legacy Airflow pipeline data flows line-by-line through three generations of pipeline architecture
Built a Make.com pipeline to buffer oversized data files, preventing single-customer batch uploads from bringing down the entire system
Managed on-call rotation for data pipeline alerts across Airflow, Kafka, and a custom streaming platform

PostgreSQLAirflowKafkaDatadogPythonSeleniumApache SupersetMake.comElasticsearchDjangoGCP

Complion

Customer Support Associate → Support Engineer → Software Engineer

2016 - 2022

Cleveland, OH

Employee #9 at a clinical trials SaaS startup. Grew from sole support person to support engineer to full-stack software engineer over 6 years. Built the support function from zero, created internal tools that automated himself out of support work, served as product owner for the legacy product, and did everything from Salesforce administration to furniture assembly.

Built a Flask app that automated Tier 1 database operations and monthly client reports (cron + Azure Blob Storage + SendGrid), freeing himself from the support queue entirely
Managed production Postgres database for 6 years — built a library of SQL queries used across support, sales, and engineering teams
Served as product owner for the legacy product, implementing sprint workflows with a single engineer to prioritize customer impact
Set up Zendesk-to-Salesforce integration, self-taught Salesforce admin via Trailhead, built custom workflows for sales and marketing
Navigated 21 CFR Part 11 compliance requirements — produced regulatory training videos, managed change processes for clinical trials software
Performed full-stack engineering work: React frontend, Node.js backend, Python API, plus Selenium test automation
Mentored team members transitioning into technical roles, using the same Flask app that enabled his own growth

PostgreSQLPythonFlaskReactNode.jsJavaScriptSeleniumZendeskSalesforceAzureRackspaceMongoDBGit

Movable

Technical Support

2014 - 2016

Cleveland, OH

Technical support for a wearable fitness tracker startup (~25 employees). Supported the MoveBand device sold to schools and corporations for health initiatives. High-volume phone and email support.

Handled high-volume phone and email support for a consumer hardware product
Developed customer communication skills for guiding non-technical users through hardware troubleshooting
Worked with Zendesk platform (training carried directly to next role at Complion)

ZendeskUSB hardwareWindowsSQL

Findaway World

Sales Administrator

2009 - 2014

Cleveland, OH

First tech job, held while completing college full-time with evening classes. Supported PlayAway pre-loaded audiobook devices sold to libraries. Got the job by entering a mock Shark Tank at business school, taking runner-up, and networking at a business conference.

Worked full-time while completing a 5-year evening degree program
Entered a company of ~100 employees through entrepreneurial initiative (mock Shark Tank → conference networking → direct hire)
Gained foundational support skills in a hardware product environment

Hardware supportLibrary systems

Skills

Proficient

SQL / PostgreSQL
Incident Management
Zendesk

Comfortable

Kubernetes
Python
Grafana / Prometheus
Datadog
Linux Administration
Docker
Airflow
Git
JavaScript / React / Node.js
Terraform / IaC
SAML / SSO
AI Tools & Agents
Bash / Shell Scripting
REST APIs
Selenium
Flask
Salesforce
MongoDB
ArgoCD

Foundational

Go
GCP
CI/CD Pipelines
Networking
Kafka

Education & Community

Education

DeVry University

Computer Information Systems — System Security

Completed while working full-time at Findaway World with evening classes over 5 years.

Community

PyOhio — Co-Chair

Part of the core organizer group for ~3 years.

Homelab

Personal Infrastructure

Runs a homelab with 3 Proxmox nodes, 3 Kubernetes (K3S) Intel NUCs, and a Raspberry Pi. Services include Immich (photos), Navidrome (music), Joplin (notes), PiHole (DNS), Grafana/Prometheus (monitoring), and OuraRing health data collection. Uses Tailscale for remote access and Cloudflare tunnels for selective public exposure. The homelab directly extends professional skills in Linux, Docker, Kubernetes, networking, and observability.