Alex Knill
Cloud Operations & Support Engineering
Software engineer with 15+ years across startups and hyperscalers. Built incident management from scratch at CoreWeave. Specializes in support engineering, data operations, and being the person who makes things work when they break. Fixer by nature, firefighter by necessity, translator by instinct.
Experience
Cloud Operations Engineer
2024 - 2025
Remote
Founding member of the Cloud Operations team at a GPU-focused hyperscaler during explosive growth (employee #609, company grew from ~600 to 1,300+ during tenure). Built incident management processes from scratch. Served as Incident Manager on Call (IMOC), coordinating cross-team responses across a 24/7 operation supporting AI/ML infrastructure at massive scale.
- Built the Cloud Ops team's entire incident management process from scratch — severity levels, triage workflows, reporting, and organizational accountability
- Served as IMOC during production incidents, coordinating war rooms across engineering teams via Slack and Zoom while maintaining team morale during extended outages
- Led development of a Grafana dashboard migration script (Python) during observability platform changes, managing scope and teaching the team software development workflows
- Investigated and triaged alerts across Kubernetes clusters, Slurm HPC nodes, and core networking infrastructure
- Contributed features to a Go-based Slack bot for incident automation (Jira integration, channel creation, on-call paging)
- Acted as an organizational shield — triaging low-level alerts with runbooks to avoid waking engineers unnecessarily on off-hours
- Documented recurring Slurm node failures into runbooks, turning unknown problems into known procedures with quick fixes
Alex joined CoreWeave as employee #609 during explosive growth — the entire Cloud Ops team started on the same day with no existing processes, no documentation, and no institutional knowledge. His proudest contribution wasn't a technical artifact — it was building the team's incident management framework from scratch: severity levels, triage workflows, escalation protocols, and reporting. As Incident Manager on Call, he became known for keeping morale high during extended outages. Colleagues told him directly: "It's always great when you're in an incident, Alex, because you make things seem not so bad." He also led a Python scripting effort to migrate hundreds of Grafana dashboards and contributed features to a Go-based Slack bot for incident automation — learning Go on the job in the process.
Incident Management Engineer
2022 - 2024
Remote
Support engineering and incident management for a utility-sector SaaS platform handling mission-critical 811 dig ticket data. Managed complex data pipelines (Airflow, Kafka, custom streaming) across legacy and modern systems. Built monitoring improvements and automation tools that eliminated hours of manual work.
- Delivered an 'impossible' SQL report in first weeks — a complex Postgres query using JSON extraction, string manipulation, and pivots that had been declared unfeasible by the previous team
- Replaced threshold-based Datadog alerts with anomaly detection, eliminating false positives from natural business-hour data patterns
- Built a Python/Selenium automation tool (on personal time) that reduced a 5-30 minute manual process to ~30 seconds
- Traced and documented legacy Airflow pipeline data flows line-by-line through three generations of pipeline architecture
- Built a Make.com pipeline to buffer oversized data files, preventing single-customer batch uploads from bringing down the entire system
- Managed on-call rotation for data pipeline alerts across Airflow, Kafka, and a custom streaming platform
Alex inherited what he calls "a spaghetti monster of data pipelines" — three layers of architecture (Airflow, Kafka, and a custom streaming platform) built by three successive engineering teams, all of whom had left. The data engineering team was dissolved about 60 days after he started, and pipeline ownership landed on his desk with no documentation and no one to page. He became the person who traced data flows line-by-line through legacy systems. In his first few weeks, he delivered a complex SQL report that the previous team had declared impossible — extracting JSON from Postgres and pivoting the data in ways the team didn't know were possible. He also built a Python/Selenium automation tool on his own time after his manager turned down the idea, cutting a 5-30 minute manual process down to about 30 seconds.
Customer Support Associate → Support Engineer → Software Engineer
2016 - 2022
Cleveland, OH
Employee #9 at a clinical trials SaaS startup. Grew from sole support person to support engineer to full-stack software engineer over 6 years. Built the support function from zero, created internal tools that automated himself out of support work, served as product owner for the legacy product, and did everything from Salesforce administration to furniture assembly.
- Built a Flask app that automated Tier 1 database operations and monthly client reports (cron + Azure Blob Storage + SendGrid), freeing himself from the support queue entirely
- Managed production Postgres database for 6 years — built a library of SQL queries used across support, sales, and engineering teams
- Served as product owner for the legacy product, implementing sprint workflows with a single engineer to prioritize customer impact
- Set up Zendesk-to-Salesforce integration, self-taught Salesforce admin via Trailhead, built custom workflows for sales and marketing
- Navigated 21 CFR Part 11 compliance requirements — produced regulatory training videos, managed change processes for clinical trials software
- Performed full-stack engineering work: React frontend, Node.js backend, Python API, plus Selenium test automation
- Mentored team members transitioning into technical roles, using the same Flask app that enabled his own growth
Alex was employee #9 at Complion — a clinical trials SaaS startup where his early responsibilities included Zendesk support, Salesforce administration, IT troubleshooting, and literally assembling furniture for new hires. That's what employee #9 at a startup looks like. Over six years, he grew from sole support person to support engineer to software engineer. The turning point was a Flask app he built from scratch: it automated Tier 1 database operations and monthly client report generation, effectively automating himself out of the support workflow. That freed him to transition into engineering, where he contributed to a React/Node.js/Python stack and built Selenium test suites. He also served as product owner for the legacy product — simultaneously the support guy, video producer, IT person, and product owner. He calls it the hardest job he's ever done.
Technical Support
2014 - 2016
Cleveland, OH
Technical support for a wearable fitness tracker startup (~25 employees). Supported the MoveBand device sold to schools and corporations for health initiatives. High-volume phone and email support.
- Handled high-volume phone and email support for a consumer hardware product
- Developed customer communication skills for guiding non-technical users through hardware troubleshooting
- Worked with Zendesk platform (training carried directly to next role at Complion)
This was high-volume phone and email support for a wearable fitness tracker sold to schools and corporations. Alex developed the customer communication skills that became foundational to his career — particularly the ability to guide non-technical users through troubleshooting without making them feel foolish. The number one support issue was a flat USB connector being inserted backwards. Convincing people to try flipping it after they'd already claimed to have tried multiple times was a daily exercise in patience and diplomacy. The Zendesk experience he built here carried directly into his next role at Complion.
Sales Administrator
2009 - 2014
Cleveland, OH
First tech job, held while completing college full-time with evening classes. Supported PlayAway pre-loaded audiobook devices sold to libraries. Got the job by entering a mock Shark Tank at business school, taking runner-up, and networking at a business conference.
- Worked full-time while completing a 5-year evening degree program
- Entered a company of ~100 employees through entrepreneurial initiative (mock Shark Tank → conference networking → direct hire)
- Gained foundational support skills in a hardware product environment
Alex's first tech job, held while completing his degree at DeVry University with evening classes over five years. He got the job through pure initiative — entering a mock Shark Tank competition at business school, taking runner-up, networking at a business conference, and convincing one of Findaway's co-founders to hire him directly. It's a pattern that shows up throughout his career: doors open through initiative, not just applications. The discipline of working full-time while studying built the time management skills that carried through every role since.
Skills
- SQL / PostgreSQL
- Incident Management
- Zendesk
- Kubernetes
- Python
- Grafana / Prometheus
- Datadog
- Linux Administration
- Docker
- Airflow
- Git
- JavaScript / React / Node.js
- Terraform / IaC
- SAML / SSO
- AI Tools & Agents
- Bash / Shell Scripting
- REST APIs
- Selenium
- Flask
- Salesforce
- MongoDB
- ArgoCD
- Go
- GCP
- CI/CD Pipelines
- Networking
- Kafka
Education & Community
DeVry University
Computer Information Systems — System Security
Completed while working full-time at Findaway World with evening classes over 5 years.
PyOhio — Co-Chair
Part of the core organizer group for ~3 years.
Homelab
Runs a homelab with 3 Proxmox nodes, 3 Kubernetes (K3S) Intel NUCs, and a Raspberry Pi. Services include Immich (photos), Navidrome (music), Joplin (notes), PiHole (DNS), Grafana/Prometheus (monitoring), and OuraRing health data collection. Uses Tailscale for remote access and Cloudflare tunnels for selective public exposure. The homelab directly extends professional skills in Linux, Docker, Kubernetes, networking, and observability.