← Back
Andy Quangvan
Principal Engineer  ·  Portfolio SRE  ·  Agentic AI
I build things to understand them. I scale what works to drive organizational impact. I teach what I've learned to solidfy my understanding - and to raise the floor for everyone around me.
Lead Site Reliability Engineer
SLO Adoption Program — Designed and leading a portfolio-wide SLO program connecting reliability metrics to client experience outcomes. Spans discovery sessions with product and engineering leads, executive alignment, and a governance model defining shared ownership between engineering and product. Produced executive communications adopted by director for VP and C-level audiences.
Margarita Hour — Identified and resolved a critical silent production failure: users experiencing daily session interruptions with no alert firing and no incident declared. Investigated root cause (token expiry, no graceful retry), resolved the issue, and used the finding to build organizational case for proactive reliability monitoring.
Agentic Workflow Platform (FlowFoundry) — Built an agentic workflow platform on Azure Functions and Azure OpenAI during innovation week to establish organizational SME position in agentic AI. Platform informed company AI architecture direction. Led adoption evaluation of GitHub Agentic Workflows and Datadog MCP Server as organizational standards. Designed hybrid deterministic/agentic execution model and JSON-driven workflow definitions.
Portfolio Modernization Program — Independently identified and leading portfolio-wide modernization beginning with Azure Function Apps. Assesses services for fragility, observability gaps, and standards drift without external direction. Decomposes findings into epics and user stories sequenced to support SLO program observability requirements.
Engineering Standards — Established DB resiliency standards (Polly, command timeouts, connection pool sizing) with CI fitness function enforcement. Led Application Insights decommission and Datadog migration. Implemented secret scanning standards across ADO and GitHub.
C# / .NET 8 Go Azure Datadog Azure OpenAI MCP Protocol Terraform Docker Kubernetes React / TypeScript
Information Security Cloud Architect
  • Developed cloud security strategy for public cloud presence and acquisition onboarding
  • Led CSPM selection and implementation
  • Defined IAM and RBAC framework for cloud workloads
Azure Architecture CSPM IAM / RBAC
IT Operations Manager
  • Led cloud-first transformation across engineering and operations teams
  • Introduced architectural patterns — connection pooling, retry logic, infrastructure as code — that became organizational standards
  • Built monitoring as a platform capability; defined engineering culture principles: fail fast, blameless postmortems, cloud first
  • Created platforms engineers could interact with directly through their pipelines
Azure DevOps SRE Terraform Leadership
Senior Engineering Manager
  • Managed full AWS infrastructure and built CI/CD pipeline from scratch
  • Converted system engineering team to SRE practice
  • Implemented security policies and compliance tooling including DLP and EDR
AWS Docker SRE Leadership
System Architect
  • Automated deployment workflows and infrastructure operations
  • Managed cloud and big data infrastructure (Cloudera Hadoop)
  • Partnered with development teams to improve operational practices
Linux Chef Ruby AWS
Platform & Reliability
SRE · SLO Design · Observability
Datadog · Error Budget Governance
Incident Management · Portfolio SRE
Agentic AI & Automation
MCP Protocol · Agent Skills
GitHub Copilot CLI · Azure OpenAI
Agentic Workflow Design · FlowFoundry
Engineering Leadership
Technical Standards · Mentoring
Cross-team Delivery · POC to Production
Executive Communication
Master's Degree, Information Technology · Southern New Hampshire University
2019
Bachelor's Degree, Computer Network Management · Westwood College
2011
Contributor — Materials Science Library
  • PR #4626 — migrated Sphinx documentation build to CI, removed generated artifacts from source control
  • PR #4631 — Add preparation step for Jekyll deployment and include static files
Python CI/CD Open Source
Contributor — Materials Science Library
  • PR #17 — Fix AttributeError in VaspInput.as_dict()
  • PR #18 — Implement attribute setting for PeriodicSite
  • PR #19 — Remove 'unsafe' loader type from YAML in DftSet
  • PR #21 — Add _filter_kwargs method to IStructure for better kwargs handling
Python CI/CD Open Source