Site Reliability Engineer working with AWS, GCP & Azure at Snowplow Analytics Ltd (London, UK) (allows remote)
At Snowplow, we are on a mission to empower people to use data to differentiate. We are able to provide technology which enables customers to not only control their data, but allows them to do amazing things with that control.
As part of that effort, we’re changing the way that people do digital analytics by moving companies away from one-size-fits-all vendors, such as Google Analytics and Adobe, to dictate what should be done with their data and enabling them to collect and own their data themselves.
Our Managed Service offering has grown significantly over the last year, and we now orchestrate and monitor the Snowplow event pipeline across more than 100 customer-owned AWS accounts, with individual accounts processing many billions of events per month.
We are looking for our second Site Reliability Engineer to help us grow to managing 1,000 and then 10,000 AWS, GCP and Azure accounts. You’ll work closely with our Tech Ops Lead, on all aspects of our proprietary deployment, orchestration and monitoring stack.
The team and mission:
Technical Operations at Snowplow is responsible for two distinct domains:
- Snowplow’s internal infrastructure, which powers Snowplow Insights, CI/CD, the Snowplow website, and our support tooling, all running on AWS
- Our customers’ Snowplow-related infrastructure, running in their own AWS account
Within both domains, Tech Ops at Snowplow is striving to increase service reliability, fulfil customer requests in a timely fashion, and automate recurring tasks. Task automation is essential as our customer base grows, because our “infrastructure estate” scales linearly with our customer numbers, unlike most software businesses.
Our roadmap includes:
- Deploying, orchestrating and monitoring Snowplow on GCP, Azure and on-premise, not just AWS
- “One click” infrastructure deployment and maintenance
- Building self-healing and self-upgrading infrastructure, which learns how to optimize itself for cost, performance and reliability
This is an enormously ambitious undertaking but also, we hope, a hugely exciting infrastructure automation challenge!
Today, our in-house stack uses pragmatic technologies including Docker, Ansible, Consul, CloudFormation, bash and Golang to manage our internal and customer infrastructure.
For our next level of automation, we are now exploring tools such as Terraform, Kubernetes and Vault.
- The development of software for the purposes of automating, monitoring and maintaining client-deployed and Snowplow-internal infrastructure and services
- Providing deep technical support to internal and client teams
- Performing planned upgrades and modifications to customer infrastructure
- Handling high-severity internal or customer incidents, ensuring we meet all SLAs
Within the software engineering side you will be responsible for the implementation, deployment and stability of your systems and services. You will own software end to end with a high expectation of ownership over anything that is deployed.
Within the operational side you will join our on-call process for incident resolution, and be in the assignment for the regular client infrastructure work, with a strong mandate to continuing automation.
What we are looking for:
This role will be a great fit for somebody who:
- Has deep knowledge of Linux, networking, containers and similar, able to troubleshoot complex problems on individual servers and distributed systems
- Has worked with at least one of: Amazon Web Services, GCP or Azure
- Has been part of an on-call rotation
- Has interacted directly with customers to solve their specific technical issues
- Is comfortable scripting in one or more of: Bash, Python, Ruby or Perl
- Is comfortable programming in one or more of: Java, Scala, Golang or Python
This role would be a great fit for a software engineer or systems administrator who wants to transition into a full SRE role.
The integrity of our customers’ systems and data underpin everything we do at Snowplow. As part of their probation, candidates will be put through a full background security check.
An important part of this role relates to out-of-hours work, particularly around:
- Performing planned upgrades and modifications to customer infrastructure outside of their working hours
- Being on-call to handle high-severity internal or customer incidents, ensuring we meet all SLAs
The on-call process for the Tech Ops team is still evolving; we will discuss these requirements with short-listed candidates.
What you’ll get in return:
- Competitive package based on experience
- 25 days holiday a year plus bank holidays
- The freedom to work wherever suits you best
- Two fantastic company away-weeks a year
- Working alongside a strong and talented team
- Convenient central Shoreditch location
- Continuous supply of Pact coffee
- Regular mystery events