Would you be happy in role spending up to 50% of your time on development tasks such as automation, automation, automation, new features and scaling with the remaining 50% of your time doing "ops" related work such as issues, on-call, and manual intervention?
My clients are creating a scalable global platform that will likely be as popular as the likes of Spotify, UBER and AirBnB. Established since 2014, they have a leadership team with a track record of taking technology-based solutions from conception through to delivery with organisations that are typically either floated or acquired in deals that make headline news.
They are looking for a talented Site Reliability Engineer who has a passion for highly scalable, highly reliable platforms. And automation, of course!. You will play a key role in optimising the stability and availability of their global platform.
What you will be doing
● Automate, automate, automate
● Solve problems relating to mission critical production services and building automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions ● Influence and create new designs, architectures, standards and methods for distributed systems, with a strong bias on scalability & reliability
● Embrace the ‘feedback loop’ by engaging with various engineering teams to transform production optimisation into deliverable backlog items
● Engage in capacity planning, software performance analysis and system tuning, ensuring production is as good as it gets!
● Ensure we deliver services that degrade gracefully
● Work collaboratively with all participants in software development activities and be supportive of developers and testers as they deliver services into production.
● Work with the wider development community to improve the software engineering processes and practices associated with continuously building, deploying, and updating software and environments.
● Evaluate both open source and 3rd party solutions to determine how we may integrate these components into their environment
● On-call duties to provide application support, incident management, and troubleshooting
● Simplify complex systems
● Optimise MTTD/MTTR/MTTF
● Automate SOPs
● Actually, automate everything
● SLOs, SLIs & reliability budgets ● Smart alerting & monitoring
● Dashboards - visualise production
● Educate SDE teams in what ‘Production ready’ means
● Embrace Chaos Engineering
● Employ ‘best practices’ for platform operations
● Review and highlight any potential security risk or fragility within the existing platforms, and ongoing developments
● ChatOps enhancement & optimisation
● DR testing
● Prod performance feedback
● Escalation management
● RCA & problem mgt
● R&D ● Capacity planning
● Ensure systems degrade gracefully
Experience & Skills
You must have
● 3+ years experience in a similar role, running scaled, distribute, cloud based services
● Development experience - be comfortable with code
● Awesome operational knowledge of running applications in Kubernetes
● Solid working knowledge of Helm
● Understanding of postgres and mysql
● Terraform experience
● Network experience
● Ability to work in a face paced startup scene
● Knowledge of security best practices for cloud products
● Astounding troubleshooting skills
● A passion for learning and adopting new technologies that will save time and ease your day-to-day job
Nice if you also have
● Jenkins scripted pipelines experience
● GCP experience
● Kubernetes CRD controller experience
To be considered for this career changing opportunity please apply ASAP with your CV, including examples of your great achievements and coding skills.
Edison Hill Limited are operating and advertising as an Employment Agency for permanent positions and as an Employment Business for interim / contract / temporary positions. Edison Hill Limited are an Equal Opportunities employer and we encourage applicants from all backgrounds.PLEASE ENSURE THAT THE JOB REFERENCE NUMBER APPEARS IN THE SUBJECT BOX.