About the company
We are a team of world class builders and researchers with expertise across several domains: Ethereum Protocol Engineering, Layer-2, Decentralized Finance (DeFi), Miner Extractable Value (MEV), Smart Contract Development, Security Auditing and Formal Verification. Working to solve some of the most challenging problems in the blockchain space, we frequently collaborate with renowned companies, such as Ethereum Foundation, StarkWare, Gnosis Chain, Aave, Flashbots, xDai, Open Zeppelin, Forta Protocol, Energy Web, POA Network and many more. We actively contribute to Ethereum core development, EIPās and network upgrades together with the Ethereum Foundation, and other client teams.
Job Summary
Responsibilities:
šLead the implementation and refinement of SRE practices across the organization, including SLOs, error budgets, and blameless postmortems šDesign and implement automation to eliminate toil and improve system reliability and efficiency šLead initiatives and architect scalable hybrid cloud solutions for Web3 infrastructure šManage error budgets and make data-driven decisions about when to prioritize reliability vs. new features šDrive SRE practices to ensure high availability, performance, and reliability under varying load conditions šCollaborate closely with Platform engineering team to build reliability into services from the ground up šCollaborate closely with Nethermindās Infrastructure Leadership department to align SRE strategies with overall technical vision šDrive the adoption of observability best practices and implement comprehensive monitoring systems šDevelop and maintain service level indicators (SLIs) and objectives (SLOs), working with product owners to define appropriate reliability targets šMentor team members in SRE practices and foster a culture of continuous learning šLead capacity planning efforts, using quantitative analysis to predict and address future scaling challenges šContribute to long-term technical roadmaps, balancing reliability concerns with product innovation
Skills:
š5+ years of experience in Site Reliability Engineering or DevOps šExpert knowledge of cloud platforms (AWS, GCP) šExpert knowledge of Kubernetes šProven experience in designing and implementing scalable, efficient, resilient systems šDeep understanding of Linux/Unix systems and networking protocols šStrong programming skills in Python or Go šStrong background in monitoring, observability, and logging systems (e.g., Grafana, Prometheus, Loki) šExpertise in CI/CD tools (e.g. GitHub Actions, ArgoCD) šExcellent communication skills, both written and verbal, with the ability to explain complex technical concepts to various audiences šExperience in producing technical documentation, runbooks, presentations, and post-mortem reports šExperience and passion for mentoring and upskilling team members