About the company
Blockdaemon offers a multi-chain multi-cloud network management tool that can deploy nodes and connect them to blockchains within minutes. Blockdaemon also offers its own infrastructure for select projects to offer faster deploy times and lower costs. The end goal is to offer all blockchain projects a global decentralized network management tool connected to multiple infrastructure providers.
Job Summary
Blockdaemon is looking for a Site Reliability Engineer (SRE) to join our rapidly growing team and support our mission to connect institutions to blockchains through a single integration. The Site Reliability Engineer will work with all facets of the business to help streamline and scale our infrastructure. In this role, you will be responsible for being a subject matter expert in network architecture and design implementation, working closely with Engineers from all parts of the business to grow Blockdaemon to meet the requirements of the Web3 ecosystem.
Position Overview:
Become an internal support system and leader for operational health and incident response Partner and support the overall engineering organization and elevate incident management Review and operationalize SLO/SLI/SLA for maximum efficiency Design, implement, and troubleshoot services for supporting our cloud infrastructure to manage and support our nodes Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability Utilize continuous integration/continuous delivery (CI/CD) using latest DevOps tools and innovative methods Build strong and highly functional partnerships with product and other technology teams Support senior engineers through outages and incidents for a business requiring 24x7 coverage Build automations and self-service tooling with a security conscious mindset Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating processes to continually improve Troubleshoot various issues around reliability, resiliency, scalability and availability Assist with oncall and triage rotation Removing barriers to building and shipping products across bare metal and cloud service providers
Required Experience:
5+ years background in DevOps, Site Reliability Engineering, or Production Engineering You have experience running a mission critical service at scale Prior experience running critical production systems in a Linux environment Passion for ensuring all things end-to-end observed and monitored Deep knowledge of distributed system design and operation Solid understanding of web and network protocols and standards (HTTP, TLS, DNS, etc) Experience writing automation tools & eagerness to "automate all the things" Experience building large applications from scratch, complete with CI/CD infrastructure Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure) Experience managing Kubernetes clusters or some other container orchestration infrastructure You have worked with common infrastructure tools like Kubernetes, Docker, Terraform, Ansible, Consul, Packer, Puppet, and Helm Strong sense of ownership, entrepreneurial spirit, and/or startup-like experience, capable of driving towards solutions independently while seeking feedback when appropriate Knowledge of at least one (1) scripting language