Staff Site Reliability Engineer (hybrid)
Staff Site Reliability Engineer
At vArmour, we’re disrupting traditional security and IT players through the understanding of relationships - the complex interdependencies that drive IT. vArmour is the leading venture-backed provider of Application Relationship Management software, enabling complex, global enterprises to discover and protect the relationships between users, digital assets and data. Based in Silicon Valley, the company's investors include NightDragon and AllegisCyber.
The largest companies in the world use vArmour to visualize, map, monitor, and control the relationships that power their organizations. Only vArmour enables organizations to auto-discover relationships between applications and users across hybrid clouds without adding new agents or infrastructure. As businesses continue to tackle the challenges of digital transformation, vArmour makes it easy to achieve Zero Trust, securely migrate to cloud, visualize and control user access, and discover the blast radius of incidents.
At vArmour, “Relationships Matter” is the cornerstone philosophy that drives the company--in our technology and product design, as well as how the company engages our people, customers, and partners. We live this value every day.
About the Role
We’re looking for someone passionate about ensuring the reliability and performance of complex software systems to join our Site Reliability Engineering team. In our dynamic and innovative team, you will play a crucial role in ensuring the reliability, performance, and scalability of our systems and services across multiple regions. You will work closely with cross-functional teams, including software engineers, operations, and QA teams, to design, build, and maintain robust and efficient systems. Your expertise in automation, monitoring, and incident response will contribute to the overall stability and uptime of our production environment.
What you’ll do
- Design, develop, and implement reliable and scalable systems, tools, and processes to ensure high availability, performance, and fault tolerance of our infrastructure and applications.
- Build out our CI/CD pipeline to deploy all layers of the architecture to production in a robust, easy-to-use and automated way
- Create and maintain robust monitoring and alerting systems to proactively identify performance bottlenecks, system failures, and potential issues. Implement appropriate remediation actions to minimize downtime and impact on users.
- Participate in incident response, perform root cause analysis, and develop strategies to prevent recurrence of critical incidents. Contribute to the continuous improvement of incident management processes and practices.
- Working in a multidisciplinary DevOps-focused team, build a close relationship with other developers and production support teams
- Identify performance bottlenecks and areas for optimization in the infrastructure and application stack. Collaborate with development teams to optimize code, database queries, and system configurations.
- Stay up to date with industry trends, emerging technologies, and best practices related to site reliability engineering. Identify opportunities for process improvements and contribute to the evolution of our infrastructure and operational practices.
What you’ll bring:
- Bachelor's degree in Computer Science, Information Systems, or a related field (or equivalent work experience).
- 7+ years experience as a Site Reliability Engineer or a similar role, with a strong background in designing and maintaining highly available, scalable, and distributed systems.
- Proficiency in programming and scripting languages such as Python, Go, Bash, or similar.
- Advanced level understanding of Kubernetes, micro services architecture, and design patterns for Enterprise SAAS class scale & success.
- Expertise in implementing and managing monitoring and logging tools (e.g., Prometheus, Grafana, Fluentd, ELK stack).
- Deep understanding of Linux/Unix operating systems, networking protocols, and web technologies.
- Strong knowledge of cloud computing platforms (e.g., AWS, Azure) and infrastructure-as-code frameworks (e.g., Terraform, Ansible, CloudFormation).
- Good networking knowledge including VPCs, Subnets, DNS, routers and firewalls
- In-depth knowledge of DevOps tools like Git, Makefile, Jenkins, Github, Helm charts, JFrog Artifactory, and so on.
What we offer
The opportunity to be a part of something special - an exciting work environment filled with motivated people on a mission to disrupt markets through excellent software. We strive to have parity of benefits across regions and while regulations differ from place to place, we believe taking care of our people is the right thing to do.
- Competitive salary and stock options
- Medical, dental, vision coverage for you and your family in many locations
- Ability to craft your schedule with flexible work arrangements and locations for many roles
- Generous holidays and vacation days each year
- Virtual weekly happy hours and catered brown bag lunches
- Monthly team programs and social events
- Annual innovation summits, hack-a-thons and top-performer recognition events
Don’t check all of the boxes? Don’t sweat it. We’re passionate about building a team of ambitious, diverse humans and as such, if you think you’ve got what it takes to succeed in our chaotic-but-fun, decentralized, remote-friendly, start-up environment—apply anyway. While we have a pretty good idea of what we need, we’re ready for you to challenge our thinking on who needs to be in this role.
vArmour is an equal opportunity employer. We do not discriminate on the basis of race, color, ancestry, religion, national origin, sexual orientation, age, citizenship, marital or family status, disability, gender identity or expression, veteran status, or any other legally protected status