Cogent labs is looking for a Site Reliability Engineer (3+ years of relevant experience) to help create innovative and creative services based on AI. Successful candidates will join a highly skilled and growing team, and should be able to help plan out high-quality backend solutions, maintain service SLOs and cloud infrastructure, as well as set up effective tooling, monitoring and alerting.
Required experience and competencies
- Understands large-scale complex systems from a reliability perspective
- Experience working with Kubernetes and container-based applications
- Deep network understanding and troubleshooting ability
- Experience with Cloud Computing platforms (particularly GCP) a plus
- Setting up and maintaining service SLOs
- Specifying and developing scalable and performant cloud infrastructure
- Developing and maintaining a comprehensive continuous integration/deployment system
- Maintaining monitoring/alerting and measuring availability, latency, and overall system health
- System design consulting, developing software platforms and frameworks, capacity planning and launch reviews
The Cogent Labs engineering department is continuously working towards developing a culture improving and rewarding the following qualities:
- Team effort: A cohesive team can be more effective than an isolated prodigy. Engineers are expected to work well in groups and look for opportunities to empower their colleagues.
- Responsibility: Take responsibility for your own tasks and hold others responsible for theirs.
- Self-improvement: Create an environment where engineers can focus on their engineering tasks and self-improvement without excessive outside disturbances.
- Experimentation: Engineers should have some freedom in experimenting with new ideas and technologies, as this ultimately could translate into building better products or the creation of valuable new IP.
- Quality: Maintaining a mindset of developing high quality features and code.