The Container-as-a-Service (CaaS) platform team is looking for a Site Reliability Engineer to join our global team. You will be part of a dynamic, international environment with the rare opportunity to influence the overall DevOps journey for a global investment bank, one of the strategic initiatives in IT. In this role, you will work on providing services around Red Hat OpenShift, the enterprise container platform and partner with many application development teams in the various business divisions globally.
We offer you an impactful role as Site Reliability Engineer, working in a challenging and innovative transformation project with end-to-end responsibility. You'll design and operate automation software to provision, upgrade, monitor, and heal a large global fleet of OpenShift clusters.
You'll participate in release cycles of our offerings by deploying code, integrating with continuous integration (CI) and continuous delivery (CD) tools, and change management.
Participate in a follow-the-sun on-call rotation and help in incident management, root cause analysis, and continuous improvement activities, leading engineering efforts against a service-level agreement (SLA) and error budget. This will include a weekend working schedule, Perform software updates, peer code reviews, testing, and Common Vulnerabilities and Exposures (CVE) analysis; respond to security threats. Serve as a goodwill ambassador to developer, operations, and partner communities.
- Excellent troubleshooting skills with a strong can-do attitude and known by your partners for your great sense of ownership and proactive attitude
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
- Scale systems through mechanisms such as automation; evolve systems by pushing for changes that improve reliability and velocity
- Practice sustainable incident response and blameless postmortems
- Familiarity with Agile and Scrum
Experience on following technologies:
- Java, Python, C, C++, Go, Perl, Ruby (in one or more)
- Container Platform: OpenShift or Kubernetes
- Source Code Management, Build Management and Configuration Management
- Monitoring Tools
- Unix/Linux operating systems internals and administration (e.g., filesystems, inodes, system calls) or networking (e.g., TCP/IP, routing, network topologies and hardware, SDN)
Nice to have
- Extensive experience in delivering and supporting production services and massive scale distributed systems
- Effective communication skills and experience working directly with and presenting to a global team
English: B2 Upper Intermediate
If needed, we can help you with relocation process. Click here for more information.