Job Title | Location | Description | Posted** |
---|---|---|---|
Site Reliability Engineer- Remote
Paramo Technologies |
|
To apply for this position you must be based in the Americas preferably Latin America (the United States of America is not applicable). Applications from other locations will be disqualified from this selection process. We are.... a cutting-edge e-commerce company developing products for our own technological platform.Our creative smart and dedicated teams pool their knowledge and experience to find the best solutions to meet project needs while maintaining sustainable and long-lasting results. How? By making sure that our teams thrive and develop professionally. Strong advocates of hiring top talent and letting them do what they do best we strive to create a workplace that allows for an open collaborative and respectful culture. What You Will Be Doing Improving reliability through the construction of systems and software your primary role will be that of a software engineer. You won´t be writing loads of code but you will be able to see the bigger picture and you´ll really understand how development decisions impact wider systems. As an integral part of the company you will collaborate closely with our various development teams to ensure that they are developing for reliability and resilience. Analyzing development decisions to understand how they will impact key reliability metrics measured by Service Level Objectives and error budgets. These metrics will be the foundation for all coding and architecture configurations. Some Of Your Responsibilities Will Include Interacting with other engineering teams to help them improve the availability reliability and resilience of our infrastructure and systems. Using your analytical skills to help engineering teams debug and fix issues. Helping teams identify troubleshoot and resolve high-impact issues. Practicing sustainable incident response facilitating incident resolution and performing blameless postmortems. Creating and keeping up-to-date required documentation related to all systems/solutions in their area of responsibility. Building knowledge in incident & problem management change management and security. On-calls availability. Knowledge And Skills You Need To Have BS. in Computer Science Computer Engineering or a related field with 5 years of relevant experience or M.S. in Computer Science Computer Engineering or a related field (if you don´t meet this requirement an equivalent combination of experience and/or education will be taken into consideration) 5+ years troubleshooting systems and infrastructure Software development background with the ability to analyze and understand existing code Familiar with microservice-based architecture Proven experience with any Monitoring systems (Prometheus Nagios Zabbix New Relic or any other). Understanding the fundamental principles of continuous integration testing and deployment. Experience with Linux and Windows-based containers and containers orchestration such as Docker Kubernetes Docker Swarm etc. Knowledge of Infrastructure as Code software (Ansible Terraform). Experience with Log Management tools like Graylog ELK or similar technologies. Basic understanding of TCP/IP (routing subnets ports etc.). Working knowledge of HTTP layer infrastructure including load balancers and Web servers. Business Analysis experience. Flexibility to work with departments in different time-zones. English & Spanish fluency is a must. Why choose us? We provide the opportunity to be the best version of yourself develop professionally and create strong working relationships whether working remotely or on-site. While offering a competitive salary we also invest in our people's professional development and want to see you grow and love what you do. We are dedicated to listening to our team's needs and are constantly working on creating an environment in which you can feel at home. We offer a range of benefits to support your personal and professional development: 22 days of annual leave. 10 days of national holidays. Health Insurance options. Access to e-learning platforms. Possibility of on-site English classes in some countries and more. Join our team and enjoy an environment that values and supports your well-being. If this sounds like the place for you contact us now!
|
|
AWS Site Reliability Engineer/ 100% Remote
Motion Recruitment |
Los Angeles, CA
|
AWS Site Reliability Engineer 6-12 Month Contract (Possible Extension) Location: 100% Remote ( Client located in Los Angeles CA) The AWS Site Reliability Engineer under the supervision of the Global IT Infrastructure Manager will focus on the day-to-day tasks of monitoring and maintaining our cloud environments. The AWS Site Reliability Engineer will monitor all cloud environments Implement monitoring and alerting systems and processes to help manage and ensure client Service Level Agreements are met. Job Description It is an exciting time to be part of the company's CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. The SRE team strives to make this company highly reliable scalable operable and secure throughout the entire platform. As a member of the CICD and Cloud Reliability team you’ll work at the heart of the the company's Network to make sure we have a high-performing platform that is also highly available and highly reliable. You will be part of a team that is execution-oriented results-driven and which enables service development by designing building deploying and operating cloud infrastructure and CICD services at scale. You’ll also be able to exercise your troubleshooting skills with the opportunity to zoom in on anything from code issues to packet loss in the network. Your primary responsibilities will include contributing to the implementation and delivery of the end-to-end automation platform to support continuous integration and continuous delivery (CI/CD) with a focus on developer self-service capabilities. This position requires extensive technical expertise and deep knowledge of continuous integration and continuous delivery platform domain expertise especially in cloud-based service environment. Broad industry knowledge strong customer focus and excellent communication skills are a must. Contributes to a team of Engineers to deliver and support highly available self-service CI/CD capabilities. Showcases uncompromising ownership of outcomes and deliverables Adheres to software development best practices Role Model for customer focused delivery for both internal and external customers AWS Experience REQUIRED Energetically and effectively works across organizational boundaries collaborating to deliver awesome developer and platform capabilities. Experienced Engineer that drives Operational Excellence within the team Builds and fosters agile engineering capabilities and quality engineering practices Results driven person with great energy Forward looking Engineer with execution know-how to take SIE to the next level of CI/CD Skill Requirements 3+ years professional Site Reliability experience operating microservices at scale 2+ years hands-on AWS experience deploying supporting managing applications Experience with Docker Kubernetes and in particular EKS Extensive use of automation and configuration management tools such as Ansible or Chef with obsessive desire to automate Strong development experience in one of these languages – Java Python or Go Experienced user of one or more source code management tools preferably Git Should have experience with continuous integration continuous delivery/deployment tools like Jenkins Spinnaker or similar Education BS in Computer Science or equivalent experience You will receive the following benefits: Medical Insurance & Health Savings Account (HSA) 401(k) Paid Sick Time Leave Pre-tax Commuter Benefit Motion Recruitment provides IT Staffing Solutions (Contract Contract-to-Hire and Direct Hire) in major North American markets. Our unique expertise in today’s highest-demand tech skill sets paired with our deep networks and knowledge of our local technology markets results in an exemplary track record with candidates and clients. Applicants must be currently authorized to work in the U.S. on a full-time basis now and in the future.
|
|
DevOps/Site Reliability Engineer (Lisbon-Remote)
ExecutivePlacements.com - The JOB Portal |
Lisbon, IA
|
Overview Token Metrics is seeking a results-oriented IT administrator to manage our company's IT infrastructure. You will be upgrading and installing hardware and software troubleshooting to resolve IT issues and maintaining our networks and servers. Candidate should possess extensive experience in administration including system administration for cloud infrastructure (AWS primarily and knowledge of multi-cloud infrastructure) process automation site reliability and the ability to optimize the performance of our IT infrastructure. Responsibilities Act as a cloud system admin (AWS and Google Cloud and knowledge of multi-cloud infrastructure) Monitoring and maintaining networks and servers. Creating and automating alerting and monitoring system logs. Building tools to mitigate weaknesses in incident management or software delivery. Troubleshooting Support Escalation requests. Upgrading installing and configuring new hardware and software to meet company objectives. Implementing security protocols and procedures to prevent potential threats. Creating user accounts and performing access control. Performing diagnostic tests and debugging procedures to optimize computer systems. Documenting processes as well as backing up and archiving data. Developing data retrieval and recovery procedures. Designing and implementing efficient end-user feedback and error reporting systems. Supervising and mentoring IT department employees as well as providing IT support. Keeping up to date with advancements and best practices in IT administration. Requirements Bachelor's degree in Computer Science Information Technology Information Systems or similar. Applicable professional qualification such as Microsoft Oracle or Cisco certification. At least two years' experience in a similar role. Extensive experience with IT systems networks and related technologies. Solid knowledge of best practices in IT administration and system security. Exceptional leadership organizational and time management skills. Strong analytical and problem-solving skills. Excellent interpersonal and communication skills. Token Metrics helps crypto investors build profitable portfolios using artificial intelligence based crypto indices rankings and price predictions. Token Metrics has a diverse set of customers from retail investors and traders to crypto fund managers in more than 50 countries. #J-18808-Ljbffr
|
|
DevOps/Site Reliability Engineer (Lisbon-Remote)
ExecutivePlacements.com - The JOB Portal |
Beaverton, OR
|
Overview Token Metrics is seeking a results-oriented IT administrator to manage our company's IT infrastructure. You will be upgrading and installing hardware and software troubleshooting to resolve IT issues and maintaining our networks and servers. Candidate should possess extensive experience in administration including system administration for cloud infrastructure (AWS primarily and knowledge of multi-cloud infrastructure) process automation site reliability and the ability to optimize the performance of our IT infrastructure. Responsibilities Act as a cloud system admin (AWS and Google Cloud and knowledge of multi-cloud infrastructure) Monitoring and maintaining networks and servers. Creating and automating alerting and monitoring system logs. Building tools to mitigate weaknesses in incident management or software delivery. Troubleshooting Support Escalation requests. Upgrading installing and configuring new hardware and software to meet company objectives. Implementing security protocols and procedures to prevent potential threats. Creating user accounts and performing access control. Performing diagnostic tests and debugging procedures to optimize computer systems. Documenting processes as well as backing up and archiving data. Developing data retrieval and recovery procedures. Designing and implementing efficient end-user feedback and error reporting systems. Supervising and mentoring IT department employees as well as providing IT support. Keeping up to date with advancements and best practices in IT administration. Requirements Bachelor's degree in Computer Science Information Technology Information Systems or similar. Applicable professional qualification such as Microsoft Oracle or Cisco certification. At least two years' experience in a similar role. Extensive experience with IT systems networks and related technologies. Solid knowledge of best practices in IT administration and system security. Exceptional leadership organizational and time management skills. Strong analytical and problem-solving skills. Excellent interpersonal and communication skills. Token Metrics helps crypto investors build profitable portfolios using artificial intelligence based crypto indices rankings and price predictions. Token Metrics has a diverse set of customers from retail investors and traders to crypto fund managers in more than 50 countries. #J-18808-Ljbffr
|
|
Site Reliability Engineer (Remote)
Frontdoor, Inc. |
|
Overview Frontdoor is reimagining how homeowners maintain and repair their most valuable asset – their home. As the parent company of two leading brands we bring over 50 years of experience in providing our members with comprehensive options to protect their homes from costly and unexpected breakdowns through our extensive network of pre-qualified professional contractors. American Home Shield the category leader in home service plans with approximately two million members gives homeowners budget protection and convenience covering up to 23 essential home systems and appliances. Frontdoor is a cutting edge one-stop app for home repair and maintenance. Enabled by our Streem technology the app empowers homeowners by connecting them in real time through video chat with pre-qualified experts to diagnose and solve their problems. The Frontdoor app also offers homeowners a range of other benefits including DIY tips discounts and more. For more information about American Home Shield and Frontdoor please visit frontdoorhome.com . Responsibilities Summary : Site Reliability Engineers (SREs) are responsible for maintaining the availability and uptime of infrastructure. SREs use software engineering principles to solve operational challenges to create reliable infrastructure. This position will reduce the toil from our everyday work using as much automation as possible. Responsibilities: Research and implement solutions to build an always-up always-available resilient services. Builds and maintains automation tooling for infrastructure CI/CD and observability (monitoring alerting logging tracing) pipelines. Builds and maintains cloud and container orchestration infrastructure. Collaborates with software engineering security systems teams to help automate and streamline operations and processes. Implements best DevOps practices across the organization to improve performance and efficiency. Performs research and implements solutions to build an always-up always-available resilient services. Integrates and automates existing manual solutions and processes. Participates in an on-call rotation for production issue escalations. Troubleshoot and support productions issues Assists with the planning for growth and capacity of the infrastructure Participates on cross functional company project teams responsible for implementing technology. Investigates anomalies/outages and determines steps to reproduce root cause and solutions options. Monitors environment performance and provides all necessary reporting analysis. Assists with the integration and automation of existing manual solutions and processes. Attends relevant conference/seminars to remain current on new and upcoming technology. Self-directed with the ability to coordinate the work of others both inside and external to the team. May include other duties as assigned. Qualifications Required Skills: Good understanding of Unix/Linux operating systems and its internals Good understanding of core concepts of computer networking (TCP/UDP IP Routing DNS) Well-versed with Linux CLI In addition to shell scripting (sh/bash) proficient with one other programming language (Python/Go) Hands-on experience with cloud service providers (at least one of GCP AWS and Azure) Hands-on experience with at least one configuration management software (Terraform/Ansible/Chef/Puppet) Working knowledge of containers and any one container orchestration platform (Kubernetes/Nomad/Mesos/Swarm) Experience with Palo Alto F5 cloud firewalls load balancers and security groups WAF Akamai and related products and technologies. Understanding and experience in at least one CI/CD pipeline (Jenkins/Travis/CircleCI/Gitlab etc.) Working knowledge of any one distributed version control systems (git/bzr/hg) Ability to write good technical user documents Exposure to managing Infrastructure as Code with tools like Terraform/CloudFormation or using Cloud Provider SDKs Experience with a CDN (e.g. Akamai) Preferred Skills: AWS & GCP Terraform Kafka Git GitLab Kubernetes Docker Good working knowledge of Istio service mesh Good working knowledge of Akamai Experience working with AWS & GCP for VPC configuration NAT Load Balancing monitoring Understanding of Kubernetes and networking in a microservice architecture PaloAlto networks PanOS and Panorama devices physical and virtual Infoblox Grid Manager Minimum Education Licensure and Professional Certification requirements: BA/BS required in Computer Science Computer Engineering preferred Minimum Experience required (number of years necessary to perform role) : 5+ years of hands-on DevOps experience required. 2+ years of managing production infrastructure on any cloud. 2+ years of experience developing code either maintaining scripts or applications Other/State Specific This role pays between $ 123k to $ 150k and your actual base pay will depend on your skills qualifications responsibilities experience and location. At Frontdoor certain roles are eligible for additional rewards and incentives. Speak directly to your recruiter to learn more. Our approach to benefits is holistic and includes health wellbeing and financial components including: insurance for medical/pharmacy dental vision life and disability weight loss and smoking cessation programs matching 401(k) and ability to participate in our employee stock purchase plan. Need help finding the right job? We can recommend jobs specifically for you! Job Locations US ID 2025-3879 Category Engineering Type Full Time Company AHS American Home Shield Corp
|
|
Site Reliability Engineering Manager
shippo |
United States
|
Here at Shippo we are the shipping layer of the internet and we consider ourselves to be one of the core building blocks of e-commerce.Our mission is to make merchants successful through world class shipping. With our products and solutions we level the playing field by providing our customers with best-in-class solutions that otherwise wouldn’t be available to them. Through Shippo e-commerce businesses marketplaces platforms and a variety of logistics infrastructure providers are able to connect to shipping carriers around the world from one API and dashboard. We provide our customers with the most competitive shipping rates print labels automated international documents shipment tracking facilitate the returns process and more.How we’ll deliver success:As the SRE Manager at Shippo you will lead a team of engineers responsible for building platforms tooling and infrastructure that enable product teams to operate reliable performant and scalable services. You will establish frameworks for observability deployment automation and infrastructure management that allow product teams to own their service reliability. You will maintain a strong support oriented team while building automation and enabling engineering productivity and operational excellence across the organization. ➡ Responsibilities ➡ Lead and develop a team of platform-focused SRE engineers providing technical mentorship career development and performance management while fostering a culture of automation self-service and continuous improvementBuild and maintain internal platforms and tooling that enable product teams to deploy monitor and operate their services reliablyManage observability platforms (metrics logs traces dashboards) that provide product teams visibility into their servicesOwn the infrastructure and Kubernetes platform that all Shippo services run on ensuring it scales ahead of business needs through capacity planning and performance optimizationEstablish frameworks and tooling for SLO/SLI definition error budget tracking and reliability measurement that product teams can adoptDesign and maintain CI/CD pipelines deployment automation and release tooling that enable safe frequent deploymentsBuild infrastructure-as-code foundations and self-service capabilities that allow product teams to provision and manage their infrastructureCreate automation to eliminate toil and prevent infrastructure problems before they impact product teamsDrive infrastructure cost optimization initiatives through analysis rightsizing recommendations reserved capacity planning and waste elimination across the cloud platformParticipate in leadership rotation for Sev1 incidents affecting services or the platform itselfManage the SRE team’s on-call rotationDesign implement and test disaster recovery capabilities and ensure infrastructure security and compliancePartner with Engineering Managers and TPMs to understand product team needs prioritize platform investments and communicate platform roadmap and capabilitiesEstablish platform SLOs for infrastructure reliability deployment success rates build times and other developer experience metrics Requirements ➡ 3+ years of hands-on engineering management experience 9+ years as a software or systems engineer with deep experience building platforms tooling or infrastructureBS or MS degree in Computer Science or equivalent experienceExpert-level experience designing and operating platforms that enable other engineering teams (internal platform-as-a-product experience)Strong operational experience with Kubernetes in production environments including experience building Kubernetes platforms for application teamsDeep expertise with at least one public cloud provider (AWS GCP) including networking compute storage and managed servicesExperience building or maintaining CI/CD systems and deployment automation (GitHub Actions GitLab CI ArgoCD Flux etc.)Strong background in infrastructure-as-code tools and patterns (Terraform Pulumi CloudFormation etc.)Experience designing and implementing observability platforms (Prometheus Grafana ELK stack Datadog New Relic etc.)Proficiency in at least one programming language for tooling and automation (Python Go or similar)Experience establishing reliability frameworks (SLO/SLI/error budgets) that other teams can adoptUnderstanding of developer experience and ability to build self-service tooling that reduces frictionTrack record of designing disaster recovery solutions and implementing security and compliance best practices for infrastructureExceptional verbal written and interpersonal communication skills with ability to influence product teams and engineering leadershipDeep understanding of enabling product team success through platform capabilities What's in the Shippo package? ➡ Healthcare coverage for medical dental and vision Take-as-much-as-you-need vacation policy & flexible working One week-long company wide winter shutdown 3 Volunteer Days Off (VTOs)WFH stipend to set up your home officeCharity donation match up to $100Dedicated programs coaching tools and resources for your professional and career growth as well as an individual learning stipend for your personal and focused growthFun team in person time through our Shippos Everywhere program which includes regular team and company off-sites throughout the year as well as local Shippos gatherings ➡ Our Compensation Shippolicy:We believe compensation is a custom experience and are commited to fair and equitable compensation practices. The standard base pay range for this role is min is $192k to a max $261k annual salary. Since we are focused on hiring Shippos Everywhere we have 2 US pay ranges a standard compensation range for the majority of the US and a standard +1 compensation range for those who live in areas where the cost of labor is higher such as NYC and California.The actual base pay is dependent upon many factors such as: financial budgets work experience training transferable skills business needs and market value. The base pay salary ranges are subject to change and may be modified in the future. Total compensation for this role will include equity medical dental vision and other benefits noted in our Shippos “package” section.Sail through the process:Here at Shippo we celebrate inclusivity and are committed to creating equal access to opportunities for people from all backgrounds perspectives and geographies. These values define who we are and everything we do. All qualified individuals are encouraged to apply. If you need assistance or a reasonable accommodation during the application and recruiting process please contact us at accommodations@goshippo.comShippos in the wild:Our people much like the packages we help ship are all over the world. This means through our remote-first program “Shippos Everywhere” our roles can be based anywhere in the US with the exception of Delaware Nevada Ohio Oregon Hawaii New Mexico and West Virginia and many roles can be based internationally.For locations outside of the US and Ireland the employment contracts are powered by Remote.com (all Shippo perks still apply - including equity!). What we want to emphasize is that you can be successful at Shippo regardless of location.Apply for this jobWe leverage AI to review all resumes during the application phase to ensure fairness comprehensively evaluate each submission and mitigate bias. However all decisions at every stage of the process are made by a real person.
|
|
Site Reliability & Observability Engineer (Remote)
360training |
|
Why 360training? At 360training we’re more than just a leader in online training—we’re helping people unlock their potential and shape their futures. For over two decades we’ve empowered millions of learners with regulatory-approved training across industries making it possible for individuals to get the jobs they want and keep the careers they love. Our success is built on two simple but powerful values: Deliver Results and Do the Right Thing. They’re not just words on a wall—they guide how we work collaborate and grow together. At 360training you’ll join a passionate team that tests in your development rewards your results and supports you personally and professionally. If you’re looking for a career where you can make an impact grow quickly and be valued every step of the way—this is your chance. Site Reliability & Observability Engineer 360training is seeking a Site Reliability & Observability Engineer to build and scale our observability and reliability practices across cloud container and application environments. This role will be responsible for developing the systems tools and processes that ensure application performance reliability and visibility across multiple platforms and brands. The SRE will partner closely with DevOps and Development teams to define service-level objectives (SLOs) establish automated monitoring and alerting and drive performance optimization across infrastructure and applications. This individual will also play a critical role in incident response postmortems and the ongoing evolution toward a data-driven reliability culture. Our ideal candidate is a hands-on engineer with experience in application performance monitoring (APM) metrics tracing and logging and a strong background in automation and cloud-native observability tooling. Key Responsibilities Observability Platform Development Design implement and manage the enterprise-wide observability stack (APM metrics logs and traces) across Azure and containerized workloads. Deploy and maintain monitoring tools to ensure full-stack visibility. Build standardized dashboards alerts and KPIs for key services and business applications. Develop and maintain automation for telemetry data collection alert configuration and dashboard provisioning. Ensure coverage for application infrastructure and end-user experience monitoring across all environments. Reliability Engineering Define and maintain Service-Level Objectives (SLOs) Service-Level Indicators (SLIs) and Error Budgets in partnership with DevOps and Development teams. Implement automated incident detection alerting and response playbooks to reduce MTTR. Analyze recurring incidents and drive permanent fixes and reliability improvements. Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages. Performance & Cost Optimization Establish performance baselines and track resource utilization across cloud and container infrastructure. Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations. Monitor and optimize monitoring metrics ingestion Azure Log Analytics and storage costs to balance visibility with efficiency. Incident Management & Postmortems Serve as a key responder during major incidents providing data-driven insights and remediation coordination. Lead root cause analysis (RCA) and ensure postmortem action items are implemented. Build dashboards and analytics to identify leading indicators of failure and performance degradation. Improve operational playbooks to accelerate detection and recovery. Automation & Continuous Improvement Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring. Continuously evaluate emerging observability tools and practices for adoption. Advocate for reliability and monitoring best practices across engineering teams. Required Skills 5+ years of experience in Site Reliability Observability or DevOps Engineering roles. Strong hands-on experience with observability tools such as Datadog New Relic Grafana ELK/EFK or equivalent. Deep understanding of metrics tracing and logging concepts and their correlation across distributed systems. Experience implementing Synthetics and RUM monitoring for frontend performance. Experience defining and managing SLOs SLIs and Error Budgets. Solid grasp of Azure infrastructure Kubernetes (AKS) and container monitoring. Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows. Excellent analytical and communication skills able to translate complex data into actionable insights. Preferred Skills Understanding of distributed tracing in microservice architectures. Experience with fronted website performance tuning/optimization based on core web vitals Strong scripting and automation skills (Python PowerShell or Bash). Experience with incident management and RCA processes.
|
|
Site Reliability Engineer (Remote)
NeonLabs |
Remote United States
|
About the Role NeonLabs is hiring a Site Reliability Engineer (SRE) to build and automate the systems that keep our platform reliable scalable and high-performing. You’ll work across the full stack—from infrastructure to application—designing monitoring and maintaining core systems that support next-generation AI and cloud-driven products. This is a hands-on engineering position where you’ll directly influence reliability and performance at scale. Responsibilities Develop and maintain automation observability and alerting systems Mentor engineers on reliability instrumentation and incident response best practices Lead incident response from triage through post-mortem and remediation Design and run load-testing disaster-recovery and chaos-engineering programs Automate service-level monitoring and capacity planning Partner with product and platform teams to ensure scalability and fault tolerance Required Skills Proven experience as a Site Reliability Engineer or similar role Strong proficiency in Terraform Python and Go Deep understanding of AWS infrastructure and services Experience improving uptime scalability and observability in distributed systems Excellent analytical and problem-solving abilities Nice to Have Familiarity with MySQL MongoDB and Redis Experience with Snowflake or other data-warehouse systems Previous work in a high-growth startup or AI-infrastructure environment Compensation & Terms Salary Range: $160000 – $300000 USD per year Type: Full-time remote independent role Payments: Regular payments via Stripe Connect How to Apply Apply through NeonLabs JobHub at: neonlabshub.com/site-reliability-engineer After applying qualified candidates will be invited to complete a short technical interview on our secure AI-partner platform as part of the hiring process. Job Type: Full-time Pay: $160000.00 - $300000.00 per year Work Location: Remote
|
|
Site Reliability Engineer, SaaS (Remote - US)
Jobgether |
|
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer SaaS in the United States . As a Site Reliability Engineer you will be responsible for ensuring the performance scalability and reliability of a modern SaaS platform. Working with a global engineering team you will design implement and maintain cloud infrastructure automate deployments and improve system observability. This role offers the chance to work with cutting-edge technologies including containers serverless frameworks and public cloud services while participating in incident response and on-call rotations. You will proactively improve system reliability collaborate across teams and contribute to best practices for compliance security and operational excellence. Accountabilities: Design deploy and maintain scalable and reliable infrastructure solutions on cloud platforms Automate deployment processes and maintain a resilient secure SaaS application environment Support delivery and release pipelines ensuring smooth operations Continuously monitor and improve system reliability performance and scalability Develop comprehensive monitoring and alerting solutions for distributed applications Participate in incident response and on-call rotations for production environments Ensure compliance with information security and industry standards (ISO SOX SSAE etc.) Define document and improve internal standards for system maintainability and style Requirements 3+ years of experience in 24x7 production operations for SaaS or cloud service environments Experience managing cloud infrastructure including IaaS and PaaS solutions (Microsoft Azure preferred) Strong problem-solving and troubleshooting skills for complex distributed multi-tenant systems Experience with container orchestration and management platforms System programming skills in languages such as Python PowerShell Bash or Go Familiarity with CI/CD practices and tools (e.g. Azure DevOps) Knowledge of distributed event-based messaging architectures (Azure Event Hub Service Bus Kafka) Proficiency in English for communication with international teams. Bonus: Industry-recognized certifications (AZ-400 AWS DevOps DCA) experience migrating on-premises products to cloud AWS experience (ECS Lambda S3 RDS) and C#/.NET knowledge Benefits Competitive salary: $136500-$195000 USD OTE inclusive of base and variable pay Unlimited PTO and 3 company-wide closure days for rest and self-care Paid holidays and Veeam Care Days (volunteering) Medical dental and vision coverage starting day one with multiple plan options Flexible Spending Accounts (FSA) and Health Savings Account (HSA) options with employer contributions Life AD&D and disability insurance options plus supplemental voluntary plans Family planning support paid parental leave and employee assistance programs Professional training education and mentoring opportunities including on-demand learning libraries Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching. When you apply your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly. 🔍 Our AI evaluates your CV and LinkedIn profile thoroughly analyzing your skills experience and achievements. 📊 It compares your profile to the job's core requirements and past success factors to determine your match score. 🎯 Based on this analysis we automatically shortlist the 3 candidates with the highest match to the role. 🧠 When necessary our human team may perform an additional manual review to ensure no strong profile is missed. The process is transparent skills-based and free of bias — focusing solely on your fit for the role. Once the shortlist is completed we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team. Thank you for your interest!
|
|
Sr. Site Reliability Engineer (SRE) (Remote - Europe)
Jobgether |
|
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer (SRE) in Europe . In this role you will be responsible for maintaining and improving the reliability performance and scalability of critical infrastructure and services. You will work closely with engineering teams providing technical leadership and operational expertise to ensure seamless deployments for both cloud and on-premises environments. This position requires proactive monitoring automation and problem-solving to minimize downtime and optimize system efficiency. You will also support customers directly troubleshoot complex issues and contribute to the development of robust operational processes. The environment is fast-paced collaborative and focused on delivering high-quality solutions across global teams. Accountabilities: Deploy monitor and maintain software for cloud and on-premises customers ensuring high availability and performance Respond to incidents promptly diagnose root causes and implement effective resolutions to minimize user impact Develop document and improve operational procedures run-books and incident response processes Implement automation scripts and tools to streamline recurring operational tasks and reduce manual workload Collaborate with cross-functional teams to resolve complex deployment configuration and integration challenges Provide tier 2/3 technical support and guidance to internal teams and external customers Participate in on-call rotations and ensure smooth handoff between shifts Track analyze and report metrics to continuously improve system reliability and customer experience Requirements Bachelor's degree in Computer Science or a related field 3+ years of experience in Site Reliability Engineering 2+ years of experience with cloud platforms and automation tools particularly AWS Strong knowledge of Kubernetes Linux AWS networking (VPC) and Terraform Experience with GitOps deployment models and distributed version control systems Familiarity with monitoring and alerting tools (e.g. Prometheus Grafana) Bazel and Helm experience is a plus Strong problem-solving skills with the ability to manage multiple tasks in a fast-paced environment Excellent communication skills to convey technical concepts to both technical and non-technical stakeholders Customer-oriented mindset with patience empathy and professionalism when handling complex issues Comfortable working across multiple time zones to support a global customer base Benefits Competitive salary and total compensation package including equity options Fully remote work with flexible hours across Europe Opportunities for professional growth and technical leadership development Participation in cross-functional high-impact projects shaping next-generation infrastructure Supportive and collaborative work culture emphasizing ownership trust and continuous improvement Access to training development resources and team events to foster learning and connection Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching. When you apply your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly. 🔍 Our AI evaluates your CV and LinkedIn profile thoroughly analyzing your skills experience and achievements. 📊 It compares your profile to the job's core requirements and past success factors to determine your match score. 🎯 Based on this analysis we automatically shortlist the 3 candidates with the highest match to the role. 🧠 When necessary our human team may perform an additional manual review to ensure no strong profile is missed. The process is transparent skills-based and free of bias — focusing solely on your fit for the role. Once the shortlist is completed we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team. Thank you for your interest!
|
* unlock: sign-up / login and use the searches from your home page
** job listings updated in real time 🔥
Login & search by other job titles, a specific location or any keyword.
Powerful custom searches are available once you login.