Site Reliability Engineer (SRE) Lead
MindBeacon
- Markham, ON
- Permanent
- Full-time
- Azure Infrastructure Design: Analyze and optimize the design and architecture of Azure-based Enterprise Data and Analytics platfrom that meet the organization's performance, scalability, security, and cost-efficiency requirements.
- Reliability Engineering: Implement SRE principles to improve the reliability and availability of services by designing automated monitoring, alerting, and incident response systems.
- Infrastructure as Code (IaC): Utilize Infrastructure as Code tools (e.g., Terraform, ARM templates, YAML, Shell) to automate the provisioning and management of Azure resources.
- Performance Optimization: Identify and address performance bottlenecks within the Azure environment through monitoring, analysis, and tuning of infrastructure components.
- High Availability and Disaster Recovery: Design and implement solutions for high availability and disaster recovery across Azure regions and availability zones.
- Automation: Develop and maintain automation scripts and tools to streamline deployment, scaling, and management of Azure resources. Build and Manager DevSecOps Pipeline automation using Azure DevOps, Github etc.
- Collaboration: Collaborate with development, operations, and security teams to ensure smooth deployment and operation of applications on Azure infrastructure.
- Incident Response: Participate in on-call rotations, responding to incidents, diagnosing and resolving issues promptly, and conducting post-incident reviews.
- Bachelor's degree in Computer Science, Information Technology, or related field. Master's degree is an asset.
- 5+ years of Cloud SRE experience preferably in Microsoft Azure.
- Professional certifications such as Microsoft Certified: Azure Solutions Architect Expert, Microsoft Certified: Azure DevOps Engineer Expert, or relevant SRE certifications.
- Extensive experience designing, implementing, and managing Azure-based solutions in a production environment.
- Strong background in Infrastructure as Code (IaC)- ARM, Terraform practices and tools.
- Proficiency in scripting and automation using languages like PowerShell, Python, or Bash.
- Hand-on experience with Databricks Clusters, ADF - Azure data factory, Azure data lake, Apache and Spark - UI and command line - ability to analyze, debug, and deliver insights for driver logs and executor logs.
- Deep understanding of SRE methodologies, including monitoring, alerting, incident management, and capacity planning.
- Knowledge of Cloud Security capabilities and frameworks.
- Knowledge of network architecture, security best practices, and compliance standards within the Azure ecosystem.
- Excellent problem-solving skills and the ability to troubleshoot complex technical issues efficiently.
- Strong communication skills and the ability to work collaboratively in cross-functional teams.
- Prior experience in mentoring or leading junior SREs or engineers is a plus Collaborating with other developers, product managers, and stakeholders
- Application development experience and background.
- Familiarity with common stacks, such as MEAN, MERN, LAMP, etc.
- Knowledge of multiple front-end languages and libraries, such as Python, Go HTML, CSS, JavaScript, jQuery, React, Angular, etc.
- Knowledge of multiple database technologies, such as MySQL, MongoDB, PostgreSQL, etc.
- Knowledge of web servers, such as Apache, Nginx, etc.
- Knowledge of web development tools, such as Git, Webpack, Babel, etc.
- Knowledge of web development best practices, such as Agile methodologies, RESTful APIs, etc.
- Knowledge of Docker, Kubernetes, Helm
- Ability to work independently and with a team
- Ability to learn new technologies quickly
- Ability to solve complex problems creatively
- Attention to detail and quality