Site Reliability Engineer
Lawrence Berkeley National Lab’s (LBNL) NERSC Division has an opening for a Site Reliability Engineer to join the team.
The National Energy Research Scientific Computing Center (NERSC) is inviting applications for the position of Site Reliability Engineer. NERSC’s mission is to accelerate scientific discovery through high performance computing and data analysis for the DOE Office of Science programs. NERSC provides critical HPC and data systems and support for NERSC’s 10,000 users researching alternative energy sources, climate science, energy efficiency, environmental science and other DOE mission areas. As a Site Reliability Engineer in the Operations Group you will be a member of a 24x7 team that helps ensure that NERSC is accessible, reliable, secure and available to our scientific users using our state of the art OMNI data collection and monitoring system.
What You Will Do at Level 2:
- Work 5 shifts per week to monitor the NERSC HPC Facility, which includes 2 - 3 OWL (midnight - 8am) shifts. Some days may be onsite, some may be offsite. The schedule will be determined by staffing needs.
- Review and respond to alerts from computer systems, storage, network, and other data center/facility related systems by triaging or calling appropriate on-call staff.
- Create appropriate solutions to improve a process and to prevent issue recurrence and automate the response to all routine service conditions.
- Identify issues and propose solutions that will improve the ability to monitor or provide better automation for monitoring or triage.
- Respond to alerts from OMNI to ensure that the system continues to collect data 24x7 to provide real time information for diagnoses.
- Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team.
- Create new software programs to provide alerts and notifications from the HPC system APIs and into the monitoring pipeline.
- Create new software configurations and solve technical issues to enable programs to scale to more dense data or to deliver at scale reliably.
- Collaborate with other groups at NERSC to ensure that communication and workflows are clearly understood. Assign technical tasks to other Operations monitoring team members to ensure that the system is being monitored according to agreed upon standards.
- Work closely with other NERSC groups to coordinate center-wide maintenance activities and to manage diagnostic and notification software during maintenances.
- Provide accurate information in the trouble ticketing system for outages, maintenance updates, and other incidents such that the workflow and protocols can be appropriately tracked by others.
- Work on and resolve problems of diverse scope where analysis of data requires evaluation of identifiable factors.
In Addition to Above, What You Will Do at Level 3:
- Provide leadership in developing OMNI monitoring and alerting pipelines for all aspects of the data center, documentation, and software development.
- Contribute to the design and deployment of the OMNI cluster
- Work closely with other groups and OMNI to help build a better monitoring experience.
- Work on and resolve complex issues where analysis of situations or data requires an in-depth evaluation of variable factors.
- Determine methods and procedures on new assignments and may coordinate activities of other personnel.
What is Required at Level 2:
- Typically requires a minimum of 5 years of related experience with a Bachelor’s degree; or 3 years and a Master’s degree; or equivalent work experience.
- Strong hands-on knowledge of the Linux shell and working in a command-line (e.g.SSH) environment.
- Experience with developing tools using various programming languages such as C, C++, perl, java, or Python or a scripting language with knowledge of standard software development practices
- Knowledge of and ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications.
- Motivated, self-starter who can learn technologies that improve data center management in areas like kubernetes, Prometheus/VictoriaMetrics, alertmanager, building management software, evaporative cooling, and power utilization
- Strong communication skills and ability to work effectively across multiple technical teams.
- Experience working in a 24/7 onsite team managing large data centers or other large installations.
- Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
- Understanding of networks and network protocols.
- A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area.
In Addition to Above, What is Required at Level 3:
- Typically requires a minimum of 8 years of related experience with a Bachelor’s degree; or 6 years and a Master’s degree; or equivalent experience.
- Expertise in a programming language such as C, C++, perl, java, or Python
- Demonstrated excellence in any of the tools mentioned in this listing
- Experience leading technical projects
- The ability to respond proactively to problems and issues.
Notes:
- This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
- Shift: Owl shift 12AM to 8AM (on-site).
- Level 2: The full salary range of this position is between $109,152 to $184,200 per year and is expected to pay between a targeted range of $122,784 to $150,096 per year depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience.
- Level 3: The full salary range of this position is between $129,948 to $219,276 per year and is expected to pay between a targeted range of $146,184 to $178,668 per year depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience.
- This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
- This position requires substantial on-site presence, but is eligible for a hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.
Want to learn more about working at Berkeley Lab? Please visit: careers.lbl.gov
Equal Employment Opportunity Employer: The foundation of Berkeley Lab is our Stewardship Values: Team Science, Service, Trust, Innovation, and Respect; and we strive to build community with these shared values and commitments. Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from all who could contribute to the Lab's mission of leading scientific discovery, inclusion, and professionalism. In support of our rich global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.
Misconduct Disclosure Requirement: As a condition of employment, the finalist will be required to disclose if they are subject to any final administrative or judicial decisions within the last seven years determining that they committed any misconduct, are currently being investigated for misconduct, left a position during an investigation for alleged misconduct, or have filed an appeal with a previous employer.