Jaspreet is available for hire

Jaspreet Singh

Verified Expert in Engineering

System Administrator and Site Reliability Developer

Location

Toronto, ON, Canada

Toptal Member Since

April 2, 2024

Jaspreet is a high-performance computing (HPC) architect, passionate about designing and optimizing cutting-edge computing systems. With nearly 13 years of industry experience, including roles as a site reliability engineer and lead system administrator, he has honed his skills in developing innovative solutions to complex computational challenges. Jaspreet's most remarkable work has been overseeing one of the largest HPC cluster environments worldwide, with 60+ clusters and 1.6+ million cores.

System Administration Linux Data Centers VMware GitLab Ansible Storage Troubleshooting Clustering User Groups DNS Data Migration Ubuntu Unix Automation Control Systems Computer Hardware Digital Electronics

Portfolio

Tenstorrent

Slurm Workload Manager, Ansible, Weka, DevOps, Load Sharing Facility (LSF)...

Qualcomm

High-performance Computing, Slurm Workload Manager, GitLab, Splunk, Licensing...

Xilinx

DNS, NFS, Linux, Load Sharing Facility (LSF), Cisco UCS, HP BladeSystem...

Experience

System Administration - 12 years Monitoring - 12 years Data Centers - 12 years Linux - 12 years Load Sharing Facility (LSF) - 12 years Operating Systems - 12 years Slurm Workload Manager - 7 years System Architecture - 7 years

Availability

Part-time

Preferred Environment

Linux, Slurm Workload Manager, GitLab, Ansible, Data Centers, VMware, CentOS, Ubuntu, Monitoring, Unix

The most amazing...

...work I've done involved managing one of the biggest HPC cluster environments in the world, with 60+ global clusters and 1.6+ million cores.

Work Experience

Site Reliability Engineer

2023 - PRESENT

Tenstorrent

Performed load-sharing facility (LSF) and Slurm cluster administration. Installed and configured HPC clusters and handled resource management, cluster monitoring and maintenance, user support, troubleshooting, scaling, and expansion.
Managed the Weka storage, covering installation, configuration, and cluster management. Optimized a high-performance, distributed file system for handling large-scale data analytics and storage workloads.
Automated server and HPC cluster deployments using Ansible. Created, modified, and tested Ansible playbooks and roles in setting up configurations and automating large-scale deployments.

Technologies: Slurm Workload Manager, Ansible, Weka, DevOps, Load Sharing Facility (LSF), GitLab, Systems Monitoring, System Administration, Monitoring, Shell Scripting, Bash Script, System Architecture, Operating Systems, VMware

HPC Architect

2017 - 2023

Qualcomm

Oversaw availability, latency, performance, efficiency, change management, monitoring, and emergency response for 60+ global HPC clusters with 1.6+ million cores.
Engaged as technical lead for the AI/ML project—predicting and modifying the job's memory, CPU, and runtime resource limit before dispatch—resulting in effective resource utilization.
Contributed to a consolidation project, decommissioning 15+ obsolete or less relevant clusters to gain operational efficiencies, cost control, and better alignment with business requirements.

Technologies: High-performance Computing, Slurm Workload Manager, GitLab, Splunk, Licensing, Architecture, Linux, Ansible, Load Sharing Facility (LSF), Systems Monitoring, System Administration, Monitoring, Shell Scripting, Bash Script, System Architecture, Operating Systems

Lead Systems Administrator

2017 - 2017

Xilinx

Managed large (1,000+ nodes) compute clusters using LSF or similar job schedulers.
Contributed as technical lead for Cisco UCS infrastructure deployment and operations.
Handled system installation and configuration, HPC, and security, installing 3rd-party software in massive environments.

Technologies: DNS, NFS, Linux, Load Sharing Facility (LSF), Cisco UCS, HP BladeSystem, Dell PowerEdge Servers, Nagios, System Administration, Shell Scripting, Bash Script, Operating Systems, VMware

Cloud Data Center Lead | Hardware Engineer

2016 - 2017

Amazon Web Services (AWS)

Helped build the world's largest cloud infrastructure. Served as the escalation point and technical troubleshooter for all systems and network hardware problems.
Installed and configured racks of hosts in line with internal service-level agreements.
Triaged and resolved trouble tickets for all devices in the region. Served as the data center point of contact for all high-severity issues.

Technologies: Data Center Management, Data Center Infrastructure, Amazon Web Services (AWS), Servers, Hardware, Linux, Systems Monitoring, System Administration, Monitoring, Operating Systems

Senior Systems Administrator

2011 - 2016

Xilinx

Provided advanced Linux—Red Hat Enterprise Linux, CentOS, Ubuntu, SUSE—troubleshooting and technical support to the engineering organization.
Engaged as the data center operations lead, maintaining minimum server downtime and managing data center power, cooling, and rack space.
Installed, configured, and managed the Cisco UCS, Dell, and HP server hardware for the organization.

Technologies: Linux, Data Centers, Load Sharing Facility (LSF), Systems Monitoring, System Administration, Monitoring, Operating Systems, VMware

Experience

HPC Cluster Patching and Upgrade

The global HPC cluster patching and upgrade initiative aimed to enhance the security, performance, and reliability of 50+ HPC clusters deployed across multiple geographic regions.

The project involved systematically patching and upgrading the software stack, including the operating system, middleware, and HPC-specific applications, to ensure optimal performance and compliance with industry standards.

AI/ML Job Prediction

This AI/ML job prediction project resulted in effective resource utilization. As the technical lead on this project, I enabled the prediction and modification of the job's memory, CPU, and runtime resource limit before dispatch.

Cluster Consolidation

The cluster consolidation project resulted in the decommissioning of 15+ obsolete or less relevant clusters—gaining operational efficiencies, controlling costs, and aligning better with business requirements.

I designed the architecture of the consolidated clusters based on the assessed requirements, considering factors such as workload distribution, fault tolerance, scalability, and resource allocation. Next, I deployed the new cluster infrastructure according to the designed architecture, covering hardware setup, software installation, configuration, and testing.

Skills

Platforms

Linux, CentOS, Ubuntu, Unix, Amazon Web Services (AWS)

Storage

Data Centers, HP BladeSystem

Other

Slurm Workload Manager, System Administration, Load Sharing Facility (LSF), Operating Systems, Monitoring, Systems Monitoring, System Architecture, Power Management Systems, Communication, Control Systems, Microelectronics, Digital Circuits, Installation, Server Configuration, Storage, Networking, IT Security, Scripting, Virtualization, Containers, Troubleshooting, Clustering, Performance Tuning, Server Management, File Systems, User Groups, Process Management, Backup & Recovery, Shell Scripting, Licensing, Architecture, Data Center Management, Data Center Infrastructure, Servers, Hardware, Project Planning, DNS, NFS, Cisco UCS, Dell PowerEdge Servers, IT Project Management, Data Migration, Backups, Risk Management, Contingency Plans, Digital Electronics

Tools

GitLab, Ansible, VMware, Splunk, Weka, Nagios

Languages

Bash Script

Libraries/APIs

Microsoft HPC

Paradigms

Automation, High-performance Computing, Scrum, Agile, DevOps

Industry Expertise

Network Security

Education

2007 - 2011

Bachelor's Degree in Electrical Engineering

Maharshi Dayanand University - Haryana, India

Certifications

MAY 2013 - PRESENT

Red Hat Certified System Administrator (RHCSA)

Red Hat

MAY 2013 - PRESENT

Red Hat Certified Engineer (RHCE)

Red Hat

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring