Jaspreet Singh
Verified Expert in Engineering
System Administrator and Site Reliability Developer
Toronto, ON, Canada
Toptal member since April 2, 2024
Jaspreet is a high-performance computing (HPC) architect, passionate about designing and optimizing cutting-edge computing systems. With nearly 13 years of industry experience, including roles as a site reliability engineer and lead system administrator, he has honed his skills in developing innovative solutions to complex computational challenges. Jaspreet's most remarkable work has been overseeing one of the largest HPC cluster environments worldwide, with 60+ clusters and 1.6+ million cores.
Portfolio
Experience
Availability
Preferred Environment
Linux, Slurm Workload Manager, GitLab, Ansible, Data Centers, VMware, CentOS, Ubuntu, Monitoring, Unix
The most amazing...
...work I've done involved managing one of the biggest HPC cluster environments in the world, with 60+ global clusters and 1.6+ million cores.
Work Experience
Site Reliability Engineer
Tenstorrent
- Performed load-sharing facility (LSF) and Slurm cluster administration. Installed and configured HPC clusters and handled resource management, cluster monitoring and maintenance, user support, troubleshooting, scaling, and expansion.
- Managed the Weka storage, covering installation, configuration, and cluster management. Optimized a high-performance, distributed file system for handling large-scale data analytics and storage workloads.
- Automated server and HPC cluster deployments using Ansible. Created, modified, and tested Ansible playbooks and roles in setting up configurations and automating large-scale deployments.
HPC Architect
Qualcomm
- Oversaw availability, latency, performance, efficiency, change management, monitoring, and emergency response for 60+ global HPC clusters with 1.6+ million cores.
- Engaged as technical lead for the AI/ML project—predicting and modifying the job's memory, CPU, and runtime resource limit before dispatch—resulting in effective resource utilization.
- Contributed to a consolidation project, decommissioning 15+ obsolete or less relevant clusters to gain operational efficiencies, cost control, and better alignment with business requirements.
Lead Systems Administrator
Xilinx
- Managed large (1,000+ nodes) compute clusters using LSF or similar job schedulers.
- Contributed as technical lead for Cisco UCS infrastructure deployment and operations.
- Handled system installation and configuration, HPC, and security, installing 3rd-party software in massive environments.
Cloud Data Center Lead | Hardware Engineer
Amazon Web Services (AWS)
- Helped build the world's largest cloud infrastructure. Served as the escalation point and technical troubleshooter for all systems and network hardware problems.
- Installed and configured racks of hosts in line with internal service-level agreements.
- Triaged and resolved trouble tickets for all devices in the region. Served as the data center point of contact for all high-severity issues.
Senior Systems Administrator
Xilinx
- Provided advanced Linux—Red Hat Enterprise Linux, CentOS, Ubuntu, SUSE—troubleshooting and technical support to the engineering organization.
- Engaged as the data center operations lead, maintaining minimum server downtime and managing data center power, cooling, and rack space.
- Installed, configured, and managed the Cisco UCS, Dell, and HP server hardware for the organization.
Experience
HPC Cluster Patching and Upgrade
The project involved systematically patching and upgrading the software stack, including the operating system, middleware, and HPC-specific applications, to ensure optimal performance and compliance with industry standards.
AI/ML Job Prediction
Cluster Consolidation
I designed the architecture of the consolidated clusters based on the assessed requirements, considering factors such as workload distribution, fault tolerance, scalability, and resource allocation. Next, I deployed the new cluster infrastructure according to the designed architecture, covering hardware setup, software installation, configuration, and testing.
Education
Bachelor's Degree in Electrical Engineering
Maharshi Dayanand University - Haryana, India
Certifications
Red Hat Certified System Administrator (RHCSA)
Red Hat
Red Hat Certified Engineer (RHCE)
Red Hat
Skills
Libraries/APIs
Microsoft HPC
Tools
GitLab, Ansible, VMware, Splunk, Weka, Nagios
Platforms
Linux, CentOS, Ubuntu, Unix, Amazon Web Services (AWS)
Storage
Data Centers, HP BladeSystem
Languages
Bash Script
Paradigms
Automation, High-performance Computing (HPC), Scrum, Agile, DevOps
Industry Expertise
Network Security
Other
Slurm Workload Manager, System Administration, Load Sharing Facility (LSF), Operating Systems, Monitoring, Systems Monitoring, System Architecture, Power Management Systems, Communication, Control Systems, Microelectronics, Digital Circuits, Installation, Server Configuration, Storage, Networking, IT Security, Scripting, Virtualization, Containers, Troubleshooting, Clustering, Performance Tuning, Server Management, File Systems, User Groups, Process Management, Backup & Recovery, Shell Scripting, Licensing, Architecture, Data Center Management, Data Center Infrastructure, Servers, Hardware, Project Planning, DNS, NFS, Cisco UCS, Dell PowerEdge Servers, IT Project Management, Data Migration, Backups, Risk Management, Contingency Plans, Digital Electronics
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring