Senior ML Infrastructure Engineer
Toptal is a global network of top freelance talent in business, design, and technology that enables companies to scale their teams, on-demand. With $200+ million in annual revenue and over 40% year-over-year growth, Toptal is the world’s largest fully remote company.
We take the best elements of virtual teams and combine them with a support structure that encourages innovation, social interaction, and fun. We see no borders, move at a fast pace, and are never afraid to break the mold.
The Data Science team builds unique Machine Learning solutions to support Toptal with decision-making capabilities, in order to provide full support for a data-driven organization. Currently, we are looking for a Senior ML Infrastructure Engineer to join this growing team.
By joining our team, you will get the opportunity to use your skills and experience to help us build and monitor our state-of-the-art Data Processing and Machine Learning services. You will be working with a team of highly skilled Data Scientists and Data Engineers from around the world. You will get to use cutting-edge technologies every day, and you will play a major role in choosing the right solutions and in the development of new tools.
We don’t cut corners, and we don’t make compromises - we focus on high-quality solutions that bring value. We are remote-only, have no offices, and fully embrace a flexible work-life balance.
This is a remote position that can be done from anywhere. Due to the remote nature of this role, we are unable to provide visa sponsorship. Resumes and communication must be submitted in English.
You will design, build and maintain infrastructure solutions for all our ML applications. This means you will have the liberty to implement the best solution that will address our current needs while proactively seeking ways to improve development and operation process, identify and evaluate new technologies to improve performance, maintainability, and reliability of our machine learning based systems. As part of this, you will assist in architecting, planning, deploying, configuring and maintaining GCP-based solutions.
We strive to ensure high uptime and reliability for our services so part of your responsibilities will be to implement and maintain monitoring and alerting, identify and resolve workflow and production issues, troubleshoot incidents, identify root causes, fix and document problems, and implement preventive measures.
Since time to delivery is critical for the team, you will also help in designing, building and maintaining CI/CD pipelines and ensure integration validation for our services.
You will do all this while collaborating with data scientists, data engineers and other engineering teams. You can expect many challenges that will allow you to use your critical thinking skills and rely on the experience of your colleagues to solve problems.
In the first week, expect to:
- Meet the entire Data Science team.
- Get familiar with our best practices guidelines.
- Get access to all our services.
In the first month, expect to:
- Understand our team structure and workflow.
- Set up your working environment.
- Meet our friends from the Data Engineering team.
In the first three months, expect to:
- Understand our ML microservices architecture.
- Start working on one or more projects.
- Propose technical and non-technical solutions and improvements.
In the first six months, expect to:
- Be completely familiar with the workflow and the team.
- Begin to be part of on-call rotation.
- Be fully integrated and a participant member of all workflow including planning sessions, reviews, and retrospectives.
- Deliver your first major end to end solution.
In the first year, expect to:
- Master our workflows and architecture.
- Act as a representative of the team and an ambassador for our work.
- Have 3+ years of experience in Python.
- Experience with Cloud Platforms (GCP, AWS, Azure).
- Experience in building and maintaining CI/CD pipelines.
- Experience with Unix-based systems, including bash programming.
- Working knowledge of tools, methods, and concepts of Quality Assurance.
- Experience with SQL scripting.
- Experience with monitoring and alerting.
- Excellent verbal, written and interpersonal communication skills.
- Experience with Pact is a plus.
- Experience with frameworks such as FastAPI, Flask or Django would be a great differentiator.
- Experience building and deploying Web Services at scale is a plus.
- Experienced in Containerisation (Docker, Kubernetes) is a plus.