
Faisal Malik Widya Prasetya
Verified Expert in Engineering
Data Engineer and Developer
Sleman Sub-District, Sleman Regency, Special Region of Yogyakarta, Indonesia
Toptal member since April 25, 2022
Faisal is a data engineer specializing in cloud data technologies like Google and AWS and end-to-end data engineering processes. From designing the architecture and building the infrastructure to developing pipeline operations, he is highly adaptable to new cloud-based, open source, or SaaS technologies. Faisal has solid experience contributing to early-stage startups by directly building end-to-end data pipelines or providing consulting services in his fields of expertise.
Portfolio
Experience
- Python - 7 years
- Amazon Web Services (AWS) - 6 years
- Google Cloud Platform (GCP) - 6 years
- PySpark - 5 years
- Apache Airflow - 4 years
- BigQuery - 4 years
- AWS Lambda - 3 years
- Data Warehousing - 3 years
Preferred Environment
Visual Studio Code (VS Code), Conda, Linux, Docker, Docker Compose, Google Cloud Platform (GCP), Amazon Web Services (AWS), Jira, OpenAI, Mentorship
The most amazing...
...project I've ever done was implementing a cost optimization strategy on the client data warehouse, reducing BI usage costs up to 100 times.
Work Experience
Senior AI Automation Engineer
Aurion Holdings Ltd
- Designed and built a production-grade algorithmic trading platform with over 50 specialized agents, 28 subsystems, and 20 microservices. The system trades live on a £140,000 FTMO prop account with full autonomy.
- Built a fail-closed governance framework with 10 canonical laws by implementing 10 enforceable laws, 12 sequential validation gates, cryptographic envelope signing (HMAC-SHA256), and separation of duties.
- Designed an immutable forensic audit pipeline by building a hash-chain verified, append-only audit trail across 500+ sealed files with tamper detection, 6-hourly automated audits, and a bypass detector.
- Achieved enterprise-grade security hardening by implementing AES-256-GCM at-rest encryption with Windows DPAPI key binding, loopback-only network binding, file integrity monitoring, and automated secret scanning in CI/CD.
Senior MLOps Engineer
Walleye Capital
- Migrated all async calls implementation to the OpenAI API to the OpenAI Batch API in the Kubeflow Pipeline. This migration reduced OpenAI token consumption costs by 50% and eliminated all parallel invocation instances from Kubeflow components.
- Standardize the pipeline using multiple reusable Kubeflow components, allowing all team members to productionize their Pipelines seamlessly.
- Set up CI/CD Pipeline to test and deploy from GitHub to the GCP ecosystem.
- Set up GitHub Actions Workflow Dispatch to submit and schedule Vertex AI Pipelines.
- Provisioned and managed GCP infrastructure using IaC tools like Pulumi.
- Refactored experimental codes from data scientists to meet production standards and deployment.
- Integrated different services and components built by other team members so the system can run smoothly and efficiently.
Back-end Data Engineer
Hivello Operations B.V.
- Designed and implemented log centralization from end-user applications to Cloud Logging, which was routed for analytics use case using Kubeflow and Spark to handle the orchestration and the transformation.
- Developed a blockchain indexing framework using a subgraph to index multiple DePINs' on-chain earnings data. This allows the data to be accessed using GraphQL and analytics purposes.
- Enabled self-serve analytics throughout the company by provisioning Metabase and connecting BigQuery as a data source with business-friendly data models. This makes the company data-driven.
- Designed and implemented a Go-based event-driven microservice back-end system deployed on Cloud Run to handle streaming events from Pub/Sub containing real-time device logs.
Lead Data Engineer
Quadrant
- Led a cross-functional data engineering team of five to design, build, and maintain robust, high-availability data pipelines.
- Integrated and scaled the Databricks platform to orchestrate data preparation and delivery operations, seamlessly processing hundreds of terabytes of data daily.
- Optimized overarching data pipelines by meticulously reconfiguring Apache Spark jobs and cluster environments, achieving significant cost reductions while meeting stringent performance requirements.
- Instituted comprehensive data quality validation checks using Great Expectations, deep-integrated with Spark DataFrames, enabling automated anomaly detection on massive datasets post-processing.
- Architected and operated real-time event ingestion pipelines using Amazon Kinesis and Amazon MSK (Kafka) to process consented mobile location events directly from the company's mobile SDK.
Senior Software Engineer
Pathbox AI Inc.
- Developed a serverless REST API on AWS Lambda, API Gateway, Cognito Aurora Serverless, and DynamoDB using the Express.js web framework.
- Built an optimized machine learning inference system using ECS tasks on Fargate. Allowed parallelization by utilizing SQS and concurrent ECS tasks.
- Developed GPU-enabled machine learning inference using AWS Batch.
- Standardized the machine learning workflow from dataset collection, data preprocessing, model setup, training, and validation to inference deployment.
- Implemented analytics on workflow operations to monitor and optimize the process further.
Web Scraping Expert
Burak Karakaya
- Developed a real-time web scraper to scrape data from various sources, such as Twitter, Binance Futures Leaderboard, etc., to feed data to the client's trading bot. The scraper can ingest tweets within 200 ms after it is published.
- Provided the infrastructure on AWS to enable a high-performance network to enable the scraper to work in real time. I set up the IP rotation so that the scraper didn't get blocked by bypassing the IP rate limit from the news sources.
- Provided an interface for non-technical clients to administer and operate the scraper conveniently. I use Streamlit and FastAPI to develop these interfaces.
- Utilized Redis and high-performance Python extensions like C to improve the storage and runtime performance of the scraper.
Data Engineer
XpressLane, Inc.
- Developed scraping tools to scrape data from various websites and push it to BigQuery.
- Created development and operations documentation so that the client could maintain the solution and can develop more features on it in the future.
- Delivered reports and dashboards to clients from the scraped data to help clients better make decisions for M&A use cases.
Senior Data Engineer
Toptal
- Designed and implemented a robust data pipeline that extracted data from multiple marketing tools and APIs like Google Ads, Facebook Ads, and Twitter Ads, and transferred it to BigQuery using in-house data pipeline tools based on Luigi.
- Created a data pipeline solution that efficiently extracted data from various learning platforms such as Polly, Udemy, and Lessonly and consolidated it with BigQuery utilizing Composer, a managed Apache Airflow service provided by GCP.
- Participated in the data engineering team split brainstorming session and came up with the idea of breaking the team into the data platform and analytics engineering teams. The analytics engineering team focuses on ETL logic, while the data platform team maintains the infrastructure.
Data Engineer
QuantumBlack
- Developed internal data analytics tools that can simplify deployment on the client site. The feature I built is to ingest data from various sources and store them incrementally on Snowflake.
- Handled a client request to build a data analytics pipeline and APIs.
- Worked closely with clients' analytics teams and leadership to gather analytics requirements and carefully plan from the architecture design, to implementation and delivery.
Data Engineering Course Mentor
MentorCruise
- Motivate mentees to engage with the Data Engineering field by showing the job market supply and demand conditions, including the prospects.
- Sharing industrial knowledge and insights about the field.
- Guide the mentee to focus on which subject to prioritize and focus on.
Senior Data Engineer
Flip
- Built a data analytics ecosystem using native Google Cloud Platform technologies, such as Datastream, Google Cloud Storage, Pub/Sub, Dataflow, and BigQuery.
- Improved the analytics waiting time from a 3-hour worst-case scenario to 30 seconds for one big report.
- Maintained the legacy technologies for data analytics on MySQL and on-server cron jobs by creating scheduled jobs on a heavy but frequently used query. The heavy query was accessible in less than 30 minutes with daily data freshness.
- Built the data engineering team and onboarded team members on the legacy, current, and future implementation.
Data Engineer
Pintu
- Developed an ELT data pipeline on Amazon EC2. It is turned on and off by AWS Lambda, triggered by using CloudWatch scheduler from various data sources (MySQL, PostgreSQL, MongoDB, Google Sheets, crypto exchange APIs) to the BigQuery data warehouse.
- Implemented partition, clustering, and materialized views on BigQuery and reduced the cost of analytics by up to 100 times.
- Collaborated with the financial expert to generate the optimum market-making strategy. Implemented and improved the model on the published paper, increasing the liquidity and market activity of the owned asset by 67%.
- Developed a fraud detection system to alert fraudulent activity in case of a security breach on the system. This alert notifies the executive team and captures the fraudster within four hours. It secured $2 million worth of assets.
- Trained the business users to develop their own BI reporting using Metabase and Google Data Studio. It led to 70% of Metabase reports being created by the business team, while the other 30% required complex queries.
- Led the data analytics team and implemented an agile culture by running sprint planning, standup, and sprint retrospective meetings. It allowed tracking business user requests, data pipeline issues, and improvements.
Data Engineer
Kulina
- Developed ELT processes from application databases, third-party marketing tools, and Google Sheets to BigQuery using Stitch data, which reduced the number of query conflicts on the production database, indirectly improving application performance.
- Developed the Snowflake schema on the data warehouse, increasing data visibility among the business team.
- Deployed, maintained, and administered several BI tools, such as Redash, Data Studio, and Metabase, to gain data governance at the business unit level and answer data-related questions with proper tools.
Experience
NASA API Python Wrapper
https://pypi.org/project/python-nasa/Scalable Web Scraper
Then for the transformation, we use PySpark deployed on Dataproc. We manifest Serverless Spark Dataproc to make our transformation pipeline cost-effective. We use GCS as the data lake, so all data ingested from the website will reside in GCS and the transformation output. The clean data will then be stored in BigQuery using the BigQuery load job, also orchestrated on Airflow. When the data arrives on BigQuery, the stakeholder dashboard will automatically be updated with the recent data. We also set up a rotating proxy to avoid getting caught as a bot.
Data Pipeline on GCP
Serverless Chi Boilerplate
https://github.com/serverless-boilerplate/serverless-chiEducation
Bachelor's Degree in Computer Science
Gadjah Mada University - Yogyakarta, Indonesia
Certifications
Infrastructure Automation with Terraform Cloud
Udemy
Google Cloud Professional Data Engineer
Udemy
Skills
Libraries/APIs
PySpark, Pandas, Asyncio, Python API, REST APIs, NumPy, Shapely, Scikit-learn, Node.js, OpenAPI, Amazon API, SQLAlchemy, Telegram Bot API, PyTorch, API Development, Spark ML, OpenCV, X (formerly Twitter) API, SciPy, TensorFlow, Interactive Brokers API, Stripe API, Pydantic, Playwright, Stripe, Luigi, Snowpark, OpenAI API, WinAPI
Tools
BigQuery, Apache Airflow, GitHub, Terraform, AWS Glue, Microsoft Power BI, Tableau, Amazon Elastic MapReduce (EMR), Amazon QuickSight, AWS Step Functions, MySQL Performance Tuning, Amazon ElastiCache, Amazon Simple Notification Service (SNS), Git, Jupyter, Pytest, Kibana, Cloud Dataflow, Apache Beam, Celery, RabbitMQ, Amazon Simple Queue Service (SQS), Amazon Elastic Container Service (ECS), AWS CloudFormation, Logging, AWS IAM, Docker Compose, Redash, Amazon CloudWatch, Amazon Athena, Amazon Redshift Spectrum, Looker, Amazon EKS, Google Analytics, Amazon Cognito, GIS, GRASS GIS, PhpStorm, Navicat, MongoDB Atlas, Amazon SageMaker, Observability Tools, Grafana, AWS Fargate, Claude, Amazon Kinesis Data Firehose, Claude Code, Stitch Data, Jira, Domo, Google Cloud Dataproc, AWS Batch, Retool
Languages
Python, SQL, Snowflake, JavaScript, HTML, Python 3, Transact-SQL (T-SQL), Stored Procedure, Go, TypeScript, Python Script, Rust, GraphQL, CSS, PHP, R, Scala
Frameworks
Django, Swagger, Flask, Hadoop, Scrapy, Apache Spark, Data Lakehouse, Jinja, Fastify, Serverless Framework, Spark, Flutter, CodeIgniter, NestJS, Express.js, OAuth 2, Selenium, Streamlit, Kedro
Paradigms
Business Intelligence (BI), ETL, MapReduce, Stress Testing, REST, Data-driven Design, Design Patterns, Microservices, Microservices Architecture, Database Design, Domain-driven Development, Back-end Architecture, Serverless Architecture, Unit Testing, Event-driven Architecture, API Architecture, Real-time Systems, Kanban, Agile Project Management, DevOps, Agile, Load Testing, API Observability, High-performance Computing (HPC), Object-oriented Design (OOD), Object-oriented Programming (OOP), Distributed Computing, Dimensional Modeling, HIPAA Compliance
Platforms
Visual Studio Code (VS Code), Linux, Docker, Google Cloud Platform (GCP), Amazon Web Services (AWS), AWS Lambda, AWS Elastic Beanstalk, SharePoint, Jupyter Notebook, Kubernetes, Amazon EC2, Oracle Database, Azure, Apache Kafka, Oracle, Databricks, Firebase, Azure Synapse, Azure SQL Data Warehouse, Dedicated SQL Pool (formerly SQL DW), MetaTrader, MetaTrader 5, Blockchain, Kubeflow, Vertex AI, Cloud Run
Storage
Amazon S3 (AWS S3), MySQL, PostgreSQL, MongoDB, Google Cloud Storage, Microsoft SQL Server, NoSQL, Data Lakes, Database Migration, Amazon Aurora, Data Pipelines, Redis, Elasticsearch, Databases, Amazon DynamoDB, Database Modeling, Data Integration, PL/SQL, Data Lake Design, Database Architecture, ClickHouse, DB, RDBMS, PostGIS, Relational Databases, Database Administration (DBA), Redshift, Neo4j, Dynamic SQL, Alibaba Cloud, Google Cloud, IIS SQL Server, Cloud Firestore
Industry Expertise
Bioinformatics
Other
Conda, Machine Learning, Google BigQuery, Data Engineering, Data Modeling, Data Migration, ETL Tools, Data Analytics, Data Analysis, Data Architecture, Data Management, Amazon RDS, CDC, Data Build Tool (dbt), Cloud Migration, ELT, Big Data Architecture, Architecture, Big Data, Project Planning, Web Scraping, Scraping, Data Wrangling, Azure Databricks, APIs, Excel 365, Dashboards, Data Manipulation, Shell Scripting, Benchmarking, Performance, Performance Testing, Caching, Data Reporting, Software Architecture, Back-end, Artificial Intelligence (AI), Data Scraping, PDF Scraping, Scalability, Algorithms, Data Structures, Software Development, Optimization, Cloud, eCommerce, Excel Macros, Automated Trading Software, SaaS, GeoPandas, API Integration, Natural Language Processing (NLP), Serverless, Lint, Consumer Packaged Goods (CPG), Back-end Development, FastAPI, Extensions, Data, Streaming Data, Data Governance, Orchestration, Solution Architecture, Technical Architecture, Monitoring, Multithreading, Entity Relationships, Software Design, Workflow, API Design, AWS Cloud Architecture, Performance Tuning, Amazon API Gateway, SSH, AWS DevOps, OpenTelemetry, OAuth, Data Encoding, Distributed Systems, Fintech, Time Series Databases, Query Optimization, Containerization, Production Support, Star Schema, Crypto, WebSockets, Cloud Infrastructure, Infrastructure as Code (IaC), Machine Learning Operations (MLOps), Computer Vision, Technical Leadership, RESTFul APIs, Directed Acrylic Graphs (DAG), Distributed Architecture, Data Cleaning, ETL Pipelines, Data Cleansing, Generative Artificial Intelligence (GenAI), Scripting, System Design, Delta Lake, Medallion Architecture, Integration, Quantitative Finance, GeoJSON, Geospatial Data, Spatial Data Infrastructure, Amazon Redshift, Latency & Throughput Analysis, Agentic Coding, DataOps, Financial Data, Observability, Third-party APIs, Finance, Leadership, Financial APIs, Cryptography, Research, Data Warehousing, Data Visualization, Metabase, Google Data Studio, CI/CD Pipelines, GitHub Actions, Scripting Languages, Data-driven Dashboards, Azure Data Factory (ADF), Technical Project Management, Azure Data Lake, Data Science, Business Analysis, Tesseract, QGIS, OpenAI GPT-3 API, Neural Networks, eCommerce APIs, Generative Pre-trained Transformers (GPT), LangChain, SharePoint Online, Data Auditing, Business Architecture, Enterprise Architecture, Mathematics, Web Scalability, Prometheus, SDKs, Data Stewardship, TypeORM, Nonlinear Optimization, Linear Optimization, TimescaleDB, AI Chatbots, POS, CAPTCHA, Reverse Engineering, Amazon Glacier, Message Queues, Large Language Models (LLMs), AI Agents, GraphDB, Recommendation Systems, Retrieval-augmented Generation (RAG), Pinecone, Vector Search, HIPAA, Security, AI Tools, Compliance, Mentorship, Stock Trading, Stock Price Analysis, Trading, Amazon EventBridge, Agentic AI, Amazon Kinesis, Real-time Processing, API Gateways, Load Balancers, Proxy Servers, Telemetry, HAProxy, AI-assisted Development, Bots, Quantitative Modeling, Finance APIs, Healthcare Services, Amazon Neptune, Dataproc, Credit Modeling, OpenAI, Pulumi, Word Embedding, UI Automation, Robotic Process Automation (RPA), Cloudflare, AI Automation, Mentorship & Coaching, Team Mentoring, Amazon MSK, Parquet, Geohash
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring