Benjamin Li, Developer in Oakville, ON, Canada
Benjamin is available for hire
Hire Benjamin

Benjamin Li

Verified Expert  in Engineering

Bio

Benjamin has over two decades of software and big data development experience, including data modeling and data warehouse design. His active toolset includes Spark, Python, Scala, AWS, Azure, SQL, Hive, Linux, Microsoft BI solutions, C#.NET, and Java. His orientation to detail and strong analytical and problem-solving skills make him an excellent addition to any team. A kind and intentional communicator, Benjamin always produces high-quality work.

Portfolio

CIBC (Contractor via MetiSign)
Scala, Python 3, Azure, Azure Databricks, Spark, SQL
Yahoo! - Search
Scala, Apache Spark, Apache Maven, Amazon Web Services (AWS), Java, Spark...
Twitter (Contract via Avenue Code)
Scala, Scalding, HDFS, BigQuery, Apache Hive, Bash Script, Git, Phabricator...

Experience

  • SQL - 20 years
  • Data Warehouse Design - 15 years
  • Azure - 8 years
  • Big Data - 5 years
  • Spark - 4 years
  • Amazon Web Services (AWS) - 3 years
  • Scala - 3 years
  • Python 3 - 3 years

Availability

Full-time

Preferred Environment

Linux, PyCharm, IntelliJ IDEA, Apache Hive, Spark, Amazon Web Services (AWS), Azure, Visual Studio, Windows, SQL Server BI

The most amazing...

...thing I've done was to reduce operation costs by 80% by rearchitecting a project and enhancing the code.

Work Experience

Senior Data Engineer and Business Consultant – Expert

2024 - PRESENT
CIBC (Contractor via MetiSign)
  • Developed Structured Notes Ingestor that downloads structured notes data from on-prem RESTful web services to CMDA landing zone in Azure Gen2 Storage and then ingests into Databricks Delta tables for later ETL or reports/dashboards.
  • Designed the app to download data from general RESTful APIs while targeting structured notes data API. Developed Scala code call RESTful API with pagination through OData for large datasets. Developed dependent entity download.
  • Developed Scala code to save downloaded JSON data into landing zone and ingest the data into Databricks Delta tables (Spark).
  • Improved performance by avoiding small data files. Developed flexible configuration so that the job can load data from new APIs without rebuilding the package.
Technologies: Scala, Python 3, Azure, Azure Databricks, Spark, SQL

Data Engineer

2024 - 2024
Yahoo! - Search
  • Developed a solution to offload heavy-duty work from Airflow orchestration using Amazon ECS.
  • Architected an Airflow operator for Amazon ECS using Python so that the ECS tasks do not block the Airflow workflow while ECS tasks perform heavy-duty work.
  • Built example heavy-duty work in Python and Amazon ECS to download large data files from the Internet and upload them to Amazon S3.
Technologies: Scala, Apache Spark, Apache Maven, Amazon Web Services (AWS), Java, Spark, Hadoop, Jetty, Amazon Elastic MapReduce (EMR), MapReduce, Amazon Neptune, Amazon Elastic Container Registry (ECR), Amazon Elastic Container Service (ECS), ECS

Senior Data Engineer

2022 - 2023
Twitter (Contract via Avenue Code)
  • Developed a Scala class to aggregate Twitter user events from Scalding TypedPipe into metrics for data science (DS) and machine learning (ML), making it possible to use them and find insights.
  • Created Dataflow jobs using Scala and Apache Beam API to extract, transform, and load (ETL) datasets for bots to detect harmful tweets.
  • Redesigned the Appen UI template for agent questionnaires, reducing the complexity of the Python code used for collecting agents' responses from the Appen RESTful API and storing data in BigQuery.
  • Developed Apache Airflow DAGs, tasks, and operators to purge history data from Appen via RESTful API and grant PII compliance.
  • Built the back end with Scala and the front end with TypeScript, JSON, and YAML for a product that addresses harassment for trust and safety policy.
  • Created Python PySpark ETL pipelines to extract, transform, and load (ETL) datasets in Parquet data format.
  • Updated a Looker-based dashboard that queries multiple datasets.
Technologies: Scala, Scalding, HDFS, BigQuery, Apache Hive, Bash Script, Git, Phabricator, Confluence, Jira, Big Data, Google Cloud Platform (GCP), IntelliJ IDEA, Bazel, Cloud Dataflow, Apache Beam, Python 3, Apache Airflow, REST APIs, Visual Studio Code (VS Code), JavaScript, TypeScript, Apache Thrift, YAML, JSON, HTML, Google Analytics, Spark, Spark SQL, Jupyter Notebook, ETL, Looker, Data Visualization, Parquet, Data, Terraform, Docker, NoSQL, Data Governance, Streaming Data, ETL Tools, Monitoring, Google BigQuery, Google Cloud, Data Science, Machine Learning, Database Migration

Data Specialist

2021 - 2022
TD Bank (Contract via Procom)
  • Guided the project team as an enterprise data foundation (EDF) consultant in designing Azure Data Factory (ADF) pipelines to process data for 30+ MAL codes (500+ tables) used by enterprise client risk rating (ECRR) and transaction monitoring (TM).
  • Designed ETL using Azure Databricks and Spark DataFrame to load source data from the raw zone, such as CSV, XML, or CopyBook, and then cleanse, transform, and persist it into the curated zone—Parquet—as Type-4 SCD.
  • Outlined ADF pipelines to prepare parameters and call Databricks notebooks. Integrated the pipelines into the Rahona orchestration framework for triggering or scheduling to meet SLA.
  • Integrated QA tests into the CI/CD enterprise delivery pipeline (EDP) on digital.ai. Coordinated the efforts of integration tests across multiple teams. Monitored the pipelines on Datadog.
Technologies: Azure, Azure Data Factory, Azure Databricks, Azure Synapse, Azure SQL Databases, Azure Data Lake, SQL, Spark, PySpark, Python 3, Scala, SQL Server Management Studio (SSMS), Data Management, Data Engineering, Big Data, Data Pipelines, Orchestration, Data Analytics, Git, Bitbucket, Confluence, Jira, Visual Studio Code (VS Code), Datadog, Leadership, Data, Solution Architecture, Database Migration, Cloud, Data Architecture, Architecture, Big Data Architecture

Big Data Consultant

2019 - 2021
Sun Life (via a Contractor)
  • Acted as a tech lead at the project's 2nd phase and provided technical guidance to the team. Hosted a daily scrum meeting and facilitated team activities.
  • Rearchitected the project and redesigned the code to reduce the number of AWS Glue jobs from 150 to 30. This reduced the operation cost by 80%.
  • Developed Python and PySpark code that handles the history data bulk load and the daily CDC load and builds daily snapshots.
  • Created Hive SQL and Spark SQL to handle complex business transformation logic.
  • Developed the CI/CD pipeline to build, package, and deploy the project to development, system integration, and production testing.
  • Tuned performance for the system and located the data skew issue. Provided suggestions to the business team to adjust the data model and avoid the recurrence of the problem.
  • Tested the solution in Amazon EMR and AWS Glue and deployed the AWS Glue job solution to production.
Technologies: Big Data, Amazon Web Services (AWS), Apache Hive, Amazon S3 (AWS S3), AWS Glue, Zeppelin, SQL, Python 3, PySpark, Spark SQL, Linux, Git, Confluence, Scala, PyCharm, IntelliJ IDEA, Jenkins Pipeline, CI/CD Pipelines, Scrum, Bash, Data Lakes, Data Warehouse Design, Apache Spark, Amazon Neptune, Amazon Elastic MapReduce (EMR), Shell Scripting, Amazon EC2, Amazon RDS, Data Warehousing, Data Management, Data Engineering, Data Architecture, Data, Amazon Virtual Private Cloud (VPC), AWS IAM, AWS CloudFormation, Amazon CloudWatch, Data Integration, Kubernetes, NoSQL, Solution Architecture, Technical Architecture, Data Auditing, Finance, Architecture, Big Data Architecture

Big Data Solution Designer | Architect IV

2016 - 2019
TD Bank Group (via a Contractor)
  • Led a team of three solution developers and successfully delivered several projects for different lines of business (LOB).
  • Worked with business analysts from different lines of business to clarify functional requirements.
  • Designed solutions for projects, documented design specifications, and shared the development work with team members.
  • Developed Apache Hive queries for complex business logic with various source data and delivered ETL solutions.
  • Created an Oozie workflow and scheduler to orchestrate and schedule jobs.
  • Built Java solutions to handle mainframe data files in a copybook format.
  • Mentored solution developers, shared design intentions, best practices, and guidelines, and reviewed solution developers' codes.
Technologies: Big Data, Cloudera, Apache Hive, Oozie, Linux, ETL, SQL, Java, HDFS, TIBCO, Bash Script, MapReduce, IntelliJ IDEA, VirtualBox, Git, Confluence, Jenkins, Bash, Data Lakes, Data Warehouse Design, Apache Maven, Leadership, Team Mentoring, Data Architecture, Data, Solution Architecture, Entity Relationships, ELT, Risk Management, Banking & Finance, Finance, Financial Risk Management, Architecture

Senior Software Developer

2016 - 2016
Creditron
  • Developed the SSRS reports according to the business' needs and deployed them to Azure SSRS.
  • Fixed bugs in existing features and developed new features for an electronic check processing (ECP) payment application using ASP.NET, C#.NET, .NET Framework, and SQL Server.
  • Created SQL scripts to populate data and showcase typical ECP system's use cases and scenarios through SSRS reports.
  • Designed a .NET application to automatically deploy SSRS reports using SSRS web services.
Technologies: SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server 2015, C#.NET, ASP.NET, Visual Studio, Azure, Azure SQL Databases, Data Warehouse Design, SQL, Microsoft SQL Server, Data Management, Data Engineering, Stored Procedure

Senior Software Developer | Scrum Master

2008 - 2016
Hatch
  • Developed SSIS packages to load data from various sources like databases, CSV files, XML files, SOAP web service, RESTful API, and FTP. Applied the data hygiene logic and developed transformations using C# script tasks. Loaded data into databases.
  • Created a data access layer and a business logic layer of applications using C#.NET and the .NET Framework to work with data in SQL Server databases.
  • Architected a RESTful API for applications to access data in SQL Server databases.
  • Used ASP.NET to develop a presentation layer of web applications.
  • Played a scrum master role, facilitated teamwork, and led daily scrum meetings, sprint planning, sprint review, and retrospective meetings.
  • Built a Windows service to replicate employee data from the on-premise SAP system and Active Directory server to the Azure SQL Server database.
  • Created SSRS reports showing the project's progress to project managers.
  • Assembled an interactive dashboard for project managers using Power BI.
Technologies: SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server 2015, C#.NET, ASP.NET, T-SQL (Transact-SQL), TFS, .NET, Data Modeling, Azure, Azure Active Directory, Scrum Master, SQL, Data Warehouse Design, Design Patterns, Service-oriented Architecture (SOA), SOAP, REST APIs, UML, Web Services, Microsoft SQL Server, Databases, Database Structure, Database Transactions, Transactions, SAP, Data Engineering, Data Management, Scrum, Certified ScrumMaster (CSM), Leadership, Communication, Data Visualization, Microsoft Power BI, Microservices, APIs, Business Intelligence (BI), Database Modeling, Entity Relationships, Stored Procedure, API Design, Architecture

Senior Software Engineer | Team Leader

2004 - 2008
Epsilon
  • Led the engineering team with seven team members and designed a BI solution for the digital marketing business.
  • Designed and developed ETL packages using SSIS to extract and cleanse data, apply business transformation logic, and load data into a data warehouse.
  • Built the data model. Defined the dimensions and facts of SSAS cubes. Developed a strategy to refresh the cubes to catch up with data changes in a warehouse.
  • Developed a set of SSRS reports visualizing business insights of campaigns.
  • Created a tool to deploy SSRS reports into different projects and farms automatically.
  • Enabled viewing data by different categories and granularities by developing a web application with a dashboard and drill-down feature.
Technologies: SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), C#.NET, SQL Server BI, SQL, C++, ASP.NET, Data Modeling, Scrum Master, Data Warehouse Design, T-SQL (Transact-SQL), UML, Design Patterns, Service-oriented Architecture (SOA), SOAP, Web Services, Microsoft SQL Server, Data Management, Data Engineering, Leadership, Team Leadership, Data Visualization, Business Intelligence (BI), DAX, Architecture, Data Architecture

Software Developer

2004 - 2004
Redknee
  • Implemented Unicode short message service (SMS) to support multiple languages.
  • Designed a thread pool to serve concurrent tag-length-values (TLV) records from sockets and files.
  • Implemented CORBA interfaces for communications across distributed components.
Technologies: Java, Oracle, Linux, StarTeam, CORBA, Design Patterns, Jakarta Server Pages (JSP), SQL

Software Developer

2001 - 2004
Invatron
  • Contributed to Periscope, a decision support system designed to optimize perishable food operations for chain stores and a widely distributed real-time system with several subsystems like Server, MB, Proxy, hhMQ, TSP, TSP-PE, Scheduler, and HITS.
  • Analyzed and designed markdown components in the OO method. Developed a data model and implemented SQL scripts in multiple database systems.
  • Developed spot-checks to review and update a real-time inventory. Developed new barcode markdowns for random-weighted (type-2 UPC) products and coupon discounts for non-type-2 UPC products. Implemented label printing over serial and WiFi.
  • Built a set of generic algorithms in C++ templates to handle various perishable food operations using Visual C++ on Windows and GCC on Linux and Unix to deploy the application to different operating systems.
  • Created a data access layer via the Open Database Connectivity (ODBC) to access multiple database systems, including SQL Server, Oracle, DB2, Informix, and Sybase. The applications can be deployed with various database systems.
  • Built a messaging framework for communication across the components of the decision support system.
  • Delivered a set of embedded applications to check and adjust inventory, check and mark down the price, and print barcode labels for various devices like hand-held scanners and wall-mounted price checkers.
  • Developed an installation daemon to automatically check and install new application versions for devices like hand-held scanners, wall-mounted price checkers, and point-of-sale (POS) machines in distributed chain stores.
Technologies: C++, Windows, Linux, SQL, SQL Server 2015, Oracle, IBM Informix, IBM Db2, Sybase, Visual Studio, GCC, Bash, Unix, Message Bus, ODBC, Data Modeling, Entity-relationships Model (ERM), T-SQL (Transact-SQL), Microsoft SQL Server, Consumer Packaged Goods (CPG), Food, Point of Sale, POS, Access Points, Entity Relationships, PL/SQL

Senior Software Engineer | Team Leader

1995 - 2000
China Construction Bank | Guangdong Branch
  • Led the team that developed a client-server system employing C, C++, Pro*C, and SQL on various Unix and Linux platforms using the Informix database system.
  • Gathered requirements from lines of businesses, designed the database and ER diagram, and implemented the data model in Informix SQL scripts.
  • Troubleshot production issues, investigated root causes, and found resolutions.
Technologies: C, C++, Pro*C, SQL, IBM Informix, Unix, Linux, HP-UX, Sco Unix, Bash, C Shell, Bourne Shell, KornShell, Entity-relationships Model (ERM), Data Modeling, Entity Relationships

AWS EMR/Glue ETL Project for an Insurance Business

This is an ETL project on AWS to extract data from multiple lines of business in the enterprise data lake. The data are transformed according to business logic and loaded into a consumption zone for Tableau reports so that reports can be quickly built on the integrated data model, regardless of various data models from lines of business.

Common Reporting Standard (CRS)

The Common Reporting Standard (CRS) is a regulatory project between tax authorities regarding bank accounts on a global level. I developed complex Hive queries, an Oozie workflow, and a scheduler to extract data from the master data management (MDM) system and consolidate accounts from the wealth management system. I discovered data discrepancies, dug out the root causes, and enhanced the enterprise data model so that the data provided by this application were accurate and accountable.

Data Lake Ingestion Data Flow

This is an add-on component for a bank to move the enterprise data ingestion to a data lake. I designed the solution and implemented Java classes to parse ingestion logs, extract bad records, convert mainframe copybook into Unicode, and persist data into Hive table for business users to view and fix data. I improved application performance by 20 times.

Global Procurement Intelligence (GPI)

I designed a data model, C#.NET and ASP.NET web application, SSIS packages with complex logic, and SSRS reports for global procurement intelligence (GPI) system that help optimize sourcing decisions significantly.

Enterprise Data Foundation (EDF)

EDF is a conformed layer in the enterprise curated zone of the Akora cloud (Azure), bringing together enterprise-relevant data for use in analytics, reporting, and consumption by downstream applications. I developed the Azure Data Factory (ADF) data pipelines, Azure Databricks notebook, and Azure Synapse database.

Periscope Server

Periscope is a decision support system (DSS) designed to optimize perishable food operations for chain stores. It is a widely distributed real-time system including a central Periscope server and message broker, as well as several subsystems distributed all over stores: Proxy, hhMQ, TPS, TSP, PE, Scheduler, HITS, etc.

Python and PySpark Job for False Discovery Rate (FDR)

False Discovery Rate (FDR) evaluates the effectiveness of bots in assessing tweets and users as per defined rules (special rules for ad users), applying labels in case of violation. The FDR project sample a portion of labeled data from bots, sends it to human agents to evaluate, collects results, and analyzes the rate of false labels.

BI Solution for Adtech

This project uses SSIS to extract, transform, and load ad campaign and response data, such as view, click, or purchase, from the transaction system into the data warehouse. It then uses SSRS reports to help clients visualize the insights of the campaign.
1992 - 1995

Master's Degree in Computer Science

Fudan University - Shanghai, China

1988 - 1992

Bachelor's Degree in Computer Science

National University of Defense Technology - Changsha, China

MAY 2015 - MAY 2019

Certified Scrum Master

Scrum Alliance

Libraries/APIs

PySpark, Jenkins Pipeline, REST APIs, ODBC, JDBC, Standard Template Library (STL), Scalding

Tools

PyCharm, IntelliJ IDEA, AWS Glue, Spark SQL, Git, Confluence, Jenkins, Cloudera, Oozie, Visual Studio, TFS, SQL Server BI, Apache Airflow, VirtualBox, GCC, Hue, Eclipse IDE, BigQuery, Apache Maven, Phabricator, Jira, Bazel, Cloud Dataflow, Apache Beam, Amazon Elastic MapReduce (EMR), Google Analytics, Bitbucket, Looker, Microsoft Power BI, Terraform, Amazon Virtual Private Cloud (VPC), AWS IAM, AWS CloudFormation, Amazon CloudWatch, Jetty, Amazon Elastic Container Registry (ECR), Amazon Elastic Container Service (ECS)

Languages

SQL, Bash, C#.NET, C++, Java, Python 3, Scala, Python, T-SQL (Transact-SQL), UML, C, Pro*C, C Shell, Bourne Shell, Snowflake, Bash Script, JavaScript, TypeScript, YAML, HTML, Stored Procedure

Frameworks

Spark, ASP.NET, .NET, Jakarta Server Pages (JSP), Hadoop, Yarn, Apache Spark, Apache Thrift

Paradigms

Database Design, Business Intelligence (BI), ETL, Scrum, Agile, MapReduce, Design Patterns, Service-oriented Architecture (SOA), Microservices

Storage

SQL Server 2016, Apache Hive, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), Microsoft SQL Server, Database Architecture, Amazon S3 (AWS S3), HDFS, Azure SQL Databases, SQL Server Analysis Services (SSAS), Azure Active Directory, MySQL, PostgreSQL, Data Lakes, IBM Informix, IBM Db2, Sybase, Redshift, Data Pipelines, JSON, Databases, Database Structure, Database Transactions, SQL Server Management Studio (SSMS), Datadog, Azure SQL, Data Integration, NoSQL, Database Modeling, PL/SQL, Google Cloud, Database Migration, Master Data Management (MDM)

Platforms

Linux, Windows, Amazon Web Services (AWS), Zeppelin, Azure, Apache Kafka, Databricks, Oracle, Unix, HP-UX, KornShell, Google Cloud Platform (GCP), Visual Studio Code (VS Code), Amazon EC2, Azure Synapse, Jupyter Notebook, Docker, Kubernetes

Industry Expertise

Banking & Finance

Other

Data Modeling, Big Data, Data Warehouse Design, Data Engineering, Data Analysis, Data Analytics, Reverse Engineering, Software Engineering, Software, TIBCO, SQL Server 2015, Azure Data Factory, CI/CD Pipelines, Scrum Master, Data Warehousing, StarTeam, CORBA, SOAP, Web Services, Message Bus, Sco Unix, Entity-relationships Model (ERM), Enterprise Architecture, MSMQ, Azure Data Lake, Amazon Neptune, Shell Scripting, Amazon RDS, Transactions, SAP, Azure Databricks, Data Management, Orchestration, Consumer Packaged Goods (CPG), Food, Point of Sale, POS, Access Points, Leadership, Team Leadership, Certified ScrumMaster (CSM), Communication, Team Mentoring, Consulting, Data Visualization, Data Architecture, Parquet, Data, Data Governance, Streaming Data, Solution Architecture, APIs, ETL Tools, Monitoring, Technical Architecture, Data Auditing, DAX, Entity Relationships, Data Build Tool (dbt), ELT, API Design, Risk Management, Finance, Financial Risk Management, Google BigQuery, Data Science, Machine Learning, Cloud, Architecture, Big Data Architecture, ECS

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring