Benjamin Li
Verified Expert in Engineering
Software Developer
Oakville, ON, Canada
Toptal member since November 3, 2021
Benjamin has over two decades of software and big data development experience, including data modeling and data warehouse design. His active toolset includes Spark, Python, Scala, AWS, Azure, SQL, Hive, Linux, Microsoft BI solutions, C#.NET, and Java. His orientation to detail and strong analytical and problem-solving skills make him an excellent addition to any team. A kind and intentional communicator, Benjamin always produces high-quality work.
Portfolio
Experience
Availability
Preferred Environment
Linux, PyCharm, IntelliJ IDEA, Apache Hive, Spark, Amazon Web Services (AWS), Azure, Visual Studio, Windows, SQL Server BI
The most amazing...
...thing I've done was to reduce operation costs by 80% by rearchitecting a project and enhancing the code.
Work Experience
Senior Data Engineer & Business Consultant - Expert
CIBC (Contractor via MetiSign)
- Developed Structured Notes Ingestor that downloads structured notes data from on-prem RESTful web services to CMDA landing zone in Azure gen2 storage, and then ingests into Databricks Delta tables for later ETL or reports / dashboards.
- While targeting structured notes data API, designed the app to download data from general RESTful APIs; Developed Scala code call RESTful API with pagination through OData for large dataset; Developed dependent entity download.
- Developed Scala code to save downloaded JSON data into landing zone and ingest the data into Databricks delta tables (Spark).
- Improved performance by avoiding small data files; Developed flexible configuration so that the job can load data from new APIs without rebuilding the package.
Data Engineer with Spark & Scala expertise for Global Platform
Yahoo! - Search
- Developed a solution to offload heavy duty work from Airflow orchestration using AWS ECS.
- Developed Airflow operator to AWS ECS using Python so that the ECS tasks do not block the Airflow workflow while ECS tasks are performing heavy duty work.
- Developed example heavy duty work in Python & AWS ECS to download large data files from Internet and upload to AWS S3.
Senior Data Engineer
Twitter (Contract via Avenue Code)
- Developed a Scala class to aggregate Twitter user events from Scalding TypedPipe into metrics for data science (DS) and machine learning (ML), making it possible to use them and find insights.
- Created Dataflow jobs using Scala and Apache Beam API to extract, transform, and load (ETL) datasets for bots to detect harmful tweets.
- Redesigned the Appen UI template for agent questionnaires, reducing the complexity of the Python code used for collecting agents' responses from the Appen RESTful API and storing data in BigQuery.
- Developed Apache Airflow DAGs, tasks, and operators to purge history data from Appen via RESTful API and grant PII compliance.
- Built the back end with Scala and the front end with TypeScript, JSON, and YAML for a product that addresses harassment for trust and safety policy.
- Created Python PySpark ETL pipelines to extract, transform, and load (ETL) datasets in Parquet data format.
- Updated a Looker-based dashboard that queries multiple datasets.
Data Specialist
TD Bank (Contract via Procom)
- Guided the project team as an enterprise data foundation (EDF) consultant in designing Azure Data Factory (ADF) pipelines to process data for 30+ MAL codes (500+ tables) used by enterprise client risk rating (ECRR) and transaction monitoring (TM).
- Designed ETL using Azure Databricks and Spark DataFrame to load source data from the raw zone, such as CSV, XML, or CopyBook, and then cleanse, transform, and persist it into the curated zone—Parquet—as Type-4 SCD.
- Outlined ADF pipelines to prepare parameters and call Databricks notebooks. Integrated the pipelines into the Rahona orchestration framework for triggering or scheduling to meet SLA.
- Integrated QA tests into the CI/CD enterprise delivery pipeline (EDP) on digital.ai. Coordinated the efforts of integration tests across multiple teams. Monitored the pipelines on Datadog.
Big Data Consultant
Sun Life (via a Contractor)
- Acted as a tech lead at the project's 2nd phase and provided technical guidance to the team. Hosted a daily scrum meeting and facilitated team activities.
- Rearchitected the project and redesigned the code to reduce the number of AWS Glue jobs from 150 to 30. This reduced the operation cost by 80%.
- Developed Python and PySpark code that handles the history data bulk load and the daily CDC load and builds daily snapshots.
- Created Hive SQL and Spark SQL to handle complex business transformation logic.
- Developed the CI/CD pipeline to build, package, and deploy the project to development, system integration, and production testing.
- Tuned performance for the system and located the data skew issue. Provided suggestions to the business team to adjust the data model and avoid the recurrence of the problem.
- Tested the solution in Amazon EMR and AWS Glue and deployed the AWS Glue job solution to production.
Big Data Solution Designer | Architect IV
TD Bank Group (via a Contractor)
- Led a team of three solution developers and successfully delivered several projects for different lines of business (LOB).
- Worked with business analysts from different lines of business to clarify functional requirements.
- Designed solutions for projects, documented design specifications, and shared the development work with team members.
- Developed Apache Hive queries for complex business logic with various source data and delivered ETL solutions.
- Created an Oozie workflow and scheduler to orchestrate and schedule jobs.
- Built Java solutions to handle mainframe data files in a copybook format.
- Mentored solution developers, shared design intentions, best practices, and guidelines, and reviewed solution developers' codes.
Senior Software Developer
Creditron
- Developed the SSRS reports according to the business' needs and deployed them to Azure SSRS.
- Fixed bugs in existing features and developed new features for an electronic check processing (ECP) payment application using ASP.NET, C#.NET, .NET Framework, and SQL Server.
- Created SQL scripts to populate data and showcase typical ECP system's use cases and scenarios through SSRS reports.
- Designed a .NET application to automatically deploy SSRS reports using SSRS web services.
Senior Software Developer | Scrum Master
Hatch
- Developed SSIS packages to load data from various sources like databases, CSV files, XML files, SOAP web service, RESTful API, and FTP. Applied the data hygiene logic and developed transformations using C# script tasks. Loaded data into databases.
- Created a data access layer and a business logic layer of applications using C#.NET and .NET Framework to work with data in SQL Server databases.
- Architected RESTful API for applications to access data in SQL Server databases.
- Used ASP.NET to develop a presentation layer of web applications.
- Played a scrum master role, facilitated teamwork, and led daily scrum meetings, sprint planning, sprint review, and retrospective meetings.
- Built a Windows service to replicate employee data from the on-premise SAP system and Active Directory server to the Azure SQL Server database.
- Created SSRS reports showing the project's progress to project managers.
- Assembled an interactive dashboard for project managers using Power BI.
Senior Software Engineer | Team Leader
Epsilon
- Led the engineering team with seven team members and designed a BI solution for the digital marketing business.
- Designed and developed ETL packages using SSIS to extract and cleanse data, apply business transformation logic, and load data into a data warehouse.
- Built the data model. Defined the dimensions and facts of SSAS cubes. Developed a strategy to refresh the cubes to catch up with data changes in a warehouse.
- Developed a set of SSRS reports visualizing business insights of campaigns.
- Created a tool to deploy SSRS reports into different projects and farms automatically.
- Enabled viewing data by different categories and granularities by developing a web application with a dashboard and drill-down feature.
Software Developer
Redknee
- Implemented Unicode short message service (SMS) to support multiple languages.
- Designed a thread pool to serve concurrent tag-length-values (TLV) records from sockets and files.
- Implemented CORBA interfaces for communications across distributed components.
Software Developer
Invatron
- Contributed to Periscope, a decision support system designed to optimize perishable food operations for chain stores and a widely distributed real-time system with several subsystems like Server, MB, Proxy, hhMQ, TSP, TSP-PE, Scheduler, and HITS.
- Analyzed and designed markdown components in the OO method. Developed a data model and implemented SQL scripts in multiple database systems.
- Developed spot-checks to review and update a real-time inventory. Developed new barcode markdowns for random-weighted (type-2 UPC) products and coupon discounts for non-type-2 UPC products. Implemented label printing over serial and WiFi.
- Built a set of generic algorithms in C++ templates to handle various perishable food operations using Visual C++ on Windows and GCC on Linux and Unix to deploy the application to different operating systems.
- Created a data access layer via the Open Database Connectivity (ODBC) to access multiple database systems, including SQL Server, Oracle, DB2, Informix, and Sybase. The applications can be deployed with various database systems.
- Built a messaging framework for communication across the components of the decision support system.
- Delivered a set of embedded applications to check and adjust inventory, check and mark down the price, and print barcode labels for various devices like hand-held scanners and wall-mounted price checkers.
- Developed an installation daemon to automatically check and install new application versions for devices like hand-held scanners, wall-mounted price checkers, and point-of-sale (POS) machines in distributed chain stores.
Senior Software Engineer | Team Leader
China Construction Bank | Guangdong Branch
- Led the team that developed a client-server system employing C, C++, Pro*C, and SQL on various Unix and Linux platforms using the Informix database system.
- Gathered requirements from lines of businesses, designed the database and ER diagram, and implemented the data model in Informix SQL scripts.
- Troubleshot production issues, investigated root causes, and found resolutions.
Experience
AWS EMR/Glue ETL Project for an Insurance Business
Common Reporting Standard (CRS)
Data Lake Ingestion Data Flow
Global Procurement Intelligence (GPI)
Enterprise Data Foundation (EDF)
Periscope Server
Python and PySpark Job for False Discovery Rate (FDR)
BI Solution for Adtech
Education
Master's Degree in Computer Science
Fudan University - Shanghai, China
Bachelor's Degree in Computer Science
National University of Defense Technology - Changsha, China
Certifications
Certified Scrum Master
Scrum Alliance
Skills
Libraries/APIs
PySpark, Jenkins Pipeline, REST APIs, ODBC, JDBC, Standard Template Library (STL), Scalding
Tools
PyCharm, IntelliJ IDEA, AWS Glue, Spark SQL, Git, Confluence, Jenkins, Cloudera, Oozie, Visual Studio, TFS, SQL Server BI, Apache Airflow, VirtualBox, GCC, Hue, Eclipse IDE, BigQuery, Apache Maven, Phabricator, Jira, Bazel, Cloud Dataflow, Apache Beam, Amazon Elastic MapReduce (EMR), Google Analytics, Bitbucket, Looker, Microsoft Power BI, Terraform, Amazon Virtual Private Cloud (VPC), AWS IAM, AWS CloudFormation, Amazon CloudWatch, Jetty
Languages
SQL, Bash, C#.NET, C++, Java, Python 3, Scala, Python, T-SQL (Transact-SQL), UML, C, Pro*C, C Shell, Bourne Shell, Snowflake, Bash Script, JavaScript, TypeScript, YAML, HTML, Stored Procedure
Frameworks
Spark, ASP.NET, .NET, Jakarta Server Pages (JSP), Hadoop, Yarn, Apache Spark, Apache Thrift
Paradigms
Database Design, Business Intelligence (BI), ETL, Scrum, Agile, MapReduce, Design Patterns, Service-oriented Architecture (SOA), Microservices
Storage
SQL Server 2016, Apache Hive, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), Microsoft SQL Server, Database Architecture, Amazon S3 (AWS S3), HDFS, Azure SQL Databases, SQL Server Analysis Services (SSAS), Azure Active Directory, MySQL, PostgreSQL, Data Lakes, IBM Informix, IBM Db2, Sybase, Redshift, Data Pipelines, JSON, Databases, Database Structure, Database Transactions, SQL Server Management Studio (SSMS), Datadog, Azure SQL, Data Integration, NoSQL, Database Modeling, PL/SQL, Google Cloud, Database Migration, Master Data Management (MDM)
Platforms
Linux, Windows, Amazon Web Services (AWS), Zeppelin, Azure, Apache Kafka, Databricks, Oracle, Unix, HP-UX, KornShell, Google Cloud Platform (GCP), Visual Studio Code (VS Code), Amazon EC2, Azure Synapse, Jupyter Notebook, Docker, Kubernetes
Industry Expertise
Banking & Finance
Other
Data Modeling, Big Data Architecture, Data Warehouse Design, Data Engineering, Data Analysis, Data Analytics, Reverse Engineering, Software Engineering, Software, TIBCO, SQL Server 2015, Azure Data Factory, CI/CD Pipelines, Scrum Master, Data Warehousing, StarTeam, CORBA, SOAP, Web Services, Message Bus, Sco Unix, Entity-relationships Model (ERM), Enterprise Architecture, MSMQ, Azure Data Lake, Amazon Neptune, Shell Scripting, Amazon RDS, Transactions, SAP, Azure Databricks, Data Management, Orchestration, Consumer Packaged Goods (CPG), Food, Point of Sale, POS, Access Points, Leadership, Team Leadership, Certified ScrumMaster (CSM), Communication, Team Mentoring, Consulting, Data Visualization, Data Architecture, Parquet, Data, Data Governance, Streaming Data, Solution Architecture, APIs, ETL Tools, Monitoring, Technical Architecture, Data Auditing, DAX, Entity Relationships, Data Build Tool (dbt), ELT, API Design, Risk Management, Finance, Financial Risk Management, Google BigQuery, Data Science, Machine Learning, Cloud, Architecture, Big Data Architecture
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring