J Chan, Developer in Bangkok, Thailand
J is available for hire
Hire J

J Chan

Verified Expert  in Engineering

Data Engineer and Software Developer

Bangkok, Thailand

Toptal member since November 20, 2022

Bio

J is a data engineer with extensive experience in master data management and data governance best practices. She has 6+ years of experience using SQL and Python for data mining, extracting master attributes, and engineering ETL pipelines. J mostly relies on Spark SQL and AWS cloud services for data transformation and migration in reliable and seamless workflows.

Portfolio

Freelance Client
PySpark, T-SQL (Transact-SQL), AWS Glue, AWS CodeBuild, Terraform, Talend...
Early Data
Python 3, SQL, Git, Amazon Web Services (AWS), Linux, PyTorch, Pandas...
PureLiving
Customer Service, Customer Service Strategy, Public Speaking...

Experience

  • Python 3 - 6 years
  • Data Engineering - 6 years
  • SQL - 6 years
  • eCommerce - 5 years
  • Linux - 4 years
  • Amazon Web Services (AWS) - 4 years
  • Machine Learning - 3 years
  • Spark SQL - 2 years

Availability

Full-time

Preferred Environment

Python 3, SQL, Amazon Web Services (AWS)

The most amazing...

...pieces of code I've written were a handful of lines using PyTorch to categorize eCommerce products by image to meet client demand.

Work Experience

ETL Developer (via Toptal)

2023 - 2025
Freelance Client
  • Helped execute lift-and-shift of data migration and transformation jobs to AWS Glue to achieve significant cost savings.
  • Designed and developed feature enhancements and user stories via stored procedures and Python orchestration scripts as executed in AWS Glue.
  • Supported active client migrations simultaneously with hotfixes and custom client data change requests, even reverse engineering inherited scripts as necessary.
  • Defined and refined the QA process and managed the product release process.
  • Served as one of the solution architects for AWS workflow automation, incorporating multiple AWS services like Glue, Step Functions, and Lambda to integrate 3rd-party APIs and execute the main ETL job in under one hour, which previously took weeks.
Technologies: PySpark, T-SQL (Transact-SQL), AWS Glue, AWS CodeBuild, Terraform, Talend, AWS Step Functions, AWS Lambda, Infrastructure as Code (IaC), Bitbucket

Head of Master Data Management

2017 - 2022
Early Data
  • Helped manage the transfer of inherited Talend master data management workflows to modern technologies such as Python, Spark SQL, and AWS, mapping a further 20% of product master brands.
  • Applied text and image classification neural nets to significantly increase the accuracy of master data management attribute tagging by over 90% accuracy, beyond what was feasible by simple regex patterns.
  • Developed an overarching master data management program to improve downstream business intelligence efforts.
  • Restructured the team and improved overall work culture, increasing team productivity by 50%.
  • Refined the product name clustering and SKU identification with sklearn's linear kernel module and proactively arranged human resources to support master data management SKU tagging workflow, reducing the corresponding workload by 80%.
  • Developed automated Spark QC scripts, reducing the workload of quality control workflow by 40%.
  • Managed a data management platform testing document sign-off, conducting software unit testing, user acceptance testing, and operational qualification protocol for a Fortune 500 medical device company.
  • Maintained an ELT solution spanning Microsoft Server, SSIS, and stored procedures for a Fortune 500 personal care company.
Technologies: Python 3, SQL, Git, Amazon Web Services (AWS), Linux, PyTorch, Pandas, TensorFlow, Talend, Spark SQL, SQL Server Integration Services (SSIS), ETL, Data Mining, Data Engineering, Machine Learning, IT Project Management, People Management, Data Warehouse Design, eCommerce, Master Data Management (MDM), Business Intelligence (BI)

Account Manager

2016 - 2017
PureLiving
  • Created custom bilingual (Chinese) proposals for 60+ existing and prospective clients for a wide range of professional indoor environmental quality programs.
  • Educated the public about the importance of environmental air quality strategy at local community events.
  • Ensured the proper maintenance of air filtration systems through regular client follow-ups.
Technologies: Customer Service, Customer Service Strategy, Public Speaking, Pitch Presentations, Quotations

Researcher

2011 - 2015
Harvard University
  • Collected tree crown property data for 235 trees to establish the relationship between functional properties and demographic rates to determine climate change impact on carbon sequestration rates (Bukit Timah Nature Reserve, Singapore).
  • Constructed capacitance curves for 100 leaves and projected how climate change will alter forest structure (Pasoh Forest Reserve, Malaysia).
  • Presented results of study at the 9th Annual Harvard Plant Biology Symposium.
Technologies: Statistics, Field Research

Experience

Master Data Management Project | A Text and Image Classifier Duo for Intelligent Category Mapping

Much of the data directly crawled from the major Chinese eCommerce platforms have dirty business-relevant master attributes such as category or brand. Because these are the primary segments clients rely on to conduct effective business intelligence, these master attributes must be intelligently mined and standardized to allow effective downstream analyses and reap the benefits of proper Master Data Management.

We found that relying on regex patterns to tag crawled products to the nutritional health category was not accurate, so I collaborated with a data science colleague to deploy text classification in the form of MXNet's Transformer model to perform more intelligent product name tagging.

To further improve the accuracy of category mapping, I managed the further implementation of fast.ai and PyTorch's VGG-16 image classifier to utilize product images as well, as sometimes the product name sounded like a nutritional health item, but the image would indicate otherwise. Consolidating the two results as an enriched tagging workflow increased our average accuracy of category mapping to 93.5% and passed the client quality requirement to publish to the production environment and launch the final digital product.

ETL Project | Transfer of Talend ETL Workflows to Spark SQL/AWS

Originally our ETL data pipelines of millions of rows of crawled eCommerce product and category-level data were written as Talend jobs, but given the instability of the weekly loads, I was tasked to transfer all of the ETL logic to another technology. After deliberate review, I decided Spark SQL was the best choice, given the large amount of data and the relative ease those pipelines could be run on AWS. Other teams were increasingly reliant on AWS' superior technology, which greatly reduced the learning curve in such an endeavor. Finally, I was also skilled in SQL and its flexible functionality, which made Spark SQL an even more logical choice.

After painstakingly reviewing the logic, documenting the logic, and monitoring topline statistics of the existing load, I slowly rewrote all nine workflows and made sure to compare my test load with the production load to make sure none of the columns were missing any crucial logic. After several months, the last workflow was written, and I rounded out the endeavor by packaging the scripts to s3 with a slew of shell scripts that allowed the entire ETL workflow to run on Amazon EMR. Not only did the new workflow run more stably, it also ran approximately 40% faster and was easier to QC.

Master Data Management Project | Optimizing Master Brand Mapping in Spark SQL

Due to a rehaul of the existing infrastructure from optimization roadblocks, I was tasked with transferring Talend logic of mapping crawled eCommerce data to master brands to another technology. Because I had previous experience with Spark SQL and realized its great data mining capacity given its distributed nature, I started documenting the existing logic by poring through the Talend workflows with the intention of using Spark SQL for the final code.

The biggest challenge by far was utilizing the existing keyword look-up table used to do keyword mapping, as merging the look-up tables effectively as a cross-join with no way to partition efficiently greatly reduced the distributed computing potential of Spark SQL. After days of fervent research, I came across the concept of using another column to force certain partitions to travel together so that they were processed together, thereby allowing Spark's distributed power to take over as expected. From this experience, I gained not only more confidence in my Spark code, but also as a developer holistically.

Education

2011 - 2015

Master's Degree in Organismic and Evolutionary Biology

Harvard University - Cambridge, MA, USA

2008 - 2011

Bachelor's Degree in Ecology, Behavior, and Evolution

University of California, Los Angeles - Los Angeles, CA, USA

Certifications

OCTOBER 2023 - PRESENT

Google Cybersecurity

Google

APRIL 2023 - PRESENT

Snowflake Decoded - Fundamentals and Hands On Training

Udemy

APRIL 2023 - PRESENT

Microsoft Power BI Desktop for Business Intelligence

Udemy

MARCH 2023 - FEBRUARY 2026

AWS Certified Cloud Practitioner

Amazon Web Services

AUGUST 2021 - JULY 2024

AWS Partner: Accreditation (Technical)

Amazon Web Services

Skills

Libraries/APIs

Pandas, Amazon EC2 API, PyTorch, TensorFlow, PySpark

Tools

Amazon Elastic MapReduce (EMR), Git, Spark SQL, AWS Glue, MATLAB, Microsoft Power BI, AWS CodeBuild, Jira, Bitbucket, AWS Step Functions, Amazon Virtual Private Cloud (VPC), AWS IAM, Terraform

Languages

Python 3, SQL, T-SQL (Transact-SQL), Snowflake

Paradigms

ETL, Business Intelligence (BI)

Platforms

Amazon Web Services (AWS), Talend, Linux, AWS Lambda, Amazon EC2

Storage

Master Data Management (MDM), Amazon S3 (AWS S3), Redshift, SQL Server Integration Services (SSIS), Microsoft SQL Server

Frameworks

MXNet

Other

Data Engineering, IT Project Management, eCommerce, Data Mining, Business Analytics, Business Services, Machine Learning, People Management, Data Warehouse Design, Cloud Computing, Optical Character Recognition (OCR), Text Classification, Statistics, Research, Field Research, Customer Service, Customer Service Strategy, Public Speaking, Pitch Presentations, Quotations, SIEM, Intrusion Detection Systems (IDS), NIST, Threat Modeling, Amazon AppFlow, Infrastructure as Code (IaC)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring