J is available for hire

J Chan

Verified Expert in Engineering

Data Engineer and Software Developer

Location

El Paso, TX, United States

Toptal Member Since

November 20, 2022

J is a data engineer with extensive experience in master data management and data governance best practices. She has five years of experience using SQL and Python for data mining, extracting master attributes, and performing general big data quality control. For ETL, J mostly relies on Spark SQL and AWS cloud services to build reliable and seamless workflows.

Data Engineering Data Mining IT Project Management Python 3 SQL Pandas Business Intelligence (BI)Master Data Management (MDM)Amazon Web Services (AWS)Talend ETL Amazon S3 (AWS S3)Amazon EC2 API Redshift Machine Learning

Portfolio

Early Data

Python 3, SQL, Git, Amazon Web Services (AWS), Linux, PyTorch, Pandas...

PureLiving

Customer Service, Customer Service Strategy, Public Speaking...

Harvard University

Statistics, Field Research

Experience

Python 3 - 5 years eCommerce - 5 years Data Engineering - 5 years SQL - 5 years Linux - 4 years Machine Learning - 3 years Amazon Web Services (AWS) - 3 years Spark SQL - 2 years

Availability

Part-time

Preferred Environment

Python 3, SQL, Linux

The most amazing...

...(or one of the most amazing) pieces of code I've written was a handful of lines using PyTorch to categorize eCommerce products by image by client demand.

Work Experience

Head of Master Data Management

2017 - 2022

Early Data

Helped manage the transfer of inherited Talend master data management workflows to modern technologies such as Python, Spark SQL, and AWS, mapping a further 20% of product master brands.
Applied text and image classification neural nets to significantly increase the accuracy of master data management attribute tagging by over 90% accuracy, beyond what was feasible by simple regex patterns.
Developed an overarching master data management program to improve downstream business intelligence efforts.
Restructured the team and improved overall work culture, increasing team productivity by 50%.
Refined the product name clustering and SKU identification with sklearn's linear kernel module and proactively arranged human resources to support master data management SKU tagging workflow, reducing the corresponding workload by 80%.
Developed automated Spark QC scripts, reducing the workload of quality control workflow by 40%.
Managed a data management platform testing document sign-off, conducting software unit testing, user acceptance testing, and operational qualification protocol for a Fortune 500 medical device company.
Maintained an ELT solution spanning Microsoft Server, SSIS, and stored procedures for a Fortune 500 personal care company.

Technologies: Python 3, SQL, Git, Amazon Web Services (AWS), Linux, PyTorch, Pandas, TensorFlow, Talend, Spark SQL, SQL Server Integration Services (SSIS), ETL, Data Mining, Data Engineering, Machine Learning, IT Project Management, People Management, Data Warehouse Design, eCommerce, Master Data Management (MDM), Business Intelligence (BI)

Account Manager

2016 - 2017

PureLiving

Ensured the proper maintenance of air filtration systems through regular client follow-ups.
Created custom bilingual proposals for 60+ existing and prospective clients for a wide range of professional indoor environmental quality programs.
Educated the public about the importance of environmental air quality strategy at a variety of local community events.

Technologies: Customer Service, Customer Service Strategy, Public Speaking, Pitch Presentations, Quotations

Researcher

2011 - 2015

Harvard University

Collected tree crown property data for 235 trees to establish the relationship between functional properties and demographic rates to determine climate change impact on carbon sequestration rates (Bukit Timah Nature Reserve, Singapore).
Constructed capacitance curves for 100 leaves and projected how climate change will alter forest structure (Pasoh Forest Reserve, Malaysia). Presented results of the study at the 9th Annual Harvard Plant Biology Symposium.
Moderated discussion sections and provided constructive feedback for undergraduate students in a course on Classical Chinese Ethical and Political Theory.

Technologies: Statistics, Field Research

Experience

Master Data Management Project | A Text and Image Classifier Duo for Intelligent Category Mapping

Much of the data directly crawled from the major Chinese eCommerce platforms have dirty business-relevant master attributes such as category or brand. Because these are the primary segments clients rely on to conduct effective business intelligence, these master attributes must be intelligently mined and standardized to allow effective downstream analyses and reap the benefits of proper Master Data Management.

We found that relying on regex patterns to tag crawled products to the nutritional health category was not accurate, so I collaborated with a data science colleague to deploy text classification in the form of MXNet's Transformer model to perform more intelligent product name tagging.

To further improve the accuracy of category mapping, I managed the further implementation of fast.ai and PyTorch's VGG-16 image classifier to utilize product images as well, as sometimes the product name sounded like a nutritional health item, but the image would indicate otherwise. Consolidating the two results as an enriched tagging workflow increased our average accuracy of category mapping to 93.5% and passed the client quality requirement to publish to the production environment and launch the final digital product.

ETL Project | Transfer of Talend ETL Workflows to Spark SQL/AWS

Originally our ETL data pipelines of millions of rows of crawled eCommerce product and category-level data were written as Talend jobs, but given the instability of the weekly loads, I was tasked to transfer all of the ETL logic to another technology. After deliberate review, I decided Spark SQL was the best choice, given the large amount of data and the relative ease those pipelines could be run on AWS. Other teams were increasingly reliant on AWS' superior technology, which greatly reduced the learning curve in such an endeavor. Finally, I was also skilled in SQL and its flexible functionality, which made Spark SQL an even more logical choice.

After painstakingly reviewing the logic, documenting the logic, and monitoring topline statistics of the existing load, I slowly rewrote all nine workflows and made sure to compare my test load with the production load to make sure none of the columns were missing any crucial logic. After several months, the last workflow was written, and I rounded out the endeavor by packaging the scripts to s3 with a slew of shell scripts that allowed the entire ETL workflow to run on Amazon EMR. Not only did the new workflow run more stably, it also ran approximately 40% faster and was easier to QC.

Master Data Management Project | Optimizing Master Brand Mapping in Spark SQL

Due to a rehaul of the existing infrastructure from optimization roadblocks, I was tasked with transferring Talend logic of mapping crawled eCommerce data to master brands to another technology. Because I had previous experience with Spark SQL and realized its great data mining capacity given its distributed nature, I started documenting the existing logic by poring through the Talend workflows with the intention of using Spark SQL for the final code.

The biggest challenge by far was utilizing the existing keyword look-up table used to do keyword mapping, as merging the look-up tables effectively as a cross-join with no way to partition efficiently greatly reduced the distributed computing potential of Spark SQL. After days of fervent research, I came across the concept of using another column to force certain partitions to travel together so that they were processed together, thereby allowing Spark's distributed power to take over as expected. From this experience, I gained not only more confidence in my Spark code, but also as a developer holistically.

Education

2011 - 2015

Master's Degree in Organismic and Evolutionary Biology

Harvard University - Cambridge, MA, USA

2008 - 2011

Bachelor's Degree in Ecology, Behavior, and Evolution

University of California, Los Angeles - Los Angeles, CA, USA

Certifications

APRIL 2023 - PRESENT

Snowflake Decoded - Fundamentals and Hands On Training

Udemy

APRIL 2023 - PRESENT

Microsoft Power BI Desktop for Business Intelligence

Udemy

MARCH 2023 - FEBRUARY 2026

AWS Certified Cloud Practitioner

Amazon Web Services

AUGUST 2021 - JULY 2024

AWS Partner: Accreditation (Technical)

Amazon Web Services

Skills

Libraries/APIs

Pandas, Amazon EC2 API, PyTorch, TensorFlow

Tools

Amazon Elastic MapReduce (EMR), Git, Spark SQL, MATLAB, Microsoft Power BI

Languages

Python 3, SQL, Snowflake

Paradigms

ETL, Business Intelligence (BI)

Platforms

Amazon Web Services (AWS), Talend, Linux

Storage

Master Data Management (MDM), Amazon S3 (AWS S3), Redshift, SQL Server Integration Services (SSIS)

Frameworks

MXNet

Other

Data Engineering, IT Project Management, eCommerce, Data Mining, Business Analytics, Business Services, Machine Learning, People Management, Data Warehouse Design, Cloud Computing, OCR, Text Classification, Statistics, Research, Field Research, Customer Service, Customer Service Strategy, Public Speaking, Pitch Presentations, Quotations

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring