
J Chan
Verified Expert in Engineering
Data Engineer and Software Developer
Bangkok, Thailand
Toptal member since November 20, 2022
J is a data engineer with extensive experience in master data management and data governance best practices. She has 6+ years of experience using SQL and Python for data mining, extracting master attributes, and engineering ETL pipelines. J mostly relies on Spark SQL and AWS cloud services for data transformation and migration in reliable and seamless workflows.
Portfolio
Experience
- Python 3 - 6 years
- Data Engineering - 6 years
- SQL - 6 years
- eCommerce - 5 years
- Linux - 4 years
- Amazon Web Services (AWS) - 4 years
- Machine Learning - 3 years
- Spark SQL - 2 years
Availability
Preferred Environment
Python 3, SQL, Amazon Web Services (AWS)
The most amazing...
...pieces of code I've written were a handful of lines using PyTorch to categorize eCommerce products by image to meet client demand.
Work Experience
ETL Developer (via Toptal)
Freelance Client
- Helped execute lift-and-shift of data migration and transformation jobs to AWS Glue to achieve significant cost savings.
- Designed and developed feature enhancements and user stories via stored procedures and Python orchestration scripts as executed in AWS Glue.
- Supported active client migrations simultaneously with hotfixes and custom client data change requests, even reverse engineering inherited scripts as necessary.
- Defined and refined the QA process and managed the product release process.
- Served as one of the solution architects for AWS workflow automation, incorporating multiple AWS services like Glue, Step Functions, and Lambda to integrate 3rd-party APIs and execute the main ETL job in under one hour, which previously took weeks.
Head of Master Data Management
Early Data
- Helped manage the transfer of inherited Talend master data management workflows to modern technologies such as Python, Spark SQL, and AWS, mapping a further 20% of product master brands.
- Applied text and image classification neural nets to significantly increase the accuracy of master data management attribute tagging by over 90% accuracy, beyond what was feasible by simple regex patterns.
- Developed an overarching master data management program to improve downstream business intelligence efforts.
- Restructured the team and improved overall work culture, increasing team productivity by 50%.
- Refined the product name clustering and SKU identification with sklearn's linear kernel module and proactively arranged human resources to support master data management SKU tagging workflow, reducing the corresponding workload by 80%.
- Developed automated Spark QC scripts, reducing the workload of quality control workflow by 40%.
- Managed a data management platform testing document sign-off, conducting software unit testing, user acceptance testing, and operational qualification protocol for a Fortune 500 medical device company.
- Maintained an ELT solution spanning Microsoft Server, SSIS, and stored procedures for a Fortune 500 personal care company.
Account Manager
PureLiving
- Created custom bilingual (Chinese) proposals for 60+ existing and prospective clients for a wide range of professional indoor environmental quality programs.
- Educated the public about the importance of environmental air quality strategy at local community events.
- Ensured the proper maintenance of air filtration systems through regular client follow-ups.
Researcher
Harvard University
- Collected tree crown property data for 235 trees to establish the relationship between functional properties and demographic rates to determine climate change impact on carbon sequestration rates (Bukit Timah Nature Reserve, Singapore).
- Constructed capacitance curves for 100 leaves and projected how climate change will alter forest structure (Pasoh Forest Reserve, Malaysia).
- Presented results of study at the 9th Annual Harvard Plant Biology Symposium.
Experience
Master Data Management Project | A Text and Image Classifier Duo for Intelligent Category Mapping
We found that relying on regex patterns to tag crawled products to the nutritional health category was not accurate, so I collaborated with a data science colleague to deploy text classification in the form of MXNet's Transformer model to perform more intelligent product name tagging.
To further improve the accuracy of category mapping, I managed the further implementation of fast.ai and PyTorch's VGG-16 image classifier to utilize product images as well, as sometimes the product name sounded like a nutritional health item, but the image would indicate otherwise. Consolidating the two results as an enriched tagging workflow increased our average accuracy of category mapping to 93.5% and passed the client quality requirement to publish to the production environment and launch the final digital product.
ETL Project | Transfer of Talend ETL Workflows to Spark SQL/AWS
After painstakingly reviewing the logic, documenting the logic, and monitoring topline statistics of the existing load, I slowly rewrote all nine workflows and made sure to compare my test load with the production load to make sure none of the columns were missing any crucial logic. After several months, the last workflow was written, and I rounded out the endeavor by packaging the scripts to s3 with a slew of shell scripts that allowed the entire ETL workflow to run on Amazon EMR. Not only did the new workflow run more stably, it also ran approximately 40% faster and was easier to QC.
Master Data Management Project | Optimizing Master Brand Mapping in Spark SQL
The biggest challenge by far was utilizing the existing keyword look-up table used to do keyword mapping, as merging the look-up tables effectively as a cross-join with no way to partition efficiently greatly reduced the distributed computing potential of Spark SQL. After days of fervent research, I came across the concept of using another column to force certain partitions to travel together so that they were processed together, thereby allowing Spark's distributed power to take over as expected. From this experience, I gained not only more confidence in my Spark code, but also as a developer holistically.
Education
Master's Degree in Organismic and Evolutionary Biology
Harvard University - Cambridge, MA, USA
Bachelor's Degree in Ecology, Behavior, and Evolution
University of California, Los Angeles - Los Angeles, CA, USA
Certifications
Google Cybersecurity
Snowflake Decoded - Fundamentals and Hands On Training
Udemy
Microsoft Power BI Desktop for Business Intelligence
Udemy
AWS Certified Cloud Practitioner
Amazon Web Services
AWS Partner: Accreditation (Technical)
Amazon Web Services
Skills
Libraries/APIs
Pandas, Amazon EC2 API, PyTorch, TensorFlow, PySpark
Tools
Amazon Elastic MapReduce (EMR), Git, Spark SQL, AWS Glue, MATLAB, Microsoft Power BI, AWS CodeBuild, Jira, Bitbucket, AWS Step Functions, Amazon Virtual Private Cloud (VPC), AWS IAM, Terraform
Languages
Python 3, SQL, T-SQL (Transact-SQL), Snowflake
Paradigms
ETL, Business Intelligence (BI)
Platforms
Amazon Web Services (AWS), Talend, Linux, AWS Lambda, Amazon EC2
Storage
Master Data Management (MDM), Amazon S3 (AWS S3), Redshift, SQL Server Integration Services (SSIS), Microsoft SQL Server
Frameworks
MXNet
Other
Data Engineering, IT Project Management, eCommerce, Data Mining, Business Analytics, Business Services, Machine Learning, People Management, Data Warehouse Design, Cloud Computing, Optical Character Recognition (OCR), Text Classification, Statistics, Research, Field Research, Customer Service, Customer Service Strategy, Public Speaking, Pitch Presentations, Quotations, SIEM, Intrusion Detection Systems (IDS), NIST, Threat Modeling, Amazon AppFlow, Infrastructure as Code (IaC)
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring