Wenjie is available for hire

Wenjie Xu

Verified Expert in Engineering

Software Developer

Location

New York, NY, United States

Toptal Member Since

June 24, 2020

Wenjie is a highly skilled and experienced senior data engineer with a proven track record of designing and implementing robust ETL/ELT pipelines. Wenjie is proficient in AWS technologies, including Athena, Glue, Lambda, and Airflow. Wenjie is also skilled in handling large amounts of data from multiple sources and transforming it into usable information for analytics and data science initiatives.

Python 3 SQL Microsoft Power BI Flask Pandas AWS Glue Amazon S3 (AWS S3)Spark SQL AWS Lambda PySpark CircleCI Redshift Spark ML Scrapy PyCharm

Portfolio

Carbon Arc

Amazon S3 (AWS S3), AWS Lambda, Amazon Athena, AWS Glue, Spark, Apache Airflow...

BASF

Python 3, SQL, Elasticsearch, Apache Airflow

PACO Technologies, Inc.

Amazon Web Services (AWS), Spark, SQL, Python

Experience

Flask - 3 years SQL - 3 years Python 3 - 3 years Spark SQL - 2 years Amazon Elastic MapReduce (EMR) - 2 years Amazon S3 (AWS S3) - 2 years AWS Glue - 2 years AWS Lambda - 2 years

Availability

Part-time

Preferred Environment

Zeppelin, Jupyter Notebook, PyCharm, Visual Studio Code (VS Code)

The most amazing...

...projects I've developed, deployed, and managed are robust, fault-tolerant data pipelines in AWS and Airflow.

Work Experience

Senior Data Engineer

2022 - PRESENT

Carbon Arc

Developed robust ETL/ELT pipelines with AWS Athena, Glue, Lambda, and Airflow to consume large amounts of data (1,000+ object files daily) from multiple data sources into a data lake and multiple warehouse endpoints.
Aggregated, tuned, partitioned, and indexed data in support of large-scale analytics and data science initiatives.
Performed data modeling design and leveraged an advanced understanding of data and analytics concepts to build tables and views autonomously.
Handled large relational data replication as well as streaming and unstructured data inputs.
Designed and implemented an ontology platform for data ingestion, cleaning, tagging, and internal databases linkage; deployed the platform on EC2 with Docker.

Technologies: Amazon S3 (AWS S3), AWS Lambda, Amazon Athena, AWS Glue, Spark, Apache Airflow, CircleCI, PySpark, Redshift

Data Architect

2020 - 2022

BASF

Integrated and maintained unstructured data (metadata and raw data) from various lab equipment, ensuring they adhered to data quality and accessibility standards. Designed data models and data schemas to standardize the data acquisition process.
Built and automated data collection, wrangling, processing, and visualization using Plotly on internal Kubernetes clusters.
Created and designed an ETL data pipeline using Python to consolidate data from a variety of sources into cloud storage. Utilized Airflow to schedule and maintain data parsers when integrating data in Elasticsearch.
Evaluated data frameworks and conducted POC to determine harmonized data format standards for heterogeneous lab processes and analytical data across internal lab LIMS systems.
Led team meetings and drove technical discussions to track project progress and achieve OKRs.
Developed internal Python packages to simplify data processing workflows,, reducing data processing and calculation time costs by about 80%.

Technologies: Python 3, SQL, Elasticsearch, Apache Airflow

Data Analytics Engineer

2017 - 2020

PACO Technologies, Inc.

Built an internal data entry system by Python-Flask to improve data quality and eliminate data noise.
Automated the data acquisition process to reduce human errors significantly. Configured the server environment on AWS EC2 and RDS with reliable security groups.
Designed and developed complex SQL queries, Python script, and triggers for ETL jobs. Integrated and maintained data from a variety of sources, assuring they adhere to data quality and accessibility standards.
Generated bi-weekly Ad-Hoc data reports by CloudWatch, Lambda, and SQL/Excel to prevent manual queries.
Developed a KPI dashboard by Power BI to track company recruiting performance internally and facilitate the decision-making process.
Developed, deployed, and managed the data pipeline (DocumentDB, Athena, Redshift, S3, Lambda) that cleans, transforms, and aggregates unorganized and messy data into databases, allowing for seamless collection, storage, and management of big data.
Developed data classifiers, mining algorithms, and models for engineering documents sentiment analysis, topic mining, and data visualization.

Technologies: Amazon Web Services (AWS), Spark, SQL, Python

Experience

ETL Project: Data Integration from CSV and XML to Relational Database

https://git.toptal.com/Ivan-Ilijasic/wenjie-xu

This was a PySpark-based ETL project developed using Spark-SQL and Python script to transform CSV/XML data into a relational database. Te project integrated data from various formats and sources into the data warehouse to ensure that target data adhered to data quality and accessibility standards. On top of the transformed data, I developed a dashboard using Power BI to provide actionable insights.

Web Scraping Using Scrapy

https://github.com/xwjsarah/scraping/blob/master/homedepot.py

I used the Scrapy framework to capture construction projects' bidding info on agency websites (MTA, Port Authority) to deliver clean and reliable bidding data to the marketing department. This data was used for further decision-making purposes. This project captured over 500+ records per day from various websites, which replaced manual searching and significantly improved work efficiency for the marketing team.

Python-Flask Data Entry System Development

I built a Python-Flask-based data entry system independently to enable data quality check-up functionality and eliminate data noise. I automated the data acquisition process to reduce human errors significantly. A dynamic dashboard based on Power BI was also deployed on this system and tracked data trends and insights as new data fed in. This system was a configured server environment on AWS EC2 and RDS with reliable security groups.

Skills

Languages

Python 3, SQL, Python

Tools

Microsoft Power BI, Amazon Elastic MapReduce (EMR), AWS Glue, Spark SQL, CircleCI, PyCharm, Apache Airflow, Amazon Athena

Frameworks

Flask, Scrapy, Spark

Libraries/APIs

Pandas, PySpark, Spark ML

Platforms

AWS Lambda, Jupyter Notebook, Zeppelin, Amazon Web Services (AWS), Visual Studio Code (VS Code)

Storage

Amazon S3 (AWS S3), Redshift, Elasticsearch

Education

2016 - 2018

Master's Degree in Computer Science

Montclair State University - Montclair, New Jersey, USA

Certifications

JANUARY 2017 - PRESENT

AWS Certified Developer

Amazon Web Services (AWS)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring