Hafiz is available for hire

Hafiz Hamid

Verified Expert in Engineering

Web Scraping Developer

Location

San Francisco, CA, United States

Toptal Member Since

June 18, 2020

Hafiz is a seasoned software architect who's lead complex software projects for last 12 years at organizations like Bing (Microsoft), Lyft, and Salesforce.com in full-time roles—now, he's pursuing a freelancing career. His areas of expertise are back-end/server development, databases, big data, cloud computing, DevOps, web crawling, and search engines.

Portfolio

Lyft, Inc.

Hadoop, Apache Flink, Apache Kafka, AWS Cloud Architecture, Amazon DynamoDB...

Salesforce.com

Apache Lucene, Apache Solr, Java

Microsoft (Bing Search)

Machine Learning, Apache Hive, Hadoop, Microsoft SQL Server, C#, .NET

Experience

SQL - 10 years Web Scraping - 6 years Python - 5 years Large Scale Distributed Systems - 5 years Big Data Architecture - 5 years DevOps - 4 years Amazon Web Services (AWS) - 3 years Pub/Sub - 3 years

Availability

Part-time

Preferred Environment

Git, Linux, MacOS

The most amazing...

...thing I've built was a real-time streaming data pipeline at Lyft. I built a web crawler to scrape 1 billion pages every day at Bing.com.

Work Experience

Staff Software Engineer (Full-time)

2015 - 2018

Lyft, Inc.

Worked as the tech lead and architect on streaming platform team; also drove the vision and strategy.
Built the real-time events ingestion and pub/sub infrastructure for Lyft that ingests/moves more than 200 billion events every day.
Developed the highly scalable and reliable message bus at Lyft which is used by hundreds of internal micro-services to asynchronously communicate with each other.
Maintained multiple tier-0 services with five nines of reliability guarantees/SLA.
Trained and mentored dozens of other engineers.

Technologies: Hadoop, Apache Flink, Apache Kafka, AWS Cloud Architecture, Amazon DynamoDB, Amazon Kinesis, Amazon CloudWatch, Redshift, Amazon S3 (AWS S3), Amazon Simple Queue Service (SQS), AWS Lambda, Amazon EC2, Python

Principal Member of Technical Staff (Full-time)

2014 - 2015

Salesforce.com

Developed several relevancy features which involved customizing Apache Lucene’s scoring framework for Salesforce’s needs.
Implemented infrastructure work to enable runtime feature extraction for the training of an ML-based ranker and its integration into an Apache Solr’s query processing pipeline.
Designed the search infrastructure to scale out Salesforce search’s static rank feature to 100% documents (currently only partially enabled due to infrastructure limitations).

Technologies: Apache Lucene, Apache Solr, Java

Senior Software Engineer (Full-time)

2005 - 2014

Microsoft (Bing Search)

Led a team of engineers to develop scalable infrastructure for a distributed web crawler and content extraction platform—enableing it to crawl hundreds of millions of web documents every day from hundreds of websites (like Amazon.com, Imdb.com, Walmart.com) and parse them to extract structured content for enriching Bing’s search index.
Received a Microsoft Gold Star Award for the above project.
Developed a log mining platform to enrich a local search index; enabled it to algorithmically discover/mine URLs and search keywords, associated with local businesses (restaurants, hotels, banks, etc.), by mining search results click logs (petabytes of data). The platform is being used in more than 20 Bing markets to enrich the local search index and cut down the URL coverage gap with Google.
Worked both as the technical lead and in the IC capacities to enhance and evolve a machine learning-based text classification framework (originally conceived by Microsoft Research) into a classification platform and integrate it with local data pipeline.
Developed a process to train, evaluate and consume statistical models which classify hundreds of millions of local businesses around the world into a taxonomy of more than 1,000 categories; for the above project.
Managed (from a tech-lead standpoint) the day-to-day maintenance and operations of a local data ingestion/processing pipeline that feeds into the index of Bing local search engine.
Worked on back-end data acquisition/processing pipeline for Bing Entertainment search (music, movies, TV shows, and more).

Technologies: Machine Learning, Apache Hive, Hadoop, Microsoft SQL Server, C#, .NET

Professional Services Consultant (Full-time)

2005 - 2006

Teradata Corporation

Developed automated ETL framework, for DHL (a Teradata customer) in order for it to ingest data from multiple heterogeneous sources and integrate into an Enterprise data warehouse.
Led a team of four developers on Eircom Metadata-driven ETL Tool project which was meant to develop generic parsing and transformation engines for data extraction from more than 50 different semi-structured CDR formats. (Eircom is Ireland’s leading telecommunication operator).
Conducted Teradata trainings and data warehouse workshops for new hires.

Technologies: Teradata, Java, SQL

Experience

OpenSecrets.org Scraper

This Python program utilizes the open-source Scrapy framework to scrape OpenSecrets.org for campaign contribution data. By providing a company or entity name as input (e.g., "Disney"), the scraper downloads information regarding the contributions made by that entity to national and local election campaigns in the United States over the past 25 years.

Lyft, Inc.

Technologies: Python, AWS Cloud (EC2, Lambda, Kinesis, DynamoDB, SQS, S3, Redshift, CloudWatch), Apache Kafka, Apache Flink, Hadoop/Hive

Salesforce.com

Technologies: Java, Apache Solr/Lucene, Search Relevancy

Microsoft (Bing Search)

I worked on this web crawling and extraction framework.

Technologies: C#/.NET, Microsoft SQL Server, Hadoop/Hive, Machine Learning

Teradata Corporation

Technologies: SQL, Java, Teradata

Skills

Languages

Python, SQL, C#.NET, Java, HTML, XQuery, XML, XPath, C#, JavaScript

Frameworks

Hadoop, Scrapy, Flask, .NET, Django

Tools

Amazon Simple Queue Service (SQS), Amazon CloudWatch, Zapier, Apache Solr, Git, Flink

Paradigms

DevOps, ETL, Agile Software Development

Platforms

AWS Lambda, Amazon EC2, Amazon Web Services (AWS), Apache Kafka, Apache Flink, MacOS, Linux

Storage

Amazon DynamoDB, PostgreSQL, Redshift, Amazon S3 (AWS S3), Databases, Teradata, SQL Server 2010, Apache Hive, Elasticsearch, Microsoft SQL Server

Other

Data Warehouse Design, Web Scraping, Data Warehousing, Amazon Kinesis Data Firehose, Big Data, Amazon Kinesis, Big Data Architecture, Stream Processing, Large Scale Distributed Systems, Pub/Sub, Machine Learning, Search Engine Development, Information Retrieval, Data Modeling, Text Classification, AWS Cloud Architecture

Libraries/APIs

Apache Lucene

Education

2009 - 2011

Master's Degree in Computer Science and Engineering

University of Washington - Seattle, WA, USA

2001 - 2005

Bachelor's Degree in Computer Science

FAST | National University of Computer and Emerging Sciences - Islamabad, Pakistan

Certifications

JANUARY 2006 - PRESENT

Teradata Certified Master

Teradata Corporation

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring