Hafiz Hamid, Web Scraping Developer in San Francisco, CA, United States
Hafiz Hamid

Web Scraping Developer in San Francisco, CA, United States

Member since June 18, 2020
Hafiz is a seasoned software architect who's lead complex software projects for last 12 years at organizations like Bing (Microsoft), Lyft, and Salesforce.com in full-time roles—now, he's pursuing a freelancing career. His areas of expertise are back-end/server development, databases, big data, cloud computing, DevOps, web crawling, and search engines.
Hafiz is now available for hire

Portfolio

Experience

Location

San Francisco, CA, United States

Availability

Part-time

Preferred Environment

Git, Linux, MacOS

The most amazing...

...thing I've built was a real-time streaming data pipeline at Lyft. I built a web crawler to scrape 1 billion pages every day at Bing.com.

Employment

  • Staff Software Engineer (Full-time)

    2015 - 2018
    Lyft, Inc.
    • Worked as the tech lead and architect on streaming platform team; also drove the vision and strategy.
    • Built the real-time events ingestion and pub/sub infrastructure for Lyft that ingests/moves more than 200 billion events every day.
    • Developed the highly scalable and reliable message bus at Lyft which is used by hundreds of internal micro-services to asynchronously communicate with each other.
    • Maintained multiple tier-0 services with five nines of reliability guarantees/SLA.
    • Trained and mentored dozens of other engineers.
    Technologies: Hadoop, Apache Flink, Apache Kafka, AWS Cloud Architecture, Amazon DynamoDB, AWS Kinesis, Amazon CloudWatch, Redshift, Amazon S3 (AWS S3), Amazon Simple Queue Service (SQS), AWS Lambda, Amazon EC2, Python
  • Principal Member of Technical Staff (full-time)

    2014 - 2015
    Salesforce.com
    • Developed several relevancy features which involved customizing Apache Lucene’s scoring framework for Salesforce’s needs.
    • Implemented infrastructure work to enable runtime feature extraction for the training of an ML-based ranker and its integration into an Apache Solr’s query processing pipeline.
    • Designed the search infrastructure to scale out Salesforce search’s static rank feature to 100% documents (currently only partially enabled due to infrastructure limitations).
    Technologies: Apache Lucene, Apache Solr, Java
  • Senior Software Engineer (Full-time)

    2005 - 2014
    Microsoft (Bing Search)
    • Led a team of engineers to develop scalable infrastructure for a distributed web crawler and content extraction platform—enableing it to crawl hundreds of millions of web documents every day from hundreds of websites (like Amazon.com, Imdb.com, Walmart.com) and parse them to extract structured content for enriching Bing’s search index.
    • Received a Microsoft Gold Star Award for the above project.
    • Developed a log mining platform to enrich a local search index; enabled it to algorithmically discover/mine URLs and search keywords, associated with local businesses (restaurants, hotels, banks, etc.), by mining search results click logs (petabytes of data). The platform is being used in more than 20 Bing markets to enrich the local search index and cut down the URL coverage gap with Google.
    • Worked both as the technical lead and in the IC capacities to enhance and evolve a machine learning-based text classification framework (originally conceived by Microsoft Research) into a classification platform and integrate it with local data pipeline.
    • Developed a process to train, evaluate and consume statistical models which classify hundreds of millions of local businesses around the world into a taxonomy of more than 1,000 categories; for the above project.
    • Managed (from a tech-lead standpoint) the day-to-day maintenance and operations of a local data ingestion/processing pipeline that feeds into the index of Bing local search engine.
    • Worked on back-end data acquisition/processing pipeline for Bing Entertainment search (music, movies, TV shows, and more).
    Technologies: Machine Learning, Apache Hive, Hadoop, Microsoft SQL Server, C#, .NET
  • Professional Services Consultant (Full-time)

    2005 - 2006
    Teradata Corporation
    • Developed automated ETL framework, for DHL (a Teradata customer) in order for it to ingest data from multiple heterogeneous sources and integrate into an Enterprise data warehouse.
    • Led a team of four developers on Eircom Metadata-driven ETL Tool project which was meant to develop generic parsing and transformation engines for data extraction from more than 50 different semi-structured CDR formats. (Eircom is Ireland’s leading telecommunication operator).
    • Conducted Teradata trainings and data warehouse workshops for new hires.
    Technologies: Teradata, Java, SQL

Experience

  • Lyft, Inc.

    Technologies: Python, AWS Cloud (EC2, Lambda, Kinesis, DynamoDB, SQS, S3, Redshift, CloudWatch), Apache Kafka, Apache Flink, Hadoop/Hive

  • Salesforce.com

    Technologies: Java, Apache Solr/Lucene, Search Relevancy

  • Microsoft (Bing Search)

    I worked on this web crawling and extraction framework.

    Technologies: C#/.NET, Microsoft SQL Server, Hadoop/Hive, Machine Learning

  • Teradata Corporation

    Technologies: SQL, Java, Teradata

Skills

  • Languages

    Python, SQL, C#.NET, Java, HTML, XQuery, XML, XPath, C#, JavaScript
  • Frameworks

    Hadoop, Scrapy, Flask, .NET, Django
  • Tools

    Amazon Simple Queue Service (SQS), Amazon CloudWatch, Zapier, Apache Solr, Git, Flink
  • Paradigms

    DevOps, ETL, Agile Software Development
  • Platforms

    AWS Lambda, Amazon EC2, Amazon Web Services (AWS), AWS Kinesis, Apache Kafka, Apache Flink, MacOS, Linux
  • Storage

    Amazon DynamoDB, PostgreSQL, Redshift, Amazon S3 (AWS S3), Databases, Teradata, SQL Server 2010, Apache Hive, Elasticsearch, Microsoft SQL Server
  • Other

    Data Warehouse Design, Web Scraping, Data Warehousing, Amazon Kinesis Data Firehose, Big Data, Big Data Architecture, Stream Processing, Large Scale Distributed Systems, Pub/Sub, Machine Learning, Search Engine Development, Information Retrieval, Data Modeling, Text Classification, AWS Cloud Architecture
  • Libraries/APIs

    Apache Lucene

Education

  • Master's Degree in Computer Science and Engineering
    2009 - 2011
    University of Washington - Seattle, WA, USA
  • Bachelor's Degree in Computer Science
    2001 - 2005
    FAST | National University of Computer and Emerging Sciences - Islamabad, Pakistan

Certifications

  • Teradata Certified Master
    JANUARY 2006 - PRESENT
    Teradata Corporation

To view more profiles

Join Toptal
Share it with others