Praveen Raju, Developer in Toronto, ON, Canada
Praveen is available for hire
Hire Praveen

Praveen Raju

Verified Expert  in Engineering

Software Developer

Toronto, ON, Canada

Toptal member since July 22, 2024

Bio

Praveen is an experienced data engineer with eight years of experience designing scalable data solutions across various industries. He's proficient in big data technologies like Apache Spark, Hadoop, and AWS and specializes in data workflow development with Scala and Java. Skilled at Amazon RDS, Amazon S3 (AWS S3), EMR, Neptune, and Microsoft Power BI, Praveen reduces processing times and costs while enhancing business outcomes through Agile methodologies and DevOps practices.

Portfolio

Lasik MD Vision
Apache Spark, Hadoop, Amazon Elastic MapReduce (EMR), Amazon RDS...
BSNL
Java, Apache Spark, Hadoop, Amazon Elastic MapReduce (EMR), Apache Pig...
Paramount Airways
Scala, Hadoop, Spark, Apache Airflow, Java, Amazon Elastic MapReduce (EMR)...

Experience

  • Hadoop - 7 years
  • Apache Spark - 7 years
  • Spark - 7 years
  • Python - 7 years
  • Apache Pig - 6 years
  • Apache Airflow - 6 years
  • Java - 6 years
  • Scala - 6 years

Availability

Full-time

Preferred Environment

Apache Spark, Apache Airflow, Java, Scala, Hadoop, Amazon Web Services (AWS), Azure, Apache Pig, MapReduce

The most amazing...

...outcome analysis tool I've developed using Apache Spark and Hadoop boosted data processing speed significantly.

Work Experience

Data Engineer

2022 - 2024
Lasik MD Vision
  • Developed custom Apache Spark applications to efficiently manage real-time data streams, resulting in a substantial 37% reduction in data processing latency, markedly boosting performance and responsiveness.
  • Built high-performance data processing applications using Scala, which led to a remarkable 40% increase in processing speed, significantly boosting system efficiency and performance.
  • Implemented scalable data processing workflows using Hadoop, which led to a 44% increase in data processing throughput, significantly enhancing operational efficiency and data handling capabilities.
  • Designed and implemented data orchestration workflows using Apache Airflow, enhancing data pipeline automation and reducing manual oversight.
  • Integrated EMR with other AWS services like ASW S3 and Redshift, improving data flow and accessibility by 20%.
  • Increased system reliability by implementing fault-tolerant data processing pipelines with Apache Spark.
  • Optimized Apache Pig scripts for complex data transformations, resulting in a 30% improvement in processing efficiency.
  • Designed and implemented robust data integration solutions in Java, improving system reliability.
  • Conducted performance tuning of Hadoop and Spark jobs running on EMR, improving job execution times.
  • Created robust and maintainable ETL pipelines using Scala, reducing data transformation errors.
Technologies: Apache Spark, Hadoop, Amazon Elastic MapReduce (EMR), Amazon RDS, Apache Airflow, Amazon SageMaker, Microsoft Power BI, SQL, Data Engineering, PySpark, Databricks, Kubernetes, T-SQL (Transact-SQL), AWS CodeBuild, AWS Glue, AWS Lambda, AWS Step Functions, Linux

BI Reporting Analyst

2020 - 2022
BSNL
  • Migrated legacy data processing systems to Apache Spark, resulting in a 30% reduction in maintenance costs.
  • Optimized existing Scala codebases, achieving a 25% reduction in execution time and resource consumption, significantly enhancing system performance and efficiency.
  • Designed and implemented efficient MapReduce jobs for large-scale data processing, achieving a notable 45% increase in processing speed and efficiency and significantly accelerating data throughput and system performance.
  • Implemented and managed Amazon Neptune instances to support graph-based queries, enhancing data retrieval speeds and improving complex data relationship analysis.
  • Architected and managed data solutions on EMR, reducing infrastructure costs.
  • Configured and managed Jetty servers to host large-scale web applications, enhancing server responsiveness and uptime.
  • Optimized Apache Airflow configurations to improve the scheduling and execution of complex data tasks, increasing workflow efficiency.
  • Developed and maintained robust data processing applications using Python, enhancing data analysis capabilities and reducing processing time.
  • Enhanced Apache Airflow configurations to improve the scheduling and execution of complex data tasks.
  • Leveraged Scala to build scalable microservices for data integration and enhance system scalability.
Technologies: Java, Apache Spark, Hadoop, Amazon Elastic MapReduce (EMR), Apache Pig, Apache Airflow, Oozie, Apache Maven, Amazon Athena, Microsoft Power BI, Amazon Neptune, Jetty, Scala, SQL, Data Engineering, PySpark, Databricks, AWS CodeBuild, AWS Glue, AWS Lambda, AWS Step Functions

BI Engineer

2018 - 2020
Paramount Airways
  • Developed custom operators and directed acrylic graphs (DAG) in Apache Airflow, improving pipeline customization and extending functionality by 30%.
  • Optimized Java code for data-intensive applications, resulting in a 50% reduction in execution time.
  • Implemented data analytics and processing workflows using EMR, reducing costs and optimizing resource allocation.
  • Designed and managed distributed storage solutions with Hadoop, improving data storage efficiency.
  • Introduced advanced data analytics algorithms in Scala, improving predictive model accuracy.
  • Conducted performance tuning and resource optimization for Apache Spark clusters, enhancing cluster utilization.
  • Integrated Apache Spark with other big data tools to increase data processing efficiency.
Technologies: Scala, Hadoop, Spark, Apache Airflow, Java, Amazon Elastic MapReduce (EMR), Jetty, Amazon Neptune, Python, MapReduce, Data Engineering, PySpark, Databricks, T-SQL (Transact-SQL), AWS Glue, AWS Lambda, Linux

BI Junior Data Engineer

2016 - 2018
Accenture
  • Implemented data caching strategies to improve the performance of BI reports and dashboards.
  • Developed and maintained data governance frameworks to ensure compliance with industry-specific regulations such as SOX and PCI DSS.
  • Automated data validation and cleansing processes using Apache Spark, reducing data errors.
  • Migrated legacy ETL processes to Scala, reducing processing time by 45%.
  • Created custom MapReduce jobs in Hadoop to increase data processing speed.
  • Built custom data validation and cleansing tools in Java to reduce data errors.
Technologies: Spark, Apache Airflow, Hadoop, Java, Scala, Amazon Elastic MapReduce (EMR), SOX, PCI DSS, AWS CodeBuild

Outcome Analysis Tool

I developed a patient outcome analysis tool using Apache Spark and Hadoop for scalable data processing, leveraging Scala for robust workflow development. I also integrated the tool with EMR for efficient large dataset management and Apache Airflow for data orchestration. In addition, I used Amazon RDS for secure data storage and Amazon SageMaker for advanced treatment effectiveness analytics, complemented by Microsoft Power BI visualizations. The tool also offers data anonymization and encryption to ensure patient privacy and regulatory compliance.

Call Center Performance Tool

This project involved developing a call center performance tool using Apache Spark and Hadoop for data ingestion and processing. Advanced workflows were crafted in Java and managed via EMR. I employed Apache Airflow for data orchestration and Oozie for workflow scheduling. Apache Pig was used for data transformations and Apache Maven for dependency management. In addition, Microsoft Power BI was used to visualize KPIs, agent metrics, and customer satisfaction scores.

Analytics Visualization Solution

This project entailed creating a BI analytics solution to visualize sales data, inventory levels, and customer trends. I leveraged Apache Spark and Hadoop for data processing, with workflows crafted in Scala and Java. I orchestrated data integration from Amazon RDS and AWS S3 using Apache Airflow, Jetty for real-time data feeds, and Amazon Neptune for complex data relationship management and query enhancement. In addition, I utilized Python for data manipulation and MapReduce for data aggregation, providing actionable insights for business stakeholders to drive strategic decisions.

Analytics on Call Detail Records

I analyzed call detail records (CDRs) for a telecommunications company using Apache Spark and Hadoop to handle massive datasets efficiently. I developed complex data extraction and transformation workflows using Scala and Java, integrated with EMR for scalable data processing. I also employed Apache Airflow to orchestrate and automate these workflows, ensuring seamless data operations, precise querying, and efficient data analysis to identify key trends in call volume.
2011 - 2015

Bachelor's Degree in Electronics and Communication Engineering

Anna University - India

Libraries/APIs

PySpark

Tools

Apache Airflow, AWS Glue, AWS Step Functions, Amazon Elastic MapReduce (EMR), Oozie, AWS CodeBuild, Amazon SageMaker, Microsoft Power BI, Apache Maven, Amazon Athena, Jetty

Languages

Scala, Python, SQL, T-SQL (Transact-SQL), Java

Frameworks

Apache Spark, Hadoop, Spark

Paradigms

MapReduce

Platforms

Amazon Web Services (AWS), Databricks, Kubernetes, Linux, Apache Pig, AWS Lambda, Azure

Other

Data Engineering, Amazon Neptune, Software, Electronics, Data Communication, Amazon RDS, Electronic Medical Records (EMR), SOX, PCI DSS

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring