Data Engineer
Experienced Data Engineer proficient in AWS infrastructure (Glue, EMR, Redshift), Azure(Databricks, ADF, Microsoft Fabric) Spark, PySpark, Azure Databricks, Data Lake, Airflow, and PL/SQL. Skilled in Python scripting, query optimization, automation, and pipeline development. Recognized for optimizing Redshift report generation and receiving awards for data governance and debut initiatives
Experience
Data Engineer
ADF Data Science Pvt. Ltd.
JUNE 2021
Centralized Reporting System with RAG LLM Integration
- Designed and implemented a reporting platform to consolidate data across business teams, utilizing automation and AI(RAG, Pipecone, Fine-Tune) for enhanced efficiency.
- Built an ETL pipeline to extract email reports, storing HTML content and attachments in S3 with a structured organization.
- Integrated an LLM agent to summarize reports and provide historical insights within a single web application.
- Optimized data processing workflows to generate faster reports and improve accessibility for end-users.
Create Data Pipelines with different activities & data flow with different sources and destinations to satisfy giving business needs:
- Develop data pipelines with diverse activities and flow patterns using AWS DMS
- Creating Data Quility Alert system for kinda posibilities
- Meet MLE business requirements with these pipelines
Extract data from various third-party sources using cutting-edge-tech stack:
- Extract data using cutting-edge technology stacks like REST API, Web Scrapping, API, Graph API
- Utilizing MongoDB to extract tags for converting Machine Learning model tables
Scheduling and monitoring tasks via (CI/CD) Jenkins:
- Utilize Jenkins for scheduling and monitoring tasks
- Take advantage of Jenkins' Continuous Integration and Continuous Deployment (CI/CD) capabilities
Implementing Data Catalog Tool (open source):
- Implement an open-source Data Catalog Tool Apache Atlas
- Facilitate data discovery and management with this tool
ETL - Pentaho, Python, Pyspark
- Utilize Pentaho for Extract, Transform, Load (ETL) operations.
Education
| Degree/Grade | Institution | Duration | Performance |
|---|---|---|---|
| MCA | SRM Institute of Science and Technologie, Chennai | JUNE 2019 - MAY 2021 | 82% |
| BCA | Aadhiparasakthi College of arts and science, vellore | JUNE 2016 - MAY 2019 | 76% |
| 12th Grade | Anderson Hr.Sec. School, Kanchipuram | Year of Passing - 2016 | 69% |
| 10th Grade | Anderson Hr.Sec. School, Kanchipuram | Year of Passing - 2014 | 85% |
Skills
- Python, Pyspark, Spark
- Data Engineering: AWS, Preprocessing, Data Modelling, Data Governance, Data Stewards, Database Administrators, DevOps Engineer(Basis), API integration, Automation, Optimization Techniques
- Data Quailty: Data Profiling, Data Validation, Data Enrichment, Data Monitoring
- DataBase Archcitechure: - SQL, PL/SQL, MongoDb, Big Query
- AWS - Redshift, DMS, RDS, Lambda, Cloud Watch, VPC, EMR
- Azure - Azure Databricks,Azure Data Factory, ADLS Gen 2
- CI/CD & Orchestrature: - Jira, BitBucket, Terraform, Jenkins and Airflow
- HTML , Css- Bootstrap
- Machine Learning Basics: - Linear/Logistic Regression, SVM, Random Forest, Decision Tree, K nearest Neighbor, Survival Analysis
Projects
Azure Data Factory Promise Table Migration with PySpark Processing I led the migration of
promise tables into Azure Data Lake Storage Gen2 (ADLS Gen2) through Azure Data Factory, ensuring seamless
data transfer and storage. Leveraging PySpark on Azure Databricks, I orchestrated efficient data processing
pipelines for in-depth analysis. By optimizing ETL orchestration and implementing data quality checks,
I ensured the reliability and accuracy of promise table data. The project focused on scalability and
performance,
enabling the processing of large datasets with ease. Through monitoring and logging mechanisms,
I ensured the continuous improvement of data pipelines, facilitating informed decision-making.
Seamless Third-Party Data Integration - Orchestrated the loading of third-party data into
Redshift through API integration, web scraping, and Graph API for Outlook, ensuring a continuous influx of
relevant data for analysis.
Creating Data Mart Tables Using CDC CDC files from DMS S3 were extracted into Redshift with
a proper naming convention Pyspark was used for transformation and mapping variables Daily jobs were run to
fetch the most recent historical CDC files and write them to the target table in Redshift 5 reports were
generated using the EDW tables as the data source
Streamlined Machine Learning QA Processes Implemented a novel methodology for QA of MLE
model tables,
slashing manual QA time by 70% and accelerating the process by 65% compared to previous methods.
Implementation of Data Governance tool using Apache Atlas - Apache Atlas was implemented
using Docker RDS metadata was imported into Atlas via REST API Automated script created to ingest RDS
entities into Atlas To ensure secure data governance, role-based policies were implemented and assigned to
relevant teams The data dictionary is uploaded via Jenkins for each release Atlas automatically populates
the data dictionary with necessary data.
Nush Shopping - Fashion E-commerce
Shopping
website
using Bootstrap, HTML5, CSS
Certifications
Google BigQuery & PostgreSQL : Big Query for Data Analysis
Data Engineering essential SQL, Python and Spark
MongoDB Basic M001
Core Java Certifications
Achievement
Received Promising Debut award for the year 2022
