Vetri's Resume

Data Engineer

Experienced Data Engineer proficient in AWS infrastructure (Glue, EMR, Redshift), Azure(Databricks, ADF, Microsoft Fabric) Spark, PySpark, Azure Databricks, Data Lake, Airflow, and PL/SQL. Skilled in Python scripting, query optimization, automation, and pipeline development. Recognized for optimizing Redshift report generation and receiving awards for data governance and debut initiatives

Experience

Data Engineer

ADF Data Science Pvt. Ltd.
JUNE 2021

Centralized Reporting System with RAG LLM Integration

Designed and implemented a reporting platform to consolidate data across business teams, utilizing automation and AI(RAG, Pipecone, Fine-Tune) for enhanced efficiency.
Built an ETL pipeline to extract email reports, storing HTML content and attachments in S3 with a structured organization.
Integrated an LLM agent to summarize reports and provide historical insights within a single web application.
Optimized data processing workflows to generate faster reports and improve accessibility for end-users.

Create Data Pipelines with different activities & data flow with different sources and destinations to satisfy giving business needs:

Develop data pipelines with diverse activities and flow patterns using AWS DMS
Creating Data Quility Alert system for kinda posibilities
Meet MLE business requirements with these pipelines

Extract data from various third-party sources using cutting-edge-tech stack:

Extract data using cutting-edge technology stacks like REST API, Web Scrapping, API, Graph API
Utilizing MongoDB to extract tags for converting Machine Learning model tables

Scheduling and monitoring tasks via (CI/CD) Jenkins:

Utilize Jenkins for scheduling and monitoring tasks
Take advantage of Jenkins' Continuous Integration and Continuous Deployment (CI/CD) capabilities

Implementing Data Catalog Tool (open source):

Implement an open-source Data Catalog Tool Apache Atlas
Facilitate data discovery and management with this tool

ETL - Pentaho, Python, Pyspark

Utilize Pentaho for Extract, Transform, Load (ETL) operations.

Education

Degree/Grade	Institution	Duration	Performance
MCA	SRM Institute of Science and Technologie, Chennai	JUNE 2019 - MAY 2021	82%
BCA	Aadhiparasakthi College of arts and science, vellore	JUNE 2016 - MAY 2019	76%
12th Grade	Anderson Hr.Sec. School, Kanchipuram	Year of Passing - 2016	69%
10th Grade	Anderson Hr.Sec. School, Kanchipuram	Year of Passing - 2014	85%

Skills

Python, Pyspark, Spark
Data Engineering: AWS, Preprocessing, Data Modelling, Data Governance, Data Stewards, Database Administrators, DevOps Engineer(Basis), API integration, Automation, Optimization Techniques
Data Quailty: Data Profiling, Data Validation, Data Enrichment, Data Monitoring
DataBase Archcitechure: - SQL, PL/SQL, MongoDb, Big Query
AWS - Redshift, DMS, RDS, Lambda, Cloud Watch, VPC, EMR
Azure - Azure Databricks,Azure Data Factory, ADLS Gen 2
CI/CD & Orchestrature: - Jira, BitBucket, Terraform, Jenkins and Airflow
HTML , Css- Bootstrap

Machine Learning Basics: - Linear/Logistic Regression, SVM, Random Forest, Decision Tree, K nearest Neighbor, Survival Analysis

Projects
Azure Data Factory Promise Table Migration with PySpark Processing I led the migration of promise tables into Azure Data Lake Storage Gen2 (ADLS Gen2) through Azure Data Factory, ensuring seamless data transfer and storage. Leveraging PySpark on Azure Databricks, I orchestrated efficient data processing pipelines for in-depth analysis. By optimizing ETL orchestration and implementing data quality checks, I ensured the reliability and accuracy of promise table data. The project focused on scalability and performance, enabling the processing of large datasets with ease. Through monitoring and logging mechanisms, I ensured the continuous improvement of data pipelines, facilitating informed decision-making.
Seamless Third-Party Data Integration - Orchestrated the loading of third-party data into Redshift through API integration, web scraping, and Graph API for Outlook, ensuring a continuous influx of relevant data for analysis.
Creating Data Mart Tables Using CDC CDC files from DMS S3 were extracted into Redshift with a proper naming convention Pyspark was used for transformation and mapping variables Daily jobs were run to fetch the most recent historical CDC files and write them to the target table in Redshift 5 reports were generated using the EDW tables as the data source
Streamlined Machine Learning QA Processes Implemented a novel methodology for QA of MLE model tables, slashing manual QA time by 70% and accelerating the process by 65% compared to previous methods.
Implementation of Data Governance tool using Apache Atlas - Apache Atlas was implemented using Docker RDS metadata was imported into Atlas via REST API Automated script created to ingest RDS entities into Atlas To ensure secure data governance, role-based policies were implemented and assigned to relevant teams The data dictionary is uploaded via Jenkins for each release Atlas automatically populates the data dictionary with necessary data.
Nush Shopping - Fashion E-commerce Shopping website using Bootstrap, HTML5, CSS

Certifications

Google BigQuery & PostgreSQL : Big Query for Data Analysis
Data Engineering essential SQL, Python and Spark
MongoDB Basic M001
Core Java Certifications

Achievement

Received Promising Debut award for the year 2022