Data Engineering Intern [PFE]

تونس

About us:



At Cognira, we strongly believe that people are the biggest asset of our company. Our hand-picked team consists of passionate, collaborative, and forward-thinking individuals from all over the globe. We are passionate about making science easy and accessible to retailers, helping them get more value from people, data, and systems. We bring together expertise in retail, science, and scalable technologies to automate and enhance the quality of decision-making through software and consulting services.



For the last three years in a row, Cognira has been recognized as one of the fastest-growing companies in North America. We are proud to have a growing team of domain experts and data scientists, as well as a culture that fosters strong and long-lasting relationships with our clients. 


Our values:




  • Stand up for what’s right

  • Customers are always first

  • Think like an Entrepreneur. Act like a CEO.

  • Learn, Unlearn, Relearn

  • No brilliant jerks allowed

  • All work and no play is no fun at all


Important: Please submit your resume in English only.


About this internship:



You will be part of a high-growth software company. Our program is designed so interns can grow their skill sets, do meaningful work, and have a lot of fun along the way!




  • Over the course of the internship, you will be exposed to a wide range of Cognira’s tools, techniques, and technologies and have the opportunity to gain credible experience and learning

  • This internship will entirely be in-person for you to get an in-depth experience of the company's culture and be more involved throughout your tenure.

  • Duration: 4-6 months.


We're looking for highly talented & motivated interns to join our Data Engineering team and nail one of the following projects: 


Project 1: Event-Driven ETL Pipeline



In many data workflows, certain data is ingested sporadically, with new files or updates appearing irregularly rather than on a predictable schedule. Setting up a daily processing pipeline for such data can be costly and inefficient, as it often sits idle. An event-driven pipeline addresses this by triggering data processing only when new data arrives, ensuring resources are used effectively and data processing is initiated exactly when needed.




  • Goal: Create an event-driven pipeline to process data based on events, mainly file uploads to the data lake.

  • Steps:
    - Event Listening: Set up an event-driven architecture using a message broker like Kafka or a cloud webhook like Prefect API.
    - Event Configuration: Configure event sources to trigger events for file drops to the Azure Data Lake (ADLS).
    - Data Processing: Load the data from the ADLS and process it using Spark.
    - Data Validation: Implement data quality validation checks and generate reports to flag inconsistent data.
    - Data Load: Load the processed data to PostgreSQL.
    - Pipeline Automation: Automate the ETL pipeline using an Airflow DAG.
    - DAG Triggering: Configure a service to trigger the ETL pipeline in Airflow with every new detected event.

  • Tech stack : Scala / Python / Spark / Prefect or Kafka / Airflow / Docker / Kubernetes / Azure Data Lake / PostgreSQL

Project 2: CRSP Metadata Catalog



CRSP, short for Cognira's Retail Science Platform, is an internal platform that provides the tools and infrastructure needed to manage big data and run complex transformation pipelines.




  • Goal: Build a metadata management system for the datasets ingested and processed by CRSP to improve data lineage and discoverability.

  • Steps:
    - Data Ingestion: Use Spark to load the CRSP datasets metadata (creation date, format, schema, etc...) and store it in PostgreSQL.
    - Metadata Catalog: Implement a metadata store using OpenMetadata to centralize all the information and enable users to search for datasets by schema or date, etc...
    - Data Lineage: Store the schema versions to track the schema evolution over time. Keep track of each dataset's transformations and update the metadata with every CRSP transformation that's created or deleted.
    - Data Validation: Add data quality validation checks and flag missing or inconsistent metadata. Generate periodic reports on CRSP usage, dataset growth trends, and metadata quality.
    - Pipeline Automation: Automate the workflow using an Airflow DAG.

  • Tech stack : Scala Spark Azure Data Lake OpenMetadata PostgreSQLDocker Kubernetes Airflow

Project 3: LLM-Powered Spark Job Tuner



As Spark jobs scale on Kubernetes, performance bottlenecks can lead to inefficient resource usage, high costs, and delayed processing times. Identifying and resolving these bottlenecks often requires deep expertise in Spark configurations and code optimizations. A tool that leverages a language model (LLM) to analyze job performance, pinpoint bottlenecks, and suggest targeted code or configuration adjustments could empower engineers to optimize their Spark workloads more effectively, reducing latency and improving resource efficiency.




  • Goal: Develop a tool that helps identify bottlenecks in Spark jobs running on Kubernetes and suggests code and configuration improvements using an LLM.

  • Steps:
    - Metrics Collection: Deploy Prometheus to scrape Spark performance metrics from jobs running on Kubernetes (e.g., run time, CPU usage, memory usage..). Write a Scala client to retrieve these metrics from Prometheus API and extract the physical plan from the Spark UI API.
    - Performance Analysis and Evaluation: Send the collected data and physical plan summary to the LLM with prompts to detect faulty patterns, identify potential bottlenecks, and expensive operations.
    - Generating Recommendations: Write code that generates tuning suggestions based on the collected data with explanation, e.g. recommend caching datasets that are used multiple times in the transformation, or suggest broadcast when the size of the data is lower than a certain threshold.
    - Dashboard: Build an interactive UI that allows users to monitor the lifecycle of every Spark job, visualize performance metrics and optimization recommendations, and receive LLM-powered explanations.
    - LLM Feedback Loop: Store the historical job data and patterns in order to fine-tune the LLM periodically and improve its recommendations over time.

  • Tech stack : Scala Spark Kubernetes Prometheus Llama React

Project 4: API-Driven Metrics Calculation Tool



Description : Build an API-driven tool that allows users to request custom metrics, triggering a Spark job to calculate these metrics and return the results in real time or near-real time, depending on processing complexity. This tool will be capable of handling both complex and simple metric computations, making it a versatile addition to data analytics for retailers.




  • Steps:
    1) Develop a RESTful API that serves as the interface for users to request specific metrics. Each request will contain a payload specifying the metrics to calculate.
    2) Implement an event-driven mechanism where each API request triggers a Spark job. Use a message broker to queue requests and allow Spark to process them sequentially or in parallel.
    3) Create an Airflow DAG to orchestrate these Spark jobs. Each API request will trigger an Airflow task that runs the appropriate Spark job for the requested metrics.
    4) Set up logic in the API to determine which specific Spark jobs to run based on the requested metrics, including any custom filters or aggregations.

  • Spark Job Design for Metric Calculations
    - Reusable Metric Calculations: Develop modular Spark jobs that can calculate different metrics based on parameters received from the API payload.
    - Performance Optimization: Use techniques like partitioning, caching, and broadcast joins in Spark to optimize processing, especially for complex or high-volume requests.

  • Data Storage and Access
    - Source Data Access: Connect the Spark jobs to relevant data sources (PostgreSQL,Cassandra/DuckDB)
    - Use caching for intermediate and Historical Metrics Storage

  • Stretch goals : Error Handling, Validation, and Logging
    - Input Validation: Validate incoming API payloads to ensure they contain valid metric names and properly structured filter parameters.
    - Error Handling and Logging: Implement error handling to catch and log issues like missing data, invalid metric requests, or Spark job failures.
    - Alerting Mechanism: Set up alerts for failed or delayed jobs, with notifications sent to a monitoring system or slack.

  • Monitoring and Scaling
    - Metrics Dashboard: Use Prometheus and Grafana to monitor the API performance, Spark job completion rates, and error rates.
    - Auto-Scaling on Kubernetes: Deploy the API and Spark cluster on Kubernetes, using auto-scaling to dynamically allocate resources based on API load and job volume.

  • Create a UI


Tech stack : Scala Spark Airflow Kubernetes Kafka Cassandra/DuckDB/Postgress Prometheus/Grafana


Project 5 : Data_lake_health_monitor



The project involves designing and developing a solution to create a centralized repository
of operational metrics for monitoring your Lakehouse (data ingress).




  • Steps:
    1)Build a Spark listener that triggers each time data is updated.
    2)Collect various metadata from the Delta log.
    3)Define KPI metrics to assess data quality and table status.
    4)Develop an ETL pipeline based ib the Medallion Architecture to process the collected data and calculate metrics.
    5)Create a dashboard to visualize these metrics for end users, using different graphs. Include a role-based recommendation system that suggests optimization techniques in certain cases (e.g., addressing small file issues, data skew, or optimizing reorganization).


Tech stack : Scala Spark Delta_Lake Airflow Docker K8s Azure Dashboarding tool(...)


About you:




  • Excellent academics in Computer Science, Engineering, or related field

  • Problem-solving is your jam, and you're all about critical thinking.

  • You're not afraid to roll up your sleeves and get stuff done, even if you're independently on your own with minimal supervision.

  • You can juggle multiple projects like a pro.

  • Challenges don't scare you; in fact, you love diving into them.

  • You can communicate like a champ, whether it's writing reports or presenting in a room full of people.

  • You're curious, and you love picking up new skills & technologies.

  • You're a team player, always up for sharing your ideas and best practices.

Important: Please submit your resume in English only.


What you'll enjoy here: It's not just an internship; we've got some great added value for you too. Here's what you'll enjoy:




  • Great company culture.

  • "Learn and Share" sessions.

  • You'll get support from your mentors.

  • Social events and after-work.

  • A flexible and fun work environment.

  • Casual dress code.

  • You'll work with a cool team! We respect your ideas, and we're all about trying new things.

  • Work/life balance

[ Important: Please send us your resume in English only ]


تاريخ النشر: ١٣ نوفمبر ٢٠٢٤
الناشر: Bayt
تاريخ النشر: ١٣ نوفمبر ٢٠٢٤
الناشر: Bayt