International Journal on Science and Technology

E-ISSN: 2229-7677     Impact Factor: 9.88

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 16 Issue 2 April-June 2025 Submit your research before last 3 days of June to publish your research paper in the issue of April-June.

Incremental Processing For Handling Late-Arriving Data in Batch Processing

Author(s) Arjun Reddy Lingala
Country United States
Abstract Batch processing systems often struggle with the challenge of handling late-arriving data [5], leading to incon- sistencies in analytical results and unnecessary computational overhead. This paper introduces an incremental processing [10] that efficiently incorporates late data into batch datasets by reducing user overhead of maintenance, avoid mistakes from users and saving computational overhead. Some of the data arriving into data warehouse often gets delayed due to multiple upstream issues like network outages, upstream delays, fixes from data reconciliation. Late-arriving data presents significant challenges in batch and real-time data processing environments which impact data accuracy [3], system efficiency, and overall analytics reliability. Unaccounted late arriving data can lead to incomplete and inaccurate analytical results and the dashboards generated from them represent incorrect metrics. It often re- quires late arriving data to be caught or raised with an alert and then the user who owns the ETL has to re-run the batch pipelines that are within the time or date range of the arrived data. This approach significantly reduces the computational over- head of batch reprocessing while ensuring data consistency and completeness. The proposed framework introduces a real-time detection mechanism that continuously monitors incoming data and identifies late records by comparing timestamps with pre- existing batch data. Once late data is detected, a targeted backfill strategy is applied, ensuring that only the affected time partitions starting from the hour of late data arrival until the current processing period are recomputed in sequential approach for a dataset that depends on order and historical information and only delta from the time of late arriving data is actioned upon for a dataset that doesn’t depend on the historical data. This selective reprocessing minimizes redundant computations and optimizes system performance with fault tolerance and scalability handling large volumes of late-arriving data in distributed environments.
Keywords Batch Processing, Late Data, Incremental processing, Distributed systems, ETL, Real-time, Intra-day pipelines, Scalability, Fault tolerance, Backfilling, Dashboards, Metrics
Field Engineering
Published In Volume 14, Issue 3, July-September 2023
Published On 2023-07-05
Cite This Incremental Processing For Handling Late-Arriving Data in Batch Processing - Arjun Reddy Lingala - IJSAT Volume 14, Issue 3, July-September 2023. DOI 10.71097/IJSAT.v14.i3.2266
DOI https://doi.org/10.71097/IJSAT.v14.i3.2266
Short DOI https://doi.org/g869wv

Share this