Abstract
Data streams management has attracted the attention of many researchers during the recent years. The reason is that numerous devices generate huge amounts of data demanding an efficient processing scheme for delivering high quality applications. Data are reported through streams and stored into a number of partitions. Separation techniques facilitate the parallel management of data while intelligent methods are necessary to manage these multiple instances of data. Progressive analytics over huge amounts of data could be adopted to deliver partial responses and, possibly, to save time in the execution of applications. An interesting research domain is the efficient management of queries over multiple partitions. Usually, such queries demand responses in the form of ordered sets of objects (e.g., top-k queries). These ordered sets include objects in a ranked order and require novel mechanisms for deriving responses based on partial results. In this paper, we study a setting of multiple data partitions and propose an intelligent, uncertainty driven decision making mechanism that aims to respond to streams of queries. Our mechanism delivers an ordered set of objects over a number of partial ordered subsets retrieved by each partition of data. We envision that a number of query processors are placed in front of each partition and report progressive analytics to a Query Controller (QC). The QC receives queries, assigns the task to the underlying processors and decides the right time to deliver the final ordered set to the application. We propose an aggregation model for deriving the final ordered set of objects and a Fuzzy Logic (FL) inference process. We present a Type-2 FL system that decides when the QC should stop aggregating partial subsets and return the final response to the application. We report on the performance of the proposed mechanism through the execution of a large set of experiments. Our results deal with the throughput of the QC, the quality of the final ordered set of objects and the time required for delivering the final response.
Similar content being viewed by others
References
Abadi DJ, Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Stonebraker M, Tatbul N, Zdonik SB (2003) Aurora: a new model and architecture for data stream management. VLDB J 12(2)
Agarwal S, Milner H, Kleiner A, Talwalkar A, Jordan M, Madden S, Mozafari B, Stoica I (2014) Knowing when you’re wrong: building fast and reliable approximate query processing systems. ACM SIGMOD, USA
Ailon N Aggregation of partial rankings, p-Ratings and top-m lists. Algorithmica 57(2):284–300
Arasu A, Babcock B, Babu S, Cieslewicz J, Datar M, Ito K, Motwani R, Srivastava U, Widom J (2004) ’STREAM: The Stanford Data Stream Management System’. Springer
Aslam J, Montague M (2001) Models for metasearch. In: Proceedings of SIGIR
Babcock B, Olston C (2003) Distributed top-k monitoring. In: 22nd ACM SIGMOD
Bohm C, Ooi BC, Plant C, Yan Y (2007) Efficiently processing continuous k-NN queries on data streams. In: ICDE
Brook D, Evans DA (1972) An approach to the probability distribution of the Cusum Run Length. Biometrika 59(3):539– 549
Chandramouli B, Goldstein J, Quamar A (2013) Scalable progressive analytics on big data in the cloud. Proc VLDB endowment 6(14)
Chandrasekaran S, Franklin MJ (2003) PSoup: a system for streaming queries over streaming data. VLDB J 12(2):140–156
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: 29th ICALP
Chaudhuri S, Das G, Srivastava U (2004) Effective use of block-level sampling in statistics estimation. In: SIGMOD
Chen T, Chen L, Ozsu MT, Xiao N (2013) Optimizing multi-top-k queries over uncertain data streams. IEEE Trans Knowl Data Eng 25(8)
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In: Proceedings of the 7th conference on networked systems design and implementation
Cranor C, Johnson T, Spataschek O, Shkapenyuk V (2003) Gigascope: a stream database for network applications. In: Proceedings of the ACM international conference on management of data. SIGMOD
Das G, Gunopulos D, Koudas N, Sarkas N (2007) Ad-hoc top-k Query answering for data streams. In: VLDB
Demaine E, Lopez-Ortiz A, Munro J (2002) Frequency estimation of internet packet streams with limited space. In: 10th ESA annual european symposium on algorithms
Dittman DJ, Khoshgoftaar TM, Wald R, Napolitano A (2013) Classification performance of rank aggregation techniques for ensemble gene selection
Doucet A, Briers M, Senecal S (2006) Efficient block sampling strategies for sequential Monte Carlo methods
Durbin J (1960) The fitting of time series models. Rev Inst Int Stat 28:233–243
Fagin R (2002) Combining fuzzy information: an overview. In: ACM SIGMOD record, pp 109–118
Fagin R, Kumar R, Sivakumar D (2003a) Comparing top k lists. In: Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms, pp 28–36
Fagin R, Lotem A, Naor M (2003b) Optimal aggregation algorithms for middleware. JCSS 66(4):614656
Fernndez-Berni J, Carmona-Galn R, Martnez-Carmona JF, Rodrguez-Vzquez (2012) Early forest fire detection by vision-enabled wireless sensor networks, vol 21
Fisne A, Kuzu C, Hudaverdi T (2011) Prediction of environmental impacts of quarry blasting operation using fuzzy logic. Environ Monit Assess 174:461–470
Ge T, Zdonik S, Madden S (2009) Top-k queries on uncertain data: on score distribution and typical answers. In: SIGMOD ’09. Providence, USA
Gouveia C, Fonseca A (2008) New approaches to environmental monitoring: the use of ICT to explore volunteered geographic information. GoeJ 72:185–197
Haghani P, Michel S, Aberer K (2009) Evaluating top-k queries over incomplete data streams
Hammad MA, Ghanem TM, Aref WG, Elmagarmid AK, Mokbel MF (2003) Efficient pipelined execution of sliding-window queries over data streams. In: Technical report TR CSD-03-035, Purdue University Department of Computer Sciences
Han X, Wang M, Zhang X, Meng X (2012) Differentially private top-k query over map-reduce. In: CloudDB ’12, Maui
Haury A-C, Gestraud P, Vert J-P (2011) The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE 6(12)
Hellerstein JM, Avnur R (2000) Informix under control: online query Processing. Data Mining and Knowledge Discovery Journal
Hua M, Pei J (2009) Continuously monitoring top-k uncertain data streams: a probabilistic threshold method. Distributed Parallel Databases 26:29–65
Ilyas IF, Beskales G, Soliman M (2008) A survey of top-k query processing techniques in relational database systems. ACM Comput Surv 40(4)
Jermaine C, Arumugam S, Pol A, Dobra A (2007) Scalable approximate query processing with the DBO engine. In: SIGMOD
Jin C, Yi K, Chen L, Xu J, Lin X (2010) Sliding window top-k queries on uncertain streams. VLDB J
Kendall MG (1955) Rank correlation methods. Hafner Publishing Co, New York
Klementiev A, Roth D, Small K, Titov I (2009) Unsupervised rank aggregation with domain-specific expertise. In: Proceedings of the 21st international joint conference on artificial intelligence, pp 1101–1106
Kolomvatsos K, Anagnostopoulos C, Hadjiefthymiades S (2015) An efficient time-optimized scheme for progressive analytics in big data. Elsevier Big Data Research 2(4)
Kolomvatsos K, Anagnostopoulos C, Hadjiefthymiades S (2015) A time optimized scheme for top-k list maintenance over incomplete data streams. Elsevier Information Sciences (INS) 311:59–73
Kumar R, Punera K, Suel T, Vassilvitskii S (2009) Top-k aggregation using intersections of ranked inputs. In: Proceedings of the WSDM
Levinson N (1947) The wiener RMS error criterion in filter design and prediction. J Math Phys 25:261–278
Logothetis D, Yocum K (2008) Ad-hoc data processing in the cloud, vol 1, pp 1472–1475
Mamoulis N, Yiu ML, Cheng KH, Cheung DW (2007) Efficient top-k aggregation of ranked inputs. ACM Trans Database Syst 32(3)
Metwally A, Agraval D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: ICDT
Mokbel M, Xiong X, Hammad M, Aref W (2005) Continuous query processing of spatio-temporal data streams in PLACE. Geoinformatics 9(4)
Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar M, Manku GS, Olston C, Rosenstein J, Varma R (2003) Query processing, approximation, and resource management in a data stream management system. In: Proceedings of the international conference on innovative data systems research. CIDR
Mouratidis K, Bakiras S, Papadias D (2006) Continuous monitoring of top-k queries over sliding windows. In: SIGMOD
Meumayer R, Mayer R, Norvag K (2011) Combination of feature selection methods for text categorization. In: Clough P, Foley C, Gurrin, Jones G, Kraaij W, Lee H, Mudoch V (eds) Advances in information retieval, vol. 661 of lecture notes in computer science. Springer, Berlin, pp 763–766
Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE
Nguyen HTH, Cao J (2014) Trustworthy answers for top-k queries on uncertain big data in decision making. Elsevier Information Sciences, In Press
Raman V, Raman B, Hellerstein JM (1999) Online dynamic reordering for interactive data processing. In: VLDB
Segaran TT (2007) Programming collective intelligence: building smart web 2.0 applications. O Reilly Media, Sebastopol
Yang D, Shastri A, Rundensteiner EA, Ward MO (2011) An optimal strategy for monitoring top-k queries in streaming windows. In: EDBT/ICDT
Yao Y, Gehrke J (2002) The cougar approach to in-network query processing in sensor networks. SIGMOD Record 31(3)
Zheng Y, Liu F, Hsieh H-P (2013) U-air: when urban air quality inference meets big data. In: Proceedings of the KDD , Chicago
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kolomvatsos, K. An intelligent, uncertainty driven aggregation scheme for streams of ordered sets. Appl Intell 45, 713–735 (2016). https://doi.org/10.1007/s10489-016-0789-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0789-8