专题论文

大数据质量管理:问题与研究进展

  • 王宏志
展开
  • 哈尔滨工业大学计算机科学与技术学院, 哈尔滨 150001
戴杰,博士研究生,研究方向为大跨径钢结构与钢-混凝土组合结构桥梁基本理论与应用,电子信箱:counter_dj@163.com

收稿日期: 2014-09-25

  修回日期: 2014-11-06

  网络出版日期: 2014-12-17

基金资助

国家重点基础研究发展计划(973计划)项目(2012CB316200);国家自然科学基金项目(61472099)

Big Data Quality Management: Problems and Progress

  • WANG Hongzhi
Expand
  • Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

Received date: 2014-09-25

  Revised date: 2014-11-06

  Online published: 2014-12-17

摘要

当前大数据在多个领域广泛存在,大数据的质量对其有效应用起着至关重要的作用,因而需要对大数据进行质量管理.尽管数据质量管理方面已经有一些研究成果,但由于大数据具有规模大、速度快和多样性高的特点,现有的方法难以适用于大数据质量管理.本文针对错误发现、错误修复和劣质数据查询处理,综述了大数据质量管理的问题与挑战,认为大数据质量管理的挑战主要有计算困难、错误混杂和缺少知识3 个方面.本文依据这3 个方面的解决方法,对大数据质量管理目前的研究进展进行了综述,并展望了大数据质量管理未来的研究方向.

本文引用格式

王宏志 . 大数据质量管理:问题与研究进展[J]. 科技导报, 2014 , 32(34) : 78 -84 . DOI: 10.3981/j.issn.1000-7857.2014.34.011

Abstract

Big data have wide applications. Since the quality of big data plays a crucial role in these data-centric applications, data quality management techniques for big data are in demand. Although some theories and techniques for data quality management have been proposed, due to the volume, variety and velocity of big data, current methods could hardly be applied to data management for big data. This paper discusses the problems and challenges for error detection, error repair and query processing of dirty data in big data management, and identifies intractability, mixed errors and the lack of knowledge as three new challenges to data quality management. The progress of big data quality management in these three aspects is reviewed and open problems for future research are proposed.

参考文献

[1] Li J Z, Liu X M. An important aspect of big data: Data usability[J]. Journal of Computer Research and Development, 2013, 50(6): 1147-1162.
[2] Eckerson W W. Data quality and the bottom line: Achieving business success through a commitment to high quality data[R]. Renton, WA: The Data Warehousing Institute, 2000: 12-20.
[3] Institute of Medicine. To err is human: Building a safer health system[M]. Washington: The National Academies Press, 1999.
[4] Bohannon P, Fan W F, Flaster M, et al. A cost-based model and effec tive heuristic for repairing constraints by value modification[C]. ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, June 14-16, 2005.
[5] English L. Plain English on data quality: Information quality manage ment: The next frontier[J]. DM Review Magazine, 2000.
[6] Ben W, Schulz S. Credit card statistics, industry facts, debt statistics[EB/OL]. 2010-03-19, [2014-09-25]. http://www.creditcards.com.
[7] Gartner. Gartner says more than 50 percent of data warehouse projects will have limited acceptance or will be failures through 2007[EB/ OL]. 2005-02-24, [2014-09-25]. http://www.gartner.com/newsroom/id/ 492112.
[8] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detec tion: A survey[J]. IEEE Transactions on Knowledge and Data Engineer ing, 2007, 19(1): 1-16.
[9] Christen P. A survey of indexing techniques for scalable record linkage and deduplication[J]. IEEE Transactions on Knowledge and Data Engi neering, 2012, 24(9): 1537-1555.
[10] Rahm E, Do H H. Data cleaning: Problems and current approaches[J].Bulletin of the Institute of Electrical and Electronics Engineers Data Engineering Bulletin, 2000, 23(4): 3-13.
[11] Fan W F, Geerts F, Jia X B, et al. Conditional functional dependen cies for capturing data inconsistencies[J]. ACM Transactions on Data base Systems, 2008, 33(2): 1-48.
[12] Bravo L, Fan W F, Ma S. Extending dependencies with conditions[C]. The 33rd International Conference on Very Large Data Bases, Univer sity of Vienna, Austria, September 23-27, 2007.
[13] Fan W F, Geerts F, Wijsen J. Determining the currency of data[C]. The 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), Athens, Greece, June 12-16, 2011.
[14] Cao Y, Fan W F, Yu W Y. Determining the relative accuracy of attri butes[C]. 2013 International Conference on Management of Data, New York, USA, June 23-28, 2013.
[15] Chiang F, Miller R J. Discovering data quality rules[J]. The Proceed ings of the VLDB Endowment, 2008, 1(1): 1166-1177.
[16] Fan W F, Geerts F, Li J Z, et al. Discovering conditional functional dependencies[J]. IEEE Transactions on Knowledge and Data Engineer ing, 2011, 23(5): 683-698.
[17] Chu X, Ilyas I F, Papotti P. Discovering denial constraints[J]. The Pro ceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.
[18] Bauckmann J, Abedjan Z, Leser U, et al. Discovering conditional in clusion dependencies[C]. The 21st ACM International Conference on Information and Knowledge Management, Maui, Hawaii, October 29-November 2, 2012.
[19] Loshin D. Master data management[M]. San Francisco: Morgan Kaufmann, 2008.
[20] Fan W F, Geerts F. Relative information completeness[J]. ACM Trans actions on Database Systems, 2010, 35(4): 27-35.
[21] Bohannon P, Fan W, Flaster M, et al. A cost-based model and effec tive heuristic for repairing constraints by value modification[C]. ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005.
[22] Cong G, Fan W, Geerts F, et al. Improving data quality: Consistency and accuracy[C]. The 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007.
[23] Arenas M, Bertossi L E, Chomicki J, et al. Scalar aggregation in incon sistent databases[J]. Theoretical Computer Science, 2003, 296(3): 405-434.
[24] Geerts F, Mecca G, Papotti P, et al. The LLUNATIC data-cleaning framework[J]. The Proceedings of the VLDB Endowment, 2013, 6(9): 625-636.
[25] Fan W F, Geerts F, Tang N, et al. Inferring data currency and consis tency for conflict resolution[C]. 29th IEEE International Conference on Data Engineering, Brisbane, April 8-12, 2013.
[26] Galland A, Abiteboul S, Marian A, et al. Corroborating information from disagreeing views[C]. The third ACM International Conference on Web Search and Data Mining, New York, USA, February 3-6, 2010.
[27] Dong X L, Berti-Equille L, Srivastava D. Integrating conflicting data: The role of source dependence[J]. The Proceedings of the VLDB En dowment-PVLDB, 2009, 2(1): 550-561.
[28] Dong X L, Berti-Equille L, Srivastava D. Truth discovery and copying detection in a dynamic world[J]. The Proceedings of the VLDB Endow ment-PVLDB, 2009, 2(1): 562-573.
[29] Zhao B, Rubinstein B I P, Gemmell J, et al. A bayesian approach to discovering truth from conflicting sources for data integration[J]. The Proceedings of the VLDB Endowment, 2012, 5(6): 550-561.
[30] Lakshminarayan K, Harp S A, Goldman R, et al. Imputation of miss ing data using machine learning techniques[C]. The Second Interna tional Conference on Knowledge Discovery and Data Mining, Portland, Oregon, August 2-4, 1996.
[31] Mayfield C, Neville J, Prabhakar S. ERACER: A database approach for statistical inference and data cleaning[C]. ACM SIGMOD Interna tional Conference on Management of Data, Indianapolis, Indiana, USA, June 6-10, 2010.
[32] Setiawan N A, Venkatachalam P, Hani A F M. Missing attribute value prediction based on artificial neural network and rough set theory[J]. Biomedical Engineering and Informatics, 2008, 1: 306-310.
[33] Hua M, Pei J. Cleaning disguised missing data: A heuristic approach[C]. The 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12-15, 2007.
[34] Lin X M, Wang W. Set and string similarity queries: A survey[J]. Chi nese Journal of Computers, 2011, 34(10): 1853-1862.
[35] Leopoldo B. Database repairing and consistent query answering[M]. California: Morgan & Claypool, 2011.
[36] Bry F. Query answering in information systems with integrity con straints[M]//Integrity and Internal Control in Information Systems. New York: Springer, 1997: 113-130.
[37] Arenas M, Bertossi L, Chomicki J. Consistent query answers in incon sistent databases[C]. Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadelphia, Pennsyl vania, May 31-June 2, 1999.
[38] Kolaitis P G, Pema E, Tan W C. Efficient querying of inconsistent databases with binary integer programming[J]. The Proceedings of the VLDB Endowment, 2013, 6(6): 397-408.
[39] Barceló P, Bertossi L. Logic programs for querying inconsistent databases[M]//Practical Aspects of Declarative Languages. New York: Springer, 2003: 208-222.
[40] Fuxman A, Fazli E, Miller R J. Conquer: Efficient management of inconsistent databases[C]. ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005.
[41] Fuxman A, Miller R J. First-order query rewriting for inconsistent da tabases[J]. Journal of Computer and System Sciences, 2007, 73(4): 610-635.
[42] Wijsen J. Consistent query answering under primary keys: A character ization of tractable queries[C]. The 12th International Conference on Database Theory, St Petersburg, Russia, March 23-25, 2009.
[43] Greco S, Pijcke F, Wijsen J, et al. Certain query answering in partially consistent databases[J]. Proceedings of the VLDB Endowment, 2014, 7 (5): 32-65.
[44] Maslowski D, Wijsen J. Counting database repairs that satisfy conjunc tive queries with self-joins[C]. The 17th International Conference on Database Theory, Athens, Greece, March 24-28, 2014.
[45] Maslowski D, Wijsen J. On counting database repairs[C]. The 4th International Workshop on Logic in Databases, San Miniato, March 25, 2011.
[46] Khalefa M E, Mokbel M F, Levandoski J J. Skyline query processing for incomplete data[C]. 2008 IEEE 24th International Conference on Data Engineering (ICDE 08), Cancun, April 7-12, 2008.
[47] Alwan A A, Ibrahim H, Udzir N I, et al. Skyline queries over incom plete multidmensional database[C]. The 3rd International Conference on Computing and Informatics, Bandung, June 8-9, 2011.
[48] Bharuka R, Kumar P S. Finding skylines for incomplete data[C]//Pro ceedings of the Twenty-Fourth Australasian Database Conference. Gold Coast, Queensland: Australian Computer Society, 2013, 137: 109-117.
[49] Miao X, Gao Y, Chen L, et al. On efficient k-skyband query processing over incomplete data[M]//Database Systems for Advanced Applica tions. Berlin Heidelberg: Springer, 2013: 424-439.
[50] Gao Y, Miao X, Cui H, et al. Processing k-skyband, constrained skyline, and group-by skyline queries on incomplete data[J]. Expert Systems with Applications, 2014, 41(10): 4959-4974.
[51] Hadjali A, Pivert O, Prade H. Possibilistic contextual skylines with incomplete preferences[C]//Proceeding of 2010 International Conference of Soft Computing and Pattern Recognition. New York, USA: Institute of Electrical and Electronics Engineers, 2010: 57-62.
[52] Arefin M S, Morimoto Y. Skyline sets queries from databases with missing values[C]//Proceeding of 22nd International Conference on Computer Theory and Applications. Chengdu: Institute of Electrical and Electronics Engineers, 2012: 24-29.
[53] Markus E, Patrick R, Florian W, et al. Handling of NULL values in preference database queries[C]. 20th European Conference on Artificial Intelligence, Montpellier, France, August 27-31, 2012.
[54] Kolb L, Thor A, Rahm E, et al. Efficient deduplication with hadoop[J]. The Proceedings of the VLDB Endowment, 2012, 5(12): 1878-1881.
[55] Kolb L, Thor A, Rahm E. Load balancing for MapReduce-based entity resolution[C]. International Council for Open and Distance Education, Washington D C, April 1-5, 2012.
[56] Kolb L, Thor A, Rahm E. Block-based load balancing for entity reso lution with MapReduce[C]. The 20th ACM International Conference on Information and Knowledge Management, Glasgow, United King dom, October 24-28, 2011.
[57] Huo R, Wang H Z, Zhu R, et al. Entity identification in big data based on MapReduce[J]. EIBM, 2013, 50(S2): 20-35.
[58] Jin L, Wang H Z, Huang S B, et al. Missing value imputation in big data based on Map-Reduce[J]. Journal of Computer Research and Devel opment, 2013, 50(Sl): 312-321.
[59] Vernica R, Carey M J, Li C. Efficient parallel set-similarity joins using mapreduce[C]. ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA, June 6-10, 2010.
[60] Metwally A, Faloutsos C. V-smart-join: A scalable mapreduce frame work for all-pair similarity joins of multisets and vectors[J]. The Proceedings of the VLDB Endowment, 2012: 213-300.
[61] Afrati F N, Sarma A D, Menestrina D, et al. Fuzzy joins using mapre duce[C]. International Council for Open and Distance Education, Washington D C, April 1-5, 2012.
[62] Okcan A, Riedewald M. Processing theta-joins using mapreduce[C]. ACM SIGMOD International Conference on Management of Data, Athens, Greece, June 12-16, 2011.
[63] Deng D, Li G L, Hao S, et al. MassJoin: A mapreduce-based method for scalable string similarity joins[C]. 2014 IEEE 30th International Conference on Data Engineering, Moscow, Russia, March 31-April 4, 2014.
[64] Sarma A D, He Y Y, Chaudhuri S. ClusterJoin: A similarity joins framework using MapReduce[J]. The Proceedings of the VLDB Endow ment, 2014, 7(12): 1059-1070.
[65] Wang H Z, Li M D, Bu Y Y, et al. A big data cleaning parfait[C]. The 23rd ACM International Conference on Information and Knowledge Management, Shanghai, Nov 3-7, 2014: 10-23.
[66] Bornhövd C, Lin T, Haller S, et al. Integrating automatic data acquisi tion with business processes experiences with saps auto-id infrastruc ture[J]. The Proceedings of the VLDB Endowment, 2004, 30: 1182-1188.
[67] Rao J, Doraiswamy S, Thakkar H, et al. A deferred cleansing method for rfid data analytics[C]. The 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006.
[68] Jeffery S, Garofalakis M, Franklin M. Adaptive cleaning for rfid data streams[C]. The 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006.
[69] Tran T, Sutton C, Cocci R, et al. Probabilistic inference over rfid streams in mobile environments[C]. The 25th International Conference on Data Engineering, March 29-April 2, 2009.
[70] Chen H, Ku W, Wang H, et al. Leveraging spatio-temporal redundan cy for rfid data cleansing[C]. ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA, June 6-10, 2010.
[71] Zhao Z, Ng W. A model-based approach for RFID data stream cleans ing[C]. The 21st ACM International Conference on Information and Knowledge Management, Maui, Hawaii, October 29-November 2, 2012.
[72] Zhu X Q, Zhang P, Wu X D, et al. Cleansing noisy data streams[C]. The IEEE International Conference on Data Mining, Cancún, México, December 15-19, 2008.
[73] Fan W F, Li J Z, Ma S, et al. Interaction between record matching and data repairing[C]. ACM SIGMOD International Conference on Management of Data, Athens, Greece, June 12-16, 2011.
[74] Fan W F, Geerts F, Tang N, et al. Inferring data currency and consis tency for conflict resolution[C]. The 29th IEEE International Confer ence on Data Engineering, Brisbane, April 8-12, 2013.
[75] Ebaid A, Elmagarmid A K, Llyas I, et al. NADEEF: A generalized data cleaning system[J]. The Proceedings of the VLDB Endowment, 2013, 6(12): 1218-1221.
[76] Demartini G, Difallah D E, Cudre-Mauroux P. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]. The 21st World Wide Web Conference, Lyon, France, April 16-20, 2012.
[77] Wang J, Kraska T, Franklin M J, et al. CrowdER: Crowdsourcing entity resolution[J]. The Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494.
[78] Wang J N, Li G L, Kraska T, et al. Leveraging transitive relations for crowdsourced joins[C]. International Conference on Management of Da ta, New York, USA, June 22-27, 2013.
[79] Ye C, Wang H Z. Capture missing values based on crowdsourcing[J]. Lecture Notes in Computer Science, 2014, 8491: 783-792.
[80] Ye C, Wang H Z, Gao H, et al. Truth discovery based on crowdsourc ing[J]. Lecture Notes in Computer Science, 2014, 8485: 453-458.
[81] Tong Y X, Cao C C, Zhang C J, et al. CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing[C]. 2014 IEEE 30th International Conference on Data Engineering, Moscow, Russia, March 31-April 4, 2014.
[82] Lofi C, El Maarry K, Balke W T. Skyline queries over incomplete da ta-error models for focused crowd-sourcing[M]//Conceptual Modeling. Berlin: Springer, 2013: 298-312.
[83] Lofi C, El Maarry K, Balke W T. Skyline queries in crowd-enabled databases[C]. The 16th International Conference on Extending Data base Technology, Genoa, Italy, March 18-22, 2013.
[84] Li Z X, Sharaf M A, Sitbon L, et al. A web-based approach to data imputation[J]. World Wide Web, 2014, 17(5): 873-897 .
[85] Chen Y C, Li J Z, Luo J Z. ITCI: An information theory based classifi cation algorithm for incomplete data[J]. Lecture Notes in Computer Science, 2014, 8485: 167-179.
文章导航

/