Apache Hive Performance Improvement Techniques for Relational Data


GÜNAY M., İNCE M. M., CETINKAYA A.

2019 International Conference on Artificial Intelligence and Data Processing Symposium, IDAP 2019, Malatya, Türkiye, 21 - 22 Eylül 2019 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/idap.2019.8875898
  • Basıldığı Şehir: Malatya
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: Big data, Hadoop, Hive Optimization
  • Akdeniz Üniversitesi Adresli: Evet

Özet

© 2019 IEEE.Hadoop is a widely adapted open-source map reduce implementation for storing and processing extremely large data sets. However, using Hadoop is not easy for end-users, especially for those who were not familiar with the map-reduce approach. Even for simple tasks, like getting raw counts or averages, users have to write map-reduce programs. Apache Hive, which is a data warehouse infrastructure tool to process structured data in Hadoop helps users to easily query, summarize and analyze Big Data with SQL-like expressions called HiveQL. It supports various file formats to import and export data from-And-To the storage file system. Hive's objective is to make it easy and performative when petabytes levels of data needs to be processed. Unlike RDBMS, Hive stores data in a document-based structure so queries with JOINS degrade performance with high resource consumption. However, there are ways to improve performance for the relational data by correctly configuring Hive. In this research, we implement several optimization techniques to improve the query performance and evaluate the results to compare their impact on the outcome. For this study, we used the TPC generated dataset. TPC is a nonprofit corporation founded to define benchmarks for transactional processing in databases. Also, different techniques are discussed in the context of the techniques of aggregation and their variable states are observed.