Free Essay SamplesAbout UsContact Us Order Now

Data Warehousing and Analytics: An Examination of Performance through Cloud and RDBMS Technologies

0 / 5. 0

Words: 825

Pages: 3

69

Name
Instructor
Course
Date
Data Warehousing and Analytics: An Examination of Performance through Cloud and RDBMS Technologies
Preliminary Experiments
Growths in technology have led to the production of large data volumes. In effect, the amount of data produced across several domains has significantly increased. An increase in the amount of data being processed has made it difficult for analysts to query large data repositories. The use of an ideal database system plays a vital role in enabling data handlers to identify the best large scale data analysis system that meets their needs and expectations. The results garnered from the experimental analyses play influential roles in informing users of why it is recommended to consider the use of a particularly large-scale data access system over another. Particularly, the analysis on the use of DBMS-X, HadoopDB, and Hive databases play key roles in developing specific reports regarding the reliability of using particular systems to support large-scale declarative queries. The choice of a database is driven by its performance and its ability to identify the time required to get all the answers to a query.
Understanding the primary features of Hive, DBMS-X and those of Hadoop plays key roles in enabling experts to select an ideal database to use. Hadoop databases provide a framework upon which large data sets across several computers can be processed. Hadoop is touted to offer efficient performance enhancements that give a platform upon which high-throughput access to data and streaming can take place.

Wait! Data Warehousing and Analytics: An Examination of Performance through Cloud and RDBMS Technologies paper is just an example!

Moreover, of the distributed database options, Hadoop has proved to be relatively inexpensive yet offers a higher degree of scalability based on data processing and management.
Hive was developed from Hadoop’s open source framework. Hive is considered essential for querying and managing large distributed datasets that are built on Hadoop. The fact that Hive works based on SQL solutions for Hadoop makes it simpler to use and provides the desired expression of SQL data from Hadoop. Thus, the use of Hive offers users with a powerful background from relational databases with an ideal interface to plug-in and custom codes.
DBMS-X is an IMDB system that supports features such as column-based queries, storage, parallel processing and data compression. DBMS-X has been identified to offer enhanced performances compared to other parallel processing IMDB systems. DBMS-X is a parallel based SQL database management system. A majority of users prefer using DBMS-X databases because of its ability to compress data in tables using well known dictionary-based schemes. As a result, the use of DBMS-X features plays vital roles in ensuring that details found in column based databases are efficiently managed.
The query samples in use during the experiment were ideal because of their capacities to comply with the chosen schemes of PT1.1 and PT1.2 datasets. The mentioned query sets play vital roles in performing joint functions and help identify where common clauses and data attributes have to be grouped. For instance, the “SELECT” clause query found on the column project performs a join while Query #8 performs the “GROUP BY” functions (Mesmoudi and Hacid 166). The specific queries are essential for the experiment because they play a vital role in the identification of the competency of the database regarding its performance.
The protocol in use during the experiment process considered various configurations to analyze the competencies of the performances of the three databases. The process took place in three phases whereby analyses on distributed systems, optimization within the distributed systems and functionality when working on centralized or distributed systems were analyzed. Using Hive for the distributed systems took 25 minutes for the analysis of the content found in the PT1.1 dataset. The outcomes for the PT1.2 dataset was one hour and thirty minutes. Mesmoudi and Hacid commented that the task must have been constrained by the fact that the speed of the local disk could only reflect at the same time for the three clusters (173). Conversely, the use of the Hadoop system necessitated that the tasks be partitioned into 3 big chunks of about 28 GB. The processing time for every portioned portion was 37-54% of the total loading time for the first, 6-16 of the loading time for the second and finally 10-21% of the total loading time for the final portion (Mesmoudi and Hacid 172). When the experiment involved a comparison of all the three databases, it was identified that DBMS-X is much slower than Hadoop and Hive databases. Hadoop emerged to be the most powerful in terms of load and execution times. The parallel database in DBMS-X demands that the data first be transformed into the native format and further build indexes. The other two map reduce functions databases do not work in the same way because their query and load times are separated.
The outcomes of the experiment are evidence that Hadoop is a more powerful alternative database system for all its data clusters involving queries #1-#7. However, Hive posts much more impressive outcomes for queries #10. Nonetheless, when prompted to order the outcome of the experiment in a particular fashion, Hadoop demonstrates the relevant capacity to generate less writing processes and is thus considered more efficient and reliable. Conversely, the efficiencies experienced when using DBMS-X may be attributed to its architectural features, which enable it to do data compression, pipelining, scheduling and column oriented storages hence leading to greater performances. The process takes advantage of the relevant indexes whereby it leads to an increase of the recommended time required to ensure data processing. Finally, for centralized and distributed systems, it was identified that the costs of the systems had an impact on the efficiencies of the databases. HadoopDB was found to induce high communication costs of up to $95 million. Moreover, poor use of strategy implies that it would be impossible to perform projections on the Map phase, which is known to reduce the costs of communication significantly.
The use of DBMS-X, HadoopDB, and Hive for large scale data is reliable. However, a comparison of the three databases reveals that DBMS-X is more efficient. Nonetheless, the configuration, architectural attributes and query in use may have also had a significant impact on the outcomes of the experiment. DBMS-X databases are known to offer indexed features, partitioning and materialized views, all of which may make it easier to attain enhanced performances compared to Map Reduce technologies. Hive has the least performance efficiency as it is dependent on HadoopDB open source code. It is likely that Hive lacks the proficiency associated with the other two databases when used as a stand-alone feature.

Works Cited
Mesmoudi, Amin, and Mohand-Saïd Hacid. “A comparison of systems to large-scale data access.” International Conference on Database Systems for Advanced Applications. Springer Berlin Heidelberg, 2014.

Get quality help now

Daniel Sharp

5,0 (174 reviews)

Recent reviews about this Writer

I can’t imagine my performance without this company. I love you! Keep going!

View profile

Related Essays

Phar-Mor

Pages: 1

(550 words)

Literature Research Proposal

Pages: 1

(275 words)

Article provided

Pages: 1

(275 words)

Macbeth and the supernatural

Pages: 1

(275 words)

Esthetician (Skincare Specialist)

Pages: 1

(275 words)

Their choice

Pages: 1

(275 words)