-------------------------------------------
一、前言
二、环境
三、配置
1.内嵌模式
2.本地模式
3.远程模式
四、测试
-------------------------------------------
一、前言
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
Hive是由Facebook贡献给Apache的开源项目,这个工具可以说是完全为DBA而生,它的的目标,是希望能让精通SQL但不熟悉JAVA编程的工程师,在HADOOP的大潮中不至于下岗待业,即使完全不懂JAVA,也能在HDFS数据分析中继续发挥光和热。Hive是做什么呢,个人理解可以将其视为一个SQL语言的解释器,它能将DBA提交的SQL语句,转换成能够在HADOOP上执行的M-R作业,对于DBA或前端用户来说,不必再将精力花在编写M-R应用上,直接借助SQL的易用性来实现大规模数据的查询和分析。
与Hadoop类似,Hive也有三种运行模式:
1.内嵌模式:将元数据保存在本地内嵌的Derby数据库中,这得使用Hive最简单的方式,不过使用内嵌模式的话,缺点也比较明显,因为一个内嵌的Derby数据库每次只能访问一个数据文件,这也就意味着不支持多会话连接。这种情况应对本地测试可能都有所不足,仅供初学者熟悉应用Hive;
2.本地模式:这种模式是将元数据库保存在本地的独立数据库中(比如说MySQL),这样就能够支持多会话和多用户连接。
3.远程模式:如果我们的Hive客户端比较多,在每个客户端都安装MySQL服务还是会造成一定的冗余和浪费,这种情况下,就可以更进一步,将MySQL也独立出来,将元数据保存在远端独立的MySQL服务中。
二、环境
系统:CentOS6.4 32位
软件包:
hive-0.8.1.tar.gz
mysql-connector-java-5.1.18-bin.jar
本实验均在上hadoop基础上做。
3种模式均在master主机(192.168.2.101)上配置,master主机已安装hadoop集群。
三、配置
1.内嵌模式
# tar -zxvf hive-0.8.1.tar.gz -C /usr/# mv /usr/hive-0.8.1/ /usr/hive# chown -R hadoop:hadoop /usr/hive/ //修改所属关系,hadoop用户已在hadoop集群中添加# vim /etc/profile //添加环境变量HIVE_HOME=/usr/hivePATH=$PATH:$HIVE_HOME/binexport HIVE_HOME# . /etc/profile //使之立即生效,而不必注销并重新登录。
# vim /usr/hive/bin/hive-config.sh //修改hive配置脚本export JAVA_HOME=/usr/java/jdk1.7.0_45export HIVE_HOME=/usr/hiveexport HADOOP_HOME=/usr/hadoop
配置hive-default.xml和hive-site.xml,hive-default.xml用于保留默认配置,hive-site.xml用于个性化配置,可覆盖默认配置,本模式使用默认配置无需修改。
# cd /usr/hive/conf/# ll-rw-rw-r--. 1 hadoop hadoop 46817 Jan 25 2012 hive-default.xml.template-rw-rw-r--. 1 hadoop hadoop 1593 Jan 25 2012 hive-env.sh.template-rw-rw-r--. 1 hadoop hadoop 1637 Jan 25 2012 hive-exec-log4j.properties.template-rw-rw-r--. 1 hadoop hadoop 2043 Jan 25 2012 hive-log4j.properties.template# cp -p hive-default.xml.template hive-default.xml# cp -p hive-default.xml.template hive-site.xml
在HDFS上建立/tmp和/user/hive/warehouse目录,并赋予组用户写权限。这是Hive默认的数据文件存放目录,在hive-site.xml文件中为默认配置。
# su - hadoop$ hadoop dfs -mkdir /tmp$ hadoop dfs -mkdir /user/hive/warehouse$ hadoop dfs -chmod g+w /tmp$ hadoop dfs -chmod g+w /user/hive/warehouse
$ hadoop dfs -ls /drwxrwxr-x - hadoop supergroup 0 2014-06-17 18:57 /tmpdrwxr-xr-x - hadoop supergroup 0 2014-06-17 19:02 /userdrwxr-xr-x - hadoop supergroup 0 2014-06-15 19:31 /usr$ hadoop dfs -ls /user/hive/drwxrwxr-x - hadoop supergroup 0 2014-06-17 19:02 /user/hive/warehouse
$ hive //启动Logging initialized using configuration in file:/usr/hive/conf/hive-log4j.propertiesHive history file=/tmp/hadoop/hive_job_log_hadoop_201406171916_734435947.txthive> show tables;OKTime taken: 7.157 secondshive> quit;
2.本地独立模式(在内嵌模式上配置)
# yum install mysql mysql-server //安装mysql# service mysqld start# mysql -u root //添加数据库及用户mysql> create database hive; Query OK, 1 row affected (0.00 sec)mysql> grant all on hive.* to 'hive'@'localhost' identified by 'hive';Query OK, 0 rows affected (0.00 sec)mysql> flush privileges;Query OK, 0 rows affected (0.00 sec)mysql> \qBye
# mysql -u hive -p //本地测试,正常Enter password: mysql> \qBye
# mv mysql-connector-java-5.1.18-bin.jar /usr/hive/lib/ //上传JDBC,用于java程序与mysql的连接# chown hadoop:hadoop mysql-connector-java-5.1.18-bin.jar
# su - hadoop $ cp /usr/hive/conf/hive-site.xml /usr/hadoop/conf/ //将hive目录下的hive-site.xml拷贝到hadoop下
配置hive-site.xml文件,用于连接mysql
$ vim /usr/hadoop/conf/hive-site.xmljavax.jdo.option.ConnectionURL //所连接的MySQL数据库实例jdbc:mysql://localhost:3306/hive JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName //连接的MySQL数据库驱动com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName //连接的MySQL数据库用户名hive username to use against metastore database javax.jdo.option.ConnectionPassword //连接的MySQL数据库密码hive password to use against metastore database
$ hive //启动Logging initialized using configuration in jar:file:/usr/hive/lib/hive-common-0.8.1.jar!/hive-log4j.propertiesHive history file=/tmp/hadoop/hive_job_log_hadoop_201406172021_1374786590.txthive> show tables;OKTime taken: 5.527 secondshive> quit;
3.远程模式(使用一台独立的centos6.4系统并安装mysql,IP为192.168.2.10/24)
# yum install mysql mysql-server //在独立的系统上安装mysql# service mysqld start# mysql -u root //添加数据库及用户mysql> create database hive; Query OK, 1 row affected (0.00 sec)mysql> grant all on hive.* to 'hive'@'localhost' identified by 'hive';Query OK, 0 rows affected (0.00 sec)mysql> flush privileges;Query OK, 0 rows affected (0.00 sec)mysql> \qBye
# yum install mysql //在hadoop集群的master主机上测试,正常# mysql -h 192.168.2.10 -u hive -p Enter password: mysql> show databases;+--------------------+| Database |+--------------------+| information_schema || hive || test |+--------------------+3 rows in set (0.08 sec)mysql> \qBye
# tar -zxvf hive-0.8.1.tar.gz -C /usr/ //在master主机上安装hive# mv /usr/hive-0.8.1/ /usr/hive# chown -R hadoop:hadoop /usr/hive/# vim /etc/profile //添加hive的环境变量HIVE_HOME=/usr/hivePATH=$PATH:$HIVE_HOME/binexport HIVE_HOME# . /etc/profile
# mv mysql-connector-java-5.1.18-bin.jar /usr/hive/lib/ //上传JDBC# chown hadoop:hadoop mysql-connector-java-5.1.18-bin.jar
# cd /usr/hive/conf/# cp -p hive-default.xml.template hive-site.xml# vim /usr/hive/conf/hive-site.xml //请按照以下配置hive.metastore.warehouse.dir:指定数据目录,默认值是/user/hive/warehouse;hive.exec.scratchdir:指定临时文件目录,默认值是/tmp/hive-${user.name}; hive.metastore.local:指定是否使用本地元数据,此处改为false,使用远端的MySQL数据库存储元数据; javax.jdo.option.ConnectionURL:指定数据库的连接串,此处修改为:jdbc:mysql://192.168.2.10:3306/hive?useUnicode=true&characterEncoding=utf8&; javax.jdo.option.ConnectionDriverName:指定数据库连接驱动,此处修改为com.mysql.jdbc.Driver; javax.jdo.option.ConnectionUserName:指定连接MySQL的用户名,根据实际情况设定; javax.jdo.option.ConnectionPassword:指定连接MySQL的密码,根据实际情况设定; hive.stats.dbclass:指定数据库类型,此处修改为jdbc:mysql; hive.stats.jdbcdriver:指定数据库连接驱动,此处指定为com.mysql.jdbc.Driver; hive.stats.dbconnectionstring:指定hive临时统计信息的数据库连接方式,此处指定为jdbc:mysql://192.168.2.10:3306/hivestat?useUnicode=true&characterEncoding=utf8$amp;user=hive&password=hive$amp;createDatabaseIfNotExist=true;
通过模板创建hive的日志输出配置文件
# cp -p /usr/hive/conf/hive-log4j.properties.template /usr/hive/conf/hive-log4j.properties
$ su - hadoop $ hive //启动正常,但查询出现错误,未解决Logging initialized using configuration in file:/usr/hive/conf/hive-log4j.propertiesHive history file=/tmp/hadoop/hive_job_log_hadoop_201406172136_2107577584.txthive> show tables;FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Communications link failureThe last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.NestedThrowables:com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failureThe last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTaskhive>
四、测试(本地独立模式)
1.启动hadoop
[hadoop@master ~]$ jps //master主机3749 NameNode3962 JobTracker4269 Jps3898 SecondaryNameNode[hadoop@slave1 ~]$ jps //slave1主机,其余两个一样3033 Jps2755 DataNode2844 TaskTracker
2.启动hive
$ hive //master主机执行
3.创建表
hive> create table test(id INT,str STRING) > row format delimited > fields terminated by ',' > stored as textfile;Time taken: 0.15 seconds
4.查看新建的表
hive> show tables;OKtestTime taken: 1.15 seconds
5.加载本地测试数据
hive> load data local inpath '/home/hadoop/data_test.txt' > overwrite into table test; Copying data from file:/home/hadoop/data_test.txtCopying file: file:/home/hadoop/data_test.txtLoading data to table default.testOKTime taken: 4.322 seconds
6.查询前10行信息
hive> select * from test limit 10;OK1 a2 b3 c4 d5 e6 f7 g8 h9 i10 jTime taken: 0.869 seconds
7.查询该文件中存在多少条数据,这时hive将执行一个map-reduce的过程来计算该值
hive> select count(1) from test; Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201406180238_0001, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201406180238_0001Kill Command = /usr/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=http://192.168.2.101:9001 -kill job_201406180238_0001Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12014-06-18 02:39:43,858 Stage-1 map = 0%, reduce = 0%2014-06-18 02:39:54,964 Stage-1 map = 100%, reduce = 0%2014-06-18 02:40:04,078 Stage-1 map = 100%, reduce = 100%Ended Job = job_201406180238_0001MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 33586560 HDFS Write: 8 SUCESSTotal MapReduce CPU Time Spent: 0 msecOK4798080Time taken: 35.687 seconds
错误信息:
hive> select * from test limit 10;FAILED: Hive Internal Error: java.lang.RuntimeException(Error while making MR scratch directory - check filesystem config (null))java.lang.RuntimeException: Error while making MR scratch directory - check filesystem config (null) at org.apache.hadoop.hive.ql.Context.getMRScratchDir(Context.java:216) at org.apache.hadoop.hive.ql.Context.getMRTmpFileURI(Context.java:267) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1063) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7275) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://192.168.2.101:9000/tmp/hive-hadoop/hive_2014-06-18_02-31-49_300_6928349557820866780, expected: hdfs://master:9000 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222) at org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:116) at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:164) at org.apache.hadoop.hive.ql.Context.getMRScratchDir(Context.java:210) ... 17 more
解决办法(集群内所有节点都要修改):
$ vim /usr/hadoop/conf/core-site.xmlfs.default.name hdfs://192.168.2.101:9000 //将IP地址修改为主机名即可!
查询主机名:
$ vim /etc/hosts192.168.2.101 master192.168.2.102 slave1192.168.2.103 slave2192.168.2.104 slave3
修改后:
fs.default.name hdfs://master:9000 //将IP地址修改为主机名