`

hadoop lzo

阅读更多

1.安装LZO

sudo apt-get install liblzo2-dev
或者下载lzo2[http://www.oberhumer.com/opensource/lzo/download/].
wget [http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz]
./configure \--enable-shared
make
make install

2.安装hadoop-lzo

wget [https://github.com/kevinweil/hadoop-lzo/archive/master.zip]
或 git clone [https://github.com/kevinweil/hadoop-lzo.git]

64位机器:
export CFLAGS=-m64
export CXXFLAGS=-m64

32位机器:

export CFLAGS=-m32
export CXXFLAGS=-m32

编译打包:ant compile-native tar

{color:#ff0000}编译过程中遇到错误:{color}
compile-native:
[mkdir] Created dir: /home/caodaoxi/soft/hadoop-lzo/build/native/Linux-i386-32/lib
[mkdir] Created dir: /home/caodaoxi/soft/hadoop-lzo/build/native/Linux-i386-32/src/com/hadoop/compression/lzo
[javah] 错误: 找不到类org.apache.hadoop.conf.Configuration。

BUILD FAILED
/home/caodaoxi/soft/hadoop-lzo/build.xml:269: compilation failed


解决方法:
 在build.xml中添加 <classpath refid="classpath"/>

 <javah classpath="${build.classes}" destdir="${build.native}/src/com/hadoop/compression/lzo" force="yes" verbose="yes">

   <class name="com.hadoop.compression.lzo.LzoCompressor" />

   <class name="com.hadoop.compression.lzo.LzoDecompressor" />
   {color:#ff0000}<classpath refid="classpath"/>{color}

 </javah>

3.将安装后的hadoop的lzo目录的native文件夹copy到hadoop的lib的native目录

 cp /home/hadoop/soft/hadoop-lzo/build/native /home/hadoop/soft/hadoop/lib/

4.将安装后的hadoop的lzo目录的jar包copy到hadoop的lib目录

cp /home/hadoop/soft/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar /home/hadoop/soft/hadoop/share/hadoop/lib

5.配置hadoop的配置文件

修改$HADOOP_HOME/conf/core-site.xml, 加入下面这段配置。(后期测试发现,这个配置可以不用加,而且加了这个配置以后会导致sqoop等一些框架加载不到LzoCode.class)

<property>
 <name>hadoop.tmp.dir</name>
 <value>/home/hadoop/soft/hadoop/tmp</value>
</property>
<property>
 <name>fs.trash.interval</name>
 <value>1440</value>
 <description>Number of minutes between trash checkpoints. If zero, the trash feature is disabled.</description>
</property>
<property>
 <name>io.compression.codecs</name>
 <value>
org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
 </value>
</property>
<property>
 <name>io.compression.codec.lzo.class</name>
 <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
修改$HADOOP_HOME/conf/mapred-site.xml

<property>
 <name>mapreduce.map.output.compress</name>
 <value>true</value>
</property>

<property>
 <name>mapred.child.java.opts</name>
 <value>-Djava.library.path=/home/hadoop/soft/hadoop/lib/native/Linux-i386-32/</value>
</property>

<property>
 <name>mapred.map.output.compression.codec</name>
 <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

6.hadoop集群重启

 cd /home/hadoop/soft/hadoop/bin

 ./stop-all.sh

 ./start-all.sh

*7.对集群进行测试

a.测试环境的测试*

1.安装lzop

wget http://www.lzop.org/download/lzop-1.03.tar.gz&nbsp;

/configure && make && sudo make install

2.使用lzop压缩日志文件
下载原始日志: hadoop fs -copyToLocal /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log

原始日志文件:

rw-rr-  1 hadoop hadoop 497060688 Jul  1 10:36 pv.log

使用lzop进行压缩: lzop pv.log
压缩后的日志文件:

rw-rr-  1 hadoop hadoop 497060688 Jul  1 10:36 pv.log
rw-rr-  1 hadoop hadoop 163517168 Jul  1 10:36 pv.log.lzo

压缩率为 163517168/497060688=33%

hadoop fs -put pv.log.lzo  /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/

测试是否安装成功:hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04

报错:

           13/07/01 15:01:35 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library
           java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
           at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
           at java.lang.Runtime.loadLibrary0(Runtime.java:845)
           at java.lang.System.loadLibrary(System.java:1084)
           at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32)
           at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71)
           at com.hadoop.compression.lzo.LzoIndexer.<init>(LzoIndexer.java:36)
           at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:134)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:601)
           at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
           13/07/01 15:01:35 ERROR lzo.LzoCodec: Cannot load native-lzo without native-hadoop
           13/07/01 15:01:36 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-14/pv.log.lzo, size 0.05 GB...
           Exception in thread "main" java.lang.RuntimeException: native-lzo library not available
           at com.hadoop.compression.lzo.LzopCodec.createDecompressor(LzopCodec.java:104)
           at com.hadoop.compression.lzo.LzoIndex.createIndex(LzoIndex.java:229)
           at com.hadoop.compression.lzo.LzoIndexer.indexSingleFile(LzoIndexer.java:117)
           at com.hadoop.compression.lzo.LzoIndexer.indexInternal(LzoIndexer.java:98)
           at com.hadoop.compression.lzo.LzoIndexer.index(LzoIndexer.java:52)
           at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:137)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:601)
           at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

       说明lzo的库没有安装成功.

       解决方法:

       1.经过无数次的排查,终于发现jdk是32位的,而机器是64的,导致hadoop编译后是32位的,将jdk版本改成64位的.

       2.进过对hadoop-lzo源码和hadoop代码的阅读,发现其中几个代码片段.

          com.hadoop.compression.lzo.GPLNativeCodeLoader:

          try {
                 //try to load the lib
                System.loadLibrary("gplcompression");
                nativeLibraryLoaded = true;
                LOG.info("Loaded native gpl library");
         } catch (Throwable t) {
                LOG.error("Could not load native gpl library", t);
                nativeLibraryLoaded = false;
         }

          /home/hadoop/soft/hadoop/bin/hadoop:

          HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"

          在此之前输出JAVA_LIBRARY_PATH,发现并不包括lzo的动态链接库文件,而使用lzo的话,必须要将lzo的动态链接库文件加入到JAVA_LIBRARY_PATH,

          所以在365行添加:JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}

          重启hadoop集群

          再次执行: hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04

          打印信息如下:

          13/07/01 17:40:53 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
          13/07/01 17:40:53 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
          13/07/01 17:40:54 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://kooxoo1-154.kuxun.cn:9000/user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo to indexing list (no index currently exists)
          13/07/01 17:40:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
          13/07/01 17:40:54 INFO input.FileInputFormat: Total input paths to process : 1
          13/07/01 17:40:54 INFO mapred.JobClient: Running job: job_201307011738_0001
          13/07/01 17:40:55 INFO mapred.JobClient:  map 0% reduce 0%
          13/07/01 17:41:11 INFO mapred.JobClient:  map 100% reduce 0%
          13/07/01 17:41:16 INFO mapred.JobClient: Job complete: job_201307011738_0001
          13/07/01 17:41:16 INFO mapred.JobClient: Counters: 19
          13/07/01 17:41:16 INFO mapred.JobClient:   Job Counters
          13/07/01 17:41:16 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=15320
          13/07/01 17:41:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
          13/07/01 17:41:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
          13/07/01 17:41:16 INFO mapred.JobClient:     Launched map tasks=1
          13/07/01 17:41:16 INFO mapred.JobClient:     Data-local map tasks=1
          13/07/01 17:41:16 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
          13/07/01 17:41:16 INFO mapred.JobClient:   File Output Format Counters
          13/07/01 17:41:16 INFO mapred.JobClient:     Bytes Written=0
          13/07/01 17:41:16 INFO mapred.JobClient:   FileSystemCounters
          13/07/01 17:41:16 INFO mapred.JobClient:     HDFS_BYTES_READ=15388
          13/07/01 17:41:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=21849
          13/07/01 17:41:16 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=15176
          13/07/01 17:41:16 INFO mapred.JobClient:   File Input Format Counters
          13/07/01 17:41:16 INFO mapred.JobClient:     Bytes Read=15220
          13/07/01 17:41:16 INFO mapred.JobClient:   Map-Reduce Framework
          13/07/01 17:41:16 INFO mapred.JobClient:     Map input records=1897
          13/07/01 17:41:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=100438016
          13/07/01 17:41:16 INFO mapred.JobClient:     Spilled Records=0
          13/07/01 17:41:16 INFO mapred.JobClient:     CPU time spent (ms)=3770
          13/07/01 17:41:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=189202432
          13/07/01 17:41:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3543986176
          13/07/01 17:41:16 INFO mapred.JobClient:     Map output records=1897
          13/07/01 17:41:16 INFO mapred.JobClient:     SPLIT_RAW_BYTES=164

          说明hadoop-lzo安装成功,创建的索引文件列表:

          -rw-r-r-   3 hadoop caodx  163517168 2013-07-01 10:55 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo
          -rw-r-r-   3 hadoop caodx      15176 2013-07-01 17:41 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo.index

    3,mapreduce测试:

        WordCount核心代码片段:

        TextOutputFormat.setCompressOutput(job, true);
        TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class);

        运行WordCount:

        hadoop fs -put soft/hadoop/README.txt /user/hadoop

        hadoop jar lzotest.jar org.apache.hadoop.examples.WordCount /user/hadoop/README.txt /user/hadoop/lzo1

        13/07/01 18:12:40 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
        13/07/01 18:12:40 INFO input.FileInputFormat: Total input paths to process : 1
        13/07/01 18:12:40 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
        13/07/01 18:12:40 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
        13/07/01 18:12:40 INFO mapred.JobClient: Running job: job_201307011738_0004
        13/07/01 18:12:41 INFO mapred.JobClient:  map 0% reduce 0%
        13/07/01 18:12:55 INFO mapred.JobClient:  map 100% reduce 0%
        13/07/01 18:13:07 INFO mapred.JobClient:  map 100% reduce 100%
        13/07/01 18:13:12 INFO mapred.JobClient: Job complete: job_201307011738_0004

        查看输出文件:

        hadoop fs -ls /user/hadoop/lzo1

        -rw-r-r-   3 hadoop supergroup       1037 2013-07-01 18:13 /user/hadoop/lzo1/part-r-00000.lzo

        可以看出输出结果被压缩了.

 

    4,hive测试

       设置map中间结果压缩:

       hive (labrador)> set mapred.compress.map.output=true;

       hive (labrador)> set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

       hive (labrador)>select count(*) from pvlog where ptdate='2013-06-04';

    5.性能测试(针对9.1G的日志做测试)

      例子程序是对pv日志统计pv,uv,ip

      hadoop@kooxoo1-155:~$ ll -h

      -rw-r-r- 1 hadoop hadoop 9.1G Jul  2 14:55 pvlog2013-06-04.txt

      a.不设置压缩:

      hive (labrador)> select count(*) pv, count(distinct visitsid) uv, count(distinct ip) ip from pvlog where ptdate='2013-06-04';

    查看http://hadoop154.ikuxun.cn/jobconf.jsp?jobid=job_201307021641_0001页面: 

   

   
      运行时显示map和reduce的个数:

      Hadoop job information for Stage-1: number of mappers: 37; number of reducers:

      运行结果:

          pv               uv          ip
      14569944    946643    685518
      Time taken: 204.92 seconds

      b.设置map中间数据压缩:

      重建表结构(必须在建表的时候指定表的输入和输出格式,而不能在执行hsql之前指定,不然会报错):

     
      hive (labrador)>drop table pvlog; (这一步删除外部表结构的时候是不会删除表的数据的)

      hive (labrador)> CREATE EXTERNAL TABLE pvlog(ip string, current_date string, current_time string, entry_time string,

      hive (labrador)>visitor_id string, url string, first_refer string, last_refer string, fromid string, ifid string, external_source string,  internal_source       string, pagetype string,

      hive (labrador)>global_landing string, channel_landing string, visits_count string,     pv_count string, kuxun_id string, utm_source string, utm_medium string, utm_term string,

      hive (labrador)>utm_id string, utm_campaign string, pool string, reserve_a string, reserve_b string, reserve_c string, reserve_d string, city string, pvid string,

      hive (labrador)>lastpvid string, visitsid string, maxpvcount string, channelpv string, channelleads string)

      hive (labrador)>PARTITIONED BY (ptdate string)
      hive (labrador)>ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
      hive (labrador)>LINES TERMINATED BY '\n'
      hive (labrador)>STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
      hive (labrador)>OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
      hive (labrador)>LOCATION '/user/hive/warehouse/labrador.db/pvlog/';

      hive (labrador)>ALTER TABLE pvlog ADD PARTITION (ptdate='2013-06-04');

      hive (labrador)> set mapred.compress.map.output=true;

      hive (labrador)> set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

      hive (labrador)> set hive.exec.compress.intermediate=true;

      hive (labrador)> set io.compression.codecs=com.hadoop.compression.lzo.LzopCodec

      hive (labrador)> select count(*) pv, count(distinct visitsid) uv, count(distinct ip) ip from pvlog where ptdate='2013-06-04';

      运行时显示map和reduce的个数:

      Hadoop job information for Stage-1: number of mappers: 37; number of reducers:

      查看http://hadoop154.ikuxun.cn/jobconf.jsp?jobid=job_201307021641_0001页面:

      

      

       运行结果:
          pv               uv          ip
      14569944    946643    685518
      Time taken: 184.92 seconds

      执行效率提高了20多秒,提升不是很明显,可能与测试环境的机器有关,测试环境4个节点,都是虚拟机,总的map槽位6个,总的reduce槽位6个. 

      c.测试索引自动创建:

        删除pvlog表并创建表

        load数据hive (labrador)> LOAD DATA local INPATH '/home/hadoop/pvlog2013-06-04.txt' INTO TABLE pvlog PARTITION(ptdate='2013-06-04');

        查看load的数据

        hadoop@kooxoo1-155:~$ hadoop fs -ls /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04

        -rw-r-r-   3 hadoop supergroup 9674697618 2013-07-04 10:12 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pvlog2013-06-04.txt

        由此可以确定在load数据的时候是不会自动压缩的和创建索引的,所以要想数据被压缩和创建索引,必须手动压缩并创建索引.

        手动压缩和创建索引的脚本:

       #!/bin/bash
       hadoop fs -copyToLocal /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/ /home/caodx/workspace/hadoopscript/lzo-test/
       cd /home/caodx/workspace/hadoopscript/lzo-test/ptdate=2013-06-04

       #创建lzo压缩文件
       /usr/local/bin/lzop pvlog2013-06-04.txt
       hadoop fs -moveFromLocal /home/caodx/workspace/hadoopscript/lzo-test/pvlog2013-06-04.txt.lzo /home/caodx/lzo-test
       hadoop fs -rmr /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pvlog2013-06-04.txt
       cd /home/caodx/workspace/hadoopscript/lzo-test/

       #创建压缩文件的索引文件
       hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/
       rm -rf ptdate=2013-06-04

分享到:
评论

相关推荐

    hadoop lzo debug jar

    hadoop用于解析lzo的包,这个问题在使用presto的时候需要将此包添加到presto的工具包中,以支持lzo格式文件的查询。

    hadoop-lzo-0.4.20.jar

    hadoop2 lzo 文件 ,编译好的64位 hadoop-lzo-0.4.20.jar 文件 ,在mac 系统下编译的,用法:解压后把hadoop-lzo-0.4.20.jar 放到你的hadoop 安装路径下的lib 下,把里面lib/Mac_OS_X-x86_64-64 下的所有文件 拷到 ...

    hadoop-lzo-0.4.15.tar.gz

    hadoop-lzo-0.4.15.tar.gz

    hadoop-lzo-0.4.21-SNAPSHOT.jar

    hadoop-lzo-0.4.21-SNAPSHOT.jar是hadoop数据压缩lzo工具包

    hadoop-lzo-0.4.20-SNAPSHOT.jar

    编译后的hadoop-lzo源码,将hadoop-lzo-0.4.21-SNAPSHOT.jar放到hadoop的classpath下 如${HADOOP_HOME}/share/hadoop/common。hadoop才能正确支持lzo,免去编译的烦恼

    hadoop-lzo-0.4.20-SNAPSHOT.jar 包

    hadoop lzo 压缩jar包,本人已经编译好,提供大家下载。

    hadoop-lzo-0.4.13.jar

    hadoop-lzo-0.4.13.jar 依赖包 hadoop-lzo-0.4.13.jar 依赖包 hadoop-lzo-0.4.13.jar 依赖包

    hadoop-lzo-master.zip

    hadoop lzo 压缩算法的所有工程,包括hadoop-lzo-master,编译好之后的target文件夹和hadoop-lzo-0.4.20-SNAPSHOT.jar文件。复制到eclipse中,可以直接使用lzo压缩算法。

    配置hadoop支持LZO和snappy压缩.pdf

    配置hadoop支持LZO和snappy压缩

    hadoop-lzo-master

    将生成的 build/hadoop-lzo-0.4.15.jar cp 到 /usr/local/hadoop-1.0.2/lib 测试解压程序 bin/hadoop jar /usr/local/hadoop-1.0.2/lib/hadoop-lzo-0.4.15.jar ...

    hadoop-lzo-0.4.15.jar

    hadoop2 lzo 文件 ,编译好的64位 hadoop-lzo-0.4.15.jar 文件 ,在mac 系统下编译的,用法:解压后把hadoop-lzo-0.4.15.jar 放到你的hadoop 安装路径下的lib 下,把里面lib/Mac_OS_X-x86_64-64 下的所有文件 拷到 ...

    lzo-2.06&hadoop;-lzo

    LZO:实时数据压缩库 hadoop-LZO:hadoop中可切割数据的LZO压缩 资源包含:lzo-2.06.tar.gz, hadoop-lzo-master.zip

Global site tag (gtag.js) - Google Analytics