MapReduce API 基本概念-低调大师

MapReduce API 基本概念

2016-04-20 641

1.序列化

序列化是指将结构化对象转为字节流以便于通过网络进行传输或写入持久存储的过程。反序列化指的是将字节流转为结构化对象的过程。在 Hadoop MapReduce 中，序列化的主
要作用有两个：永久存储和进程间通信。为了能够读取或者存储 Java 对象， MapReduce 编程模型要求用户输入和输出数据中的 key 和 value 必须是可序列化的。在 Hadoop MapReduce 中，使一个 Java 对象可序列化的方法是让其对应的类实现 Writable 接口。但对于 key 而言，由于它是数据排序的关键字，因此还需要提供比较两个 key 对象的方法。为此，key对应类需实现WritableComparable 接口，它的类如图:

在package org.apache.hadoop.io 中的WritableComparable.java文件中定义：

public interface WritableComparable<T> extends Writable, Comparable<T> {
}

再来看看Writable接口的定义：

public interface Writable {
      /** 
       * Serialize the fields of this object to <code>out</code>.
       * 
       * @param out <code>DataOuput</code> to serialize this object into.
       * @throws IOException
       */
      void write(DataOutput out) throws IOException;

      /** 
       * Deserialize the fields of this object from <code>in</code>.  
       * 
       * <p>For efficiency, implementations should attempt to re-use storage in the 
       * existing object where possible.</p>
       * 
       * @param in <code>DataInput</code> to deseriablize this object from.
       * @throws IOException
       */
      void readFields(DataInput in) throws IOException;
    }

可以很明显的看出，write(DataOutput out)方法的作用是将指定对象的域序列化为out相同的类型；readFields(DataInput in)方法的作用是将in对象中的域反序列化，考虑效率因素，实现接口的时候应该使用已经存在的对象存储。

DataInput接口定义源代码如下：

public
interface DataInput {
   void readFully(byte b[]) throws IOException;

   void readFully(byte b[], int off, int len) throws IOException;

   int skipBytes(int n) throws IOException;

   boolean readBoolean() throws IOException;

   byte readByte() throws IOException;

   int readUnsignedByte() throws IOException;

   short readShort() throws IOException;
   
   int readUnsignedShort() throws IOException;
 
   char readChar() throws IOException;

   int readInt() throws IOException;
   
   long readLong() throws IOException;
   
   float readFloat() throws IOException;
   
   double readDouble() throws IOException;
   
   String readLine() throws IOException;

   String readUTF() throws IOException;
}

每个方法的含义差不多，具体可参见java jdk源码

DataOutput接口定义源代码如下：

public
interface DataOutput {
  
    void write(int b) throws IOException;
  
    void write(byte b[]) throws IOException;

    void write(byte b[], int off, int len) throws IOException;
   
    void writeBoolean(boolean v) throws IOException;

    void writeByte(int v) throws IOException;

    void writeShort(int v) throws IOException;

    void writeChar(int v) throws IOException;

    void writeInt(int v) throws IOException;

    void writeLong(long v) throws IOException;

    void writeFloat(float v) throws IOException;

    void writeDouble(double v) throws IOException;

    void writeBytes(String s) throws IOException;
    
    void writeChars(String s) throws IOException;

    void writeUTF(String s) throws IOException;
}

WritableComparable可以用来比较，通常通过Comparator . 在hadoop的Map-Reduce框架中任何被用作key的类型都要实现这个接口。

看一个例子：

public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;
       
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
       
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
       
       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
}

2.Reporter 参数

Reporter 是 MapReduce 提供给应用程序的工具。如图所示，应用程序可使用Reporter 中的方法报告完成进度（progress）、设定状态消息（setStatus 以及更新计数器（ incrCounter）。

Reporter 是一个基础参数。 MapReduce 对外提供的大部分组件，包括 InputFormat、Mapper 和 Reducer 等，均在其主要方法中添加了该参数。

3.回调机制

回调机制是一种常见的设计模式。它将工作流内的某个功能按照约定的接口暴露给外部使用者，为外部使用者提供数据，或要求外部使用者提供数据。
Hadoop MapReduce 对外提供的 5 个组件（ InputFormat、 Mapper、 Partitioner、 Reducer 和 OutputFormat）实际上全部属于回调接口。当用户按照约定实现这几个接口后， MapReduce运行时环境会自动调用它们。如图所示，MapReduce 给用户暴露了接口 Mapper，当用户按照自己的应用程序逻辑实现自己的 MyMapper 后，Hadoop MapReduce 运行时环境会将输入数据解析成 key/value 对，并调用 map() 函数迭代处理。

微信关注我们

原文链接：https://yq.aliyun.com/articles/32243

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

MapReduce 编程模型概述

MapReduce 编程模型给出了其分布式编程方法，共分 5 个步骤： 1）迭代（iteration）。遍历输入数据，并将之解析成 key/value 对。 2）将输入 key/value 对映射（map）成另外一些 key/value 对。 3）依据 key 对中间数据进行分组（grouping）。 4）以组为单位对数据进行归约（reduce）。 5）迭代。将最终产生的 key/value 对保存到输出文件中。 MapReduce 将计算过程分解成以上 5 个步骤带来的最大好处是组件化与并行化。为了实现 MapReduce 编程模型， Hadoop 设计了一系列对外编程接口。用户可通过实现这些接口完成应用程序的开发。 MapReduce 编程接口体系结构 MapReduce 编程模型对外提供的编程接口体系结构如图所示，整个编程模型位于应用程序层和 MapReduce 执行器之间，可以分为两层。第一层是最基本的 Java API，主要有 5个可编程组件，分别是 InputFormat、Mapper、Partitioner、Reducer 和 OutputFormat ...

2016-04-20

700

OutputFormat 主要用于描述输出数据的格式，它能够将用户提供的 key/value 对写入特定格式的文件中。本文将介绍 Hadoop 如何设计 OutputFormat 接口，以及一些常用的OutputFormat 实现。 1.旧版 API 的 OutputFormat 解析如图所示，在旧版 API 中，OutputFormat 是一个接口，它包含两个方法： RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException; void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException; checkOutputSpecs 方法一般在用户作业被提交到 JobTracker 之前，由 JobClient 自动调用，以检查输出目录是否合法。 getRecordWriter 方法返回一个 Recor...

2016-04-20

784

资源下载

更多资源

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。