如何在 map-reduce 中从 hdfs 读取多个图像文件作为输入?

2022-01-13 00:00:00 hadoop mapreduce java
private static String[] testFiles = new String[]     {"img01.JPG","img02.JPG","img03.JPG","img04.JPG","img06.JPG","img07.JPG","img05.JPG"};
 // private static String testFilespath = "/home/student/Desktop/images";
private static String testFilespath ="hdfs://localhost:54310/user/root/images";
//private static String indexpath = "/home/student/Desktop/indexDemo";
private static  String testExtensive="/home/student/Desktop/images";

public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
private Text input_image = new Text();
private Text input_vector = new Text();
    @Override
public void map(Text key, Text value,OutputCollector<Text, Text> output,Reporter       reporter) throws IOException {

 System.out.println("CorrelogramIndex Method:");  
       String featureString;
int MAXIMUM_DISTANCE = 16;
AutoColorCorrelogram.Mode mode = AutoColorCorrelogram.Mode.FullNeighbourhood;
for (String identifier : testFiles) {
            try (FileInputStream fis = new FileInputStream(testFilespath + "/" +    identifier)) {
  //Document doc = builder.createDocument(fis, identifier);
//FileInputStream imageStream = new FileInputStream(testFilespath + "/" + identifier);
BufferedImage bimg = ImageIO.read(fis);
 AutoColorCorrelogram vd = new AutoColorCorrelogram(MAXIMUM_DISTANCE, mode);
                 vd.extract(bimg);
               featureString = vd.getStringRepresentation();
               double[] bytearray=vd.getDoubleHistogram();
              System.out.println("image: "+ identifier + " " + featureString );

        }
             System.out.println(" ------------- ");
input_image.set(identifier);
input_vector.set(featureString);
   output.collect(input_image, input_vector);
              }

     }
   }

  public static class Reduce extends MapReduceBase
  implements Reducer<Text, Text, Text, Text> {

    @Override
public void reduce(Text key, Iterator<Text> values,
                   OutputCollector<Text, Text> output, 
                   Reporter reporter) throws IOException {
  String out_vector="";

  while (values.hasNext()) {
   out_vector.concat(values.next().toString());
 }
  output.collect(key, new Text(out_vector));
  }
}

static int printUsage() {
System.out.println("image_mapreduce [-m <maps>] [-r <reduces>] <input> <output>");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}


@Override
  public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), image_mapreduce.class);
conf.setJobName("image_mapreduce");

// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(Text.class);

conf.setMapperClass(MapClass.class);        
//  conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

List<String> other_args = new ArrayList<String>();
for(int i=0; i < args.length; ++i) {
  try {
    if ("-m".equals(args[i])) {
      conf.setNumMapTasks(Integer.parseInt(args[++i]));
    } else if ("-r".equals(args[i])) {
      conf.setNumReduceTasks(Integer.parseInt(args[++i]));
    } else {
      other_args.add(args[i]);
    }
  } catch (NumberFormatException except) {
    System.out.println("ERROR: Integer expected instead of " + args[i]);
    return printUsage();
  } catch (ArrayIndexOutOfBoundsException except) {
    System.out.println("ERROR: Required parameter missing from " +
                       args[i-1]);
    return printUsage();
  }
}



   FileInputFormat.setInputPaths(conf, other_args.get(0));
    //FileInputFormat.setInputPaths(conf,new    Path("hdfs://localhost:54310/user/root/images"));
FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));

JobClient.runJob(conf);
return 0;
}


 public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new image_mapreduce(), args);
System.exit(res);
 }

}

`我正在编写一个程序,它将多个图像文件作为输入,存储在 hdfs &提取地图功能中的特征.如何指定在 FileInputStream(一些参数)中读取图像的路径?或者有什么方法可以读取多个图像文件?

`I am writing a program which takes multiple image files as input , stored in hdfs & extract the features in map function. How can I specify the path to read the image in FileInputStream(some parameters)? Or is there any way to read the multiple image files?

我想做的是:--以hdfs中的多个图像文件作为输入-- 在地图功能中提取特征.-- 迭代地减少.请帮助我编写代码或更好的方法.

What I want to do is: --Take multiple image files in hdfs as input -- extract features in map function. --reduce itearatively. Please help me in the code or better ways to do it.

推荐答案

考虑使用 HIPI 库 - 它将图像集合存储到 ImageBundle 中(这比将单个图像文件存储在 HDFS 中更有效).他们也有几个例子.

Look into using the HIPI library - it stores a collection of images into an ImageBundle (which is more efficient that storing the individual image files in HDFS). They have a couple of examples too.

至于您的代码,您需要指定您计划使用的输入和输出格式.没有当前的输入格式可以传递整个文件,但是您可以扩展 FileInputFormat 并创建一个 RecordReader 发出 对,其中键是文件名,值是图像文件的字节数.

As for your code, you need to specify what input and output formats you plan to use. There is no current input format that hands the entire file over, but you can just extend FileInputFormat and create a RecordReader that emits <Text, BytesWritable> pairs, where the key is the filename, and the value is the bytes of the image file.

事实上 Hadoop - The Definitive Guide 提供了这种精确输入格式的示例:

In fact Hadoop - The Definitive Guide has an example of this exact input format:

相关文章