Working with the Hadoop Distributed File System

$ sudo apt install python python-pip virtualenv $ virtualenv .snakebite $ source .snakebite/bin/activate $ pip install snakebite This client is not a drop-in replacement for the JVM-based CLI but shouldnt have a steep learning curve if youre already familiar with GNU Core Utilities file system commands..$ snakebite snakebite [general options] cmd [arguments] general options: -D –debug Show debug information -V –version Hadoop protocol version (default:9) -h –help show help -j –json JSON output -n –namenode namenode host -p –port namenode RPC port (default: 8020) -v –ver Display snakebite version commands: cat [paths] copy source paths to stdout chgrp <grp> [paths] change group chmod <mode> [paths] change file mode (octal) chown <owner:grp> [paths] change owner copyToLocal [paths] dst copy paths to local file system destination count [paths] display stats for paths df display fs stats du [paths] display disk usage statistics get file dst copy files to local file system destination getmerge dir dst concatenates files in source dir into destination local file ls [paths] list a path mkdir [paths] create directories mkdirp [paths] create directories and their parents mv [paths] dst move paths to destination rm [paths] remove paths rmdir [dirs] delete a directory serverdefaults show server information setrep <rep> [paths] set replication factor stat [paths] stat information tail path display last kilobyte of the file to stdout test path test a path text path [paths] output file in text format touchz [paths] creates a file of zero length usage <cmd> show cmd usage The client is missing certain verbs that can be found in the JVM-based client as well as the Golang-based client described above..One of which is the ability to copy files and streams to HDFS..That being said I do appreciate how easy it is to pull statistics for a given file..$ snakebite stat /one_gig access_time 1539530885694 block_replication 1 blocksize 134217728 file_type f group supergroup length 1073741824 modification_time 1539530962824 owner mark path /one_gig permission 0644 To collect the same information with the JVM client would involve several commands..Their output would also be harder to parse than the key-value pairs above..As well as being a CLI tool, Snakebite is also a Python library..$ python from snakebite.client import Client client = Client("localhost", 9000, use_trash=False) [x for x in client.ls(['/'])][:2] [{'access_time': 1539530885694L, 'block_replication': 1, 'blocksize': 134217728L, 'file_type': 'f', 'group': u'supergroup', 'length': 1073741824L, 'modification_time': 1539530962824L, 'owner': u'mark', 'path': '/one_gig', 'permission': 420}, {'access_time': 1539531288719L, 'block_replication': 1, 'blocksize': 134217728L, 'file_type': 'f', 'group': u'supergroup', 'length': 1073741824L, 'modification_time': 1539531307264L, 'owner': u'mark', 'path': '/one_gig_2', 'permission': 420}] Note Ive asked to connect to localhost on TCP port 9000..Out of the box Hadoop uses TCP port 8020 for the NameNode RPC endpoint..Ive often changed this to TCP port 9000 in many of my Hadoop guides..You can find the hostname and port number configured for this end point on the master HDFS node..Also note that for various reasons HDFS, and Hadoop in general, need to use hostnames rather than IP addresses.. More details

Leave a Reply