Install Single Node HDFS to Linix System Like Windows' WSL, for Remote Access

by Trevor Lee in Circuits > Software

519 Views, 1 Favorites, 0 Comments

Install Single Node HDFS to Linix System Like Windows' WSL, for Remote Access

60-0_browse_hdfs.png

In this post, I hope to show you the essential steps to install HDFS to Linux system; and here Windows' WSL will be the target.

There are a few things that must be cleared up for successful HDFS installation, for remote access with IP address, not just localhost:

  • You will need to find out the host IP address of your Linux System. For Windows WSL, the IP address is kind of tricky, since the actually WSL IP address is assigned by Windows, and can change, say, once Windows is rebooted.
  • Make sure your Linux System has SSH support. Not just SSH client, it needs to support SSH service as well. And importantly, SSH key must be setup so that it can SSH itself without needing to enter password.
  • Java is needed, since HDFS is written in Java.

Find Out WSL IP Address

00-0_ifconfig.png

Finding WSL IP address is easy, just run the command ifconfig:

ifconfig

As shown above, you can easily see WSL's IP address listed. Jog it down. You will need this IP address in the future of the installation, as well as accessing HDFS.

Check You Have SSH

00-1_ssh-failure.png

Most likely, WSL will have SSH pre-installed. However, you will not able to SSH WSL itself, like shown above, since you will not have SSH service installed.

Without going too much into the details, here I list the steps

  • install openssh
sudo apt install openssh-server
  • edit /etc/ssh/sshd_config, making sure of the followings
		Port 22
ListenAddress 0.0.0.0
PubkeyAuthentication yes
PasswordAuthentication yes

Now, you should be ready to startup SSH service

Startup SSH Service

00-2_start-ssh-service.png

To startup SSH service, just enter the command as follow

sudo service ssh start

Generate SSH Key

Normally, it is not necessary to use [client-side] SSH key for SSH login, since you can always interactively enter password. Neverthess, for successful HDFS installation, SSH key is a must.

Generating SSH key is easy, since SSH installation comes with tool to do exactly this

ssh-keygen

During the process, several questions will be asked. Just press enter to accept the defaults.

Now, copy the generated SSH key to the [server-side] SSH service

ssh-copy-id 172.22.30.107

Note that the IP address shown above is for my WSL. Yours will be different, say you should have jogged down in previous step.

Check the SSH Setup

00-3_ssh-no-password-successful.png

Not only WSL can SSH to self. It has to be able to do this without entering password. In other words, SSH key access needs be setup appropriately.

As shown in the above screenshot, you should now be able to successfully SSH into WSL itself, without being asked for password.

Install Java

10-0_check-java-installed.png

You may want to enter the command

java --version

to see if Java is installed. If not, installing it is easy

sudo apt-get install openjdk-17-jdk

Find Out Where Java Is

10-1_where_java.png

Normally, you will not care where a program is installed. However, for setting up HDFS, you will need to know where Java is installed, and setup JAVA_HOME appropriately.

To find you where Java is, can enter the following command

readlink -f $(which java)

The location shown will be location of the Java executable.

e.g. /usr/lib/jvm/java-17-openjdk-amd64/bin/java

JAVA_HOME is just a leve up.

e.g. /usr/lib/jvm/java-17-openjdk-amd64

Jog down the JAVA_HOME location.

Download HDFS

To download HDFS and extract it out, the following steps can be followed

  • download the GZ file from Hadoop official we site; here, Hadoop version 3.3.4 will be download
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
  • extract the downloaded GZ file
tar zxf hadoop-3.3.4.tar.gz

Afterward, a directory hadoop-3.3.4 will be created with all the Hadoop files. This hadoop-3.3.4 directory is the home for Hadoop / HDFS.

And from now on, the steps will be relative to this directory.

Setup Hadoop Env

20-0_modify-env.png
20-1_modify-env.png

You will need to modify 2 config files. The first one is

etc/hadoop/hadoop-env.sh

Add a line at the top of it to "define" where is JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

Setup Core Site

30-0_modify-core-site.png
30-1_modify-core-site.png

The second one, and also the last one, is

etc/hadoop/core-site.xml

Add the following section inside the <configuration> block

 <property>
 <name>fs.defaultFS</name>
  <value>hdfs://172.22.30.107:9000</value>
 </property>

Again, the IP address above is for my WSL. Yours will be different.

Optionally, Specify User of the Hadoop Web UI

Like Linux file system, folders / files are associated with a creator user. By default, folders / files created via the Hadoop Web UI (later section will mention it), the creator is "Dr.Who".

To change this default user to match your real identity, you can add to

etc/hadoop/core-site.xm

the following section inside the <configuration> block

 <property>
 <name>hadoop.http.staticuser.user</name>
  <value>trevorlee</value>
 </property>

Of cause, you will have a different user name.

Format HDFS

40-0_format_hdfs.png

Before starting HDFS, you will need format it first (just like you need to format HDD before use. Enter the command

bin/hdfs namenode -format

After successful formatting, you should be ready to actually startup HDFS

Startup HDFS

50-0_start-dfs.png

Staring HDFS is also just a single command

sbin/start-dfs.sh


Check HDFS Using Browser

60-0_browse_hdfs.png

To check that HDFS is started and runnning, use your browser to visit the HDFS admin site

http://172.22.30.107:9870/

And from this admin sit, you can get to the Hadoop Web UI for browsing HDFS.

Chech HDFS Using Commands

70-0_ls_empty_hdfs.png
70-1_make_test_dir.png

Here are some commands that you can use to check to see if HDFS is up and running

bin/hdfs dfs -ls hdfs://172.22.30.107:9000/
bin/hdfs dfs -mkdir hdfs://172.22.30.107:9000/test_dir
bin/hdfs dfs -ls hdfs://172.22.30.107:9000/
bin/hdfs dfs -rmdir hdfs://172.22.30.107:9000/test_dir
bin/hdfs dfs -ls hdfs://172.22.30.107:9000/


Stopping HDFS

To stop HDFS, also one commend

sbin/stop-dfs.sh

Enjoy!

Of cause, there can still be many questions about various aspects of setup, like

  • How to make the IP address stays the same between Windows reboot?
  • In fact, how to make HDFS as a Window service?

Sorry, I don't have immediately answers to all of above. But just hope that you can successfully setup HDFS in Linux System for your future experiment / development, especially in Windows' WSL. Enjoy!


Peace be with you! Jesus loves you! May God bless you!

P.S. Accessing WSL From Other Machine

Normally, it is not easy to setup Windows to access what is running in WSL from outside machine. It is not easy, but not impossible.

In fact, here I am going to suggest an easy way to do exactly this -- WSL-Port-Forwarding, which is a Python program that can automatically do this.

Actaully, as far as my understanding of it goes:

  • When it starts, it modifies Windows settings to tell Windows to do the port forwarding to the WSL it is running in.
  • When it stops normlaly, it reverts the settings.

Here is one way to run the Python program:

  • Open Windows Command Prompt (cmd) with Admin previlege.
  • Run wsl in it. Unless you have multiple WSL distributions, there will only be a single instance of it.
  • Run the "port forwarding" Python -- port_forwarding.
  • That is it. As long as it is running, [any same] port TCP communication will be forwarded. In other words, if you access port 12345 of your Windows machine, it will be forwarded to port 12345 of your WSL.
  • Keeps it running, until you don't want the port forwarding.