Install Single Node HDFS to Linix System Like Windows' WSL, for Remote Access
by Trevor Lee in Circuits > Software
519 Views, 1 Favorites, 0 Comments
Install Single Node HDFS to Linix System Like Windows' WSL, for Remote Access
In this post, I hope to show you the essential steps to install HDFS to Linux system; and here Windows' WSL will be the target.
There are a few things that must be cleared up for successful HDFS installation, for remote access with IP address, not just localhost:
- You will need to find out the host IP address of your Linux System. For Windows WSL, the IP address is kind of tricky, since the actually WSL IP address is assigned by Windows, and can change, say, once Windows is rebooted.
- Make sure your Linux System has SSH support. Not just SSH client, it needs to support SSH service as well. And importantly, SSH key must be setup so that it can SSH itself without needing to enter password.
- Java is needed, since HDFS is written in Java.
Find Out WSL IP Address
Finding WSL IP address is easy, just run the command ifconfig:
ifconfig
As shown above, you can easily see WSL's IP address listed. Jog it down. You will need this IP address in the future of the installation, as well as accessing HDFS.
Check You Have SSH
Most likely, WSL will have SSH pre-installed. However, you will not able to SSH WSL itself, like shown above, since you will not have SSH service installed.
Without going too much into the details, here I list the steps
- install openssh
sudo apt install openssh-server
- edit /etc/ssh/sshd_config, making sure of the followings
Port 22
ListenAddress 0.0.0.0
PubkeyAuthentication yes
PasswordAuthentication yes
Now, you should be ready to startup SSH service
Startup SSH Service
To startup SSH service, just enter the command as follow
sudo service ssh start
Generate SSH Key
Normally, it is not necessary to use [client-side] SSH key for SSH login, since you can always interactively enter password. Neverthess, for successful HDFS installation, SSH key is a must.
Generating SSH key is easy, since SSH installation comes with tool to do exactly this
ssh-keygen
During the process, several questions will be asked. Just press enter to accept the defaults.
Now, copy the generated SSH key to the [server-side] SSH service
ssh-copy-id 172.22.30.107
Note that the IP address shown above is for my WSL. Yours will be different, say you should have jogged down in previous step.
Check the SSH Setup
Not only WSL can SSH to self. It has to be able to do this without entering password. In other words, SSH key access needs be setup appropriately.
As shown in the above screenshot, you should now be able to successfully SSH into WSL itself, without being asked for password.
Install Java
You may want to enter the command
java --version
to see if Java is installed. If not, installing it is easy
sudo apt-get install openjdk-17-jdk
Find Out Where Java Is
Normally, you will not care where a program is installed. However, for setting up HDFS, you will need to know where Java is installed, and setup JAVA_HOME appropriately.
To find you where Java is, can enter the following command
readlink -f $(which java)
The location shown will be location of the Java executable.
e.g. /usr/lib/jvm/java-17-openjdk-amd64/bin/java
JAVA_HOME is just a leve up.
e.g. /usr/lib/jvm/java-17-openjdk-amd64
Jog down the JAVA_HOME location.
Download HDFS
To download HDFS and extract it out, the following steps can be followed
- download the GZ file from Hadoop official we site; here, Hadoop version 3.3.4 will be download
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
- extract the downloaded GZ file
tar zxf hadoop-3.3.4.tar.gz
Afterward, a directory hadoop-3.3.4 will be created with all the Hadoop files. This hadoop-3.3.4 directory is the home for Hadoop / HDFS.
And from now on, the steps will be relative to this directory.
Setup Hadoop Env
You will need to modify 2 config files. The first one is
etc/hadoop/hadoop-env.sh
Add a line at the top of it to "define" where is JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Setup Core Site
The second one, and also the last one, is
etc/hadoop/core-site.xml
Add the following section inside the <configuration> block
<property>
<name>fs.defaultFS</name>
<value>hdfs://172.22.30.107:9000</value>
</property>
Again, the IP address above is for my WSL. Yours will be different.
Optionally, Specify User of the Hadoop Web UI
Like Linux file system, folders / files are associated with a creator user. By default, folders / files created via the Hadoop Web UI (later section will mention it), the creator is "Dr.Who".
To change this default user to match your real identity, you can add to
etc/hadoop/core-site.xm
the following section inside the <configuration> block
<property>
<name>hadoop.http.staticuser.user</name>
<value>trevorlee</value>
</property>
Of cause, you will have a different user name.
Format HDFS
Before starting HDFS, you will need format it first (just like you need to format HDD before use. Enter the command
bin/hdfs namenode -format
After successful formatting, you should be ready to actually startup HDFS
Startup HDFS
Staring HDFS is also just a single command
sbin/start-dfs.sh
Check HDFS Using Browser
To check that HDFS is started and runnning, use your browser to visit the HDFS admin site
http://172.22.30.107:9870/
And from this admin sit, you can get to the Hadoop Web UI for browsing HDFS.
Chech HDFS Using Commands
Here are some commands that you can use to check to see if HDFS is up and running
bin/hdfs dfs -ls hdfs://172.22.30.107:9000/
bin/hdfs dfs -mkdir hdfs://172.22.30.107:9000/test_dir
bin/hdfs dfs -ls hdfs://172.22.30.107:9000/
bin/hdfs dfs -rmdir hdfs://172.22.30.107:9000/test_dir
bin/hdfs dfs -ls hdfs://172.22.30.107:9000/
Stopping HDFS
To stop HDFS, also one commend
sbin/stop-dfs.sh
Enjoy!
Of cause, there can still be many questions about various aspects of setup, like
- How to make the IP address stays the same between Windows reboot?
- In fact, how to make HDFS as a Window service?
Sorry, I don't have immediately answers to all of above. But just hope that you can successfully setup HDFS in Linux System for your future experiment / development, especially in Windows' WSL. Enjoy!
Peace be with you! Jesus loves you! May God bless you!
P.S. Accessing WSL From Other Machine
Normally, it is not easy to setup Windows to access what is running in WSL from outside machine. It is not easy, but not impossible.
In fact, here I am going to suggest an easy way to do exactly this -- WSL-Port-Forwarding, which is a Python program that can automatically do this.
Actaully, as far as my understanding of it goes:
- When it starts, it modifies Windows settings to tell Windows to do the port forwarding to the WSL it is running in.
- When it stops normlaly, it reverts the settings.
Here is one way to run the Python program:
- Open Windows Command Prompt (cmd) with Admin previlege.
- Run wsl in it. Unless you have multiple WSL distributions, there will only be a single instance of it.
- Run the "port forwarding" Python -- port_forwarding.
- That is it. As long as it is running, [any same] port TCP communication will be forwarded. In other words, if you access port 12345 of your Windows machine, it will be forwarded to port 12345 of your WSL.
- Keeps it running, until you don't want the port forwarding.