Posts

Featured post

Make an executable Bash Scripts

Image
Since I am using Windows Subsystem for Linux(WSL), I have to constantly run same set of few commands at the start of the Ubuntu and I thought of automating them by executing it as a bash script. Lets see how to create a bash script that can be executed through command line with few key strokes. 1. Create an empty file with .sh extension 2. Add  #! /bin/bash to the start of the script followed by the commands you want to execute. Please find the sample file as below 3. Save the file by pressing CTRL+X and press Y and Enter. The file is not yet executable and you can verify by executing ls -lrt command. 4. We can make the script executable by changing the permission by using chmod command as shown below. 5. You can verify whether the file is converted to executable or not by using the ls -lrt command again. The color of the file also will change from WHITE to GREEN as shown below. 6. You can execute the script by issuing the command ./script.sh

Download Learning Spark V2 from O'Reily for FREE!!! (2021)

Image
  Apache Spark™ is a unified analytics engine for large-scale data processing.  Spark can run 100x faster than Hadoop's MapReduce in Memory and 10x faster in while processing the data in disk.   Learning Spark V2 is the best book for learning Apache Spark. It includes the latest updates on new features from the Apache Spark 3.0 release, to help you: Learn the Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow Use Koalas, the open source pandas framework, and Spark for data transformation and feature engineering and You can get it absolutely for FREE from Databricks!! Go to the below link and fill the details to download it. Download Learning Spark V2  

Get AWS Certified: Solutions Architect Challenge - 50% Offer (Validity till - December 4, 2021) (Updated)

Image
Get AWS Certified: Solutions Architect Challenge Join this challenge and set a goal for earning AWS Certified Solutions Architect - Associate. Follow a recommended preparation path to earn your certification before AWS re:Invent 2021. Get exam ready with free training, including our new series on Twitch, AWS Power Hour: Architecting . By signing up, you’ll also be eligible for a 50% discount voucher for the exam. Terms and conditions apply. Solutions Architect Challenge This challenge ends December 4, 2021.    Why should you take the challenge? Set and achieve your goal with peers from around the world and support from AWS Training and Certification. You’ll get free recommended training, suggested resources, Q&A opportunities, and encouragement along the way. You’ll also be eligible for a 50% discount voucher for the exam *terms and conditions apply.   Why should you earn this AWS Certification? Embrace new skills, show your potential, and plan a career trajecto

Install Apache Sqoop on Ubuntu 20.04 LTS (2021)

Image
  Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.   In this post, lets see how to install Apache Sqoop on Ubuntu.   Step 1: Download Apache Sqoop    Download the Apache Sqoop binary file from official download page   Apache Sqoop   Note: As of 2021-06, Apache Sqoop project has been retired since there is no development after Sqoop version 1.4.7    Step 2: Untar the Sqoop binary file   Untar and rename the Sqoop binary file using the below command tar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop     Step 3: Configure Sqoop   Configure the Sqoop by executing the below command cd sqoop/conf mv sqoop-env-template.sh sqoop-env.sh

Install Apache Spark on Ubuntu 20.04 LTS (2021)

Image
Apache Spark™ is a unified analytics engine for large-scale data processing.  Spark can run 100x faster than Hadoop's MapReduce in Memory and 10x faster in while processing the data in disk.   Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open source engine for this task, making it a standard tool for any developer or data scientist interested in big data.      Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or incredibly large scale.  In this post we will see how to install Apache Spark on Ubuntu 20.04. Before installing Spark, make sure you have Java and Python installed in your s

Install Hive on Ubuntu 20.04 with MySQL Integration (2021)

Image
  The Apache Hive ™ is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Hive internally converts the SQL codes into MapReduce codes hence the developer doesn't have to learn Java programming and they can use SQL to query the data in HDFS. All Hive implementations require a metastore where the metadata of all the Hive objects will be stored. The metastore must be a JDBC compliant relational database, by default Hive uses the Apache Derby database as its metastore. Apache Derby is suitable only for where there is a small need of RDBMS in an application. In this post, we will see how we can install Hive with MySQL as its metastore. MySQL Installation Step 1: Installing MySQL server Update your Ubuntu package list by executing the below command

Install Hadoop on Ubuntu 20.04 (2021)

Image
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Step 1: Installing Java Hadoop is written in Java so we need to install Java before installing Java as all the Hadoop daemons will be running as JVM processes.  You can install OpenJDK 8 from the default apt repositories: sudo apt-get update sudo apt install openjdk-8-jdk Once the installation is completed, you can verify the installation b