background-shape
feature-image

Golang, or Go, is a programming language that is known for its simplicity, concurrency support, and efficient memory management. Concurrency in Go allows developers to write programs that can perform multiple tasks concurrently, allowing for faster and more efficient execution.

Crawling Amazon.com and storing the data in Elasticsearch using GoLang is a common use case for web crawlers. Elasticsearch is a popular search engine that can be used to store and search large amounts of data, making it an ideal choice for storing the results of a web crawl. In this blog post, we will learn how to build a web crawler for Amazon.com using GoLang and store the data in Elasticsearch.

To get started, we will need to install and set up Elasticsearch on our machine. You can find instructions for doing this in the Elasticsearch documentation.

1
2
3
4
5
6
7
import (
	"fmt"
	"net/http"
	"io/ioutil"
	"golang.org/x/net/html"
	"github.com/olivere/elastic"
)

The “fmt” and “net/http” packages are used for printing messages and making HTTP requests, the “io/ioutil” package is used for reading the response from the server, the “html” package is used for parsing the HTML content, and the “elastic” package is used for interacting with Elasticsearch.

Next, we will define a function to connect to Elasticsearch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func connectToElasticsearch() *elastic.Client {
	// Set the Elasticsearch URL
	url := "http://localhost:9200"

	// Create a new Elasticsearch client
	client, err := elastic.NewClient(elastic.SetURL(url))
	if err != nil {
		fmt.Println(err)
		return nil
	}

	// Return the client
	return client
}

In this function, we set the URL of our Elasticsearch instance and create a new Elasticsearch client using the “elastic” package. We then return the client to be used in other functions.

Next, let’s define the main function of our web crawler:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
func main() {
	// Set the starting URL
	url := "https://www.amazon.com/products"

	// Make an HTTP GET request to the URL
	response, err := http.Get(url)
	if err != nil {
		fmt.Println(err)
		return
	}
	defer response.Body.Close()

	// Read the response body
	body, err := ioutil.ReadAll(response.Body)
	if err != nil {
		fmt.Println(err)
		return
	}

	// Parse the HTML content
	doc, err := html.Parse(strings.NewReader(string(body)))
	if err != nil {
		fmt.Println(err)
		return
	}

	// Connect to Elasticsearch
	client := connectToElasticsearch()
	if client == nil {
		fmt.Println("Unable to connect to Elasticsearch")
		return
	}

	// Call the function to extract product data from the HTML
	extractProductData(doc, client)
}

In this function, we set the starting URL of our crawl, make an HTTP GET request to the URL, read the response body, and parse the HTML content using the “html” package. We then connect to Elasticsearch using the “connectToElasticsearch” function and call the “extractProductData” function to extract the product data from the HTML.

Next, let’s define the “extractProductData” function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
func extractProductData(n *html.Node, client *elastic.Client) {
	// Check if the node is an element node
	if n.Type == html.ElementNode {
		// Check if the element is a "div" with the class "product"
		if n.Data == "div" {
			for _, attr := range n.Attr {
				if attr.Key == "class" && attr.Val == "product" {
					// Extract the product data from the element
					product := extractProductDataFromElement(n)

					// Index the product data in Elasticsearch
					indexProduct(product, client)
				}
			}
		}
	}

	// Recursively call the function for each child node
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		extractProductData(c, client)
	}
}

In this function, we use a recursive approach to traverse the HTML tree and search for “div” elements with the class “product”. Whenever we find such an element, we call the “extractProductDataFromElement” function to extract the product data from the element, and then call the “indexProduct” function to index the data in Elasticsearch.

Here is the “extractProductDataFromElement” function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
func extractProductDataFromElement(n *html.Node) *Product {
	// Initialize a product struct
	product := &Product{}

	// Traverse the children of the element
	for c := n.FirstChild; c != nil; c {
        // Check if the child is an element node
        if c.Type == html.ElementNode {
            // Extract the product name
            if c.Data == "h3" {
                for _, attr := range c.Attr {
                    if attr.Key == "class" && attr.Val == "name" {
                        product.Name = c.FirstChild.Data
                    }
                }
            }
            // Extract the product price
            if c.Data == "span" {
                for _, attr := range c.Attr {
                    if attr.Key == "class" && attr.Val == "price" {
                        product.Price = c.FirstChild.Data
                    }
                }
            }
        }
    }
	// Return the product struct
	return product
}

In this function, we traverse the children of the element and extract the product name and price using the HTML tag names and class attributes. We then store the data in a “Product” struct and return it to the caller.

Finally, let’s define the “indexProduct” function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func indexProduct(product *Product, client *elastic.Client) {
	// Index the product in Elasticsearch
	_, err := client.Index().
		Index("products").
		Type("product").
		Id(product.Name).
		BodyJson(product).
		Do(context.Background())
	if err != nil {
		fmt.Println(err)
		return
	}

	// Print a message to the console
	fmt.Println("Product indexed:", product.Name)
}

In this function, we use the Elasticsearch client to index the product data in the “products” index and “product” type. We set the ID of the document to be the product name, and the body of the document to be the product struct.

Now that we have all of the functions defined, we can run our web crawler by calling the “main” function. The web crawler will make an HTTP request to the starting URL, parse the HTML content, extract the product data from the HTML, and index the data in Elasticsearch.

This is just a basic example of how to build a web crawler for Amazon.com using GoLang and store the data in Elasticsearch. With a little modification and customization, you can use this as a starting point to build a more sophisticated web crawler that meets your specific needs.