Use Go Selenium To Crawl Data

Crawl data

Crawl is a widespread issue occurring in making software. News, discount news, film ticket, etc are some examples of crawl. To be simple, it is analytics HTML, read cards, and extract data. The Go library I usually use is goquery.

However, crawling an original HTML will not work in some cases: data loaded by ajax (when reading HTML, we will only see wrapper, not data), or must login when entering a page need crawl.

In this article, take crawling Amazon deal into consideration. In this page, javascript will call ajax taking data and then pour it into DOM. When using goquery read HTML, we will not see div cards like inspecting elements.

To these types, I use selenium to run the web in a real browser, take action to have fully loaded HTML before extracting data.

Selenium running in JVM is quite famous in automation test. It allows me to run script test in a real browser. My method will be: Use selenium to run the Amazon page, wait for javascript to load, and then crawl the data normally.

How to setup

Firstly, you go to seleniumhq link to download and set up seleniumhq. Selenium plays a role like a server, receiving requests sent from my code Go.

To run it, we go to the folder containing file jar and run the command:

java -jar selenium-server-standalone-2.50.1.jar -port 8081

=> We have server selenium running at port 8081. Next, you pull Go-selenium in by Go get:

go get sourcegraph.com/github.com/sourcegraph/go-selenium

After that, we need to set up a browser. I choose Firefox. Remember, when running locally, we only need to set up Firefox on the web. In contrast, running on the host we need to set up Firefox by Shell script. You can refer to how to set up Selenium on Ubuntu 14.04 Done! Now let’s code.

We need:

Remote to server selenium
Access to Amazon deal link
Conduct analytics HTLM to get information. I will print page title and the image of the first product

func main() {
    var webDriver selenium.WebDriver
    var err error
    // set browser as firefox
    caps := selenium.Capabilities(map[string]interface{}{"browserName": "firefox"})
    // remote to selenium server
    if webDriver, err = selenium.NewRemote(caps, "http://localhost:8081/wd/hub"); err != nil {
        fmt.Printf("Failed to open session: %s\n", err)
        return
    }
    defer webDriver.Quit()

    err = webDriver.Get(URL_AMAZON_DEAL)
    if err != nil {
        fmt.Printf("Failed to load page: %s\n", err)
        return
    }
    // sleep for a while for fully loaded javascript
    time.Sleep(4 * time.Second)
    // get title
    if title, err := webDriver.Title(); err == nil {
        fmt.Printf("Page title: %s\n", title)
    } else {
        fmt.Printf("Failed to get page title: %s", err)
        return
    }

    var elem selenium.WebElement
    elem, err = webDriver.FindElement(selenium.ByCSSSelector, "#widgetContent")
    if err != nil {
        fmt.Printf("Failed to find element: %s\n", err)
        return
    }

    var firstElem selenium.WebElement
    firstElem, err = elem.FindElement(selenium.ByCSSSelector, ".a-section .dealContainer")
    if err != nil {
        fmt.Printf("Failed to find element: %s\n", err)
        return
    }
    // get image
    image, err := firstElem.FindElement(selenium.ByCSSSelector, "img")
    if err == nil {
        img, _ := image.GetAttribute("src")
        fmt.Println(img)
    }
}

Run the code, we have

Page title: Gold Box Deals | Today's Deals - Amazon.com
https://images-na.ssl-images-amazon.com/images/I/51eU5JrGAXL.\_AA210\_.jpg

Well, we got all the needed information.

Conclusion

Above is my knowledge when having problems with crawl in developing software. Here is Go software programming language. Selenium also helps us in other cases, like pages need login, web pages request captcha, etc. If anyone has other experiences, I hope to hear from you.