3

tldr; selenium-java (my newest learned tool) vs beautifulsoup4 (my most experience with) or scrapy(average experience, mediocre ability with). Which should I use if allowed to use any for web scrapeing assignment

We were explicitly told we can use anything we know from class or self study (slight bonus for self study implementations) for the group project, but would it be OK/fair for me to use beautifulsoup4 or scrapy to pull the data from the assigned site rather than the selenium-java we were taught in class

If I did use bs4 or scrapy my group wouldn't be able to edit if needed but the data collection is only a small (if immensely important) part of the assignment and I'd have the bs4 script done a lot quicker than with selenium which I have learned more recently (for class) and have less experience with

Comments
  • 3
    I almost always go to java-selenium for this kind of thing, unless I need something quick and dirty then I'll just abuse file_get_contents() in PHP or cURL if I need to carry sessions around.

    Selenium isn't that hard to
    work out, just as long as you remember that elements must be in the browsers view or it usually haves a heart attack.
  • 5
    @C0D4 I would always go for python and bs4 because I don't want the weight of spinning up a headless browser just to pull some text from a page.
  • 2
    @vorticalbox I get that. But for a large site being tested or if the scraping will occur a lot and with other people involved, I'll happily spend the time doing it.

    As I said, a quick and dirty scrape, I wouldn't bother and just use something I can write in 10 min and let it rip.
  • 1
    @C0D4 makes sense though there are things like testcafe.
  • 1
    @vorticalbox other then being node based, first glance it looks like it works basically the same way as selenium. Maybe I'm missing something, but how does this differ to selenium?
  • 0
    @C0D4 doesn't as far as I know, we used it once at work for testing in code pipelines and the new studio looks nice.
  • 2
    @C0D4 ++ for file_get_contents()

    The most underrated function in PHP.
  • 1
    Python or perl are my gotos
  • 1
    Bs4, you really dont need a webdriver for scraping
  • 1
    @yellow-dog Without webdriver, your solution will fail on sites using a bunch of JS.
  • 2
    @hitko oh WOW, my generic advise for most websites fails on a specific type of website?? Unbelievable!!! How could that happen???? Whats next, you gonna tell me you dont like my left foot shoved up all the way in your rectum??? Imagine my shock!
  • 0
    @yellow-dog Seems like you greatly underestimate percentage of those websites
  • 0
    @hitko maybe but I've been scraping for years and only needed to use a headless browser from one site.
Add Comment