19
AleCx04
20d

Writing the new software dev test for our incoming interview process.

Me: And here is where we ask them to parse HTML with regex.

Lead developer: You are fucked up and the villain of this movie, multiverse and everything in between, fk u.

CMS Admin: And I thought Palpatine was evil. That is legit fucked up, fk u.

Comments
  • 8
    If you do that routinely in your job, it's good to have on the test.

    That way people know to decline a second stage interview. 😋
  • 3
    You evil son of a bitch!

    But I have a valid production use case for it 😆
  • 1
    @C0D4
    Consider puppeteer instead if it's at all dynamic. it's a little slower than perl, but is a lot more reliable and correctable.
  • 1
    @SortOfTested a mix of csv and xml files with HTML + all kinds of shit encodings inside it.

    The joys of "enriched" data 🥲
  • 1
    Nothing wrong with extracting information embedded in HTML using regular expressions.
    It works for almost all cases where structured data is to be extracted from automatically generated pages.

    But XPath obviously is the better tool for that - so use it if available.
  • 1
    @Oktokolo xpath works on xml but html is not necessarily xml.

    Regex can be used with html if your goal is not a full parse but only to find some known sub patterns but for any serious parsing I always use a dedicated library for it, those often include xpath like queries or similar.
  • 0
    @Oktokolo Neither of those things work with HTML.
  • 1
    @C0D4
    Existence is pain, amirite?
  • 0
    @junon
    They both worked fine when i used them some years ago.

    Of course, regexps work just fine for data extraction if the data is surrounded by known text - whether that text is also HTML doesn't matter.

    And of course do XPaths work on HTML after you feed that HTML to an XML parser which isn't a bitch about parsing HTML (which in practice is almost just less strict XML without a DTD anyways).
    PHP's DOMDocument has a loadHTML method for parsing HTML. And it supports XPaths.
    For Python, there is lxml and its etree - also parsing HTML and supporting XPaths.

    Especially, when your HTML already has been preprocessed by a browser (like when you bot with Selenium and store DOM snapshots in a DB for later processing), both, regexps and XPaths, work very well on HTML.
  • 0
    @Oktokolo

    > if the data is surrounded by known text

    That's not the original thesis.

    > And of course do XPaths work on HTML after you feed that HTML to an XML parser

    HTML is not XML. They look similar but are governed by entirely different standards. Valid HTML doesn't always parse correctly with an XML parser.

    https://stackoverflow.com/a/...

    Just because it worked fine for you anecdotally doesn't mean it's the correct solution.
  • 0
    @junon
    >> if the data is surrounded by known text
    >That's not the original thesis.

    It is in my only previous post in this thread:
    "Nothing wrong with extracting information embedded in HTML using regular expressions."
    You obviously need to know which information to extract and where to find it. But then, regexps work fine most of the time.

    > HTML is not XML.
    Doesn't matter if the parser explicitly offers HTML parsing too (which my examples both do).
    It isn't hard to make an XML parser also accept canonical HTML. So non-bitchy parsers go that extra mile and are then actually useful in a world which now mostly prefers JSON over XML for data transfer and storage.
    HTML and XML are that similar - we even had websites delivering XHTML for some years...
  • 0
    @Oktokolo You're moving the goalposts.

    Your original comment stated that you can extract HTML data (assumedly the stuff inside tags) with regex. You cannot; not reliably.

    Also, the thesis was XPath, which by itself is an XML specification. If an XML parsing library also includes an HTML parsing library, that's fine - but that's still using a different parser and they are NOT the same thing and NOT the same standards.

    Munging words and getting things half-right isn't the same as being accurate.
  • 0
    Damn, all I was trying to do was make fun of an interaction at work :P y'all need to dev down for a bit homies
  • 0
    @junon
    > Your original comment stated that you can extract HTML data (assumedly the stuff inside tags) with regex.

    My post is still there. It literally starts with "Nothing wrong with extracting information embedded in HTML using regular expressions."
    Information embedded in HTML clearly is extractable using regexps most of the time.
    No goalpost moving required - so stop trying!

    > Also, the thesis was XPath, which by itself is an XML specification.

    So you aren't okay with me using it on HTML DOMs?
    Well, i could make DOM trees from YAML or even dirty PHP arrays and use XPath on them and you couldn't do anything about it.

    XPath literally is the only good thing that emerged from that XML hell and i will continue using it where and for what i want - especially when it is so well suited for the task like it is when extracting information from HTML.
  • 1
    @Oktokolo the problem with using an xml parser with html is that you might end up with elements in the wrong place in the hierarchy since xml does not have the rules for how to fix broken elements.

    Forms for example can result in very strange trees with part of the input elements outside of the form element.

    Depending on what you are trying to do this might not matter, but your claim will break in many cases which means that without context for exactly what you try to do you statement is false and could fool someone to try to do the same and end up with a solution that fails intermittently, and thats why we disagree.

    As I explained in by comment, for certain well known html sources it can work since you might know that it will not contain any invalid elements or that any errors that might occur does not affect your requirements.

    But unless your source is xhtml you cannot without a full understanding on exactly how it might break use an xml parser for html.
  • 0
    I have a valid use case for parsing JS with regex in production though. Should consider that in the next interview 😈
Add Comment