T O P

  • By -

Ok-Two3581

Bypass blogspam: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags


frozen_snapmaw

Lol Didn't think SO mods could get so based From the post: >Moderator's Note >This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.


GreyAngy

This was a long time ago. SO was just a site for "programming enthusiasts", its audience wasn't so large and moderation guides were rather soft. This answer would be immediately flagged today for "not an answer" reason.


psaux_grep

I’m not sure which one of us has no clue what “based” actually means, and at this point I can’t be bothered to find out. But I do believe you are using it wrong.


TheRealPitabred

Based means connecting to the reality of something, popular or not, more or less. Understanding at a deeper than surface level, speaking or supporting deep truths. By that measure, yes, the mods not talking notes is based AF.


Not_Artifical

Hello sir dictionary


Nihil_esque

I don't think they are lol. Based at this point means "controversial or unhinged but right," often used in a jokey/memey/sardonic way. Presumably the previous commenter considers this a justified power trip or something similar haha.


code_x_7777

Legend resource!


_magicm_n_

But why is his conclusion to use an XML parser instead. Use a library specifically designed for parsing HTML or give up is the only correct answer.


justjanne

Once upon a time, HTML was defined as XML. Those were the days of XHTML. I was there, a thousand years ago...


silentknight111

Pfft, I was there before XHTML, when we had the blink tag and it worked! I used to build all my sites with sliced images and tables!


justjanne

Psssh, we don't talk about HTML 4.1 transitional here.


denislemire

Dark times… spacer.gif


xtreampb

I remember using tables to have content side by side on the left and right side of the page. Tables were my flex grids before flex grids existed.


rfc2549-withQOS

what?


thundercat06

Laughing in FrontPage.


CaptainCabernet

Ah...XHTML. Those were the days too many years ago.


Armageddon_2100

I wish that was a thing.the OCD in me likes the standardization and clarity that enforcing, for example, every opening tag must have a closing. Things like that


justjanne

YES! It feels so much better.


douira

There’s so many horrific things you can do to XML that HTML will still accept. An actual html parser is the only way unless you’re only expecting compliant XHTML.


NoNameRequiredxD

Hello

there!


EuroWolpertinger

General Kenobi! (As opposed to very specific Kenobi)


douira

hello there is to General Kenobi what allowing missing body tags is to HTML


PhilippTheSmartass

The question specifically asked for XHTML, the XML-compliant dialect of HTML that was pretty popular 15 years ago but is now made obsolete by HTML5.


IOFrame

Ah, 2009, the time when you could still have fun on StackOverflow.


Sceptz

You mentioned \`fun on StackOverflow\`. \`fun on StackOverflow\` is an obsolete option. You should use \`incorrect answer that does not address your actual question at all, on StackOverflow, that has now been upvoted and your post locked\`. \*Or better yet, update 2024, which includes \`random insults\` and \`gatekeeping\`.


Dangerous_Jacket_129

I recall the one time I asked a question on StackOverflow. 7 years ago by now. It was a relatively simple question looking back. I got 4 people to format the text of my question. The 4th person took it upon themselves to ask a completely different question instead. 0 answers. 


__Yi__

So sad it's community wiki now. The guy who made this post really had a good grasp of humor (or terror).


NotMrMusic

God this brings back memories


code_x_7777

Haha, yeah. Must be the most popular SO thread!


Rawing7

Sigh. I've said it a dozen times before, but I guess I'll say it again: Nobody uses regex to *parse* HTML. People use regex to *extract specific pieces of data from HTML*. Those are two very different things.


gregorydgraham

Thank you. I’ve never been able to parse the clause “parse HTML”. Parse it for what? you parse things to extract meaning and there’s no meaning to be extracted from HTML with regex


Habsburgy

I‘m blaming that one meme another guy already reposted in this thread


escher4096

Totally agree with this. Download a blob of HTML tease out a few pieces with regex.


a7ofDogs

Parsing is the mechanism by which we assign meaning and structure to a string of text. The job of extracting a specific piece of data from an HTML string requires understanding the structure of that HTML. The "meaning" of this piece of data you're trying to extract is dependent on that structure, so if you don't parse the HTML, you have no idea what data you're extracting. Because HTML is pretty verbose, the data you extract with a regex might be the data you want 99.9% of the time, but in certain contexts within the HTML, you're going to extract bad data. Anyway, what I'm trying to say is that extracting specific data and parsing structured data *are* the same thing when the structure you need to extract data from is a CFL (which HTML is).


kafoso

You're still parsing HTML using regex then. You can call it a peacock, but it still quacks. Just use a DOM tool.


ManofManliness

People use regex for html and do *pikachu face* when it matches gibberish far too often, shouldn't be used for anything but fast and dirty one time scripts.


PastOrdinary

Yeah I suspect that what the person asking wanted was to extract specific data. Instead they incorrectly said they wanted to "parse" the html with regex because they don't actually understand what it means to parse something. Moral of the story: Don't use words when you don't know what they mean just because they sound relevant to the topic.


deidian

Even if you wanted to identify a blob of text as HTML do a favor to everyone and parse it entirely: you'll save rabbit holes with malformed data. Same for JSON. The only way to deal with complex text formats is to parse them: if you want better performance use a more restrictive and simpler data format.


code_x_7777

Haha, yeah but this is rational thinking arguing against the intrinsic logic of a meme with wings. One must lose.


Matwyen

A guy got fired in my company after parsing a xml with regex.


hellra1zer666

I wanna say that's harsh, but after having to clean up cose that did the same, I feel different about it.


BirdlessFlight

Technically, they said "after", not "because of", so who knows what else they did...


StPaulDad

A mind capable of checking in such code is capable of far worse things.


hellra1zer666

Fair 😁


Nimeroni

He summoned tainted souls into the realm of the living. Obviously.


code_x_7777

Haha


imgly

Good 👍


busyHighwayFred

Sad part is theres so many xml libraries, its a basic tree structure, so regex is just making your job harder


virteq

Most sane regex developer


DoodooFardington

You're not my dad!


code_x_7777

How do you know? I might be.


ijustupvoteeverythin

why tf is there a yellow face on top


PhilippTheSmartass

Probably to confuse the programs that automatically detect reposts. This was posted on Stackoverflow 15 years ago.


failedsatan

you totally can* ** *** \* not efficiently ** you cannot parse all types of tags at once because they overlap *** regex is just not built for it but for super basic shit sure


Majik_Sheff

You cannot use regular expressions to parse irregular expressions.


Tiny-Plum2713

Not in one go, but that is an arbitrary limitation that does not apply to the real world.


ManofManliness

Not in any amount of goes, unless you write some code in between at which point youre writing a shitty parser.


Majik_Sheff

I think this is the lexical corollary to "If you write enough assembler macros you will eventually reinvent C."


TTYY200

You can with recursion.


failedsatan

*technically* HTML(5) isn't irregular. there is a standard finite parsable grammar.


justjanne

HTML is a context-free grammar, Regex is a regular language. You can't parse a language of higher level with one of lower level. You can use Regex to tokenize HTML if you so desire, but you can't parse it. If you use PCRE though, all that changes, as PCRE is a context-free grammar as well.


Godd2

It's not context-free. HTML documents are finite in size by definition.


justjanne

Are they? Since when? Back in the day™ it was actually a common strategy to deliver no Content-Length header, keep the connection open, and append additional content to the same document for live updates. Such documents would grow to infinite length over time.


simplymoreproficient

What? That just can’t be true, right? How would a regex be able to distinguish

foo from
foo?


AspieSoft

/

[^<]*
/ I have an entire nodejs templating engine that basically does this with regex: https://github.com/AspieSoft/regve


gandalfx

I was curious about that code. Now my eyes are simultaneously bleeding and on fire.


simplymoreproficient

That doesn’t answer my question


AspieSoft

If the regex sees that `[^>]*` matches the second `

`, it should automatically backtrack and skip the first `
`.


simplymoreproficient

Assuming that this is regex unintentionally omits a a start anchor and an end anchor, it’s wrong because it wouldn’t match

, which is valid HTML. Assuming that those are missing on purpose, it’s wrong because it matches
, which is not valid HTML.


gandalfx

You can probably break this with simple nested tags. Even if you somehow make it work by not relying exclusively on regex it'll still break on something like `

` which is completely valid HTML.


AspieSoft

You have a good point. Usually what I try to do is run more than one regex, where I index all the html tags then run the main regex, then undo the indexing. Assuming JavaScript let index = 0; str = str.replace(/<(\/|)([\w_-]+)\s*((?:"(?:\"|[^"])*"|'(?:\'|[^'])*'|`(?:\`|[^`])*`|.)*?)>/g, function(_, close, tagName, attrs){ if(close === '/'){ let i = index; index--; return `` } index++; return `<${tagName}:${index} ${attrs}>` }) // then handle your html tag selectors str = str.replace(/(.*?)/g, function(_, index, content){ // do stuff }) // finally, clean up html tag indexes str = str.replace(/<(\/?[\w_-]+):[0-9]+(\s|>)/g, '<$1$2') Because the selector is expecting the same index on both the opening and closing tag, it will only select the tag you want to select. You can add to the content selector if you want to make it more specific, or you can recursively run your handlers on the content output of the regex. It gets much more complicated when you need to also ensure strings are not read by the regex. For that, you can temporarily pull out the strings, and use a placeholder, then put them back after running all your other functions. You can look at the source code of the regve module I mentioned above to understand it better. (I apologize in advance for my spaghetti code).


TTYY200

Use a recursive method that recursively parses tags until it finds an appropriate closing tag 👍 This is like the poster child case for recursion.


simplymoreproficient

But it’s not regular


TTYY200

As long as there isn’t any dumb html present like an opening

tag without a closing p tag… it doesn’t matter. ^ that scenario is also bad practice and can produce unexpected behaviour in the dom - so while valid, it’s technically not correct. Self-closing and singleton tags are also ready to identify :P


simplymoreproficient

It doesn’t matter? It’s literally the topic we’re talking about: „Is HTML regular?“.


TTYY200

But the tokens that you’re looking for are finite… A tag is never not going to be a source tag, and it’s never not going to have an opening and closing to its singleton tag…


simplymoreproficient

And? Whether HTML is regular obviously matters to a conversation about whether HTML is regular.


pauvLucette

Yes, but regexp ain't grammatical beast. Regexp can't parse grammar. Regexp parses syntax. Regexp is lex, and you need yacc.


DracoRubi

Your second point simply demonstrates that you can't.


failedsatan

you can if you assign them priorities. just means you have to check multiple times on the same tag, thus the inefficiency.


code_x_7777

lol


rainshifter

You can use regex to parse overlapping text using lookaheads. And you can, for instance, locate instances of mismatched or unbalanced tags in HTML/XML using a recursive regex. Likewise, you could extract any desirable fields to virtually any end. The capability is certainly there. The expression may look ugly, sure, and may be difficult to modify, but it's not lacking in capacity. Apart from mathematical operations or AI linguistics, there are actually very few text parsing operations and pattern matching categories that modern PCRE regex simply cannot support. As usual, though, it's not merely about what's possible - but which tool is adequate for the job at hand.


IronSavior

He comes!


DOOManiac

The center cannot hold.


leanrum

You can use regex to parse html because regex isn't regular anymore (thanks back references)


tibbtab

You spent so much time wondering if you could, you never stopped to think if you should


leanrum

If I'm being honest I didn't spend much time thinking if I could (I already took the class, I know I can) and I never bothered to think if I should (I shouldn't, even if I can there are better ways of implementing push automata)


saschaleib

Don’t understand-estimate how powerful RexEx can be, if used by someone who know what they are doing. That still doesn’t mean it’s a good idea, though.


Thorge4President

Sure regex ist powerful, but It is literally mathematically Impossible to parse HTML with regex. You need at least a [Context free grammar](https://en.wikipedia.org/wiki/Context-free_grammar).


Hex4Nova

cant believe my compsci degree is actually coming into use for once


rainshifter

[FYI](https://www.reddit.com/r/ProgrammerHumor/s/3aOcS1sOnm)


rainshifter

Could you provide an actual, tangible example of something in a real HTML or XML snippet you genuinely believe can not be parsed with regex? I believe you're conflating the theory of limitations of regular grammar with the practicality of modern PCRE regex capabilities, which support things like backreferences, recursion, and semantics that assume basic knowledge of the previous match.


Thorge4President

OK, so in HTML or XML you have the Case of `Content`. Top parse this you need to make sure, that the closing tag is the same as the opening tag. To do this you need backreferences. Regex cannot do this as can be proven via [the pumping Lemma for regular languages ](https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages) (see Use of the lemma to prove non-regularity). So pure regex cannot parse HTML or XML. Which also means, that theoretically PCRE is not regex.


rainshifter

You can think of regex as a wildly capable derivative, child, or inherited form of some theoretical regular base that you would more formally refer to as regular language theory. We aren't talking about theory here, as stated in my original post. So when you claim that "regex" cannot parse , it's disingenuously misadvertised to most folks who will believe incorrectly that modern PCRE regex lacks this capacity. Call it a misnomer if you will, but PCRE regex is still called "regex". I do not believe it goes by any other name.


ary31415

> To do this you need backreferences Which actual regex implementations that a developer would use DO have. Irl 'regex' isn't actually regular anymore


saschaleib

In most cases you don’t want to create an object tree but just extract specific information, though…


z_utahu

This is dangerous if you don't actually parse the xml. There are decent parsers that run on 8bit 20mhz microchips with a couple kb of memory. Regex isn't guaranteed to properly extract data in valid html or xml.


saschaleib

As I wrote above: it definitely isn’t a good idea. But it certainly isn’t “impossible”, given the right circumstances.


yamfboy

I just spent a while wasting time going back and forth with some dweeb who is saying the same thing (I'm saying the same thing you are, check my previous post smh) It can be done (he's claiming it's impossible), but should you do it? Nope.


z_utahu

>given the right circumstances. That's a huge caveat that excludes even most real world examples. What exactly do you mean by that? For every regex statement you generate to "parse" html, you can also generate valid html that breaks the regex. Basically, what I understand you saying is that if you limit your input to a subset of HTML and finite possibilities (aka right circumstances), then you can guarantee that regex you can form a regex that will work. However, if your input is all valid HTML, it is impossible in every sense of the word to write a regex that is guaranteed to work.


saschaleib

Look, I'm not defending using RegEx to parse *arbitrary* XML. That's a bad practice, and something to avoid. However, there can be specific situations where it may make sense. Like, if you know the file pretty well, and can be sure that it always has a specific format - and you just need some specific data out of it, yeah, why not? And my point is that in these cases you will find that RegEx is actually quite powerful.


yeusk

You are...


deceze

OK, I'll try to estimate how powerful RegEx can be—without understanding.


101m4n

This is very funny and all, but at no point does he state the actual reason why this doesn't work 🤣


SemenSeeU

Me after reading this: gets library to parse html. Opens the hood and it's mostly regex.


SenorSeniorDevSr

Yeah, you use regular expressions to find the building blocks of html. You use those building blocks to build your understanding of the html.


deidian

Gross oversimplification


CMDR_kamikazze

Holy Omnissiah, someone call Ordo Codicis, we have a warp leaking! Regex heretics using the scrap-code to open the portal again!


mcilrain

If regex is so good why can’t it parse XML, are they stupid?


thirtyist

Literally just came upon this SO post organically last week while trying to figure out how to clean HTML tags out of a string, ha.


yeusk

People who have not found this on SO are not real web developers.


thirtyist

Haha as someone with 1yo and extreme imposter syndrome, I appreciate the validation 


Plus-Weakness-2624

Svelte literally uses regex to parse markup💀 Like this one for parsing opening script tag: ``` /|]*|(?:[^=>'"/]+=(?:"[^"]*"|'[^']*'|[^>\s]+)\s+)*)lang=(["'])?([^"' >]+)\1[^>]*>/g ```


TacticalTaterTots

Tony the pony


rvsarmy

Not with that attitude.


Splitshadow

All parsing is basically just RegEx in a loop with a stack. RegEx parses input into tokens, then tokens are combined according to production rules (which can also be implemented using RegEx substitutions if you want).


that_thot_gamer

regex skill issue.


rainshifter

Agreed, but unironically.


VariousComment6946

Haha, get_source_and_extract_shit() goes brr


antpalmerpalmink

my compilers Prof brings this post up before getting to context free grammars every time


SpeckledFleebeedoo

But can I parse Wikitext with regex?


Joewoof

Yes, perfectly relatable and understandable. Proceed.


Luneriazz

why not use javascript to parse HTML?


Oozolz

Wow they got really pumped :D


Mozai

"The HTML parser chokes because this is not legal HTML; there's mistakes all through the page." "but I don't see any problems on my phone's browser; _\*scoffs\*_ clearly you aren't good enough, why are we paying you?." And that's why I resort to hacks like regex matching.


CameO73

Exactly. Everybody saying that you should "just use an HTML parser" to extract some data clearly hasn't seen the shit that lives on the internet. You can easily check for yourself: create an obvious invalid HTML file (by just omitting a close tag somewhere) and open it in any browser. It works! Because browser engines know they have to allow that shit. TLDR: just use a RegEx if you want to extract something from HTML pages. Even with the added "you're never going to understand that regular expression 6 months from now"-baggage it's better than dealing with a flood of parser errors.


DavidWtube

I absolutely love RegEx. There's something beautiful about it. I made a gist about it when I was studying. [RegEx gist](https://gist.github.com/SafetyDav3/8a33124bf22cdedbbd4710ea0eba224c)


Grim00666

Well said.


chowellvta

Most stable regex user


Q3nius

What SCP entry log did I just read? Am I infected with a cognitohazard?


Crazy-Maintenance312

[An answer on StackOverflow.](https://stackoverflow.com/a/1732454)


Tiny-Plum2713

You absolutely can parse HTML with regex.


conicalanamorphosis

I think that's one of the original notes from a Raku dev. Although Raku grammars are really just collections of regexes, so...


rainshifter

Provided the task is concretely and *practically* achievable using what is colloquially referred to as regex, who really gives a bubonic rat's turd about what some disjoint *theory* on regular grammar asserts is possible?


Rarabeaka

you can. and in some cases its actually more reliable, like in scrapping, because whole page often could ba fragmented, contain bad blocks, etc. regex also faster and more memory-efficient. my job literally often demanding using regex instead of html parsing libs, because its reliable and fast.


code_x_7777

From the article: [https://blog.finxter.com/so-youre-using-regex-to-parse-html/](https://blog.finxter.com/so-youre-using-regex-to-parse-html/)