yes, what I was thinking, but didn't put the work to write ;) have a few sats sir makes me think, could there be a method to detect same title or same content regardless of different url? AI maybe? content indexing like search engines? dunno
The AMP duplicate issue is easy to fix. There's already Issue #138 in SN's Github repo for that. And for the case where there's a Description post with a link in it which was shared in a prior Link post was acknowledged as a possible problem, see Issue #150 on that. And there are some other issues on SN's Github related to improvements on dup detection (e.g., ignoring upper/lower case in URL, ignoring www. subdomain, ignoring m. subdomain (e.g, for youtube videos), etc.
could there be a method to detect same title or same content regardless of different url?
That could be done ... exact duplicate is easy. Simply store a hash of the response. If the hash for a new post matches the hash for a prior post, include the earlier as a potential dupe. That solves the case where there are two URLs that respond with identical content. That doesn't solve for something like where ZeroHedge reprints a Bitcoin Magazine article. In that instance they have a different hash but are essentially the same content.
There are anti-plagiarism services and duplicate content detection tools that could identify occurrences of that. But I don't think that a duplicate could be detected in real-time -- as SN doesn't know what prior post(s) should be looked at when there is a new post. Maybe doing the anti-plagiarism search against prior posts with similar words from the title would find duplicates but that would take longer than is reasonable to expect the user entering a post to wait.
Or simply there could be a mechanism for users to flag a post as being a duplicate, and some handling be done for that. Reddit does this, but their approach involves moderators who investigate -- something SN is designed to not need.
The duplicate posts is a minor inconvenience that I suspect bothers mostly only a small number of heavy users of SN. I can imagine there are other things to be done that are at a higher priority.
reply