Hacker News Score, Comments and Rank as Time Series
I updated @hn today to not only post when an item hits #1 on HN but also track HN metrics about each item on the front page every minute.
Since I was lazy, this means that @hn now checks every minute for a new item on the top instead of every hour as before so there might be more "spam" from @hn for a while.1
I decided to track metrics as the foundation for a more sophisticated algorithm that decides if @hn should crosspost something to SN. This is something I wanted to do since #192328 where @nout mentioned that @hn basically seeds ~tech but it could post less but more quality content.
But how to decide if something is quality content for SN before it gets posted?
This question is basically what I am trying to answer with the new collected metrics. With them, I can now run some backtesting to see if there is some pattern on HN that @hn can exploit for the items that performed well on SN. I don't collect SN metrics yet though (but I could cheat and use my database access).
But even if such patterns are not visible, I have some ideas that should definitely be an improvement since they are way better proxies for "is this item interesting?" than simply looking at the rank at a single moment in time.
For example, I could track the time an item stays on #1 and only post when it's been there for at least one hour. This should be better since else @hn might post something which reached #1 as fast as it falls down the ranks.
Another idea is to track the "slope" an item has. If it rises very fast (based on rank, score and/or amount of comments), @hn could post something even before it hits the top. @hn would basically predict that this item is very interesting and will reach the top soon.
That's also the reason why @hn only checked every hour: the probability that @hn will find an item at the top that reached the top at the same moment it checked (and then falls of) is lower than if it checks every minute.
All of these ideas can now be expressed with a single SQL query. For the more technical stackers, here is the query that currently decides if @hn should post something:
SELECT t.id, time, title, url, author, score, ndescendants FROM ( SELECT id, MAX(created_at) AS created_at FROM hn_items WHERE rank = 1 AND id NOT IN (SELECT hn_id FROM sn_items) GROUP BY id ) t JOIN hn_items ON t.id = hn_items.id AND t.created_at = hn_items.created_at;
#464039 (24 sats, 0 comments):
#464032 (0 sats, 0 comments):
#463914 (10 sats, 0 comments):
#463813 (102 sats, 0 comments):
#463756 (41 sats, 0 comments):
#463719 (20 sats, 0 comments):
#463685 (387 sats, 0 comments):
#463921 (10 sats, 0 comments):
#463449 (20 sats, 0 comments):
#463261 (30 sats, 0 comments):
#463222 (20 sats, 2 comments):
wasn't posted for some reason:
#462642 (41 sats, 0 comments):