Over the weekend we decided to build a tool to help us keep track of what’s happening in content strategy, UX and other web strategy topics. And lo Web Strategy Twitter Trends was Frankensteined together, kind of like a TweetMeme designed specifically for our industry.
This is how it works.
It starts off with a list of topics and associated Twitter search queries. It has currently been configured with four topics, but we are open to expanding the list in the future.
|Content Strategy||“content strategy” OR #contentstrategy|
|User Experience||“user experience” OR #userexperience OR #ux|
|Web Strategy||“web strategy” OR #webstrategy|
|SEO Strategy||“seo strategy” OR #seostrategy|
The tool performs an hourly Twitter search for each query and looks for new tweets that include links. The links are recorded in a database with some accompanying information: the author of the tweet, who else was mentioned or re-tweeted in the text, and so on.
This is where it gets tricky. Those of you who don’t care about the finer points of technical wrangling might want to skip this next part:
First, we need to detect and convert shortened URLs to their full equivalent, so that we can accurately measure totals for each URL. Unfortunately there are hundreds of URL shortening services, but no comprehensive service that converts them all back. Instead we resort to low-level HTTP header trickery, make a short request for the headers of each shortened URL, and parse the (potentially multiple) re-direction Location headers we get back to find the “final resting place” of the URL. All short URLs and their full equivalents are cached locally in the database so that we only have to look-up new short URLs.
Next, we need to normalize the full URL to remove query strings that include utm tracking codes and other parameters that would prevent us from detecting multiple uses of the same base URL.
The title of each link can’t be accurately extracted from tweets, where it may be paraphrased, spelled incorrectly or completely omitted. Instead we make a short, custom HTTP request to the web page, requesting only the first 1kb of data. This normally equates to about the first 1,000 characters of the HTML. Usually this contains the <title> near the top of the page where we can extract it. If not we request the next 1kb of data and check again, and so on. By only requesting a short amount of data, we lessen the load and bandwidth of the remote server, and speed up the time it takes to find the title.
The link data is all stored in a database and the final page is created with some fairly sophisticated SQL statements that extract the relevant data. The SQL may take a little time to run, so the page is re-created only once an hour, and the output placed into a static HTML file. This means less of a load on our server and a faster response when you view the page.
As it currently stands, we start the Twitter queries at 45 minutes past every hour, and re-create the static page on the hour.
The code uses the concept of a rolling day/week window of time, rather than specific full days and weeks. As such, because it has only been running for less than a day (at the time of writing), it will currently display similar items in the daily/weekly trend lists. After a few days these should start to diverge and become independently useful.
We’d very much like to hear any suggestions you have for how to improve the tool; it’s really just a starting point at the moment. Drop us a comment below.
P.S. As an aside, this is the start of a bigger set of exciting changes at Contentini. We’ll have more news this week, you lucky buggers.