Platform 2 Platform: the Matching Algorithm

Brief explanation

The Platform 2 Platform prototype is composed of two layers: a database collecting the articles from the connected publishers, and a matching algorithm to suggest connections between them. A third layer is added to this by editors reviewing the algorithmic suggestions and deciding if those links are meaningful or not.

While the first layer, collecting the articles, is a fairly straightforward scraping process — using the semantics of how articles are organized on a webpage (eg. a field that indicates the authors name) — the matching algorithm requires more explanation.

The basic diagram is the following:

  • articles are transformed into vectors
  • the algorithm builds a model based on this pool of articles; the model allows to compare articles and retrieve a matching score between them
  • the article requesting for matches is used as external vector and positioned in the model, in order to get a list of similar articles

Let’s unpack this:

A vector is a numerical representation of the content of the article. In order to produce this vector, each article is first tokenized (turned into a list of words by stripping away common english words (“the”, “where”, etc. anything we decide) and punctuation), then the remaining words are counted to get a vector representation of it (eg. how many times each word appears in the article).
The model constitutes all articles from the publishers, except those from the publisher that is requesting matches. The algorithm tries to match every article with each other one and calculates the relationship between them. Every new attempt is built upon the previous one (training the algorithm), and the result of all of this is the algorithmic model.

Finally, an “external” article, which did not contribute to the training of the model, is added to it and then used to request the matches that are most similar. This is measured by “cosine similarity”: each vector has a direction and length, the cosine similarity is the angle between two vectors. The smaller the angle, the more similar they are seen by the model. This allows the tool to have vectors “placed“ in different areas of the model, as we compare by angle rather than distance: using angles gives a more precise comparison between vectors, as the more they point to the same direction, the more similar they are.

The editor then gets to see the list of most similar articles and decide whether each of them is meaningful or not. The editor rates the matches and this feedback is stored in the database next to the articles. An eventual option would consist in using this review process by the editors to influence the matching algorithm model and close the feedback loop.

Conceptual breakdown

The matching algorithm is a system for suggesting articles across a group of publishers, that works by receiving an input article and giving back a list of suggestions. Briefly, what it does is to create an algorithmic model of the matches for each combination of publishers (eg. given the publishers A, B, and C, the combinations are: A + B, A + C, B + C). Whenever a matching request is made for a specific article, the algorithm picks the right model and compares the input article to all the articles in this model, eventually serving up the matches.

Given this, we can break down a bit more how the matching works, and what are the current limits of the prototype.

Each article in the algorithmic model is composed of the following fields: title, author, tags, body. Except for the author field, all other fields are tokenized. As of now, we use all fields with equal importance when requesting matches from the algorithm, since the pool of articles we run is pretty small (~ 600 articles). The matching algorithm in use was built with bigger pools of articles in mind (eg 50.000 articles). In our tests, selecting only some of the fields to base the matching on did not influence the results given by the algorithm. Technically, the current code allows for selecting specific fields, and the frontend comparison tool does too. This means that given a larger data set, we could test this feature again and draw different conclusions.

At the algorithm level, a general rule is that words with both high and low frequency have less relevance than words in the middle of the spectrum. Unique words, as well as very common words, are considered “noise” when creating the algorithmic model. We cut some of this noise already when preparing the version of the article to use for the comparison process (see tokenization process, point 1 above), but keeping this in mind while reviewing the matches is helpful.

As this comparison process happens in real-time (the model is built once but can be updated, the input article is an “external” element outside of the model, see above), each time matches for an article are requested, the newest version of the article is used. This means that if an article is edited, those changes are detected by the software, which will scrape the article again with the new changes and update the database accordingly. Right now, the scraper does not run as an automatic, continuous process, but it’s designed to do so in the future. During this prototyping phase, we triggered the process manually. Of course, it depends on the size and type of the edits made to the article how much they weigh in on the balance of the model.

Two things not yet implemented in the current system are the way editors can influence the algorithm, and how the algorithm can learn from the editors.

So far, we store each match done by the editors in a database, where we keep track of the two articles that matched, the time it was submitted, and a score from 1 to 5 generated by the algorithm. While the score range can feel a bit arbitrary, documenting this is important to create a bridge between editors’ feedback and the way each article ‘weighs’ in the model representation. Using words like NO MATCH – OK – GOOD – SUPER instead of numbers in the front-facing interface could be a good way forward to make it more intuitive for editors, as long as each word has a certain weight to be used in the matching process.

The next step would be to expand the current software in use, to automatically add ‘editorial’ weights to the algorithmic models, so that when an input article requests matches, the choices made by the editors are taken into account by the algorithm too. This process would close the feedback loop between the algorithmic matching produced by the software, and the editorial process done afterwards by each publisher.

You can find this project on Github here.