Inter-American Development Bank
facebook
twitter
youtube
linkedin
instagram
Abierto al públicoBeyond BordersCaribbean Development TrendsCiudades SosteniblesEnergía para el FuturoEnfoque EducaciónFactor TrabajoGente SaludableGestión fiscalGobernarteIdeas MatterIdeas que CuentanIdeaçãoImpactoIndustrias CreativasLa Maleta AbiertaMoviliblogMás Allá de las FronterasNegocios SosteniblesPrimeros PasosPuntos sobre la iSeguridad CiudadanaSostenibilidadVolvamos a la fuente¿Y si hablamos de igualdad?Home
Citizen Security and Justice Creative Industries Development Effectiveness Early Childhood Development Education Energy Envirnment. Climate Change and Safeguards Fiscal policy and management Gender and Diversity Health Labor and pensions Open Knowledge Public management Science, Technology and Innovation  Trade and Regional Integration Urban Development and Housing Water and Sanitation
  • Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer

Abierto al público

  • HOME
    • About this blog 
    • Editorial guidelines
  • CATEGORIES
    • Knowledge Management
    • Open Data
    • Open Learning
    • Open Source
    • Open Systems
  • Authors
  • English
    • Español
topic-modeling-seo

Applying topic modeling to knowledge management online

August 10, 2020 by Kyle Strand - Daniela Collaguazo - Michelle Marshall - Open Knowledge Team Leave a Comment


How topic modeling helped us restructure the blog Abierto al Público and increase our search visibility

When it comes to knowledge management, and in the case of open knowledge in particular, the main challenge is arguably no longer about a lack of information. For those with access to it, the Internet has the potential to connect us with an abundance of knowledge, and increasingly via formats that are free to access. That being said, the ongoing curation, navigation, and synthesis of so much information is one of the current dilemmas related to connecting the most relevant and actionable resources to those who search for them. This goes beyond a pursuit that is merely for aesthetics, promotion, or marketing. As we have observed recently around the world, the so-called “infodemic” urgently demands new ways of supporting people to find the knowledge they are looking for, and that content creators assume an increased responsibility for presenting knowledge and information clearly and comprehensively to their readers. 

For these reasons, the IDB is constantly exploring and refining techniques to better connect the Latin American and Caribbean region with quality open knowledge. One very particular example of this, out of many ongoing efforts, includes work our team has done to improve the curation and organization of the content published here at Abierto al Público. In this article, we share some of the learnings about how we have used techniques like topic modeling and SEO to approach content management more efficiently, with the guiding motivation to better support readers in finding the content and learning resources that are most meaningful and practical to them. We hope that you can use these techniques to better organize and share your knowledge, too. 

A big milestone, new emerging topics – and a lot of content

Abierto al Publico is first and foremost an IDB resource to share learnings about open knowledge. As we recently celebrate over five years of being online, the blog has published more than 500 articles related to all things open in connection with economic and social development in Latin America and the Caribbean: including open knowledge, open data, open government, open innovation, and more recently, open source technology.  

But how to navigate and make sense of it all — especially for our first-time visitors? It was an important time for us to reflect on this question, for various reasons. For one, our coverage related to the open movement continued to evolve beyond the blog’s original categories. We needed a new method for grouping content in a way that would make sense to readers while also offering flexibility to incorporate future content as we continue to grow and follow new lines of conversation. Second, the volume of content discouraged too much manual sorting and rearranging. This is an important consideration because we want to be efficient with our use of time and resources.  

With this in mind, we wanted to see how AI and Natural Language Processing could play a role in complementing our strategy and streamlining the otherwise manual task of sorting and categorizing our content in a balanced and consistent manner. 

Centering on SEO: Mapping content for the benefit of both people and search engines

Similar to the discussion around good practices for open data, it is also broadly essential for good knowledge and content management that related subject matter can be found and followed by both people and machines.  

For this reason, understanding the science behind search engine optimization became an important focal point of our content management strategy. In order to improve how your content appears in search results, search engines like Google constantly scan the web to evaluate the sitemaps of different content providers and try to understand what that content is about, while also making a determination about the quality and relevance of that information to a user’s search. Because of this, we learned about how important it is to maintain consistent categories and tags as well as maintaining relevant links between related content.  

When it comes to categories, each article should only belong to one, like the branch of a tree or the hub at the center of a wheel. The number of categories should be roughly balanced in terms of the amount of content in each, and a clear logic should connect the content to its category while also making it distinct it from the other categories. 

Learn more here about categorization and topic clusters.

But how many categories would we need to organize so much content? This was our next question. We needed to compare and evaluate our options without too much manual sorting.  It is in this context where Topic Modeling becomes highly relevant.

How we used Topic Modeling to identify and create categories of content

Topic modeling is one of several Natural Language Processing techniques within the wider field of artificial intelligence. It can be applied to automatically identify underlying, hidden or latent themes, patterns or groupings within large volumes of text, also known as the “corpus”. As we have learned and shared from previous experiences involving Artificial Intelligence, it is key to remember that success depends largely on the quantity and quality of data that will be used.  In the case of Topic Modeling, that same reminder also holds true. 

In the case of Abierto al Público, first we gathered the 500+ articles (the corpus) into a single csv file for analysis. This can be achieved using web scraping techniques or otherwise depending on your access to the original file sources and their formats.  

The next step was to clean the data to maximize the emphasis on the thematic content. For example, we removed punctuation and words that did not provide much comparative information about the contents of the text such as prepositions, conjunctions, etc. Programming techniques in python can help facilitate this process. 

After the data set was cleaned and prepared, we started the iterative process of training the topic modeling algorithm. This meant running the cleaned corpus dataset through an engine. Each iteration consisted of assigning a different arbitrary number of buckets, or topics, in which to classify the terms found in the corpus. The output would provide the groupings of each individual article along with a probability of confidence about how well that content matched the rest of the information in the same grouping.  

What tools are available to implement topic modeling?

There are multiple tools that can help you run the Topic Modeling exercise, such as:

  • For working in open source, the Gensim library developed for python or the topicmodels package for R.
  • Even though they are not open, here are several other services available that let you perform Topic Modeling, even with limited coding experience and a reasonable cost. Two examples of these alternatives are the Amazon Comprehend AWS Service and the LDA module (LatentDirichtletAllocation) included in the Azure Machine Learning Studio.

Interpreting the results

Analyzing the results of a topic modeling exercise can be a very subjective task, so involving subject matter experts in the process is important. It is important to cross-validate the potential patterns that the machine has interpreted with a more human validation. We played with combinations ranging from 3 topics to 10 topics, and carefully compared the results of each output, until we finally homed in toward the balance offered in the results of the 5 topic range, which came to be interpreted as these categories:

  • Aprendizaje Abierto
  • Código Abierto
  • Datos Abiertos
  • Gestión del Conocimiento
  • Sistemas Abiertos

Once we reached that point, we then repeated the topic modeling process with the content inside each of the categories to identify more specific sub-themes or clusters. This second round helped us to build out new content that could highlight the content within each category and their related subtopics. From there, we could also make the final validations and adjustments regarding specific tags or incorporating specific keyphrases in relation to SEO.

Applying and implementing the results into our strategy for improved search visibility

This classification structure has helped us expand our content coverage while also maintaining specific points of focus. It has also helped us with common legacy issues, such as avoiding the duplication of existing content by having a clear mapping and awareness at hand, and to continue building constructively on the existing conversations where we have invested before in different topics of conversation. This helps Abierto al Público respond to users’ interests with content that is better structured and connected. It has also contributed to make the content more visible and attractive to search engines.

As a result of this and a few other editorial changes, Abierto al Público has more than doubled the visibility of its content via organic search over the past year.

And you? How do you think topic modeling can benefit knowledge resources for your work, community or government?


Filed Under: Gestión del conocimiento, Knowledge Management Tagged With: Actionable Resources, Methodologies, Natural Language Processing

Kyle Strand

Kyle Strand is Lead Knowledge Management Specialist and Head of the Felipe Herrera Library in the Knowledge, Innovation and Communication Sector of the Inter-American Development Bank (IDB). For more than a decade, his work has focused on initiatives to improve access to knowledge both at the Bank and in the Latin American and Caribbean region. Kyle designed the first open repository of knowledge products at the IDB and spearheaded the idea of software as a knowledge product to be reused and adapted for development purposes, which led the IDB to become the first multilateral to formally recognize it as such. Currently, Kyle coordinates library services within the organization, supports the open knowledge product lifecycle including publications and open data, and promotes the use of artificial intelligence and natural language processing as a cornerstone of knowledge management in the digital age. Kyle is also executive editor of Abierto al Público, a blog in Spanish that promotes the opening and reuse of knowledge. He has a B.A. from the University of Michigan and an M.A. from the George Washington University.

Daniela Collaguazo

Born in Quito, Ecuador in April 1984. Daniela completed her undergraduate studies at the San Francisco de Quito University. Subsequently, he lived for 3 years in Germany, where he completed his master's degree in Technology Management and Innovation at the Brandenburg Technical University Cottbus-Senftenberg. Upon completing her studies, Daniela taught Web Technologies at the Faculty of Architecture, Design and Arts at the Pontificia Universidad Católica del Ecuador. Currently, she is collaborating with the IDB as a consultant on projects related to machine learning and natural language processing. She is passionate about sports and has participated in several competitions in her native country, including one in open water and the first two medium distance triathlons.

Michelle Marshall

Michelle Marshall was the editor of Abierto al Público from 2018 to 2020. She has worked as a knowledge management consultant at the IDB since 2016 facilitating collaborative knowledge-sharing activities and documenting open innovation techniques. Michelle is specifically interested in the application of systems thinking and human-centered design approaches to the widely shared challenges of international development. She studied International Relations at the George Washington University and Designing for Inclusion at the Copenhagen Institute of Interaction Design.

Open Knowledge Team

El equipo editorial de Abierto al Público en el BID.

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Follow Us

Subscribe

About this blog

Open knowledge can be described as information that is usable, reusable, and shareable without restrictions due to its legal and technological attributes, enabling access for anyone, anywhere, and at any time worldwide.

In the blog 'Abierto al Público,' we explore a wide range of topics, resources, and initiatives related to open knowledge on a global scale, with a specific focus on its impact on economic and social development in the Latin American and Caribbean region. Additionally, we highlight the Inter-American Development Bank's efforts to consistently disseminate actionable open knowledge generated by the organization.

Search

Topics

Access to Information Actionable Resources Artificial Intelligence BIDAcademy Big Data Citizen Participation Climate Change Code for Development Coronavirus Creative Commons Crowdsourcing Data Analysis Data Journalism Data Privacy Data Visualization Development projects Digital Badges Digital Economy Digital Inclusion Entrepreneurship Events Gender and Diversity Geospatial Data Hackathons How to Instructional Design Key Concepts Knowledge Products Lessons Learned Methodologies MOOC Most Read Natural Language Processing Numbers for Development Open Access Open Government Open Innovation Open Knowledge Open Science Solidarity Sustainable Development Goals Taxonomy Teamwork Text Analytics The Publication Station

Similar Posts

  • Open Knowledge Maps: A visual interface to the world‘s scientific knowledge
  • How we used Natural Language Processing to connect people with knowledge through the FindIt platform
  • Trusted knowledge at your fingertips: generative AI powered search across our Publications Catalog
  • Meet SmartReader, our open-source text analytics tool
  • Open-Source technology: concepts and applications

Footer

Banco Interamericano de Desarrollo
facebook
twitter
youtube
youtube
youtube

    Blog posts written by Bank employees:

    Copyright © Inter-American Development Bank ("IDB"). This work is licensed under a Creative Commons IGO 3.0 Attribution-NonCommercial-NoDerivatives. (CC-IGO 3.0 BY-NC-ND) license and may be reproduced with attribution to the IDB and for any non-commercial purpose. No derivative work is allowed. Any dispute related to the use of the works of the IDB that cannot be settled amicably shall be submitted to arbitration pursuant to the UNCITRAL rules. The use of the IDB's name for any purpose other than for attribution, and the use of IDB's logo shall be subject to a separate written license agreement between the IDB and the user and is not authorized as part of this CC- IGO license. Note that link provided above includes additional terms and conditions of the license.


    For blogs written by external parties:

    For questions concerning copyright for authors that are not IADB employees please complete the contact form for this blog.

    The opinions expressed in this blog are those of the authors and do not necessarily reflect the views of the IDB, its Board of Directors, or the countries they represent.

    Attribution: in addition to giving attribution to the respective author and copyright owner, as appropriate, we would appreciate if you could include a link that remits back the IDB Blogs website.



    Privacy Policy

    Copyright © 2025 · Magazine Pro on Genesis Framework · WordPress · Log in

    Banco Interamericano de Desarrollo

    Aviso Legal

    Las opiniones expresadas en estos blogs son las de los autores y no necesariamente reflejan las opiniones del Banco Interamericano de Desarrollo, sus directivas, la Asamblea de Gobernadores o sus países miembros.

    facebook
    twitter
    youtube
    This site uses cookies to optimize functionality and give you the best possible experience. If you continue to navigate this website beyond this page, cookies will be placed on your browser.
    To learn more about cookies, click here
    x
    Manage consent

    Privacy Overview

    This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
    Necessary
    Always Enabled
    Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
    Non-necessary
    Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
    SAVE & ACCEPT