August 4th 2023

information retrieval in nlp :: Article Creator

Overcoming Language Barriers In Global Information Retrieval

Welcome to the age of globalization, where accessing information is as simple as a few clicks. The digital era has transformed the way we gather Knowledge, placing it at our fingertips. However, a significant challenge persists – language barriers. These barriers confine valuable insights and pioneering research within linguistic boundaries, limiting our access to a world of discovery. But fear not! In this article, we will explore the captivating realm of surmounting language barriers in global information retrieval.

Introduction Understanding Information Retrieval and Language Barriers

In the quest for global information, language barriers pose a formidable hurdle. Even with translation tools, effectively communicating with individuals who speak different languages can be arduous. This challenge becomes more pronounced when dealing with technical jargon and specialized vocabulary.

One approach to conquer these barriers is through information retrieval (IR), the process of locating pertinent information within a given text corpus. This involves utilizing keywords, Boolean operators, and other search methods.

Various resources aid in IR, including online dictionaries, machine translation services, and bilingual text collections. Employing techniques like stemming and lemmatization can enhance the efficacy of IR.

Despite language barriers, IR stands as a potent tool for accessing global knowledge. Armed with suitable strategies and resources, it's feasible to overcome these barriers and access vital information.

Key Hurdles in Bridging Language Barriers in IR

A primary challenge in overcoming language barriers in information retrieval arises from the absence of standardized terminology and indexing practices. This discrepancy complicates information search and retrieval, leading to inconsistent outcomes across different search engines.

The profusion of online content in languages other than English presents another challenge. While machine translation aids in deciphering this content, its accuracy may be compromised, missing crucial nuances. This undermines the reliability of retrieved information.

Furthermore, IR experts need to comprehend and evaluate foreign-language documents. This demands a comprehensive grasp of the language itself and an awareness of cultural variations influencing information conveyance.

Approaches to Overcoming Language Barriers in IR

Multiple approaches can surmount language barriers in information retrieval. Employing machine translation tools to automatically translate documents is one avenue. Alternatively, human translators can manually undertake translations.

Machine translation may lack precision, especially for intricate or legal documents, potentially impacting decision-making. Human translators yield more accurate translations but at a higher cost and with potential language limitations.

Another approach is using bilingual or multilingual search engines, permitting users to specify languages for search results. This is beneficial when relevant documents exist in other languages.

An inventive technique involves a system where a single-word query produces documents containing synonyms of that word, facilitating effective searches using contextual clues within document text.

Best Practices for Translating Content for IR

Prudent translation for information retrieval involves adhering to specific best practices, ensuring accurate representation and easy accessibility for the target audience.

Tips for translating content for IR:

Employ straightforward, comprehensible language.

Minimize jargon and technical terms.

Accurately translate field names and metadata.

Consider machine translation for extensive content, with subsequent proofreading and editing.

Test translated content within an IR system to confirm proper retrieval and display.

Engage professional translators to ensure precision and consistency.

Establish a uniform style guide encompassing capitalization, hyphenation, and language-specific typographic elements.

Utilizing Natural Language Processing Tools for IR

The burgeoning field of natural language processing (NLP) offers tailored solutions for information retrieval (IR). A variety of tools and technologies automate text processing and analysis.

Prominent NLP tools for IR encompass:

Text classification: Assigning documents to predefined categories.

Text clustering: Grouping similar documents, creating categorical or hierarchical structures.

Topic modeling: Identifying main document topics, facilitating summarization and categorization.

Entity extraction: Automatically recognizing named entities like people, places, organizations.

Sentiment analysis: Evaluating sentiment in text data, useful for opinion mining.

Text summarization: Generating document summaries, aiding in query expansion.

Strategies for Multilingual Search Engine Optimization (SEO)

Optimizing websites for multilingual search engine optimization (SEO) involves several strategies. Design websites for easy translation using internationalized domain names (IDNs) and translated content.

Key strategies entail:

Implementing country-specific top-level domains (TLDs) for higher search result rankings.

Submitting websites to international directories and search engines.

Leveraging social media to broaden outreach, engaging global audiences.

Crafting localized content to optimize for regional search engines.

Conclusion

Language barriers present formidable challenges to accessing global information. Technological advancements offer solutions to surmount these barriers. Machine translation and natural language processing empower us to transcend language limitations, accessing information that was once unreachable. With continued innovation, unlocking global knowledge will become more attainable, benefiting us collectively.

How To Detect Fake News With Natural Language Processing

The sheer volume of information produced every day makes it difficult to distinguish between real and fake news, but advances in natural language processing (NLP) present a possible solution.

In today's digital era, the spread of information via social media and internet platforms has given people the power to access news from many different sources. The growth of fake news, meanwhile, is a drawback of this independence. Fake news is inaccurate information that has been purposefully spread to confuse the public and undermine confidence in reputable journalism. Maintaining an informed and united global community requires identifying and eliminating fake news.

NLP, a subfield of artificial intelligence, gives computers the capacity to comprehend and interpret human language, making it a crucial tool for identifying deceptive information. This article examines how NLP can be used to identify fake news and gives examples of how it can be used to unearth misleading data.

Sentimental analysis

To identify bogus news, sentiment analysis using NLP can be an effective strategy. NLP algorithms can ascertain the intention and any biases of an author by analyzing the emotions displayed in a news story or social media post. Fake news frequently preys on readers' emotions by using strong language or exaggeration.

A news item covering a political incident, for instance, can be identified by an NLP-based sentiment analysis model as being significantly biased in favor of a specific party and using emotionally charged language to affect public opinion.

Related: 5 natural language processing (NLP) libraries to use

Semantic analysis and fact-checking

To confirm the accuracy of the material, fact-checking tools driven by NLP can analyze the content of a news piece against reliable sources or databases. By highlighting inconsistencies and contradictions that can point to fake news, semantic analysis aids in understanding the meaning and context of the language that is being used.

An NLP-based fact-checking system, for instance, can instantly cross-reference a news article's assertion that a well-known celebrity endorses a contentious product with reliable sources to ascertain its veracity.

Named entity recognition (NER)

In NLP, named entity recognition (NER) enables computers to recognize and categorize particular entities referenced in a text, such as individuals, groups, places or dates. By identifying significant players, fake news can be debunked by discovering contradictions or made-up information.

Examples of nonexistent organizations or locales that NER algorithms may highlight as potential signs of false news are mentions in news articles about purported environmental disasters.

Recognizing sensationalism and clickbait

NLP models may be trained to spot sensationalized language and clickbait headlines, both of which are characteristics of fake news. These methods can assist in filtering out false information and ranking trustworthy news sources.

For instance, sensational phrases and inflated claims that frequently accompany clickbait articles can be found by analyzing headlines and content using an NLP-powered algorithm.

Related: 5 emerging trends in deep learning and artificial intelligence

Assessing the reliability of the source

NLP methods are capable of analyzing historical information on news organizations, such as their standing, reliability and historical reporting accuracy. This data can be used to evaluate the validity of fresh content and spot potential fake news sources.

For instance, an NLP-powered system may evaluate the legitimacy of a less well-known website that published a startling news report before deeming the content reliable.

A Computational Inflection For Scientific Discovery

By Tom Hope, Doug Downey, Daniel S. Weld, Oren Etzioni, Eric HorvitzCommunications of the ACM, August 2023, Vol. 66 No. 8, Pages 62-7310.1145/3576896Comments

Credit: Peter Crowther Associates

We stand at the foot of a significant inflection in the trajectory of Scientific discovery. As society continues its digital transformation, so does humankind's collective scientific knowledge and discourse. The transition has led to the creation of a tremendous amount of information, opening exciting opportunities for computational systems that harness it. In parallel, we are witnessing remarkable advances in artificial intelligence, including large language models capable of learning powerful representations from unstructured text. The confluence of societal and computational trends suggests that computer science is poised to ignite a revolution in the scientific process itself.

Key Insights

At the heart of the scientific process, a basic behavior has remained unchanged for hundreds of years: We build on existing ideas to form new ideas. When faced with a new question or problem, we leverage knowledge from accumulated learnings and from external sources, and we perform synthesis and reasoning to generate insights, answers, and directions. But the last few decades have brought change. The explosion of digital information and steep acceleration in the production of scientific data, results, and publications—with more than one million papers added every year to the PubMed biomedical index alone—stand in stark contrast to the constancy of human cognitive capacity. While scientific knowledge, discourse, and the larger scientific ecosystem are expanding with rapidity, our human minds have remained static, with severe limitations in the capacity to find, assimilate, and manipulate information.

Herbert Simon's reflection that "…a wealth of information creates a poverty of attention" aptly describes the limited attention of researchers in the modern scientific ecosystem. Even within narrow areas of interest, there is a vast space of potential directions to explore, while the keyhole of cognition admits only a tiny fraction of the broad landscape of information and deliberates over small slices of possibility. The way we search through and reflect on information across the vast space—the areas we select to explore and how we explore them—is hindered by cognitive biases26 and lacks principled and scalable tools for guiding our attention.32 "Unknowns" are not just holes in science but important gaps in personal knowledge about the broader knowns across the sciences. We thus face an imbalance between the treasure trove of scholarly information and our limited ability to reach into it. Despite technological advances, we require new paradigms and capabilities to address this widening gap.

We see promise in developing new foundational capabilities that address the cognitive bottleneck, aimed at extending human performance on core tasks of research—for example, keeping abreast with developments, forming and prioritizing ideas, conducting experiments, and reading and understanding papers (see Table 1). We focus on a research agenda we call task-guided scientific knowledge retrieval, in which systems counter humans' bounded capacity by ingesting corpora of scientific knowledge and retrieving inspirations, explanations, solutions, and evidence synthesized to directly serve task-specific utility. We present key concepts of task-guided scientific knowledge retrieval, including work on prototypes that highlight the promise of the direction and bring into focus concrete steps forward for novel representations, tools, and services. Then we review systems that help researchers discover novel perspectives and inspirations,8,9,11,29 help guide the attention of researchers toward opportunity areas rife with uncertainties and unknowns,18,32 and models that leverage retrieval and synthesis of scientific knowledge as part of machine learning and prediction.6,24 We conclude with a discussion of opportunities ahead with computational approaches that have the potential to revolutionize science. To set the stage, we begin by discussing some fundamental concepts and background for our research agenda.

Table 1. Research may be decomposed into salient tasks that are prime targets for computational augmentation.

Human-Centric Perspective

Extraordinary developments at the convergence of AI and scientific discovery have emerged in specific areas, including new kinds of analytical tools; the prominent example is AlphaFold, which harnesses deep neural models to dramatically improve the prediction of protein structure from amino-acid sequence information.15 Large language models (LLMs) have very recently made stellar progress in the ability to reason about complex tasks, including in the medical domain.25 The most advanced LLM at present—emerging before the ink has dried on this article—is GPT-4, which has exhibited jaw-dropping skill in handling clinical questions, mathematical problems, and computer coding tasks.1

The explosion of digital information and steep acceleration in the production of scientific data, results, and publications stand in stark contrast to the constancy of human cognitive capacity.

We view these developments as tremendous research opportunities for building computational approaches that accelerate scientific discovery. We take a human-centered, cognitive perspective: augmenting researchers by considering the diversity of tasks, contexts, and cognitive processes involved in consuming and producing scientific knowledge. Collectively, we refer to these as a researcher's inner cognitive worlda (see Figure 1). The researcher interacts with the scientific ecosystem—literature, resources, discussions—to inform decisions and actions. Researchers have different uses for scholarly information, depending on the task at hand and the stage of exploration (see Table 1 and discussion in the section, "Task-Guided Retrieval"). We pursue a research agenda around assisting researchers in their tasks, guided by two main desiderata:

Figure 1. Information flows from the outer world into the inner cognitive world of researchers, constrained by cognitive capacity and biases. We see opportunities to support researchers by retrieving knowledge that helps with tasks across multiple phases of the scientific process (See Table 1).

Systems for augmenting human capabilities in the sciences need to enhance the effective flow of knowledge from the outer world of scientific information and discourse to the researcher's inner cognitive world—countering humans' bounded capacity by retrieving and synthesizing information targeted to enhance performance on tasks. Achieving this goal requires methods that build and leverage rich representations of scientific content and that can align computational representations with human representations, in the context of specific tasks and backgrounds of researchers.

Research on such systems should be rooted in conceptual models of the inner cognitive world of a researcher. Shining a spotlight on this inner world brings numerous factors and questions to the fore. How do researchers form ideas? How do they decide which problems to look into? How do they find and assimilate new information in the process of making decisions? What cognitive representations and bottlenecks are involved? What computing services would best augment these processes?

Background and related themes. We leverage research in natural language processing (NLP), information retrieval, data mining, and human-computer interaction (HCI) and draw concepts from multiple disciplines. For example, efforts in metascience focus on sociological factors that influence the evolution of science,17 such as analyses of information silos that impede mutual understanding and interaction,38 of macro-scale ramifications of the rapid growth in scholarly publications,4 and of current metrics for measuring impact5—work enabled by digitization of scholarly corpora. Metascience research makes important observations about human biases (desideratum 2) but generally does not engage in building computational interventions to augment researchers (desideratum 1). Conversely, work in literature-based discovery33 mines information from literature to generate new predictions (for example, functions of materials or drug targets) but is typically done in isolation from cognitive considerations; however, these techniques have great promise in being used as part of human-augmentation systems. Other work uses machines to automate aspects of science. Pioneering work from Herbert Simon and Pat Langley automated discovery of empirical laws from data, with models inspired by cognitive mechanisms of discovery. More recent work has focused on developing robot scientists16,30 that run certain experiments in biology or chemistry—not only formulating hypotheses but "closing the loop" through automated tests in a physical laboratory—where robots may use narrow, curated background knowledge (for example, of a specific gene regulatory network) and machine learning to guide new experiments. Related work explores automating scientific data analysis,6 which we discuss in the "Task-Guided Retrieval" section as a case of retrieval from scientific repositories to augment aspects of experimentation and analysis (see Table 1).

We now turn to a discussion of central concepts: the ecosystem of science and the cognitive world. This presentation lays the foundations for our exposition of task-guided retrieval and research opportunities.

Outer world: Scientific ecosystem. We collectively name the scientific ecosystem and the digital representations of scientific knowledge as the outer world (see Figure 1). The outer world comprises scientific communities: a complex and shifting web of peers, concepts, methodologies, problems, and directions revolving around shared interests, understandings, and paradigms. This ecosystem generates digital information—digital "traces" of scientific thought and behavior—lying at the center of our attention as computer scientists interested in boosting human capacity to "reach into" the pool of scientific knowledge. This knowledge includes scholarly publications that appear in journals, conference proceedings, and online preprint repositories. Online publications are a main case of digital research artifacts; other examples of products of research include software, datasets, and knowledge bases. Research artifacts are also typically associated with signals of quality and interest, such as citations to a specific paper or downloads of a dataset. The specific context for why a paper or resource was cited or used is often reflected in natural language descriptions. Different types of signals include peer review prior to publication (mostly not shared publicly) and social media discussions, such as on Twitter, which has become a major virtual platform for academic dissemination and conversation. Along with the trend in society, private communication channels among researchers are also digital—email, online calls, and messages. Similarly, note taking and writing—important activities across the scientific workflow—are done in digital form. This information is siloed in different platforms under privacy restrictions yet represents a treasure trove for tools for the augmentation of scientific reasoning and exploration.

Inner world: Human cognition in science. The way researchers decide to interact with information in the outer world and the way they process and use this information is governed by a complex array of cognitive processes, personal knowledge and preferences, biases, and limitations, which are only partially understood. We collectively name these the inner world, and briefly discuss several salient aspects.

Early work in AI by Herbert Simon and Alan Newell and later efforts by Pat Langley and Paul Thagard focused on cognitive and computational aspects of problem solving, creativity, decision making, scientific reasoning, and discovery, seeking algorithmic representations to help understand and mimic human intelligence.19,36 Cognitive mechanisms that play important roles in scientific discovery include inductive and abductive reasoning, mental modeling of problems and situations, abstraction, decomposition, reformulation, analogical transfer, and recombination—for example, in analogical transfer, given a situation or problem being considered in our working memory, we retrieve prior analogous problems or situations from our long-term memory.

This cognitive machinery powers human ingenuity. However, the human mind also has severe limitations—bounded rationality in the words of Simon—that impede these powerful mechanisms. Our limitations and capabilities have been studied for more than 100 years with cognitive psychology. Our limitations manifest in bounded cognitive capacity and knowledge, as well as in the biases that govern our behaviors and preferences. These limitations are all tightly interrelated. The ability to generate ideas, for instance, directly relies on prior knowledge, but when a large volume of information from the outer world of science is met by insufficient cognitive capacity for processing and assimilating it, the result is information overload—a ubiquitous hindrance for researchers.29 Information overload in science strains the attentional resources of researchers, forcing them to allocate attention to increasingly narrow areas. This effect, in turn, amplifies a host of biases which researchers, just like all humans, suffer from.26,32 For example, scientists can be limited by confirmation bias, aversion to information from novel domains, homophily, and fixation on specific directions and perspectives without consideration of alternative views.11,26 More broadly, selection of directions and areas to work on is a case of decision making; as such, personal preference and subjective utility play fundamental roles. Our research decisions rely on subjective assessment of feasibility, long-term or short-term goals and interests, and even psychological factors—for example, tendencies for risk aversion. These factors are also impacted by biases.26 Clearly, the inner world of researchers is dauntingly complex. However, in the next section, we present encouraging results of applying computational methods to augment cognition in the sciences, helping to mitigate biases and limitations and enabling researchers to make better use of their powerful creative mechanisms.

Task-Guided Retrieval

How might we widen and deepen the connection between the outer world of science and the limited cognitive worlds of researchers? We see a key bridge and research opportunity with developing tools for scientific task-guided knowledge retrieval. Drawing from discussions in literature on the process of scientific discovery, we enumerate in Table 1 salient scientific tasks and activities, such as problem identification, forming directions, learning, literature search and review, and experimentation. These tasks could benefit from augmentation of human capabilities but remain under-explored in computer science. Existing computational technologies to help humans discover scientific knowledge are under-invested in important aspects of the intricate cognitive processes and goal-oriented contexts of scholarly endeavors.

The dominant approach to information-retrieval research and systems can be summarized as "relevance first," focusing on results that answer user queries as accurately as possible. Academic search engines assume users know what queries to explore and how to formulate them. For pinpointed literature search in familiar areas, this assumption may often suffice. But a broad array of other scholarly tasks, such as ideation or learning about a new topic, are very much underserved.9,10,11,18,29 At the same time, many voices in the information-retrieval community have discussed a different, broader view of utility-driven search situated in a wider context of information seeking by users with specific intents and tasks.31 Here, we adapt ideas and principles from this general paradigm.

We envision methods for task-guided scientific knowledge retrieval: systems that retrieve and synthesize outer knowledge in a manner that directly serves a task-guided utility of a researcher, while taking into consideration the researcher's goals, state of inner knowledge, and preferences.

Consider the tasks in Table 1. For researchers engaged in experimentation or analysis, we envision systems that help users identify experiments and analyses in the literature to guide design choices and decisions. For researchers in early stages of selecting problems to work on, we picture systems that support this decision with information from literature and online discussions, synthesized to obtain estimated impact and feasibility. As part of forming directions to address a problem, systems will help users find inspirations for solutions. Researchers who are learning about a new topic will be provided with retrieved texts and discussions that explain the topic in a manner tailored to personal knowledge. Importantly, task-guided knowledge retrieval follows the two desiderata previously introduced; namely, systems should enable users to find knowledge that directly assists them in core research tasks by augmenting their cognitive capacity and mitigating their biases, and computational representations and services should align with salient cognitive aspects of the inner world of researchers.

Prototypes of task-guided retrieval. We present work on initial steps and prototypes, including representative work that we have done and the work of others, framed in alignment with task-guided knowledge retrieval and the tasks enumerated in Table 1. The main aim of this brief review is to stimulate discussion in the computer science community on tools for extending human capabilities in the sciences. Existing methods are far from able to realize our vision. For example, we see major challenges in representation and inferences about the inner world of knowledge and preferences and aligning these with representations and inferences drawn from the outer-world knowledge. Today's prototypes are limited examples of our vision, using very rough proxies of inner knowledge and interest based on papers and documents written or read by the user, or in some cases only a set of keywords.

We focus on a research agenda we call task-guided scientific knowledge retrieval, in which systems counter humans' bounded capacity.

Forming directions. We have developed methods for helping researchers generate new directions. A fundamental pattern in the cognitive process of creativity involves detecting abstract connections across ideas and transferring ideas from one problem to another.36 Grounded in this cognitive understanding, we have pursued several approaches for stimulating creativity

Figure 2. Matching researchers to authors with whom they are unfamiliar, to help generate directions. Author cards show key problems and methods extracted from their papers.

We have also explored retrieving outer knowledge to enhance the human ability to find opportunities for analogical transfer.3,8 Extensive work in cognitive studies has highlighted the human knack for "analogical retrieval" as a central function in creativity—bringing together structurally related ideas and adapting them to a task at hand.36 We developed a search method that enables researchers to search through a database of technological inventions and find mechanisms that can be transferred from distant domains to solve a given problem. Given a textual description of an invention as input from the user, we retrieve ideas (inventions, papers) that have partial structural similarity to the input (for example, inventions with similar mechanisms) to facilitate discovery of analogical transfer opportunities. We found that the method could significantly boost measures of human creativity in ideation experiments, in which users were asked to formulate new ideas after viewing inspirations retrieved with our approach versus baseline information-retrieval methods. For example, a biomechanical engineering lab working on polymer stretching/folding for creating novel structures found useful inspiration in a civil engineering paper on web crippling in steel beams—abstractly related to stretching and folding.

Innovation may also involve traversing multiple levels of abstraction around a problem to "break out" of fixation on the details of a specific problem by exploring novel perspectives. Given as input a problem description written by the user (as a proxy summary of the user's inner world of knowledge and purpose), we have pursued mechanisms that can retrieve diverse problem perspectives that are related to the focal problem, with the goal of inspiring new ideas for problem abstraction and reformulation11 (see Figure 3). Using NLP models to extract mentions of problems, we mine a corpus of technological-invention texts to discover problems that often appear together; we use this information to form a hierarchical problem graph that supports automatic traversal of neighboring problems around a focal problem, surfacing novel inspirations to users. In a study of the efficacy of the methods, more than 60% of "inspirations" retrieved this way were found to be useful and novel—a relative boost of 50%-60% over the best-performing baselines. For example, given an input problem of reminding patients to take medication, our system retrieves related problems, such as in-patient health tracking and alerting devices.

Figure 3. Using an extracted hierarchy of problems to retrieve new perspectives on a focal problem of interest.

Guiding attention and problem identification. We see great opportunity in developing methods for guiding the attention of researchers to important areas in the space of ideas where there exists less knowledge or certainty (Figure 4).18,32 In one direction, we built a search engine that allows users to retrieve outer knowledge in the form of difficulties, uncertainties, and hypotheses in the literature. The key goals of this search mode are to bolster attention to rising and standing challenges of relevance to the user, to help overall with problem identification and selection. We performed experiments with participants from diverse research backgrounds, including medical doctors working in a large hospital. Using query topics as a proxy for the inner world of participants' interests, we found the system could dramatically outperform PubMed search, the go-to biomedical search engine, at discovering important and interesting areas of challenges and directions. For example, while searching PubMed for the ACE2 receptor in the context of COVID-19 returns well-studied results, the prototype system by contrast focuses on finding statements of uncertainty, open questions, and initial hypotheses, such as a paper noting the possibility that ACE2 plays a role in liver damage in COVID-19 patients.

Figure 4. Suggesting research opportunities for query concepts (for example, medical topics) by identifying blind spots, gaps in collective knowledge, and promising areas for exploration.

Another direction on biases and blind spots considers the long-term effort to identify protein-protein interactions (PPIs). A dataset of the growing graph of confirmed PPIs over decades was constructed and leveraged to identify patterns of scientific attention.32 A temporal analysis revealed a significant "bias of locality," where explorations of PPIs are launched more frequently from those that were most recently studied, rather than following more general prioritization of exploration. While locality reflects an understandable focus on adjacent and connected problems in the biosciences, the pattern of attention leads to systematic blind spots in large, widely used PPI databases that are likely unappreciated—further exacerbating attentional biases. The study further demonstrated mechanisms for reprioritizing candidate PPIs based on properties of proteins and showed how earlier discoveries could be made using debiasing methods. The findings underscore the promise of tools that retrieve existing outer-world knowledge to guide attention to worthwhile directions. In this case, the outer-knowledge source is a PPI database, and a user-selected subgraph provides a proxy for inner-world knowledge and interests.

Literature search and review. A great body of work on literature search and review has deep relevance to task-guided retrieval in the sciences. In particular, we see great opportunity to build on recent advances in information retrieval to help biomedical researchers with domain-specific representations and to enhance scientific search by building new neural models. Specialized search systems have been developed for the biomedical domain, with the overall vision of harnessing natural language-understanding technologies to help researchers discover relevant evidence and expedite the costly process of systematic literature review.27 For example, Nye et al.27 built a search-and-synthesis system based on automated extraction of biomedical treatment-outcome relations from clinical trial reports. The system is found to assist in identifying drug-repurposing opportunities. As another recent example, the SPIKE system enables researchers to extract and retrieve facts from a corpus using an expressive query language with biomedical entity types and new term classes that the user can interactively define.34 Together, this work underscores the importance of extracting a semantically meaningful representation of outer-world knowledge that aligns with core aspects of inner-world reasoning by researchers.

Information overload in science strains the attentional resources of researchers, forcing them to allocate attention to increasingly narrow areas.

In separate work, neural language models built via self-supervision on large corpora of biomedical publications have recently led to performance boosts and new features in literature search systems,39 such as support for natural language queries that provide users with a more natural way to formulate their informational goals. Neural models have also been trained to match abstract discourse aspects of pairs of papers (for example, sentences referring to methodologies) and automatically retrieve documents that are aspectually similar.23 By employing a representation that aligns with scientific reasoning across areas, this method achieves state-of-the-art results across biomedical and computer science literature.

Experimentation, analysis, and action. Beyond helping researchers via awareness and knowledge, we see great opportunities to use scientific corpora to construct task-centric inferential systems with automated models and tools for assisting with analysis, prediction, and decisions. We demonstrate these ideas by casting two different lines of work as cases of task-guided retrieval.

Workflows are multi-step computational pipelines used as part of scientific experimentation for data preparation, analysis, and simulation.6 Technically, this includes execution of code scripts, services and tools, querying databases, and submitting jobs to the cloud. In the life sciences, in areas such as genomics, there are specialized workflow-management systems to help researchers find and use workflows, enabled by a community that creates and publicly shares repositories of workflows with standardized interfaces, metadata, and functional annotations of tools and data. As discussed in Gil,6 machine-learning algorithms can potentially use these resources to automate workflow construction, learning to retrieve and synthesize data-analysis pipelines. In this setting, outer-world knowledge takes the form of workflow repositories, from which systems retrieve and synthesize modular building blocks; the user's inner world is reflected via analysis objectives and constraints.

In our work on clinical predictions,24 the goal is to enhance prediction of medical outcomes of patients hospitalized in the intensive care unit (ICU), such as in-hospital mortality or prolonged length of stay. Our system, Biomedical Evidence Enhanced Prediction (BEEP), learns to make predictions by retrieving medical papers that are relevant to each specific ICU patient and to synthesize this outer knowledge in combination with internal EMR knowledge to form a final prediction. The primary envisaged user is a practice-oriented researcher—a medical doctor whose inner knowledge is given by a rough proxy in the form of internal clinical notes from which we extract "queries" issued over medical papers. We find BEEP to provide large improvements over state-of-the-art models that do not use retrieval from the literature. BEEP's output can be aligned with inner-world representations—for example, matches between patient aspects and related cohorts in papers (see Figure 5).

Figure 5. Leveraging medical corpora to enhance the precision of AI models for inference about patient outcomes.

Learning and understanding. We introduced a system22 for helping users learn about new concepts by showing definitions grounded in familiar concepts—for example, a new algorithm is explained as a variant of an algorithm familiar to the user. Cognitive studies have asserted that effective descriptions of a new concept ground it within the network of known concepts. Our system takes as input a list of source concepts reflecting the user's inner knowledge as obtained from papers that they have written or read. When the user seeks a definition of a new target concept, we retrieve outer knowledge in the form of definitions appearing in scientific papers in which the target concept is explained in terms of the source concepts; a neural text-generation model then rewrites the text in a structured, templated form that relates the target to the source.

Opportunities Ahead

The challenges of task-guided retrieval in support of researchers frame a host of problems and opportunities. We focus on select challenges and directions (see also Table 2). We begin with an illustrative example, imagining a futuristic system to motivate the discussion.

Table 2. Directions with formulating and leveraging computational representations of scientific knowledge.

Aspirations. We envision tools that flow outer-world knowledge to researchers based on inferences about their inner world—users' knowledge, past and present goals and difficulties, and the tasks from Table 1 they are engaged in. The systems would use multiple signals for making inferences, including users' papers, data, experiments, and communication channels; the systems also converse with the user to understand needs and suggest solutions, hypotheses, and experiments.

We foresee systems

Envisioned systems would be designed as human-centric, focusing on the individual researcher. The systems would enable users to convey preferences, goals, and interests, and mediate the presentation of suggested directions and problem solutions based on personal prior knowledge—proposing concrete new directions grounded in representations that researchers can follow and assisting users in reading complex retrieved texts by editing their language to conform with concepts that users are familiar with.

Research directions. While we have witnessed remarkable strides in AI, the journey toward actualizing our vision requires further advancement. Envisioning such capabilities, however, can serve as a compass for directing research endeavors. An encouraging development can be seen in the recent progress with LLMs, which have demonstrated surprising capabilities with interpreting and generating complex texts and tackling technical tasks. The demonstrated proficiency of these models instills confidence that many of the possibilities we have discussed are attainable. We now elaborate on challenges and directions ahead, including limitations in representing scientific knowledge and making inferences about the inner worlds of researchers (see Table 2).

Task-aligned representations and scientific NLP. Paul Thagard writes: "Thinking can best be understood in terms of representational structures in the mind and computational procedures that operate on those structures." We seek representations that can be aligned with human thinking—for insight-building, decision making, and communication. Can we go beyond textual representation toward representations that support such cognitive processes?

The quest for a universal schema representing scientific ideas goes back hundreds of years. Gottfried Leibniz and René Descartes were intrigued by the prospects of a universal codification of knowledge. Leibniz proposed the characteristica universalis, a hypothesized formal language of ideas enabling inferences with algebraic operators. While such a representation is not within reach, envisioning its existence—and how to roughly approximate it—points to important research directions. One exciting direction is obtaining representations that support a "computational algebra of ideas"—for example, modeling compositions of concepts and the affordances that would be formed as a result. Early work on learning vector representations of natural language concepts supported rudimentary forms of addition, subtraction, and analogy—for example, the Word2vec model.

Systems should enable users to find knowledge that directly assists them in core research tasks by augmenting their cognitive capacity and mitigating their biases.

Recently, LLMs28 have made striking progress in generating new content and coherently combining concepts. Emerging evidence on GPT-4's ability to reason not only in unstructured language but also with logical structures grounded in code suggests strong potential for generating novel ideas via compositionality and relational reasoning.1 Our early experiments with GPT-4 have revealed a constellation of promising abilities to assist with the scientific process, such as formulating hypotheses, recommending future research directions, and critiquing studies. Equipped with training and retrieval with access to millions of scientific papers, descendants of today's models may be able to synthesize original scientific concepts with the in-depth technical detail at a level reported in high-quality scientific papers. We see great opportunity ahead to leverage LLMs to augment human scientific reasoning along the lines described in this paper.

One limitation with LLMs is that representations learned by these models are currently far from understood and lack "hooks" for control and interpretability, which are important in human-AI collaboration. In line with our focus on grounding representations of outer-world knowledge with inner-world cognitive aspects, we have pursued methods that "reverse-engineer" scientific papers to automatically extract, using NLP, structured representations that balance three desiderata:

Semantically meaningful representations, aligned with a salient task from the tasks in Table 1, grounded in cognitive research to guide us toward useful structures.

Representations with sufficient level of abstraction to generalize across areas and topics.

Representations expressive enough for direct utility in helping researchers as measured in human studies.

For example, we have extracted representations of causal mechanisms and hierarchical graphs of functional relationships. This kind of decomposition of ideas has enabled us to perform basic analogical inference in the space of technological and scientific ideas, helping researchers discover inspirations. However, many richer structures should be explored—for example, of experimentation processes and methodologies to enable tasks in Table 1.

A central challenge is that current models' extraction accuracy is limited, and the diversity of scientific language leads to problems in generalization and normalization of terms and concepts. We have pursued construction of new datasets, models, and evaluations for identifying similarity between concepts and aspects across papers,2,23 with fundamental problems in resolving diversity, ambiguity, and hierarchy of language. As our results have highlighted, models tend to focus on surface-level lexical patterns rather than deeper semantic relationships. Generally, substantial advances are needed to handle challenges posed by scientific documents. We require NLP models with full document understanding, not only of text but of tables, equations, figures, and reference links. Open access corpora, such as S2ORC,20 provide a foundation to address this challenge.

New modes of writing and reading. Perhaps the way we write can be dramatically different, using machine-actionable representations? Beyond reporting and documentation, writing represents a channel between the inner and outer worlds, forcing us to communicate ideas in concrete language. This process often begets new questions and perspectives. Can systems accompany di

The Ultimate Guide to Cloud Gaming: D…
Best Steam Irons in India
ÙƒØªØ§Ø¨ Ø¹Ù† Ø§Ù„ØØ¨: Ø§Ø³ØªÙƒØ´Ù …
Customer Stunned by Electric Vehicle …
Appleâ€™s MM1 Large Language Model Bl…

This post first appeared on Autonomous AI, please read the originial post: here