Meet ChatGLM: An Open-Source NLP Model Trained on 1T Tokens and Capable of Understanding English/Chinese

Tags: natural language processing language natural language

April 20th 2023

techniques of natural language processing :: Article Creator

5 Natural Language Processing Libraries To Use

Natural language processing (NLP) is important because it enables machines to understand, interpret and generate human language, which is the primary means of communication between people. By using NLP, machines can analyze and make sense of large amounts of unstructured textual data, improving their ability to assist humans in various tasks, such as customer service, content creation and decision-making.

Additionally, NLP can help bridge language barriers, improve accessibility for individuals with disabilities, and support research in various fields, such as linguistics, psychology and social sciences.

Here are five NLP libraries that can be used for various purposes, as discussed below.

NLTK (Natural Language Toolkit)

One of the most widely used programming languages for NLP is Python, which has a rich ecosystem of libraries and tools for NLP, including the NLTK. Python's popularity in the data science and machine learning communities, combined with the ease of use and extensive documentation of NLTK, has made it a go-to choice for many NLP projects.

NLTK is a widely used NLP library in Python. It offers NLP machine-learning capabilities for tokenization, stemming, tagging and parsing. NLTK is great for beginners and is used in many academic courses on NLP.

Tokenization is the process of dividing a text into more manageable pieces, like specific words, phrases or sentences. Tokenization aims to give the text a structure that makes programmatic analysis and manipulation easier. A frequent pre-processing step in NLP applications, such as text categorization or sentiment analysis, is tokenization.

Words are derived from their base or root form through the process of stemming. For instance, "run" is the root of the terms "running," "runner," and "run." Tagging involves identifying each word's part of speech (POS) within a document, such as a noun, verb, adjective, etc.. In many NLP applications, such as text analysis or machine translation, where knowing the grammatical structure of a phrase is critical, POS tagging is a crucial step.

Parsing is the process of analyzing the grammatical structure of a sentence to identify the relationships between the words. Parsing involves breaking down a sentence into constituent parts, such as subject, object, verb, etc. Parsing is a crucial step in many NLP tasks, such as machine translation or text-to-speech conversion, where understanding the syntax of a sentence is important.

Related: How to improve your coding skills using ChatGPT?

SpaCy

SpaCy is a fast and efficient NLP library for Python. It is designed to be easy to use and provides tools for entity recognition, part-of-speech tagging, dependency parsing and more. SpaCy is widely used in the industry for its speed and accuracy.

Dependency parsing is a natural language processing technique that examines the grammatical structure of a phrase by determining the relationships between words in terms of their syntactic and semantic dependencies, and then building a parse tree that captures these relationships.

Stanford CoreNLP

Stanford CoreNLP is a Java-based NLP library that provides tools for a variety of NLP tasks, such as sentiment analysis, named entity recognition, dependency parsing and more. It is known for its accuracy and is used by many organizations.

Sentiment analysis is the process of analyzing and determining the subjective tone or attitude of a text, while named entity recognition is the process of identifying and extracting named entities, such as names, locations and organizations, from a text.

Gensim

Gensim is an open-source library for topic modeling, document similarity analysis and other NLP tasks. It provides tools for algorithms such as latent dirichlet allocation (LDA) and word2vec for generating word embeddings.

LDA is a probabilistic model used for topic modeling, where it identifies the underlying topics in a set of documents. Word2vec is a neural network-based model that learns to map words to vectors, enabling semantic analysis and similarity comparisons between words.

TensorFlow

TensorFlow is a popular machine-learning library that can also be used for NLP tasks. It provides tools for building neural networks for tasks such as text classification, sentiment analysis and machine translation. TensorFlow is widely used in industry and has a large support community.

Classifying text into predetermined groups or classes is known as text classification. Sentiment analysis examines a text's subjective tone to ascertain the author's attitude or feelings. Machines translate text from one language into another. While all use natural language processing techniques, their objectives are distinct.

Can NLP libraries and blockchain be used together?

NLP libraries and blockchain are two distinct technologies, but they can be used together in various ways. For instance, text-based content on blockchain platforms, such as smart contracts and transaction records, can be analyzed and understood using NLP approaches.

NLP can also be applied to creating natural language interfaces for blockchain applications, allowing users to communicate with the system using everyday language. The integrity and privacy of user data can be guaranteed by using blockchain to protect and validate NLP-based apps, such as chatbots or sentiment analysis tools.

Related: Data protection in AI chatting: Does ChatGPT comply with GDPR standards?

A New Era Of Natural Language Search Emerges For The Enterprise

(ArtBackground/Shutterstock)

Large language models and their applications are suddenly a hot topic due to the popular release of OpenAI's ChatGPT and the subsequent search engine wars between Google and Microsoft. ChatGPT and similar systems are revitalizing our idea of what the search experience can be. Instead of relying on specific keywords or complex search query syntax, users can now interact with search engines using human-like language.

Question answering systems are one capability of natural language processing, a set of language capabilities that LLMs enable, but QA systems have not always been a popular use case. Ryan Welsh, CEO of an NLP search company called Kyndi, recalls facing difficulties when explaining his company's focus on NLP for search: "I remember raising money three years ago and everyone was like, 'Hey, that's cool, you're an NLP, but this search thing is not a use case.'"

Welsh says this reaction has completely changed thanks to how many people are aware of natural language capabilities because of ChatGPT: "I feel like ChatGPT has done a decade's worth of category evangelism in 90-120 days," he said in an interview with Datanami.

Now, billions are being invested in next-generation search tech. There is suddenly a real market demand for QA systems that can give fast and accurate answers to questions asked by stakeholders or external customers visiting a company's website or knowledge portal, as well as for internal employees searching company documents.

(ra2 studio/Shutterstock)

However, these current chatbot technologies are not meeting enterprise demands, Welsh says, noting that explainability, which is key for end-user trust, is often lacking. Enterprise requirements for a large language model system dictate that generated answers are accurate and trustworthy versus filled with hallucinations from training data sourced from web content, a problem that the large, mainstream models like ChatGPT face. Due to the statistical nature of their underlying technology, chatbots can hallucinate incorrect information, as they do not actually understand the language but are simply predicting the next best word. Often, the training data is so broad that explaining how a chatbot arrived at the answer it gave is nearly impossible.

This "black box" approach to AI with its lack of explainability simply will not fly for many enterprise use cases. Welsh gives the example of a pharmaceutical company that is delivering answers to a healthcare provider or a patient who visits its drug website. The company is required to know and explain each search result that could be given to those asking questions. So, despite the recent spike in demand for systems like ChatGPT, adapting them for these stringent enterprise requirements is not an easy task, and this demand is often unmet, according to Welsh.

Welsh says his company has stayed focused on these enterprise requirements over the years, learning from experience and direct interaction with customers. Kyndi was founded in 2014 by Welsh, AI expert Arun Majumbar, and computer scientist John Sowa, an expert in knowledge graphs who introduced a specific type called conceptual graphs at IBM in 1976.

Kyndi's natural language search application builds on both knowledge graph and LLM breakthroughs by employing neuro-symbolic AI, a semantic approach that complements statistical machine learning techniques. Instead of just predicting the next most-likely word in text, the system creates symbolic representations of language, leveraging vector and knowledge graph technologies to map the relationships between data. This allows the system to understand the true intent behind an end-user question, helping to find context-specific answers while differentiating between common synonyms, semantically equivalent words, acronyms, and misspellings.

(vs148/Shutterstock)

This technology needs little training data to be effective, which can alleviate bottlenecks caused by a lack of labeled data and AI expertise. The high costs associated with data labeling make training and fine-tuning LLMs prohibitively expensive for many enterprises. This ease of tuning is another differentiating factor of Kyndi's neuro-symbolic approach. Welsh says many enterprise customers have been burned by slow AI deployments. He describes a large pharmaceutical company that, before working with Kyndi, had been tuning an LLM using six machine learning engineers and data scientists for over six months. Welsh says Kyndi was able to train and tune their model in one day with just the help of a business analyst. In several other cases, Kyndi was able to take AI projects through demo, sandbox validation, and deployment within two weeks.

One Kyndi customer is the U.S. Air Force which uses the company's answer engine system to search its flight mission reports and human intelligence data that spans decades to now. Welsh says the system has been transformational for the military branch, as the self-service Kyndi platform allows intelligence analysts and leaders to ask questions like "Was there an intervention anywhere in the world today?" (meaning a non-conflict interaction with another country's airspace or the like) and get instant answers. Welsh says successful, large-scale use cases like this point to the future of search.

Ryan Welsh, CEO of Kyndi. (Source: Kyndi)

"I think every search bar, and every chat interface, in every enterprise in the world, is going to have an answer engine behind it at some point in the next 10 years. And it's going to be the biggest kind of transition that we've seen in enterprise software," Welsh says, comparing this moment to the transition from on-prem to cloud. "I don't think there's any vendor that is positioned presently to dominate that market."

Welsh predicts the companies that will win during this new era of the enterprise search space are those that had the foresight to have a product on the market now, and though the competition is currently heating up, some of these newer companies are already behind the curve. He estimates they have about 2-3 years and $30 million worth of building to do before they have a product that can be deployed by a military agency or pharma company.

Welsh says Kyndi has found success through its differentiating factors and is checking the boxes for many enterprise customers: "When we come through, and we talk to customers, and they say, 'Hey, we need accurate answers from trusted enterprise content, it has to be explainable, and it has to be easy to run in an enterprise environment, we're going check . . . Check . . . Check."

How Modern Natural Language Processing Is Improving Healthcare

Every time a physician or a nurse practitioner sees a patient, they create a document. It may be a clinic note or an encounter note. Similarly, every time a diagnostic physician such as a radiologist or a pathologist evaluates a case, they produce a document.

All of these documents are in the form of unstructured text, written out in full sentences or phrases. For example, a primary care provider might write, "Tim came to see me last week with a three-week history of right knee pain."

As caregivers, we need this ability to create prose, to describe symptoms and explain treatment plans in ways that typically require addressing complexities and subtleties. This is medicine, after all, and what we document goes into a patient's record for future reference, so unstructured text is hugely important.

The problem is that much of this unstructured data can only be used if read by another human. A clinician can't say to a smart device, "Tell me about Mrs. Smith." And if you called up a hospital and asked how many of their patients had diffused large B-cell lymphoma last month, they likely would reply that they don't know because such information "is trapped in our notes."

Fueling burnout

One unfortunate byproduct of the need for medical documentation is clinician burnout. Many physicians today are exiting full-time practices because so much of their time is spent on documentation. We not only have a fiduciary responsibility to our patients to produce documentation, but we also have legal requirements for healthcare organizations to produce significant amounts of documentation with every patient encounter.

While it's clearly essential that clinicians document patient encounters, it is a very time-consuming task. And as the healthcare system gets busier and busier, and as more patients with chronic conditions seek treatment, many clinicians feel chained to their electronic health records (EHR) systems and compelled to produce volumes of documentation each day. They're coming in early, working through their lunch hours, and staying late – in large part to create documents.

Technology is needed to help clinicians with documentation in a way that's easiest for them and then to structure the data so it can be used more easily by others. This is where natural language processing (NLP) can be of immense value. NLP software can read and understand unstructured clinical notes, extract data from them in a structured format, store the data, and make it available to use in multiple ways.

Researchers can use NLP to find exactly the cases they're looking for to build a cohort. Hospitals get access to better data for analytics and insights. Physicians have smart devices they can ask questions of because now the contents of a chart are known and understood. NLP is a foundational technology that, through its ability to structure unstructured text data, can transform how healthcare is practiced and delivered.

Theoretically, at least. In practice NLP, which has been deployed in healthcare as far back as the 1980s, generally has been disappointing. It frequently has been inaccurate and has dealt poorly with ambiguities in language. Traditional NLP software might even tell clinicians that patients had diseases they don't actually have.

Fortunately, the market is becoming more and more aware of the benefits and limitations of NLP just as a new generation of medical NLP technologies is becoming available.

Clinical-grade NLP leverages AI and deep learning to contextualize language in medical notes and accurately identify common medical terms. Not only does this result in high levels of accuracy, AI-powered NLP can process and understand information from unstructured medical data exponentially faster than humans – and can do so at scale.

However, even modern NLP platforms must rely on medical expertise to guide the deep learning models. Infusing these deep learning models with specialized medical knowledge enables modern medical NLP to meet the data optimization needs of providers, payers, pharmaceutical companies, and clinical researchers.

NLP use cases

One thing most physicians wish they could do today is effectively and efficiently search their patient's chart when they're not confident about information they're getting from that patient. Deep learning NLP models would be able to go beyond a simple keyword search to a smart search where it knows whether the disease currently is present or absent.

Automated patient summaries are another excellent use case for medical NLP. Most practices today are hand-curated, a painstaking process in which information from multiple sources (such as discharge summaries) is distilled by or for the clinician at the point of care. Automated summaries listing patients' diseases, history of family health issues, and other relevant clinical information would allow clinicians to waste less time chasing down data and focus more on their patients.

Medical NLP also can improve the quality, safety, and efficiency of healthcare by creating rich analytics that shed insights into how healthcare is being delivered inside any clinic, hospital, or other healthcare organization. Identifying adverse events that may fall below reporting thresholds or otherwise go unreported is very important in uncovering risk factors and inefficiencies. Thus, setting up automated systems to monitor events in a healthcare setting is vital.

Another way medical NLP helps is by identifying cohorts of data for building AI models. Whether it's a type of AI to detect patients at risk for something in a healthcare setting, or whether it's image analysis, AI for specialties such as pathology or radiology provides clinical and research benefits.

Conclusion

Clinical-grade NLP is in the earlier stages of adoption, but as healthcare organizations continue to struggle with unstructured data and clinician burnout, we will see a gradual increase in its use. Once these tools are in the hands of most caregivers, we will improve patient and population outcomes while removing a major cause of clinician burnout.

The AI and deep learning models being developed today in conjunction with medical NLP will be used in healthcare five, 10 and 20 years from now. These technologies will be so tightly integrated into EHRs that they'll be almost invisible to end users. The ability of these models to predict adverse events will help us prevent those events, which will save lives and reduce healthcare costs.

Photo: PeopleImages, Getty Images

Unveiling the Secrets of Seamless Clo…
best projectors for home
ÙƒØªØ§Ø¨ Ø¹Ù† Ø§Ù„ØØ¨: Ø§Ø³ØªÙƒØ´Ù …

This post first appeared on Autonomous AI, please read the originial post: here

People also like

Unveiling the Secrets of Seamless Cloud Gaming on Steam

best projectors for home

Meet ChatGLM: An Open-Source NLP Model Trained on 1T Tokens and Capable of Understanding English/Chinese

5 Natural Language Processing Libraries To Use

A New Era Of Natural Language Search Emerges For The Enterprise

How Modern Natural Language Processing Is Improving Healthcare

Related Articles

Share the post

Subscribe to Autonomous Ai

Thank you for your subscription