June 30th 2023

nlp search engine :: Article Creator

10 Types Of Data That Should Be On Your Keyword Clustering Wish List

Everybody is talking about keyword clusters. At the core, it's pretty simple – group related keywords together. Sounds easy, right?

Some free tools walk you through some basic Natural Language Processing (NLP) that help deduplicate and find semantic similarities between keywords. There's nothing wrong with starting there, but those are inevitably limited. Google, on the other hand, has infinitely more data to feed into its algorithms, including on-page data and links to provide more context than just basic keyword manipulation.

To truly understand how Google sees the world, you must collect SERP data to see which pages rank for which terms. At scale, by comparing how many URLs overlap in the top 10 results, you get a very clear picture of which SERPs are related. This method has been recently popularized by Keyword Insights, also available from Nozzle, Cluster AI, and others.

I am continually surprised when I find keywords that I would have manually grouped together, but Google shows zero overlapping URLs, and vice versa. Whether or not Google is "right" in these cases is irrelevant – it's Google's world, and we just live in it.

Here are the results when you search for "SEO agencies" and "SEO companies" side by side with ads removed, and you can see that eight of the top 10 are the same!

Manually finding these overlapping pages is nearly impossible to do at scale but trivial for good tooling. For years now, there have been various tools that help curate keyword lists but fail to dig deeper. There's even a big new kid on the block offering basic clustering, but their 2,000 keyword limit is disappointing.

Automatically clustering your keywords is great, but this is where most tools end – a list of keywords, maybe search volume and/or rank. Here is a wishlist of 10 types of data that would be invaluable in the context of keyword clusters, most of which have been unavailable to date.

Ranking URLs

Refine by

PAA

FAQ

SERP features

Search intent

Ranking position

Share of voice

Entities

Categories

1. Ranking URLs/pages

Existing tools are not showing you exactly which pages are shared between all the keywords in the cluster, which makes it very difficult to learn what Google is rewarding. Additionally, knowing the number of URLs gives significant insight into the strength/tightness of the cluster. Like in the example above, sharing eight out of 10 URLs is a very tight cluster, where only 3-4 overlapping pages are moderately tight.

Most tools also force you to decide before you get started about how many overlapping URLs to count, which is hard to know before you see the data. You should be able to dynamically change this value as you explore and without having to pay to run the clustering process again.

If you have any experience with content writing tools, they will often scrape the top results for a single keyword for you to see. It is much more effective to scrape the URLs ranking for ALL THE KEYWORDS in the cluster!

Seeing detailed information about their headings, schema, and text stats like word count/grade level can be a helpful guide.

2. Refine by

One criminally overlooked source for relevant topic information is hidden right at the top of every SERP, helpfully labeled with the hidden H1 tag, "Filters and Topics".

After a few traditional tabs like Images, News, Maps (changes from search to search), Google links to related topics, typically prepending or appending the topic to the current keyword phrase. These are generally easy to identify manually and can also be differentiated by HTML markup/CSS classes.

3 and 4. People Also Ask (PAA) and frequently asked questions (FAQ)

People Also Ask questions are a goldmine for content creators, as Google gives you the blueprint for what to answer in your content. PAAs are also much more volatile than traditional search results, so by aggregating them over time rather than a one-time scrape that most tools use, you can identify which questions appear the most often. In our example above, even though the SERPs were nearly identical, there were zero overlapping questions.

First, we have the top 10 questions for this specific cluster over the last 30 days. SERP count is the total number of SERPs they appeared on, and keyword count is the number of unique keywords that showed the question.

Very similar to PAA, below are the questions that Google deemed relevant enough to the topic to provide significantly more visual real estate to the site by implementing correct schema.Org markup.

5. SERP features

The presence or absence of specific SERP features for a cluster will impact your content strategy. PAA and FAQ typically get very high visibility on the SERP – for this cluster, ranking position 2.5 and 4.2, respectively – so adding the correct markup and answering the right questions can drive significant traffic if you can capture it. Maps show 65% of the time, which signals some fractured intent. Things_to_know is only visible on a single SERP but might represent a growth opportunity if you optimize.

6. Search intent

Search intent influences your entire strategy, so knowing the overall cluster intent, including mixed intent, is crucial to a great strategy. Search intent should also be available per result, in addition to an overall aggregate score, to help identify opportunities to rank multiple pages on a single SERP. Having data that informs that intent, like Google Ads metrics, is also useful.

7. Ranking positions

Reporting your current ranking position is vital. If you don't currently rank at all, there might be some low-hanging fruit where you simply have a content gap, and with sufficient topical authority, you could rank just by publishing. Similarly, if you rank 8-15, you might be able to 10x your traffic with just some extra optimization.

Bonus points if you can see more than just rank, including newer metrics like pixel depth and above-the-fold %.

Having the rank available doesn't mean much if you can't meaningfully visualize it to identify opportunities.

This shows the number of keywords in a cluster against search volume, with CPC as the bubble radius and rank as the bubble color. It's easy to quickly identify clusters matching your criteria to drill into for more detail.

8. Competitive overview/Share of voice

Seeing your own rank is great, but it's even better if you can spy on your competitors with the same data. Switching between domains gives you god-like powers to dominate your competition.

Since every cluster is different, there might be a different set of competitors in each one, so make sure you can report on the share of voice per cluster.

9. Entities

Google has long since stopped seeing the web via exact-match keywords. It's more about semantic similarities, which can be represented by entities extracted from page content using natural language processing (NLP).

To learn more, I highly recommend reading Timothy Warren's article on SEL, "Entity SEO: The definitive guide." While there are many APIs and open-source tools like spaCy to extract data from text, I prefer to use Google's API, and they have a demo, as shown below, identifying important parts of the text.

Salience: the importance of the entity to the text.

Sentiment Score: from -1 to 1, with -1 being the most negative, 0 neutral, and 1 being the most positive.

Sentiment Magnitude: indicates how much emotional content is present within the document.

As a content writer targeting a keyword cluster, hopefully, you've evolved past keyword density. However, it's still important to ensure you correctly target entities that Google knows and cares about.

Now let's say that we're writing for Bruce Clay, and we have decided to target this "SEO companies" cluster example. Normally the workflow would involve the writer scanning a few important pages, then updating an existing page. With entities, we have a new way to approach content optimization. We can use the same NLP to extract entities from our mapped page, then compare them with the cluster entities.

In this case, it is very clear that there is a total mismatch. Assuming the page you are comparing already ranks for other clusters, this is a strong signal that you need to create new content to target this cluster.

10. Content categorization

Similar to our approach to entities, we can also categorize our content, and by comparing that to the cluster, we can identify mismatches.

Google has a classification API with nearly 1,100 categories that work across multiple languages. Unsurprisingly, SEO is the leading category for this cluster's matching URLs, but category #4 is "Product Reviews & Price Comparisons."

At that confidence level, that doesn't mean you need to run and add a comparison table to your page, but it's worth considering if that would add value to your audience.

Conclusion

Until now, you've had to stitch together different tools to see even a fraction of this data. Today, Nozzle is launching our keyword clustering tool on Product Hunt, with all this data at your fingertips. Come and try it out on your own keywords for free!

How To Use Entities In Schema To Improve Google's Understanding Of Your Content

Adding schema markup on your website is a great way to help search engines like Google to understand your content more quickly and accurately.

One of the lesser-known ways of utilizing schema markup is by including "entities" within it. Adding entities into schema can help Google better understand the key topics of your content.

In this article, I'll walk you through a step-by-step process of using entities in schema markup.

Why use entities in schema markup?

So why go through the trouble of adding entities in your schema markup when Google's natural language processing capabilities (such as BERT and MUM) already help the search engine understand the content of your article?

The answer is that both writers and AI sometimes fail to accurately communicate and identify the meaning, context, and importance of the topics within an article.

Imagine going to your favorite local restaurant and seeing a delicious-looking burrito on the menu, but it doesn't say what kind it is and what's in it.

So you order it and when it comes, you have to figure out using your senses to pick up on all the contextual clues of what makes up the dish.

You'll likely figure out most of the ingredients if you have enough culinary experience, but likely not all of them, especially if it has blended spices!

Using entity schema is like giving Google all the main ingredients for your article, making it inherently easier for them to identify and understand your article's most important topics without any confusion.

Doing that takes the pressure off of ensuring the words are perfectly used within the article and its sentences to convey their meaning and importance.

Adding entities to your article's schema

The following process gives me much control and less reliance on third-party plugins. However, if you want to go the plugin route, check out WordLift.

Either way, reading this guide will help you better understand how Google and NLP tools see your most important topics.

Let's say you have an article titled "The 10 Best Toys for Small Adult Dogs."

Here are the steps for identifying the most relevant entities for this article and adding them to the schema markup.

Step 1: Analyze your article using TextRazor

Start by copying and pasting your article's text into the TextRazor demo and clicking the "Analyze" button.

(For this guide, I'm using the article text from DogLab.)

Step 2: Identify relevant entities

On the results page, you'll see a list of top entities or topics ranked by relevance in the right sidebar.

The higher the score for a topic, the more relevant it is to the article.

The key here is to review this entire list and see how well it's scoring the relevance of the topics.

If there's a core topic, such as "frisbee," and it doesn't have a high relevance score, then it's even more important to add this to your schema.

Plus, you may want to consider rewriting sentences containing the word "frisbee" to get a higher salience or relevance score.

For this example, we'll select the following topics or entities for which you'll then get their schema data.

Primary entities:

Secondary entities:

Chihuahua

Yorkshire Terrier

Pomeranian

Shih Tzus

Pugs

Frisbee

Chew toy

Squeaky toy

Tennis ball

Not every topic on the sidebar represents a known entity within Wikipedia, Wikidata or Google.

So it's important to review all the bolded and underlined words within each sentence that's broken down on the left side of the page.

Get the daily newsletter search marketers rely on.

Next, locate a sentence on the left side of the results page that contains your first entity.

In this example, let's choose "dog" as the entity.

Next, click on the Entities tab beneath the sentence that contains the word dog. This will display a list of all the entities within that particular sentence.

We'll want to copy all the entity URLs for this entity and temporarily store it in a document or spreadsheet.

Right-click on the first entity in the list and copy its Wikipedia link. In this case, it's:

Then, locate the corresponding Google entity (which should start with "/m/") and copy the ID. In this case, it's (/m/0bt9lr)

Add the Google entity ID to the end of this Google search URL:

So it looks like:

Go ahead and click on this to verify that the search result page shows results for the query "dog." Cool, right?

Lastly, find the Wikidata entity (usually starting with the letter Q) and copy its link (e.G., http://wikidata.Org/wiki/Q144).

You'll want to repeat this exact process for each entity on your list. If you find that this is something you want to automate more, TextRazor does have an API you can work with.

Step 4: Incorporate entity URLs into schema

Now that you have collected the Wikipedia, Google, and Wikidata URLs for each entity, you can integrate them into a JSON schema called "about," which should be nested under the main schema, such as "Article."

Follow this structure for each entity:

"about": [ { "@type": "Thing", "name": "Dog", "sameAs": "https://google.Com/search?&kgmid=/m/0bt9lr" }, { "@type": "Thing", "name": "Dog", "sameAs": "http://en.Wikipedia.Org/wiki/Dog" }, { "@type": "Thing", "name": "Dog", "sameAs": "http://wikidata.Org/wiki/Q144" } ]

If you validate with Schema.Org, it should look like this:

Repeat this process for all your entities.

Step 5: Add schema to your WordPress theme

This is where things can get a little more technical and you may need the help of a programmer or try ChatGPT.

Next, we'll need to add PHP code that will store all of these entities and their schema markup.

The good news is that once you generate the schema for an entity, you won't have to do it again.

The way I've coded it for my WordPress site is to associate a WordPress "tag" to each entity.

For example, I have a WordPress tag called "Dog" and any article about a dog gets this tag assigned to it.

When that happens, the WordPress code automatically shows the dog entity schema.

The cool part is that you can add as many tags as you want to a WordPress post or page, so you can load as many relevant entities as you want to a post with a click of a button.

Here's a good ChatGPT prompt to start with for generating this code:

If you use a plugin like Yoast SEO, you'll want to adjust the prompt to incorporate it in their JSON format.

Step 6: Assign tags to your article

Once you've got your PHP code in place, you can add tags to your articles.

Head to your WordPress dashboard and ensure that your article (in this case, "Best Toys for Small Adult Dogs") has the appropriate tags (e.G., "dog") assigned to it.

The cool part in this example is that once I tag any existing article with "dog," all those articles will instantly be updated.

Step 7: Rinse and repeat

Repeat this process for any additional entity (e.G., "toy," "Chihuahua," "Yorkshire Terrier," etc.) that you'd like to include in your schema markup.

Incorporating entities in schema markup

Integrating entities into your schema markup isn't necessary to rank first in organic search. However, it can help you hedge your long-term SEO bets.

Writers and AI aren't perfect. Writing and interpreting the text on the page isn't always done perfectly. This means there's a chance that the relevance and importance of the primary topics of an article could be lessened or missed.

If you're on the fence about it, test it out to see how it works for your site. Find four articles on your site that are topically related and add at least 5 to 10 entities to each.

You can probably manually edit the schema just for the test articles. If it works well, you can integrate it more deeply into your site's code or try WordLift.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

How Databricks Is Adding Generative AI Capabilities To Its Delta Lake Lakehouse

It's been a busy few weeks for Databricks. After releasing a new iteration of its data lakehouse with a universal table format and introducing Lakehouse Apps, the company on Wednesday announced new tools aimed at helping data professionals develop generative AI capabilities.

The new capabilities — which include a proprietary enterprise knowledge enginem dubbed LakehouseIQ, a new vector search capability, a low-code large language model (LLM) tuning tool named AutoML, and open source foundation models — are being added to the company's Delta Lake lakehouse.

The new capabilities draw on technology from the company's recent acquisitions — MosiacML this week, and Okera in May.

LakehouseIQ to open up enterprise search via NLP

The new LakehouseIQ engine is meant to help enterprise users search for data and insights from its Delat Lake, without the need to seek technical help from data professionals. To simplify data search for nontechnical users the LakehouseIQ engine uses natural language processing (NLP).

In order to enable NLP-based enterprise searches, LakehouseIQ uses generative AI to understand jargon, data usage patterns, and concepts like organizational structure.

It's a different approach than the common practice of creating knowledge graphs, a method used by companies including Glean and Salesforce. A knowledge graph is a representation of structured and unstructured data in the form of nodes and edges, where nodes represent entities (such as people, places, or concepts) and edges represent relationships between these entities.

In contrast, the LakehouseIQ engine, according to SanjMo principal analyst Sanjeev Mohan, consists of machine learning models that infer the context of the data sources and make them available for searching via natural language queries.

Enterprise users will be able to access the search capabilities of LakehouseIQ via Notebooks and the Assistant in its SQL editor, the company said. The Assistant will be able carry out various tasks such as writing queries and answering data-related questions.

Databricks said that it is adding LakehouseIQ to many management features inside its lakehouse, in order to deliver automated suggestions. These could include informing the user about an incomplete data set, or suggestions for debugging jobs and SQL queries.

Additionally, the company is exposing LakehouseIQ's API, to help enterprises use its abilities in any custom applications they develop, said Joel Minnick, vice president of Marketing at Databricks.

The LakehouseIQ-powered Assistant is currently in preview.

Delta Lake gets AI toolbox for developing generative AI use cases

The addition of the Lakehouse AI toolbox to its lakehouse is meant to support the development of enterprise generative AI applications such as the creation of intelligent assistants, Databricks said. The toolbox consists of features including vector search, low-code AutoML, a collection of open source models, MLflow 2.5, and Lakehouse Monitoring.

"With embeddings of files automatically created and managed in Unity Catalog, plus the ability to add query filters for searches, vector search will help developers improve the accuracy of generative AI responses," Minnick said, adding that the embeddings are kept updated using Databricks' Model Serving.

Embeddings are vectors or arrays that are used to give context to AI models, a process known as grounding. This process allows enterprises to avoid having to fully train or finetune AI models using the enterprise information corpus.

Lakehouse AI also comes with a low-code interface to help enterprises tune foundational models.

"With AutoML, technically skilled developers and non-technical users have a low code way to fine-tune LLMs using their own enterprise data. The end result is a proprietary model with data input from within their organization, not third-party," Minnick said, underlining the company's open source foundation model policy.

As part of Lakehouse AI, Databricks is also providing several foundation models that can be accessed via the Databricks' marketplace. Models from Stable Diffusion, Hugging Face and MosiacML, including MPT-7B and Falcon-7B, will be provided, the company said.

The addition of MLflow 2.5 — including new features such prompt tools and an AI Gateway — is meant to help enterprises manage operations around LLMs.

While AI Gateway will enable enterprises to centrally manage credentials for SaaS models or model APIs and provide access-controlled routes for querying, the prompt tools provides a new no-code interface designed to allow data scientists to compare various models' output based on a set of prompts before deploying them in production via Model Serving.

"Using AI Gateway, developers can easily swap out the backend model at any time to improve cost and quality, and switch across LLM providers," Minnick said.

Enterprises will be able to continuously monitor and manage all data and AI assets within the lakehouse with the new Lakehouse Monitoring feature, Databricks said, adding that the feature provides end-to-end visibility into data pipelines.

Databricks' already offers an AI governance kit in the forms of Unity Catalog.

Do Databricks' updates leave Snowflake trailing?

The new updates from Databricks, specifically targeting development of generative AI applications in the enterprise, may leave Snowflake trailing, according to Constellation Research principal analyst Doug Henschen.

"Both Databricks and Snowflake want their customers to handle all their workloads on their respective platforms, but in my estimation, Databricks is already ready to help them with building custom ML [machine learning], AI and generative AI models and applications," Henschen said, adding that Snowflake's generative AI capabilities, such as the recently announced Snowpark Container Services, is currently in private preview.

Snowflake, according to Amalgam Insights principal analyst Hyoun Park, is just starting to build out language and generative AI capabilities through the NVIDIA NeMO partnership and the Neeva acquisition.

In contrast, most of Databricks' capabilities are either in general availability or in public preview, analysts said.

Databricks' new updates may also lead to query performance gains across generative AI use cases, according to Gartner analyst Aaron Rosenbaum, and this may act as a differentiator against rival Snowflake.

"While Snowflake and Databricks have many common customers, running a wide variety of SQL queries cheaply, quickly and simply is a goal for every one of them," Rosenbaum said.

Ø§Ø³ØªÙƒØ´Ù Ø£ÙØ¶Ù„ ÙƒØªØ¨ Ø§Ù†Ø¬Ù„…
The Epitome of Exploration: Best Open…
A List of the Best College Graduation…
Key Tips for Renting an Electric Vehi…
Android 15: Locate Your Smartphone Ev…

This post first appeared on Autonomous AI, please read the originial post: here