The sitemap.xml file is one of the fundamental files within a website that must be inserted to facilitate the search engine crawlers in the search and indexing phase of the contents.
Therefore, it supports the discovery (discovery) phase of the url’s that search engines perform when they scan sites for updates or new content.
The usefulness and function of this file (the second most important resource within a site immediately after robots.txt) is to indicate which and how many url’s are present on the site, how they were created and set, which rules and suggestions must be followed (or considered) by the crawlers during the site scanning action, useful for the subsequent population of the search engine content database.
Remember: The analysis of the sitemap and its features is one of the SEO consulting services offer to web agencies, companies and individuals
In this article we will address some theoretical and technical aspects to end up with some suggestions on how to manage the initial phase of creating the file that is, let’s remember, closely related to how we want to organize the logical structure of the site, its sections and both static content. (Pages, categories, tags, keywords, etc) and dynamic (articles, posts, updates, etc).
We will also talk about the sitemap file in the html version which is the clear version of the .xml version
In any case, the dedicated Google Search Console portal contains many useful suggestions that can be evaluated to investigate these topics.
Using the sitemap.xml file
So the sitemap.xml file is a file specifically dedicated to search engines and its crawlers that crawl the sites during the crawling phase in the process that leads to the creation of a SERP.
This does not mean that it is not visible even by the end user (just like the robots.txt file) but that it is not his real purpose which, it is worth repeating, is to communicate the list of url’s (therefore contents) facilitating the job of scanning the search engines in creating a map of the url’s contained in a site.
This is the classic path and is what is most commonly set in a site. However, we need to make some important notations with reference to this setting:
The sitemap.xml file must always be inserted in the main or root folder and not in the secondary ones because it contains all the url’s of the site itself.
It is fundamental to segment (but also hyper-segment, in the case) the sitemap.xml file by creating additional files that contain specific parts of content that you want to have scanned and subsequently indexed.
This is because the maximum number of url’s that can be inserted in a file corresponds to 50,000 (or 50 MB) and to create more sub-files (divided by type and with each one the rules and suggestions that we will see later with greater precision) facilitates, not only the fundamental action of scanning, but also the possible search for errors by the administrator or site manager.
Features of a sitemap.xml
Now let’s see the logical structure of this site and the various tags used. In this guide we will refer to the original code without explaining, if not summarily, the use of any plugin or module that manages the operations.
This is because having a first but solid theoretical basis allows a better understanding of the SEO side consequences of a correct implementation.
Introduction: the characteristic language of the sitemap file is the XML (Extensible Mark-up Language); it is one of the evolutions of the HTML markup language. It is not in fact a real language as much as a meta-language that is intended to be used to create further declarations having their own markers or custom tags.
Let’s start with the three mandatory attributes:
The first one, closed by the </urlset> tag , provides the language declaration, the version and the encoding used: <? Xml version = “1.0 ″ encoding =” UTF-8 “?>
Let’s analyze better: we use version 1.0 which responds to the features of the UTF-8 coding, therefore without Latin and alphanumeric characters.
You should always manage and set the url’s with the following classic escape characters.
Characters and corresponding escape codes
|and commercial||&||& Amp;|
|quotation marks||‘||& Apos;|
|double quotes||“||& Quot;|
|Greater than||>||& Gt;|
|Less than||<||& Lt;|
The second one, closed by the </url> tag , declares and delimits the single url that is inserted in the file.
The third, closed by the tag </loc> , inserts the url address (the content) that you want to be indexed by the search engine.
For this attribute you must specify that:
- must always start with the protocol in use (http or https);
- it must not contain more than 2048 characters;
- must contain the final bar / , depending on the setting from the server.
These three are the attributes considered mandatory within the file.
As already mentioned, it is a good idea to create an index sitemap that contains other specific sitemaps, remembering that this case will be the attributes to be inserted in a preventive way. We will see them soon.
Lastmod, changefreq and priority of the sitemap.xml file
Three other attributes that can be inserted in the file, although not considered mandatory but only recommended, are:
Let’s look at them in detail, remembering that they have a lot of importance, in particular the first if properly set, with regards to the search engine scanning and therefore the direct influence on SEO.
The first one, closed by the tag </lastmod> , provides a very important data to the crawler: it declares when the associated url has been modified last time .
This is information of great practical value.
Imagine having a sitemap containing thousands of url’s without any distinction between content and therefore not appropriately segmented.
When the crawler visits the file he is forced every time to check it and download it entirely.
But thousands of url’s generally correspond to some MB of space occupied whose download occupies bandwidth that the crawler is forced not to devote to other operations or specific scans.
In essence: heavy file → a lot of dedicated bandwidth → less space dedicated by the crawler
As you can see, it is not the optimal SEO solution. (This is a check that you should do if you are doing an SEO Audit)
<lastmod> therefore helps us in this case. Inserting it into the declaration and the corresponding url we have the possibility to indicate to the search engine which contents have been modified since its last visit / scan of our site.
Even more important to insert <lastmod> in the declarations of the sitemap-index file so that the same crawler is previously informed if, within a single sub-sitemap (only content, only tags, only categories, only images, only videos , only Custom Post Type, etc), there have been changes.
The second attribute, closed by the tag </changefreq> , is less important than the previous one.
It tells the search engine crawler how often the site content is updated. This frequency can be set via seven values:
- always (always)
- hourly (every hour)
- daily (daily)
- weekly (weekly)
- monthly (monthly)
- yearly (annually)
- never (never)
There are some considerations to support:
- always can be used for sites that modify data every time the user connects;
- never can be used for the contents to be considered archived; this does not mean however that the crawler no longer scans the url’s so marked as it may consider that there have been unexpected changes to this content
Even more important is to understand that the <changefreq> attribute makes sense above all for newly created sites as it can push the crawler to come back to it more frequently indicating a more detailed time schedule.
But once a web project is indexed and consolidated this attribute loses some of its importance as the crawler itself (better, the parser that guides it) creates its own repeated scan map.
The third attribute that can be set within the sitemap.xml file is closed by the corresponding and specific tag </priority>.
It indicates a very important fact: the structural value we want to give to the individual contents of the site (pages, categories, articles, products, tags, images, etc.).
Not all content or sections of a site are of equal interest to us and the users who visit it. Therefore it is right to differentiate them also in the eyes of search engines and we can do it by setting a numerical value that determines their importance.
The corresponding numerical scale starts from 1 (most important content) up to 0.0 (no importance).
Generally the home page is always the most important one and it is assigned Priority 1 to then go down until you reach the less useful pages to which you can assign values lower than 0.5.
Some important notations related to this attribute:
- good The search engine crawler is able, on its own and after several scans, to realize the structure of the site and to map it correctly;
- It may happen that the priority of the single contents is incorrectly set, rendering the utility of this attribute useless. Better, therefore, not to set it than to do it incorrectly ;
- A site structure, logical and coherent, with a correct and well-structured internal linking that drives the structure towards the most valuable pages, above all commercial, is able to replace the priority declaration, consequently facilitating the scanning of the site itself.
Segment and hyper segment the sitemap.xml file
As already mentioned above, it is a good rule (indeed, it should represent the rule) to divide the sitemap.xml file into subsections that contain specific sitemaps.
These additional sitemaps have the task of grouping single contents generically referable to these types:
This helps both the crawler and the site administrator to search for errors or warnings.
Remember : It is possible to further segment the sitemap by generating files that contain only divided contents, for example, by month of publication or by single category or by single tag.
However, as we shall see in the next section, it is essential to always set the <lastmod> attribute in each individual sitemap below to indicate to the crawler if there have been any changes since its last visit / scan.
Structure of the sitemap index and individual sitemaps
As seen above for the tags or attributes of the generic sitemap specific attributes must be set between the mandatory and the recommended ones.
Specifically we talk about:
The first three must necessarily be inserted in the file; the fourth, although not mandatory, is recommended for the reasons that we will see shortly.
The first, closed by the attribute </sitemapindex> , declares the protocol and which attributes will be used in the text containing the various specific sitemaps: <sitemapindex xmlns = ” http://www.sitemaps.org/schemas/sitemap/0.1 “>
The second, closed by the attribute </sitemap> , has the task of indicating and delimiting the individual sitemap.
The third, closed by the attribute </loc> , serves to provide the precise indication of the route where the list of url’s inserted in the relative sitemap will be found.
The fourth, closed by the attribute </lastmod> , is essential although not mandatory .
His statement is closely linked to the creation of a segmented sitemap system and the reason can be explained by the fact that when search engines visit a site through the crawler, they read what was declared in the main sitemap file (sitemap index).
If they find the declaration with <lastmod> they check when the last scan was done and download only the modified url’s.
This is a fundamental indication as it allows crawlers to optimize their work by downloading only what actually changed instead of the entire sitemap which, perhaps, weighs different mega.
Precisely for this reason it would be a good rule, even if not easy to implement, to create many sitemaps that contain single logical sections of the site: categories, tags, and types of images.
Also dividing them by month of publication of articles and contents.
In this way we obtain many smaller files containing less url’s but ultimately more easily downloadable.
So far we have only talked about the XML format. But this is not the only format with which you can create and declare a sitemap.
As you can read in the official guide, both on sitemap.org and on Google (taken for example as a search engine), sitemaps with formats can also be processed:
- RSS Feed 2.0 or Atom 0.3 / 1.0
- text file
- Google site (only for Google and not included in the official sitemap.org guidelines)
By connecting to the guide indicated you will be able to read all the features of the first two formats.
Briefly, we remind you that the former is used in particular for news sites and blogs that share RSS through the declaration of the fields; however, it has the limitation of not considering all the url’s but only those of last publication.
The second includes a simple list of url’s (specially formatted) inserted into a file that will then be inserted into the root of the site and named so that the crawler can find and scan it.
The html sitemap
This is a really important file that can be inserted into a website and represents for the user the “visible ” version of the sitemap.xml file during normal browser browsing.
It contains all the url’s present and is of help both to those who surf, which in this way immediately have all the contents available, and to search engines that can have a further possibility to create the precise mapping of the site.
If you use a CMS such as WordPress, Joomla, Magento, Prestashop, plugins or modules are available which, when properly configured, automatically create and update this page of the site.
Sitemap for mobile version
If a dedicated version for mobile is used (for example with AMP technology ) and the contents are properly created for these specific devices, it is recommended to create specific sitemaps that contain only the url’s dedicated to the mobile to be sent to the search engine.
If instead you have a responsive site then the normal sitemap is sufficient.
The sitemap path in the robots.txt file
It is correct practice to insert the path of the sitemap.xml (both single and index if properly created) within the robots.txt file
This is the first file that the crawlers search and scan and must contain (if implemented in the root) all the directives and suggestions that the webmaster of the site wants to provide to crawlers such as: the pages or sections to be scanned or blocked, any blocks for the many bots, etc.
It is also essential to enter the sitemap path following this specific syntax:
Together with the sending through the appropriate sections of search engines and / or HTTP headers, this is the third official method for submitting / sending the file to search engines.
How to generate an online sitemap
You can generate a sitemap.xml in different ways:
- via online tools
- via plugin / modules for CMS (WordPress, Joomla for example)
In the first case, by connecting to this Google Code site you will have a good number of resources that you can test and evaluate individually.
A good sitemap generator is what you find by connecting to this site: https://www.xml-sitemaps.com
The limit of the free version is to create files containing up to a maximum of 500 url’s
In addition to this value it is necessary, if interested; sign up for a monthly subscription plan for the amount of about 4 dollars.
This way you can unlock other resources and more advanced options.
Another sitemap generator is the one proposed by the specific section of the Screaming Frog desktop software . In any case, it is a paid software.
In the second case you can evaluate, depending on the CMS in use, the most appropriate plugins or modules:
- for WordPress you can consult this series of links found in the specific SERP considering that the two most famous and used plugins are: Yoast SEO and Google XML Sitemaps ;
- for Joomla you can consult this SERP , as well as for Magento and Prestashop .
Send and verify a Sitemap via the Google Search Console
After having mentioned how to create a sitemap we now see the procedure to be able to send it using the tools that Google makes available to us.
We chose this search engine because it is the most used one but you can easily find the correct procedures also for Bing, Yandex, Yahoo, etc.
After generating the file and inserting it in the main root of the site, we can link to the Google Search Console site administration.
In the old version still in use, this path is followed: Property of the site → Scan → Sitemap access to the dedicated section in which you can check, send the sitemap and then, even after several days, test it while in the new version not yet mandatory the path is: Property name -> Sitemap -> Add a new sitemap
Open SEO Stats: extension for Chrome and Firefox
A free extension for the main browsers in use (Google Crome and Mozilla Firefox) which allows you to immediately see if the sitemap.xml is installed on the site being analyzed is Open SEO Stats
It is a convenient extension that you can easily find in the extension database, which can be easily installed and which once loaded on the menu bar analyzes some of the SEO side of the site and specifically checks whether the file is present or not.
Example of a site’s logical sitemap
We conclude this long article with a consideration: before actually creating a sitemap file it is mandatory to study the site structure very carefully.
It must be logical and consistent with reference to the various sections that will be implemented.
It is therefore a kind of exercise that is not only mental and that must come before the practical realization and to do this every SEO Specialist or Web Agency uses its own method.
The sitemap is a consequence of this study and this working method.
I will present you mine without any claim that it can be defined as definitive or perfect. It is my method that I have thought and refined over the years and that I propose to my clients or during the training courses offer.
Whether it is a new project or a site that already boasts an online presence, the first steps I follow are the careful study of the customer’s commercial sector and its competitors that have a structured presence.
I also carefully evaluate, in the case of sites already online, how the site was created, which sections it presents, how the url’s were generated, how the contents are inserted.
All this in order to have clearer ideas of the sector in which I will be confronted.
After finishing the long search, management and selection of the best specific keywords, I create the structure defining it in an Excel sheet (but you can easily use LibreOffice’s Calc or Google Drive sheets).
Therefore create several columns and divide them by:
- Keywords (from a minimum of 1 to a maximum of 5)
- Link IN
- Link OUT
- Meta Title
- Length of the Meta Title
- Meta Description
- Length of the Meta Description
Each of these subdivisions is set as columns while the specific contents are inserted into the individual rows.
- With the Menu item I define in which specific position of the site content (line) must be positioned. It can be main menu, top-menu, sidebar-left, footer, top footer, etc
- With the Levels item I define the depth of the site always starting from level 1 which is the Home page.
- With the URL entry I define the structure and the path of the url’s that distinguish the sections and contents.
- With the Keywords entry I insert the specific keywords that must be inserted in the text of the contents themselves.
- This is perhaps the longest and most laborious procedure because all the keywords found during the long research phase must be studied and categorized.
- With the Link IN item I record the numerical value of all incoming links to that specific content.
- With the Link OUT item I note the number of outgoing links from that specific content.
- With the Meta Title entry I insert the Title which will then be part of the snippet in SERP.
- With the Meta Length Title entry I note the number of Meta Title characters created.
- With the Meta Description item, I enter the description of the content which will also form the snippet (or rich snippet if I also enter structured data) in SERP.
- With the item Meta Description Length I note the number of characters of the Meta Description created.
Following this procedure I am sure to create a good sitemap and that this can be updated continuously and in a very simple but intuitive way. If you are looking more important information then visit here website