Error for media index after upgrade to Sitecore 9.3

Sitecore 9.3 manages the indexing of media files, such as PDFs, differently than older versions. Be prepared for what's to come.

many pdf icon

The update to Sitecore 9.3 changed the way Sitecore extracts text from media files for indexing. If you had an index for media files with PDFs in earlier versions of Sitecore, you would probably get the following error during the build of your index after you updated to Sitecore 9.3:

# sitecore crawling log
6160 11:10:48 WARN  [Index=custom_content_index] Crawler : AddRecursive DoItemAdd failed - {06A90343-0E30-4DEF-85BF-184D0FABAD76}
Exception: System.ArgumentException
Message: '.', hexadecimal value 0x00, is an invalid character.
Source: System.Xml
   at System.Xml.XmlEncodedRawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlEncodedRawTextWriter.WriteString(String text)
   at System.Xml.XmlWellFormedWriter.WriteString(String text)
   at System.Xml.Linq.ElementWriter.WriteElement(XElement e)
   at System.Xml.Linq.XElement.WriteTo(XmlWriter writer)
   at System.Xml.Linq.XNode.GetXmlString(SaveOptions o)
   at SolrNet.Commands.AddCommand`1.ConvertToXml()
   at SolrNet.Commands.AddCommand`1.Execute(ISolrConnection connection)
   at SolrNet.Impl.LowLevelSolrServer.SendAndParseHeader(ISolrCommand cmd)
   at Sitecore.ContentSearch.SolrProvider.SolrBatchUpdateContext.AddRange(IEnumerable`1 group, Int32 groupSize)
   at Sitecore.ContentSearch.SolrProvider.SolrBatchUpdateContext.AddDocument(Object itemToAdd, IExecutionContext[] executionContexts)
   at Sitecore.ContentSearch.SolrProvider.SolrIndexOperations.ApplyPermissionsThenIndex(IProviderUpdateContext context, IIndexable version)
   at Sitecore.ContentSearch.SitecoreItemCrawler.DoAdd(IProviderUpdateContext context, SitecoreIndexableItem indexable)
   at Sitecore.ContentSearch.HierarchicalDataCrawler`1.CrawlItem(T indexable, IProviderUpdateContext context, CrawlState`1 crawlState)

14548 11:10:54 WARN  [Index=custom_content_index] IntervalAsynchronousUpdateStrategy triggered but muted. Index is being built at the moment.
6160 11:10:55 WARN  [Index=custom_content_index] Crawler : AddRecursive DoItemAdd failed - {414C7629-328C-4316-9599-F227E017612B}
Exception: System.ArgumentException
Message: '', hexadecimal value 0x1F, is an invalid character.
Source: System.Xml
   at System.Xml.XmlEncodedRawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlEncodedRawTextWriter.WriteString(String text)
   at System.Xml.XmlWellFormedWriter.WriteString(String text)
   at System.Xml.Linq.ElementWriter.WriteElement(XElement e)
   at System.Xml.Linq.XElement.WriteTo(XmlWriter writer)
   at System.Xml.Linq.XNode.GetXmlString(SaveOptions o)
   at SolrNet.Commands.AddCommand`1.ConvertToXml()
   at SolrNet.Commands.AddCommand`1.Execute(ISolrConnection connection)
   at SolrNet.Impl.LowLevelSolrServer.SendAndParseHeader(ISolrCommand cmd)
   at Sitecore.ContentSearch.SolrProvider.SolrBatchUpdateContext.AddRange(IEnumerable`1 group, Int32 groupSize)
   at Sitecore.ContentSearch.SolrProvider.SolrBatchUpdateContext.AddDocument(Object itemToAdd, IExecutionContext[] executionContexts)
   at Sitecore.ContentSearch.SolrProvider.SolrIndexOperations.ApplyPermissionsThenIndex(IProviderUpdateContext context, IIndexable version)
   at Sitecore.ContentSearch.SitecoreItemCrawler.DoAdd(IProviderUpdateContext context, SitecoreIndexableItem indexable)
   at Sitecore.ContentSearch.HierarchicalDataCrawler`1.CrawlItem(T indexable, IProviderUpdateContext context, CrawlState`1 crawlState)

To fix this problem, you have to enable the PDF IFilter to extract media file content for indexing. Modify the config file (App_Config\Sitecore\ContentSearch\Sitecore.ContentSearch.config) like this:

# Sitecore.ContentSearch.config
<services>
  <register serviceType="Sitecore.ContentSearch.ContentExtraction.IMediaFileTextExtractor, Sitecore.ContentSearch.ContentExtraction" 
  implementationType="Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IFilterMediaExtractor, Sitecore.ContentSearch" />
</services>

You can find more information about content media indexing on the Sitecore documentation page: Index content of media files.

Stefan Busch
VIU AG Rennweg 38 8001 Zürich CH-Switzerland
Imprint