Prevent Search Engine Indexing Using Robots.txt

Example Robots.txt Files

Choose the robots.txt file most appropriate to your situation:

1. To prevent indexing of the entire server use:

User-agent: *
Disallow: /

2. To only prevent indexing of Confluence and JIRA while still indexing other applications deployed in your application server:

#This needs to be at /robots.txt
# tomcat: put it in the webapps/ROOT 
# apache and tomcat integrated: put it it root pages directory
Disallow: /jira/
Disallow: /confluence/pages
Disallow: /confluence/spaces
Disallow: /confluence/dashboard.action
Disallow: /confluence/adminstrators.action
Disallow: /confluence/searchsite.action
Disallow: /confluence/display/\[space to disalllow]
# ignore user pages
Disallow: /confluence/display/~

Instructions For Standalone Editions On Their Own Domain

  1. Open your Confluence install directory and save the appropriate file as \confluence\robots.txt. For example c:\confluence-2.5.4-std\confluence\robots.txt
  2. Visit your Confluence domain and confirm that the robots.txt is now accessible in the root directory. For example, if your domain is http://confluence.atlassian.com then the file should be accessible at http://confluence.atlassian.com/robots.txt

General Instructions

Save the appropriate file as robots.txt in the application server root directory. For Apache Tomcat, place it under ../webapps/ROOT.

Labels

google google Delete
restrict restrict Delete
web web Delete
crawler crawler Delete
webcrawler webcrawler Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jul 01, 2007

    Peter R. says:

    Here's one were using to try to get our Google Enterprise Search Appliance to st...

    Here's one were using to try to get our Google Enterprise Search Appliance to stop using up the majority of our Confluence server resources. Feedback welcome:

    # Note: this files uses parameters specific to Google, parameters that are not in the robots.txt standard
    # http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
    # some links shouldn't show to an anonymous browser such as GAS but are included for completeness
    
    # Updated 2007.06.30.09.44
    
    User-agent: * # match all bots. The Google Search Appliance (GSA) is our primary crawler but logs indicate there may be others on our Intranet
    Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
    Request-rate: 1/5 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, maximum rate is one page every 5 seconds. may not work
    # DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
    Disallow: /*?decorator=printable # remove printable version links, non-display URLs
    Disallow: /*javascript* # remove any javascript links, per log analysis
    Disallow: /admin/ # administrator links
    Disallow: /adminstrators.action? # remove any administrator links
    Disallow: /createrssfeed.action? # remove internal RSS links
    # Disallow: /dashboard.action # primary dashboard link
    Disallow: /dashboard.action? # remove secondary dashboard links, not needed for anonymous crawling
    Allow: /display # ensure primary display pages are allowed
    Disallow: /display/*&tasklist.complete= # remove tasklist links
    Disallow: /display/*&tasklist.uncomplete= # remove tasklist links
    Disallow: /display/*?decorator=normal # remove redundant link for standard display
    Disallow: /display/*?decorator=printable # remove printable version links, display URLs
    Disallow: /display/*?focusedCommentId= # remove page comment focus links
    Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
    Disallow: /display/*?replyToComment= # remove reply to comment links
    Disallow: /display/*?rootCommentId= # remove news comment focus links
    Disallow: /display/*?showChildren= # remove the children view links, not needed, anonymous defaults to showing children
    # Disallow: /display/*?showChildren=true # remove show children link - DISABLED for now so crawler can see more "real" pages
    Disallow: /display/*?sortBy= # remove sort by links for pages with embedded attachments, not needed
    Disallow: /display/*showComments= # remove comment links
    Disallow: /display/WikiDevQA/ # remove the DEV Space from being indexed
    Disallow: /doexportpage.action? # remove pdf export links
    Disallow: /dopeopledirectorysearch.action # people search
    Disallow: /dosearchsite.action # remove generic site searches
    Disallow: /dosearchsite.action? # remove specific site searches
    Disallow: /download/attachments/*?version= # knock out previous versions of attachments
    Disallow: /download/userResources/ # knock out user resource links, per log analysis
    Disallow: /download/resources/ # knock out resource links, per log analysis
    Disallow: /dwr/ # knock out DWR links, per log analysis and http://getahead.org/dwr/
    Disallow: /exportword? # remove word export links
    Disallow: /form-mail-plugin/ # remove form mail links
    Disallow: /label/ # remove all label links, per vendor analysis
    Disallow: /labels/ # remove all label links, per vendor analysis
    Disallow: /labels-javascript # remove label javascript
    Allow: /labels/listlabels-alphaview.action # allow label index page
    Disallow: /login.action # remove the login page
    Disallow: /login.action? # remove the login page derivatives
    # Next line, 35, will be enabled when line after, 36, is removed
    # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
    Disallow: /pages/ # this line to purge GSA of all old page entries, _may_ eventually be removed so that specific /pages/ lines below take effect and non-html compatible titled pages can be crawled
    # DISABLED FOR NOW Disallow: /pages/pageinfo.action? # exclude all the previous versions of pages by excluding Page Info pages
    # Disallow: /pages/*?showChildren=true # remove show children link - DISABLED for now so crawler can see more "real" pages
    Disallow: /pages/*&tasklist.complete= # remove tasklist links
    Disallow: /pages/*&tasklist.uncomplete= # remove tasklist links
    Disallow: /pages/*?decorator=normal # remove redundant link for standard display
    Disallow: /pages/*?decorator=printable # remove printable version links, display URLs
    Disallow: /pages/*?focusedCommentId= # remove page comment focus links
    Disallow: /pages/*?refresh= # prevent crawler from clicking refresh button
    Disallow: /pages/*?replyToComment= # remove reply to comment links
    Disallow: /pages/*?rootCommentId= # remove news comment focus links
    Disallow: /pages/*?showChildren=false # remove the don't show children link, not needed, per log analysis
    Disallow: /pages/*?sortBy= # remove sort by links for pages with embedded attachments, not needed
    Disallow: /pages/*showComments= # remove comment links
    Disallow: /pages/copypage.action? # remove copy page links
    Disallow: /pages/createblogpost.action? # remove add news links
    Disallow: /pages/createpage.action? # remove add page links
    Disallow: /pages/diffpages.action? # remove page comparison pages
    Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
    Disallow: /pages/editblogpost.action? # remove edit news links
    Disallow: /pages/editpage.action? # remove edit page links
    Disallow: /pages/removepage.action? # remove the remove page links
    Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
    Disallow: /pages/templates # remove template pages
    Disallow: /pages/templates/ # block template indexes
    Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
    Disallow: /pages/viewpage.action?*&showComments # remove comments links
    Disallow: /pages/viewpage.action?spaceKey= # remove page view links that are "duplicates" of the Display URL pages
    Disallow: /pages/viewpagesrc.action? # remove view page source links
    Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
    Disallow: /plugins/ # blocks plug-in calls
    Disallow: /rpc/ # remove any RPC links
    Disallow: /s/ # remove any links to label calls down this path, per log analysis
    Disallow: /searchsite.action? # remove the wiki search engine pages
    Disallow: /spaces/*&decorator=printable # remove printable version links
    Disallow: /spaces/blogrss.action? # remove rss links
    Disallow: /spaces/listrssfeeds.action? # remove rss links
    Disallow: /spaces/viewmail.action? # remove view mail links (we don't have email integration enabled anyway)
    Disallow: /spaces/viewmailarchive.action? # remove view mail archive links (we don't have email integration enabled anyway)
    Disallow: /spaces/viewthread.action? # remove view mail thread links (we don't have email integration enabled anyway)
    Disallow: /themes/ # theme links
    Disallow: /users/ # remove user action pages
    Disallow: /x/ # remove tiny link urls
    
    # End file
    
    1. Jun 25, 2007

      Charles Miller says:

      I'm surprised all of this is necessary. Most of the things you list are already ...

      I'm surprised all of this is necessary. Most of the things you list are already protected by robots META headers in the application itself.

      Is your search appliance configured to respect rel="nofollow" on links?

      1. Jun 25, 2007

        Peter R. says:

        In CSP-8619 I'm being told that this is a good robots.txt to help stop our GSA f...

        In CSP-8619 I'm being told that this is a good robots.txt to help stop our GSA from impacting our performance. At no point has anyone said that there are robots META headers in the application.

        Yes, our search appliance respects nofollow on links. Even so, it's churned through almost 80,000 URLs in our instance and we've only got ~7500 current version pages. Since Confluence generates pages dynamically with no cache, GSA will circle back and hit the same page over and over. Before we implemented robots.txt it was chewing up 85% of the bandwidth of the site! Over 250 requests a second at one point, making it our prime suspect in our performance and application hang issues.

        Matt Ryall even suggested opening a feature request, which I did at CONF-8749.

        If you can see something we're all missing I'll be forever in your debt! We're losing users back to MediaWiki and other "unapproved" applications because of this issue.

        Peter

        1. Jun 28, 2007

          Peter R. says:

          I was hoping to hear back on this...

          I was hoping to hear back on this...

          1. Jun 28, 2007

            Guy Fraser says:

            It might be worth posting a message on the conf-user and conf-dev mailing lists ...

            It might be worth posting a message on the conf-user and conf-dev mailing lists telling people about this page to get more people looking at it...

            1. Jul 01, 2007

              Peter R. says:

              I was actually in the middle of doing so when Charles commented on the page. Giv...

              I was actually in the middle of doing so when Charles commented on the page. Given his credentials, I wanted to see what he had to say before linking to it from the forums / mailing list. I'll give him another day or so here and then link.

        2. Jul 01, 2007

          Charles Miller says:

          Well, looking at this page, for example: The printable and pdf export links a...

          Well, looking at this page, for example:

          • The printable and pdf export links are served with rel=nofollow attributes
          • The info page is served with:
                        <meta name="robots" content="noindex,nofollow">
                    <meta name="robots" content="noarchive">

          It would be really useful if someone with the Google search appliance could:

          • Verify if this is true for their instance
          • Try and get some diagnostic info from the appliance itself to determine why those pages are being indexed anyway?
          1. Jul 01, 2007

            Peter R. says:

            I can verify that our PROD GSA has ~102,000 URLs from the Confluence system with...

            I can verify that our PROD GSA has ~102,000 URLs from the Confluence system with "decorator=printable" in the URL. When I visit one of the URLs and look at the page source I do NOT see the noarchive attribute:

            <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
            <!-- main.vmd
              themebuilder : 'com.adaptavist.confluence.sitebuilder.SiteBuilderVelocityHelper@27478ea9'/'$themebuilder.initialise'
              spaceKey : 'QuickStart'
              pageId : '766'
              currentURL : '/pages/viewpage.action?spaceKey=QuickStart&title=Plug-in+Requests&decorator=printable'
              contextPath : ''
              spaceName : 'Wiki Quick Start'
              decorator : 'printable'
              printable : 'true'
              mailId : '$mailId'
              mode : 'view'
              context : 'page'
            -->
            <html>
            <head>
                <title>Page Name Removed for Privacy</title>
                <META HTTP-EQUIV="Pragma" CONTENT="no-cache">
                <META HTTP-EQUIV="Expires" CONTENT="-1">
                <script type="text/javascript" language="JavaScript">var contextPath = '';</script>
                <script language="javascript">
                    var contextPath = '';
                    var i18n = [];
                </script>
                <link rel="shortcut icon" href="/images/icons/favicon.ico">
                <link rel="icon" type="image/png" href="/images/icons/favicon.png">
                <script type="text/javascript" language="JavaScript" src="/decorators/effects.js"></script>
                <script type="text/javascript" language="JavaScript" src="/download/resources/com.adaptavist.confluence.themes.sitebuilder:sitebuilder/icons/visibility.js"></script>
                <style type="text/css">
                .breadcrumb2 {display:none;visibility:false;}
                </style>
                <link rel="stylesheet" type="text/css" href="/plugins/sitebuilder/sitebuilder-resources.action?spaceKey=QuickStart&resource=panelcss&hash=1121034283"/>
                <link rel="stylesheet" type="text/css" href="/plugins/sitebuilder/sitebuilder-resources.action?spaceKey=QuickStart&resource=css&hash=1121034283"/>
                <!--[if gte IE 5.5000]>
                    <link rel="stylesheet" type="text/css" href="/plugins/sitebuilder/sitebuilder-resources.action?spaceKey=QuickStart&resource=iecss&hash=1121034283"/>
                <![endif]-->
                <!--[if gte IE 5.5000]>
                    <script type="text/javascript" language="JavaScript" language="JavaScript" src="/download/resources/com.adaptavist.confluence.themes.sitebuilder:sitebuilder/icons/PieNG.js"></script>
                <![endif]-->
                <script type="text/javascript" src="/download/resources/com.adaptavist.confluence.themes.sitebuilder:sitebuilder/general/browser_detect.js"></script>
            
            </head>
            

            Given that, that's why the GSA is indexing it...

            1. Jul 01, 2007

              Peter R. says:

              I took a look at an Info page for one of our pages and verified that the PDF ver...

              I took a look at an Info page for one of our pages and verified that the PDF version does indeed have the "nofollow" attribute.

              I also see in the header:

                  <meta name="robots" content="noindex,nofollow">
                  <meta name="robots" content="noarchive">
              

              So that's telling me that the info page itself isn't being indexed, correct? Shouldn't it also mean that no links on the actual page should be followed? And also make the nofollow attribute on the URLs for PDF, Copy and Export to Word redundant?

              Due to the distributed/outsourced nature of our support, it's going to be difficult for me to get configuration info from the GSA side. However, I can easily do an advanced search through the interface to see what pages it has indexed, which is what I did with the decorator=printable above.

              Given that, how can I help you help me/us on this?

              Thank you.

              Peter

            2. Aug 08, 2007

              Guy Fraser says:

              Peter - Theme Builder 3.0 beta 4 should solve that problem: http://jira.adaptav...

              Peter - Theme Builder 3.0 beta 4 should solve that problem:

              http://jira.adaptavist.com/browse/BUILDER-662

              I also added this feature request (might not make it in to 3.0 release):

              http://jira.adaptavist.com/browse/BUILDER-591