Robots.txt：网站与搜索引擎的「交通规则」

什么是 Robots.txt？

Robots.txt 是一个放置在网站根目录下的纯文本文件，它就像网站与搜索引擎爬虫之间的「交通规则手册」。当搜索引擎的机器人（如 Googlebot、Bingbot 等）访问您的网站时，它们会首先查看这个文件，了解哪些区域可以访问，哪些应该避开。

基本位置：https://www.example.com/robots.txt

核心作用：权限控制与资源优化

1. 保护隐私区域

防止搜索引擎索引后台管理、用户数据等敏感页面：

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /user/profile/

2. 节省爬取配额

引导爬虫专注于重要内容，避免浪费资源在无价值页面：

User-agent: *
Disallow: /tmp/
Disallow: /logs/
Allow: /news/

3. 避免重复内容

阻止索引打印版本、会话ID等重复页面：

Disallow: /*?print=yes
Disallow: /*?sessionid=

文件语法详解

基本指令

User-agent: *          # 适用于所有爬虫
User-agent: Googlebot  # 仅适用于Google

Disallow: /path/       # 禁止访问
Allow: /path/file.html # 允许访问（覆盖Disallow）

实用示例

# 对所有人：禁止后台和临时文件
User-agent: *
Disallow: /admin/
Disallow: /tmp/

# 对Google图片爬虫：允许所有图片
User-agent: Googlebot-Image
Allow: /

# 对特定爬虫：禁止整个网站
User-agent: BadBot
Disallow: /

🚨 重要注意事项

1. 不是安全工具

误解：Robots.txt 能保护敏感数据
现实：文件本身公开可访问，禁止的URL仍可能被访问
正确做法：敏感数据应使用密码保护或 robots 元标签

2. 爬虫遵守是自愿的

知名搜索引擎（Google、Bing）会遵守规则
恶意爬虫、数据采集器可能完全忽略
不应依赖它作为唯一保护措施

现代网站的最佳实践

电商网站示例

User-agent: *
Allow: /products/      # 允许产品目录
Allow: /categories/    # 允许分类页面
Disallow: /cart/       # 禁止购物车
Disallow: /checkout/   # 禁止结账流程
Disallow: /search?     # 禁止搜索结果页
Sitemap: https://www.example.com/sitemap.xml

内容网站示例

User-agent: *
Allow: /article/       # 允许文章页
Disallow: /draft/      # 禁止草稿
Disallow: /api/        # 禁止API接口
Disallow: /*.pdf$      # 禁止所有PDF文件
Sitemap: https://www.example.com/sitemap_index.xml

高级特性与技巧

1. 模式匹配

# 禁止所有以 .php 结尾的URL
Disallow: /*.php$

# 禁止包含特定参数的URL
Disallow: /*?sort=
Disallow: /*&session=

# 允许特定文件类型
Allow: /*.jpg$
Allow: /*.png$

2. 特定爬虫控制

# Google AdsBot（广告验证）
User-agent: AdsBot-Google
Allow: /

# Bing 爬虫
User-agent: bingbot
Crawl-delay: 2  # 每次请求间隔2秒

# 阻止垃圾爬虫
User-agent: AhrefsBot
Disallow: /

🔧 创建与验证步骤

创建流程

文本编辑器创建 robots.txt
放置在网站根目录（如 public/、htdocs/）
可通过 example.com/robots.txt 访问

验证工具

Google Search Console： Robots.txt 测试工具
第三方验证器：TechnicalSEO.com、SEOmatic
命令行测试：
```
curl https://example.com/robots.txt
```

常见问题解答

Q：为什么禁止的页面仍出现在搜索结果中？

A：Robots.txt 阻止的是抓取，不是索引。如果其他网站链接到该页面，搜索引擎仍可能知道它的存在并显示URL（无摘要内容）。要完全禁止索引，应使用 <meta name="robots" content="noindex">。

Q：应该禁止所有爬虫吗？

A：除非是纯内部网站，否则不建议。完全禁止会导致：

User-agent: *
Disallow: /

这会阻止网站在搜索引擎中出现，影响流量和可见性。

Q：多个 Disallow 需要多行吗？

A：是的，每个路径单独一行：

# 正确
Disallow: /admin/
Disallow: /tmp/
Disallow: /private/

# 错误
Disallow: /admin/ /tmp/ /private/

🌟 最佳实践总结

保持文件轻量：避免过多规则，影响爬虫效率
定期审查：网站结构调整后更新 robots.txt
结合其他方法：重要保护使用身份验证
包含站点地图：帮助搜索引擎发现内容
测试再上线：使用工具验证规则效果
监控日志：查看爬虫实际访问情况

实际案例参考

GitHub 的 robots.txt（精简版）

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/advanced
Disallow: /stars
Disallow: /*/stars
Sitemap: https://github.com/sitemap.xml

新闻网站的智能配置

User-agent: *
Allow: /article/
Allow: /news/
Disallow: /print-article/  # 打印版
Disallow: /amp/            # AMP页面单独处理
Disallow: /user/*/edit     # 用户编辑页面
Crawl-delay: 1             # 减轻服务器压力
Sitemap: https://news.example.com/sitemap-news.xml

Robots.txt 是网站SEO和爬虫管理的基础工具，正确使用可以有效引导搜索引擎，保护敏感资源，优化爬取效率。但它只是网站与搜索引擎交互的第一步，应与其他技术（如元标签、HTTP头、站点地图）配合使用，才能实现最佳效果。

记住：好的 robots.txt 策略是平衡的艺术——既要让搜索引擎看到该看的内容，又要保护不该公开的资源。