【Rust日报】2022-05-06 - 用 Rust 构建爬虫：抓取和解析 HTML

EAHITechnology 发表于 2022-05-06 21:38

Tags：rust, 日报

用 Rust 构建爬虫：抓取和解析 HTML

文章介绍了如何用 Rust 构建爬虫抓取和解析 HTML，

https://kerkour.com/rust-crawler-scraping-and-parsing-html

Rust 编译器为您捕获的错误

事实证明，几十年来，我们在输出无 bug 程序方面表现不佳。试图去寻找“银弹”逻辑的计算机程序似乎注定要失败。代码审查是一个比较好的解决办法，虽然代码审查的实践还在逐步进行，尤其是在开源文化成为主导的情况下，但情况仍然不是太乐观：原因是因为它需要花费大量时间和金钱。相反，如果我们可以有一个伙伴，随时可用，永不疲倦，并且锦上添花，这不会花费开发人员的薪水，这将帮助我们在软件投入生产之前避免软件中的错误？让我们看看现代编译器和类型系统如何帮助防止许多错误，从而帮助提高每个人的安全性并降低软件生产和维护的成本。

忘记关闭文件或连接：

resp, err := http.Get("http://kerkour.com")
if err != nil {
    // ...
}
// defer resp.Body.Close() // DON'T forget this line

Rust 强制执行RAII（资源获取即初始化），这使得泄漏资源几乎是不可能的：它们在被丢弃时会自动关闭。

let wordlist_file = File::open("wordlist.txt")?;
  // do something...

  // we don't need to close wordlist_file
  // it will be closed when the variable goes out of scope

未释放的互斥锁：

type App struct {
  mutex sync.Mutex
  data  map[string]string
}

func (app *App) DoSomething(input string) {
  app.mutex.Lock()
  defer app.mutex.Unlock()
  // do something with data and input
}

到现在为止还挺好。但是当我们想要处理许多项目时，事情可能会很快变得非常糟糕

func (app *App) DoManyThings(input []string) {
  for _, item := range input {
      app.mutex.Lock()
      defer app.mutex.Unlock()
      // do something with data and item
  }
}

我们刚刚创建了一个死锁，因为互斥锁没有在预期的时候释放，而是在函数结束时释放。同样，Rust 中的 RAII 有助于防止未释放的互斥锁：

for item in input {
  let _guard = mutex.lock().expect("locking mutex");
  // do something
  // mutex is released here as _guard is dropped
}

缺少 switch case：

假设我们正在跟踪在线商店中产品的状态：

const (
  StatusUnknown   Status = 0
  StatusDraft     Status = 1
  StatusPublished Status = 2
)

switch status {
    case StatusUnknown:
        // ...
    case StatusDraft:
        // ...
    case StatusPublished:
        // ...
}

但是，如果我们添加了 StatusArchived Status = 3 变量而忘记更新这条 switch 语句，编译器仍然很乐意接受程序并让我们引入一个错误。在 Rust 中，非穷举 match 会产生编译时错误：

#[derive(Debug, Clone, Copy)]
enum Platform {
    Linux,
    MacOS,
    Windows,
    Unknown,
}

impl fmt::Display for Platform {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match self {
            Platform::Linux => write!(f, "Linux"),
            Platform::Macos => write!(f, "macOS"),
            // Compile time error! We forgot Windows and Unknown
        }
    }
}

无效的指针取消引用：

据我所知，不可能在安全的 Rust 中创建对无效地址的引用。

type User struct {
    // ...
    Foo *Bar // is it intended to be used a a pointer, or as an optional field?
}

甚至更好的做法是，你不必使用 null 指针来表示不存在的东西，因为 Rust 有 Option 枚举。

struct User {
    // ...
    foor: Option<Bar>, // it's clear that this field is optional
}

未初始化的变量：

假设我们正在处理用户帐户：

type User struct {
  ID          uuid.UUID
  CreatedAt   time.Time
  UpdatedAt time.Time
  Email       string
}

func (app *App) CreateUser(email string) {
    // ...
    now := time.Now().UTC()

    user := User {
      ID: uuid.New(),
      CreatedAt: now,
      UpdatedAt: now,
      Email: email,
    }
    err = app.repository.CreateUser(app.db, user)
    // ...
}

很好，但是现在，我们需要添加字段 AllowedStorage int64 到 User 结构中。如果我们忘记更新CreateUser函数，编译器仍然会愉快地接受代码而不做任何更改并使用int64:的默认值0，这可能不是我们想要的。

而下面的 Rust 代码会产生一个编译时错误

struct User {
  id: uuid::Uuid,
  created_at: DateTime<Utc>,
  updated_at: DateTime<Utc>,
  email: String,
  allowed_storage: i64,
}

fn create_user(email: String) {
    let user = User {
      id: uuid::new(),
      created_at: now,
      updated_at: now,
      email: email,
      // we forgot to update the function to initialize allowed_storage
    };
}

以上是部分例子，但智能编译器是错误和代码审查的终结吗？答案当然不是！但是强大的类型系统和相关的编译器是任何想要大幅减少软件中的错误数量并让用户/客户满意的人的首选武器。

https://kerkour.com/bugs-rust-compiler-helps-prevent

From 日报小组侯盛鑫坏姐姐

社区学习交流平台订阅：

评论区

写评论

eweca-d 2022-05-06 21:50

用过reqwest + scraper的组合写爬虫，感觉还蛮顺手的，写起来不难，API文档也挺齐全的。

1 共 1 条评论, 1 页