nginx upstream读写超时问题定位

vlambda
2022-04-14

nginx upstream读写超时问题定位

前几天在生产环境发现了一个问题，存在负载的情况下，当upstream服务器reload时（upstream也是一个nginx服务），下面都用nginx代指连接过程中的client，upstream代指连接过程中的server。nginx会报一些error日志，主要的日志有如下三种：

[error] 46478#0: *24814949 upstream timed out (110: Connection timed out) while reading response header from upstream

[error] 46478#0: *24814949 connect() failed (110: Connection timed out) while connecting to upstream

[error] 127715#0: *2830183 no live upstreams while connecting to upstream

nginx配置如下：

upstream backend {
  server x.x.x.x weight=1;
  server x.x.x.x weight=1;
  keepalive 256;
}

http {
    default_type  application/octet-stream;
    keepalive_timeout  300s;
    client_header_timeout 5m;
    send_timeout  1m;
    proxy_request_buffering  off;
    proxy_buffering  off;
    proxy_buffer_size  16k;
    proxy_headers_hash_max_size 40000;
    proxy_headers_hash_bucket_size 128;
    proxy_connect_timeout  5s;
    subrequest_output_buffer_size 128k;
......

Upstream读超时

即第一种报错，通过 debug日志以及在nginx中搜索打印错误信息的地方，发现在报错处在 ngx_http_upstream_next函数：

static void
ngx_http_upstream_next(ngx_http_request_t *r, ngx_http_upstream_t *u,
    ngx_uint_t ft_type)
{
    ngx_msec_t  timeout;
    ngx_uint_t  status, state;
  ...
    
  if (ft_type == NGX_HTTP_UPSTREAM_FT_TIMEOUT) {
        ngx_log_error(NGX_LOG_ERR, r->connection->log, NGX_ETIMEDOUT,
                      "upstream timed out");
    }
  ...
}

在此处打断点，用 gdb打印下调用栈发现调用此函数的地方在 ngx_http_upstream_process_header()这是nginx处理上游响应头的地方。

gdb调试

static void
ngx_http_upstream_process_header(ngx_http_request_t *r, ngx_http_upstream_t *u)
{
    ssize_t            n;
    ngx_int_t          rc;
    ngx_connection_t  *c;

    c = u->peer.connection;

    ngx_log_debug0(NGX_LOG_DEBUG_HTTP, c->log, 0,
                   "http upstream process header");

    c->log->action = "reading response header from upstream";
    
    
    if (c->read->timedout) {
        ngx_http_upstream_next(r, u, NGX_HTTP_UPSTREAM_FT_TIMEOUT);
        return;
    }

c->read->timedout为 1代表着读超时，其流程如下：nginx调用 ngx_http_upstream_init_request进行upstream初始化，在其中会调用 ngx_http_upstream_connect对上游服务器进行连接，连接成功后会调用 ngx_http_upstream_send_request发送请求到上游服务器，如果发送成功，会将读事件添加到定时器中，在一定时间内进行读超时事件处理，定时器的超时时间就是 proxy_read_timeout

static void
ngx_http_upstream_send_request(ngx_http_request_t *r, ngx_http_upstream_t *u,
    ngx_uint_t do_write)
{
    ngx_int_t          rc;
    ngx_connection_t  *c;

    c = u->peer.connection;
    ...
    rc = ngx_http_upstream_send_request_body(r, u, do_write);
    
  ...
  if (!u->request_body_sent) {
        u->request_body_sent = 1;

        if (u->header_sent) {
            return;
        }
    
    //这里gdb打印出来的read_timeout值为1s
        ngx_add_timer(c->read, u->conf->read_timeout);
    ...
    }

上面定时器的设置的回调处理c->read是在 ngx_http_upstream_connect函数中设置的，并且调用 ngx_event_connect_peer进行实际的连接操作：

void
ngx_http_upstream_connect(ngx_http_request_t *r, ngx_http_upstream_t *u)
{
    ngx_int_t          rc;
    ngx_connection_t  *c;
  ...
  rc = ngx_event_connect_peer(&u->peer);
  
  ...
  
  //这里设置了连接读、写事件的回调函数
  c->write->handler = ngx_http_upstream_handler;
    c->read->handler = ngx_http_upstream_handler;

    u->write_event_handler = ngx_http_upstream_send_request_handler;
    u->read_event_handler = ngx_http_upstream_process_header;
  ...
}

nginx在 ngx_event_connect_peer中会将建立连接的 socket通过 ngx_epoll_add_connection 加入到 epoll监听中，并注册 EPOLLIN|EPOLLOUT|EPOLLET|EPOLLRDHUP监听事件：

ngx_int_t
ngx_event_connect_peer(ngx_peer_connection_t *pc)
{
  ...
    //这里的的函数为：ngx_epoll_add_connection
  if (ngx_add_conn) {
        if (ngx_add_conn(c) == NGX_ERROR) {
            goto failed;
        }
    }
  ...
}

nginx使用 ngx_event_expire_timers检查定时器中的所有事件，这就意味着，当u->conf->read_timeout时间后，nginx会在 ngx_event_expire_timers 中将 c->read的 timeout置为1，并调用 c->read->handler进行处理，即调用 ngx_http_upstream_handler：

static void
ngx_http_upstream_handler(ngx_event_t *ev)
{
    ngx_connection_t     *c;
    ngx_http_request_t   *r;
    ngx_http_upstream_t  *u;
  ...
  
  //如果是写事件，则调用ngx_http_upstream_send_request_handler
  if (ev->write) {
        u->write_event_handler(r, u);

    } else {
    //如果是读事件，调用ngx_http_upstream_process_header
        u->read_event_handler(r, u);
    }

  ...
}

所以，经过以上流程，读事件触发时，会调用 ngx_http_upstream_process_header处理响应头，当发现已经超时了 epoll监听中还没有对应的读事件到达，则就报一开始的 upstream timed out错误。在上面 ngx_http_upstream_send_request中打印的 u->conf->read_timeout为1s，这个值有些低，而在nginx.conf中没找到显式的 proxy_read_timeout定义，根据nginx文档，这个值默认为60s，有些奇怪：

# http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
**Syntax**:   proxy_read_timeout time;
**Default**: proxy_read_timeout 60s;
**Context**: http, server, location

最后仔细查找，在nginx.conf include的某个配置文件中找到了 proxy_read_timeout定义值为1s

Upstream连接超时

第二种报错是在 ngx_http_upstream_test_connect中报的：

static ngx_int_t
ngx_http_upstream_test_connect(ngx_connection_t *c)
{
    int        err;
    socklen_t  len;
  ...

  /*
     * BSDs and Linux return 0 and set a pending error in err
     * Solaris returns -1 and sets errno
     */
  
    if (getsockopt(c->fd, SOL_SOCKET, SO_ERROR, (void *) &err, &len)
        == -1)
    {
        err = ngx_socket_errno;
    }
  
  **//这里打印发现err为110**
    if (err) {
        c->log->action = "connecting to upstream";
        (void) ngx_connection_error(c, err, "connect() failed");
        return NGX_ERROR;
    }

}

经过 gdb打印调用栈发现其是在 ngx_http_upstream_send_request中调用的，是在向上游服务器发送请求：

static void
ngx_http_upstream_send_request(ngx_http_request_t *r, ngx_http_upstream_t *u,
    ngx_uint_t do_write)
{
    ngx_int_t          rc;
    ngx_connection_t  *c;

    c = u->peer.connection;
  ...

    if (!u->request_sent && ngx_http_upstream_test_connect(c) != NGX_OK) {
        ngx_http_upstream_next(r, u, NGX_HTTP_UPSTREAM_FT_ERROR);
        return;
    }

  ...
}

这里的过程是这样的：在upstream完成和上游的连接后（在 ngx_http_upstream_connect中调用完 ngx_event_connect_peer），会将写事件添加到定时器中，超时时间就是nginx.conf中 proxy_connect_timeout配置项：

void
ngx_http_upstream_connect(ngx_http_request_t *r, ngx_http_upstream_t *u)
{
  ...
  rc = ngx_event_connect_peer(&u->peer);
  ...
  if (rc == NGX_AGAIN) {
        ngx_add_timer(c->write, u->conf->connect_timeout);
        return;
    }

  ...
}

这里值得一提的是，对于非阻塞的 connect()系统调用，返回 -1并不必然代表连接失败，主要看设置的 errno，如果为 EINPROGRESS，则表示连接还在进行中，才有后续添加写超时定时器的操作。

ngx_int_t
ngx_event_connect_peer(ngx_peer_connection_t *pc)
{
  ...
  if (rc == -1) {
        err = ngx_socket_errno;

        if (err != NGX_EINPROGRESS
      ...
      return NGX_DECLINED
    }
  }

  //errno为EINPROGRESS
  if (ngx_add_conn) {
        if (rc == -1) {

            /* NGX_EINPROGRESS */

            return NGX_AGAIN;
        }
    ...
    }
  ...
}

我们回到定位问题的主线，上面在 ngx_http_upstream_test_connect()中打印 socket的 errno为 110，我们在 /usr/include/asm-generic/errno.h中我们可以看到完整的 errno定义：

...
#define ETIMEDOUT   110 /* Connection timed out */
#define ECONNREFUSED    111 /* Connection refused */
...
#define EINPROGRESS 115 /* Operation now in progress */
...

这里可以看到 110代表着连接超时，我这里以为是nginx配置中的 proxy_connect_timeout时间配置的太短了，导致触发了写超时事件，于是调大了这个值，但是发现没有用，依旧会报同样的错误。这里就需要审视一下， getsockopt(c->fd, SOL_SOCKET, SO_ERROR, (void *) &err, &len)这个操作到底是在获取什么错误。看下 connect的 man page这里说了什么：

EINPROGRESS
  The  socket  is  nonblocking  and the connection cannot be completed immediately.  
 It is possible to select(2) or poll(2) for completion by selecting the socket 
 for writing.  After select(2) indicates writability, use getsockopt(2) to read 
 the SO_ERROR option at level SOL_SOCKET to determine whether connect() completed successfully
  (SO_ERROR is zero) or unsuccessfully (SO_ERROR is one of the usual error codes listed
here, explaining the reason for the failure).

这里的意思就是，对于非阻塞的 connect操作，在可写事件触发后，还需要调用 getsockopt检查原本的 socket有没有出现错误，这一点决定是连接是否建立成功。这里的关键点是，这个连接操作是由操作系统内核完成的，不在nginx参数调控范围内，所以调大 proxy_connect_timeout对此无效。

于是想到复现问题并且 tcpdump抓包看下握手期间发生了什么：

upstream抓包

找到一个报错的连接，握手过程如下

nginx向upstream发送一个 SYN，并且等待upstream的 ACK回复（就是调用 connect()）
upstream收到了nginx的 SYN，但是没有回复
1s后，nginx又向upstream发送了一个 SYN，等待回复，这个时候已经触发了tcp的超时重传
upstream收到了重传的 SYN，但是依旧没有回复

连接在此就断开了，nginx没有继续尝试重连，然后打印出来 connect() failed错误日志

这里有两个问题

upstream在reload期间，无法建立新的连接，导致连接阻塞
nginx等待1s后，重传 SYN包进行尝试连接，但只尝试了一次

对于第一个问题，在upstream reload时，立即在nginx上访问upstream服务并记录时间，发现upstream reload成功到能够建立新的连接，大概是6秒左右：

date; curl x.x.x.x; date;

之所以在 reload期间不能建立新的连接，是因为nginx reload的过程是建立新的worker进程，这个时候老的worker进程不再接受新连接，等新的worker建立成功后才能建立新的连接，当处理完老的连接后，老的woker也就退出了。这里upstream reload时间过长，也是一个需要解决的问题，但是是另一类问题了。

对于第二个问题，重传次数其实是由内核参数 net.ipv4.tcp_syn_retries规定的，查询下nginx服务器上这个参数的配置： sysctl -a | grep net.ipv4.tcp_syn_retries，发现值为1，这和抓包看到的现象是一致的。这个值需要改大一些，改成多少合适呢，需要做下实验，找一个访问不通的ip，比如 www.google.com的服务器

# tcp_syn_retries为1时是3s左右
$ date; curl 185.45.7.189; date
 Wed Apr 13 13:44:40 CST 2022
 curl: (7) Failed connect to 185.45.7.189:80; Connection timed out
 Wed Apr 13 13:44:43 CST 2022

# tcp_syn_retries为2时是8s左右
$ date; curl 185.45.7.189; date    
Wed Apr 13 13:45:47 CST 2022
curl: (7) Failed connect to 185.45.7.189:80; Connection timed out
Wed Apr 13 13:45:55 CST 2022

# tcp_syn_retries为3时是15s左右
date; curl 185.45.7.189; date    
Wed Apr 13 13:46:54 CST 2022
curl: (7) Failed connect to 185.45.7.189:80; Connection timed out
Wed Apr 13 13:47:09 CST 2022

......

实验的结果表示， tcp_syn_retries设置成2或3都可以，于是把这个值改成2试试： sysctl net.ipv4.tcp_syn_retries=2，修改后测试发现依旧会出现连接超时的问题，需要注意，这里还需要把 proxy_connect_timeout调大，这样才能避免，在触发内核重传阈值之前触发nginx的写超时事件。

no live upstreams

这个错误是在连接upstream的时候报的：

void
ngx_http_upstream_connect(ngx_http_request_t *r, ngx_http_upstream_t *u)
{
  ...
  rc = ngx_event_connect_peer(&u->peer);
  ...
  if (rc == NGX_BUSY) {
        ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "no live upstreams");
        ngx_http_upstream_next(r, u, NGX_HTTP_UPSTREAM_FT_NOLIVE);
        return;
    }
}

可以看到，当 rc为 NGX_BUSY时，会报这个错误， NGX_BUSY实际上是在 ngx_peer_connection_t *pc尝试选择一个upstream时，发现无有效的upstream。这里调用 pc->get()，这个 get是一个函数指针，其指向的值是在 ngx_http_upstream_init_round_robin_peer中设置的，而后者是在初始化upstream时，在 ngx_http_upstream_init_request中调用的。

ngx_int_t
ngx_event_connect_peer(ngx_peer_connection_t *pc)
{
    ...
  
  //这里的get方法是ngx_http_upstream_get_round_robin_peer，即轮询
    rc = pc->get(pc, pc->data);
    if (rc != NGX_OK) {
        return rc;
    }
  ...
}

ngx_int_t
ngx_http_upstream_init_round_robin_peer(ngx_http_request_t *r,
    ngx_http_upstream_srv_conf_t *us)
{
    ngx_uint_t                         n;
    ngx_http_upstream_rr_peer_data_t  *rrp;
  ...

  r->upstream->peer.get = ngx_http_upstream_get_round_robin_peer;
    r->upstream->peer.free = ngx_http_upstream_free_round_robin_peer;
    r->upstream->peer.tries = ngx_http_upstream_tries(rrp->peers);
  ...
}

ngx_int_t
ngx_http_upstream_get_round_robin_peer(ngx_peer_connection_t *pc, void *data)
{
  ...
  peer = ngx_http_upstream_get_peer(rrp);
  if (peer == NULL) {
        goto failed;
    }
  ...
  failed:
    ...
    return NGX_BUSY;
}

nginx负载均衡的具体逻辑这里我们不深究了，总之原因和上面第二种错误相关，两个upstream会几乎同时reload，这个时候nginx发现无有效可用upstream，就会报 no live upstream那个错误。

总结

upstream timed out while reading...这种错误的直接原因是 proxy_read_timeout配置设置太小
connect() failed (110: Connection timed out) while connecting to upstream这种错误的直接原因是内核参数 net.ipv4.tcp_syn_retries太小、没能在 proxy_connect_timeout时间内建立起有效连接
no live upstream这种错误的直接原因和第二种相关，是负载均衡发现无可用upstream导致的
上面所有错误的更近一层的原因是upstream reload时间过长，导致upstream太长时间不能建立新连接