vlambda博客
学习文章列表

epoll内核源码分析

epoll概述

epoll是linux中IO多路复用的一种机制,I/O多路复用就是通过一种机制,一个进程可以监视多个描述符,一旦某个描述符就绪(一般是读就绪或者写就绪),能够通知程序进行相应的读写操作。当然linux中IO多路复用不仅仅是epoll,其他多路复用机制还有select、poll,但是接下来介绍epoll的内核实现。

网上关于epoll接口的介绍非常多,这个不是我关注的重点,但是还是有必要了解。该接口非常简单,一共就三个函数,这里我摘抄了网上关于该接口的介绍:


  1. int epoll_create(int size); 创建一个epoll的句柄,size用来告诉内核这个监听的数目一共有多大。这个参数不同于select()中的第一个参数,给出最大监听的fd+1的值。需要注意的是,当创建好epoll句柄后,它就是会占用一个fd值,在linux下如果查看/proc/进程id/fd/,是能够看到这个fd的,所以在使用完epoll后,必须调用close()关闭,否则可能导致fd被耗尽。



  2. int epollctl(int epfd, int op, int fd, struct epollevent *event); epoll的事件注册函数,它不同与select()是在监听事件时告诉内核要监听什么类型的事件,而是在这里先注册要监听的事件类型。第一个参数是epollcreate()的返回值,第二个参数表示动作,用三个宏来表示: EPOLLCTLADD:注册新的fd到epfd中; EPOLLCTLMOD:修改已经注册的fd的监听事件; EPOLLCTLDEL:从epfd中删除一个fd; 第三个参数是需要监听的fd,第四个参数是告诉内核需要监听什么事,struct epollevent结构如下:


 
   
   
 
  1. struct epoll_event {

  2. __uint32_t events; /* Epoll events */

  3. epoll_data_t data; /* User data variable */

  4. };

events可以是以下几个宏的集合:

EPOLLIN :表示对应的文件描述符可以读(包括对端SOCKET正常关闭); EPOLLOUT:表示对应的文件描述符可以写; EPOLLPRI:表示对应的文件描述符有紧急的数据可读(这里应该表示有带外数据到来); EPOLLERR:表示对应的文件描述符发生错误; EPOLLHUP:表示对应的文件描述符被挂断; EPOLLET: 将EPOLL设为边缘触发(Edge Triggered)模式,这是相对于水平触发(Level Triggered)来说的。 EPOLLONESHOT:只监听一次事件,当监听完这次事件之后,如果还需要继续监听这个socket的话,需要再次把这个socket加入到EPOLL队列里

  1. int epollwait(int epfd, struct epollevent * events, int maxevents, int timeout); 等待事件的产生,类似于select()调用。参数events用来从内核得到事件的集合,maxevents告之内核这个events有多大,这个maxevents的值不能大于创建epollcreate()时的size(备注:在4.1.2内核里面,epollcreate的size没有什么用),参数timeout是超时时间(毫秒,0会立即返回,小于0时将是永久阻塞)。该函数返回需要处理的事件数目,如返回0表示已超时

epoll相比select/poll的优势:


  • select/poll每次调用都要传递所要监控的所有fd给select/poll系统调用(这意味着每次调用都要将fd列表从用户态拷贝到内核态,当fd数目很多时,这会造成低效)。而每次调用epollwait时(作用相当于调用select/poll),不需要再传递fd列表给内核,因为已经在epollctl中将需要监控的fd告诉了内核(epollctl不需要每次都拷贝所有的fd,只需要进行增量式操作)。所以,在调用epollcreate之后,内核已经在内核态开始准备数据结构存放要监控的fd了。每次epoll_ctl只是对这个数据结构进行简单的维护。



  • select/poll一个致命弱点就是当你拥有一个很大的socket集合,不过由于网络延时,任一时间只有部分的socket是"活跃"的,但是select/poll每次调用都会线性扫描全部的集合,导致效率呈现线性下降。但是epoll不存在这个问题,它只会对"活跃"的socket进行操作---这是因为在内核实现中epoll是根据每个fd上面的callback函数实现的。



  • 当我们调用epollctl往里塞入百万个fd时,epollwait仍然可以飞快的返回,并有效的将发生事件的fd给我们用户。这是由于我们在调用epollcreate时,内核除了帮我们在epoll文件系统里建了个file结点,在内核cache里建了个红黑树用于存储以后epollctl传来的fd外,还会再建立一个list链表,用于存储准备就绪的事件,当epollwait调用时,仅仅观察这个list链表里有没有数据即可。有数据就返回,没有数据就sleep,等到timeout时间到后即使链表没数据也返回。所以,epollwait非常高效。而且,通常情况下即使我们要监控百万计的fd,大多一次也只返回很少量的准备就绪fd而已,所以,epollwait仅需要从内核态copy少量的fd到用户态而已。那么,这个准备就绪list链表是怎么维护的呢?当我们执行epollctl时,除了把fd放到epoll文件系统里file对象对应的红黑树上之外,还会给内核中断处理程序注册一个回调函数,告诉内核,如果这个fd的中断到了,就把它放到准备就绪list链表里。所以,当一个fd(例如socket)上有数据到了,内核在把设备(例如网卡)上的数据copy到内核中后就来把fd(socket)插入到准备就绪list链表里了。


源码分析

epoll相关的内核代码在fs/eventpoll.c文件中,下面分别分析epollcreate、epollctl和epoll_wait三个函数在内核中的实现,分析所用linux内核源码为4.1.2版本。

epoll_create

epoll_create用于创建一个epoll的句柄,其在内核的系统实现如下:

sysepollcreate:

 
   
   
 
  1. SYSCALL_DEFINE1(epoll_create, int, size)

  2. {

  3. if (size <= 0)

  4. return -EINVAL;


  5. return sys_epoll_create1(0);

  6. }

可见,我们在调用epollcreate时,传入的size参数,仅仅是用来判断是否小于等于0,之后再也没有其他用处。 整个函数就3行代码,真正的工作还是放在sysepoll_create1函数中。

sysepollcreate -> sysepollcreate1:

 
   
   
 
  1. /*

  2. * Open an eventpoll file descriptor.

  3. */

  4. SYSCALL_DEFINE1(epoll_create1, int, flags)

  5. {

  6. int error, fd;

  7. struct eventpoll *ep = NULL;

  8. struct file *file;


  9. /* Check the EPOLL_* constant for consistency. */

  10. BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);


  11. if (flags & ~EPOLL_CLOEXEC)

  12. return -EINVAL;

  13. /*

  14. * Create the internal data structure ("struct eventpoll").

  15. */

  16. error = ep_alloc(&ep);

  17. if (error < 0)

  18. return error;

  19. /*

  20. * Creates all the items needed to setup an eventpoll file. That is,

  21. * a file structure and a free file descriptor.

  22. */

  23. fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));

  24. if (fd < 0) {

  25. error = fd;

  26. goto out_free_ep;

  27. }

  28. file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,

  29. O_RDWR | (flags & O_CLOEXEC));

  30. if (IS_ERR(file)) {

  31. error = PTR_ERR(file);

  32. goto out_free_fd;

  33. }

  34. ep->file = file;

  35. fd_install(fd, file);

  36. return fd;


  37. out_free_fd:

  38. put_unused_fd(fd);

  39. out_free_ep:

  40. ep_free(ep);

  41. return error;

  42. }

sysepollcreate1 函数流程如下:

  • 首先调用ep_alloc函数申请一个eventpoll结构,并且初始化该结构的成员,这里没什么好说的,代码如下:

sysepollcreate -> sysepollcreate1 -> ep_alloc:

 
   
   
 
  1. static int ep_alloc(struct eventpoll **pep)

  2. {

  3. int error;

  4. struct user_struct *user;

  5. struct eventpoll *ep;


  6. user = get_current_user();

  7. error = -ENOMEM;

  8. ep = kzalloc(sizeof(*ep), GFP_KERNEL);

  9. if (unlikely(!ep))

  10. goto free_uid;


  11. spin_lock_init(&ep->lock);

  12. mutex_init(&ep->mtx);

  13. init_waitqueue_head(&ep->wq);

  14. init_waitqueue_head(&ep->poll_wait);

  15. INIT_LIST_HEAD(&ep->rdllist);

  16. ep->rbr = RB_ROOT;

  17. ep->ovflist = EP_UNACTIVE_PTR;

  18. ep->user = user;


  19. *pep = ep;


  20. return 0;


  21. free_uid:

  22. free_uid(user);

  23. return error;

  24. }

  • 接下来调用getunusedfd_flags函数,在本进程中申请一个未使用的fd文件描述符。

sysepollcreate -> sysepollcreate1 -> epalloc -> getunusedfdflags:

 
   
   
 
  1. int get_unused_fd_flags(unsigned flags)

  2. {

  3. return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);

  4. }

linux内核中,current是个宏,返回的是一个taskstruct结构(我们称之为进程描述符)的变量,表示的是当前进程,进程打开的文件资源保存在进程描述符的files成员里面,所以current->files返回的当前进程打开的文件资源。rlimit(RLIMITNOFILE) 函数获取的是当前进程可以打开的最大文件描述符数,这个值可以设置,默认是1024。

_allocfd的工作是为进程在[start,end)之间(备注:这里start为0, end为进程可以打开的最大文件描述符数)分配一个可用的文件描述符,这里就不继续深入下去了,代码如下:

sysepollcreate -> sysepollcreate1 -> epalloc -> getunusedfdflags -> _allocfd:

 
   
   
 
  1. /*

  2. * allocate a file descriptor, mark it busy.

  3. */

  4. int __alloc_fd(struct files_struct *files,

  5. unsigned start, unsigned end, unsigned flags)

  6. {

  7. unsigned int fd;

  8. int error;

  9. struct fdtable *fdt;


  10. spin_lock(&files->file_lock);

  11. repeat:

  12. fdt = files_fdtable(files);

  13. fd = start;

  14. if (fd < files->next_fd)

  15. fd = files->next_fd;


  16. if (fd < fdt->max_fds)

  17. fd = find_next_fd(fdt, fd);


  18. /*

  19. * N.B. For clone tasks sharing a files structure, this test

  20. * will limit the total number of files that can be opened.

  21. */

  22. error = -EMFILE;

  23. if (fd >= end)

  24. goto out;


  25. error = expand_files(files, fd);

  26. if (error < 0)

  27. goto out;


  28. /*

  29. * If we needed to expand the fs array we

  30. * might have blocked - try again.

  31. */

  32. if (error)

  33. goto repeat;


  34. if (start <= files->next_fd)

  35. files->next_fd = fd + 1;


  36. __set_open_fd(fd, fdt);

  37. if (flags & O_CLOEXEC)

  38. __set_close_on_exec(fd, fdt);

  39. else

  40. __clear_close_on_exec(fd, fdt);

  41. error = fd;

  42. #if 1

  43. /* Sanity check */

  44. if (rcu_access_pointer(fdt->fd[fd]) != NULL) {

  45. printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);

  46. rcu_assign_pointer(fdt->fd[fd], NULL);

  47. }

  48. #endif


  49. out:

  50. spin_unlock(&files->file_lock);****

  51. return error;

  52. }

  • 然后,epollcreate1会调用anoninode_getfile,创建一个file结构,如下:

sysepollcreate -> sysepollcreate1 -> anoninodegetfile:

 
   
   
 
  1. /**

  2. * anon_inode_getfile - creates a new file instance by hooking it up to an

  3. * anonymous inode, and a dentry that describe the "class"

  4. * of the file

  5. *

  6. * @name: [in] name of the "class" of the new file

  7. * @fops: [in] file operations for the new file

  8. * @priv: [in] private data for the new file (will be file's private_data)

  9. * @flags: [in] flags

  10. *

  11. * Creates a new file by hooking it on a single inode. This is useful for files

  12. * that do not need to have a full-fledged inode in order to operate correctly.

  13. * All the files created with anon_inode_getfile() will share a single inode,

  14. * hence saving memory and avoiding code duplication for the file/inode/dentry

  15. * setup. Returns the newly created file* or an error pointer.

  16. */

  17. struct file *anon_inode_getfile(const char *name,

  18. const struct file_operations *fops,

  19. void *priv, int flags)

  20. {

  21. struct qstr this;

  22. struct path path;

  23. struct file *file;


  24. if (IS_ERR(anon_inode_inode))

  25. return ERR_PTR(-ENODEV);


  26. if (fops->owner && !try_module_get(fops->owner))

  27. return ERR_PTR(-ENOENT);


  28. /*

  29. * Link the inode to a directory entry by creating a unique name

  30. * using the inode sequence number.

  31. */

  32. file = ERR_PTR(-ENOMEM);

  33. this.name = name;

  34. this.len = strlen(name);

  35. this.hash = 0;

  36. path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);

  37. if (!path.dentry)

  38. goto err_module;


  39. path.mnt = mntget(anon_inode_mnt);

  40. /*

  41. * We know the anon_inode inode count is always greater than zero,

  42. * so ihold() is safe.

  43. */

  44. ihold(anon_inode_inode);


  45. d_instantiate(path.dentry, anon_inode_inode);


  46. file = alloc_file(&path, OPEN_FMODE(flags), fops);

  47. if (IS_ERR(file))

  48. goto err_dput;

  49. file->f_mapping = anon_inode_inode->i_mapping;


  50. file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);

  51. file->private_data = priv;


  52. return file;


  53. err_dput:

  54. path_put(&path);

  55. err_module:

  56. module_put(fops->owner);

  57. return file;

  58. }

anoninodegetfile函数中首先会alloc一个file结构和一个dentry结构,然后将该file结构与一个匿名inode节点anoninodeinode挂钩在一起,这里要注意的是,在调用anoninodegetfile函数申请file结构时,传入了前面申请的eventpoll结构的ep变量,申请的file->privatedata会指向这个ep变量,同时,在anoninode_getfile函数返回来后,ep->file会指向该函数申请的file结构变量。

简要说一下file/dentry/inode,当进程打开一个文件时,内核就会为该进程分配一个file结构,表示打开的文件在进程的上下文,然后应用程序会通过一个int类型的文件描述符来访问这个结构,实际上内核的进程里面维护一个file结构的数组,而文件描述符就是相应的file结构在数组中的下标。

dentry结构(称之为“目录项”)记录着文件的各种属性,比如文件名、访问权限等,每个文件都只有一个dentry结构,然后一个进程可以多次打开一个文件,多个进程也可以打开同一个文件,这些情况,内核都会申请多个file结构,建立多个文件上下文。但是,对同一个文件来说,无论打开多少次,内核只会为该文件分配一个dentry。所以,file结构与dentry结构的关系是多对一的。

同时,每个文件除了有一个dentry目录项结构外,还有一个索引节点inode结构,里面记录文件在存储介质上的位置和分布等信息,每个文件在内核中只分配一个inode。 dentry与inode描述的目标是不同的,一个文件可能会有好几个文件名(比如链接文件),通过不同文件名访问同一个文件的权限也可能不同。dentry文件所代表的是逻辑意义上的文件,记录的是其逻辑上的属性,而inode结构所代表的是其物理意义上的文件,记录的是其物理上的属性。dentry与inode结构的关系是多对一的关系。

  • 最后,epollcreate1调用fdinstall函数,将fd与file交给关联在一起,之后,内核可以通过应用传入的fd参数访问file结构,本段代码比较简单,不继续深入下去了。

sysepollcreate -> sysepollcreate1 -> fd_install:

 
   
   
 
  1. /*

  2. * Install a file pointer in the fd array.

  3. *

  4. * The VFS is full of places where we drop the files lock between

  5. * setting the open_fds bitmap and installing the file in the file

  6. * array. At any such point, we are vulnerable to a dup2() race

  7. * installing a file in the array before us. We need to detect this and

  8. * fput() the struct file we are about to overwrite in this case.

  9. *

  10. * It should never happen - if we allow dup2() do it, _really_ bad things

  11. * will follow.

  12. *

  13. * NOTE: __fd_install() variant is really, really low-level; don't

  14. * use it unless you are forced to by truly lousy API shoved down

  15. * your throat. 'files' *MUST* be either current->files or obtained

  16. * by get_files_struct(current) done by whoever had given it to you,

  17. * or really bad things will happen. Normally you want to use

  18. * fd_install() instead.

  19. */


  20. void __fd_install(struct files_struct *files, unsigned int fd,

  21. struct file *file)

  22. {

  23. struct fdtable *fdt;


  24. might_sleep();

  25. rcu_read_lock_sched();


  26. while (unlikely(files->resize_in_progress)) {

  27. rcu_read_unlock_sched();

  28. wait_event(files->resize_wait, !files->resize_in_progress);

  29. rcu_read_lock_sched();

  30. }

  31. /* coupled with smp_wmb() in expand_fdtable() */

  32. smp_rmb();

  33. fdt = rcu_dereference_sched(files->fdt);

  34. BUG_ON(fdt->fd[fd] != NULL);

  35. rcu_assign_pointer(fdt->fd[fd], file);

  36. rcu_read_unlock_sched();

  37. }


  38. void fd_install(unsigned int fd, struct file *file)

  39. {

  40. __fd_install(current->files, fd, file);

  41. }

总结epollcreate函数所做的事:调用epollcreate后,在内核中分配一个eventpoll结构和代表epoll文件的file结构,并且将这两个结构关联在一块,同时,返回一个也与file结构相关联的epoll文件描述符fd。当应用程序操作epoll时,需要传入一个epoll文件描述符fd,内核根据这个fd,找到epoll的file结构,然后通过file,获取之前epoll_create申请eventpoll结构变量,epoll相关的重要信息都存储在这个结构里面。接下来,所有epoll接口函数的操作,都是在eventpoll结构变量上进行的。

所以,epoll_create的作用就是为进程在内核中建立一个从epoll文件描述符到eventpoll结构变量的通道。

epoll_ctl

epoll_ctl接口的作用是添加/修改/删除文件的监听事件,内核代码如下:

sysepollctl:

 
   
   
 
  1. /*

  2. * The following function implements the controller interface for

  3. * the eventpoll file that enables the insertion/removal/change of

  4. * file descriptors inside the interest set.

  5. */

  6. SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,

  7. struct epoll_event __user *, event)

  8. {

  9. int error;

  10. int full_check = 0;

  11. struct fd f, tf;

  12. struct eventpoll *ep;

  13. struct epitem *epi;

  14. struct epoll_event epds;

  15. struct eventpoll *tep = NULL;


  16. error = -EFAULT;

  17. if (ep_op_has_event(op) &&

  18. copy_from_user(&epds, event, sizeof(struct epoll_event)))

  19. goto error_return;


  20. error = -EBADF;

  21. f = fdget(epfd);

  22. if (!f.file)

  23. goto error_return;


  24. /* Get the "struct file *" for the target file */

  25. tf = fdget(fd);

  26. if (!tf.file)

  27. goto error_fput;


  28. /* The target file descriptor must support poll */

  29. error = -EPERM;

  30. if (!tf.file->f_op->poll)

  31. goto error_tgt_fput;


  32. /* Check if EPOLLWAKEUP is allowed */

  33. if (ep_op_has_event(op))

  34. ep_take_care_of_epollwakeup(&epds);


  35. /*

  36. * We have to check that the file structure underneath the file descriptor

  37. * the user passed to us _is_ an eventpoll file. And also we do not permit

  38. * adding an epoll file descriptor inside itself.

  39. */

  40. error = -EINVAL;

  41. if (f.file == tf.file || !is_file_epoll(f.file))

  42. goto error_tgt_fput;


  43. /*

  44. * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,

  45. * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.

  46. * Also, we do not currently supported nested exclusive wakeups.

  47. */

  48. if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {

  49. if (op == EPOLL_CTL_MOD)

  50. goto error_tgt_fput;

  51. if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||

  52. (epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))

  53. goto error_tgt_fput;

  54. }


  55. /*

  56. * At this point it is safe to assume that the "private_data" contains

  57. * our own data structure.

  58. */

  59. ep = f.file->private_data;


  60. /*

  61. * When we insert an epoll file descriptor, inside another epoll file

  62. * descriptor, there is the change of creating closed loops, which are

  63. * better be handled here, than in more critical paths. While we are

  64. * checking for loops we also determine the list of files reachable

  65. * and hang them on the tfile_check_list, so we can check that we

  66. * haven't created too many possible wakeup paths.

  67. *

  68. * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when

  69. * the epoll file descriptor is attaching directly to a wakeup source,

  70. * unless the epoll file descriptor is nested. The purpose of taking the

  71. * 'epmutex' on add is to prevent complex toplogies such as loops and

  72. * deep wakeup paths from forming in parallel through multiple

  73. * EPOLL_CTL_ADD operations.

  74. */

  75. mutex_lock_nested(&ep->mtx, 0);

  76. if (op == EPOLL_CTL_ADD) {

  77. if (!list_empty(&f.file->f_ep_links) ||

  78. is_file_epoll(tf.file)) {

  79. full_check = 1;

  80. mutex_unlock(&ep->mtx);

  81. mutex_lock(&epmutex);

  82. if (is_file_epoll(tf.file)) {

  83. error = -ELOOP;

  84. if (ep_loop_check(ep, tf.file) != 0) {

  85. clear_tfile_check_list();

  86. goto error_tgt_fput;

  87. }

  88. } else

  89. list_add(&tf.file->f_tfile_llink,

  90. &tfile_check_list);

  91. mutex_lock_nested(&ep->mtx, 0);

  92. if (is_file_epoll(tf.file)) {

  93. tep = tf.file->private_data;

  94. mutex_lock_nested(&tep->mtx, 1);

  95. }

  96. }

  97. }


  98. /*

  99. * Try to lookup the file inside our RB tree, Since we grabbed "mtx"

  100. * above, we can be sure to be able to use the item looked up by

  101. * ep_find() till we release the mutex.

  102. */

  103. epi = ep_find(ep, tf.file, fd);


  104. error = -EINVAL;

  105. switch (op) {

  106. case EPOLL_CTL_ADD:

  107. if (!epi) {

  108. epds.events |= POLLERR | POLLHUP;

  109. error = ep_insert(ep, &epds, tf.file, fd, full_check);

  110. } else

  111. error = -EEXIST;

  112. if (full_check)

  113. clear_tfile_check_list();

  114. break;

  115. case EPOLL_CTL_DEL:

  116. if (epi)

  117. error = ep_remove(ep, epi);

  118. else

  119. error = -ENOENT;

  120. break;

  121. case EPOLL_CTL_MOD:

  122. if (epi) {

  123. if (!(epi->event.events & EPOLLEXCLUSIVE)) {

  124. epds.events |= POLLERR | POLLHUP;

  125. error = ep_modify(ep, epi, &epds);

  126. }

  127. } else

  128. error = -ENOENT;

  129. break;

  130. }

  131. if (tep != NULL)

  132. mutex_unlock(&tep->mtx);

  133. mutex_unlock(&ep->mtx);


  134. error_tgt_fput:

  135. if (full_check)

  136. mutex_unlock(&epmutex);


  137. fdput(tf);

  138. error_fput:

  139. fdput(f);

  140. error_return:


  141. return error;

  142. }

根据前面对epollctl接口的介绍,op是对epoll操作的动作(添加/修改/删除事件),epophasevent(op)判断是否不是删除操作,如果op != EPOLLCTLDEL为true,则需要调用copyfromuser函数将用户空间传过来的event事件拷贝到内核的epds变量中。因为,只有删除操作,内核不需要使用进程传入的event事件。

接着连续调用两次fdget分别获取epoll文件和被监听文件(以下称为目标文件)的file结构变量(备注:该函数返回fd结构变量,fd结构包含file结构)。

接下来就是对参数的一些检查,出现如下情况,就可以认为传入的参数有问题,直接返回出错:

  1. 目标文件不支持poll操作(!tf.file->f_op->poll);

  2. 监听的目标文件就是epoll文件本身(f.file == tf.file);

  3. 用户传入的epoll文件(epfd代表的文件)并不是一个真正的epoll的文件(!isfileepoll(f.file));

  4. 如果操作动作是修改操作,并且事件类型为EPOLLEXCLUSIVE,返回出错等等。

当然下面还有一些关于操作动作如果是添加操作的判断,这里不做解释,比较简单,自行阅读。

在ep里面,维护着一个红黑树,每次添加注册事件时,都会申请一个epitem结构的变量表示事件的监听项,然后插入ep的红黑树里面。在epollctl里面,会调用epfind函数从ep的红黑树里面查找目标文件表示的监听项,返回的监听项可能为空。

接下来switch这块区域的代码就是整个epollctl函数的核心,对op进行switch出来的有添加(EPOLLCTLADD)、删除(EPOLLCTLDEL)和修改(EPOLLCTL_MOD)三种情况,这里我以添加为例讲解,其他两种情况类似,知道了如何添加监听事件,其他删除和修改监听事件都可以举一反三。

为目标文件添加监控事件时,首先要保证当前ep里面还没有对该目标文件进行监听,如果存在(epi不为空),就返回-EEXIST错误。否则说明参数正常,然后先默认设置对目标文件的POLLERR和POLLHUP监听事件,然后调用ep_insert函数,将对目标文件的监听事件插入到ep维护的红黑树里面:

sysepollctl -> ep_insert:

 
   
   
 
  1. /*

  2. * Must be called with "mtx" held.

  3. */

  4. static int ep_insert(struct eventpoll *ep, struct epoll_event *event,

  5. struct file *tfile, int fd, int full_check)

  6. {

  7. int error, revents, pwake = 0;

  8. unsigned long flags;

  9. long user_watches;

  10. struct epitem *epi;

  11. struct ep_pqueue epq;


  12. user_watches = atomic_long_read(&ep->user->epoll_watches);

  13. if (unlikely(user_watches >= max_user_watches))

  14. return -ENOSPC;

  15. if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))

  16. return -ENOMEM;


  17. /* Item initialization follow here ... */

  18. INIT_LIST_HEAD(&epi->rdllink);

  19. INIT_LIST_HEAD(&epi->fllink);

  20. INIT_LIST_HEAD(&epi->pwqlist);

  21. epi->ep = ep;

  22. ep_set_ffd(&epi->ffd, tfile, fd);

  23. epi->event = *event;

  24. epi->nwait = 0;

  25. epi->next = EP_UNACTIVE_PTR;

  26. if (epi->event.events & EPOLLWAKEUP) {

  27. error = ep_create_wakeup_source(epi);

  28. if (error)

  29. goto error_create_wakeup_source;

  30. } else {

  31. RCU_INIT_POINTER(epi->ws, NULL);

  32. }


  33. /* Initialize the poll table using the queue callback */

  34. epq.epi = epi;

  35. init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);


  36. /*

  37. * Attach the item to the poll hooks and get current event bits.

  38. * We can safely use the file* here because its usage count has

  39. * been increased by the caller of this function. Note that after

  40. * this operation completes, the poll callback can start hitting

  41. * the new item.

  42. */

  43. revents = ep_item_poll(epi, &epq.pt);


  44. /*

  45. * We have to check if something went wrong during the poll wait queue

  46. * install process. Namely an allocation for a wait queue failed due

  47. * high memory pressure.

  48. */

  49. error = -ENOMEM;

  50. if (epi->nwait < 0)

  51. goto error_unregister;


  52. /* Add the current item to the list of active epoll hook for this file */

  53. spin_lock(&tfile->f_lock);

  54. list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);

  55. spin_unlock(&tfile->f_lock);


  56. /*

  57. * Add the current item to the RB tree. All RB tree operations are

  58. * protected by "mtx", and ep_insert() is called with "mtx" held.

  59. */

  60. ep_rbtree_insert(ep, epi);


  61. /* now check if we've created too many backpaths */

  62. error = -EINVAL;

  63. if (full_check && reverse_path_check())

  64. goto error_remove_epi;


  65. /* We have to drop the new item inside our item list to keep track of it */

  66. spin_lock_irqsave(&ep->lock, flags);


  67. /* record NAPI ID of new item if present */

  68. ep_set_busy_poll_napi_id(epi);


  69. /* If the file is already "ready" we drop it inside the ready list */

  70. if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {

  71. list_add_tail(&epi->rdllink, &ep->rdllist);

  72. ep_pm_stay_awake(epi);


  73. /* Notify waiting tasks that events are available */

  74. if (waitqueue_active(&ep->wq))

  75. wake_up_locked(&ep->wq);

  76. if (waitqueue_active(&ep->poll_wait))

  77. pwake++;

  78. }


  79. spin_unlock_irqrestore(&ep->lock, flags);


  80. atomic_long_inc(&ep->user->epoll_watches);


  81. /* We have to call this outside the lock */

  82. if (pwake)

  83. ep_poll_safewake(&ep->poll_wait);


  84. return 0;


  85. error_remove_epi:

  86. spin_lock(&tfile->f_lock);

  87. list_del_rcu(&epi->fllink);

  88. spin_unlock(&tfile->f_lock);


  89. rb_erase(&epi->rbn, &ep->rbr);


  90. error_unregister:

  91. ep_unregister_pollwait(ep, epi);


  92. /*

  93. * We need to do this because an event could have been arrived on some

  94. * allocated wait queue. Note that we don't care about the ep->ovflist

  95. * list, since that is used/cleaned only inside a section bound by "mtx".

  96. * And ep_insert() is called with "mtx" held.

  97. */

  98. spin_lock_irqsave(&ep->lock, flags);

  99. if (ep_is_linked(&epi->rdllink))

  100. list_del_init(&epi->rdllink);

  101. spin_unlock_irqrestore(&ep->lock, flags);


  102. wakeup_source_unregister(ep_wakeup_source(epi));


  103. error_create_wakeup_source:

  104. kmem_cache_free(epi_cache, epi);


  105. return error;

  106. }

前面说过,对目标文件的监听是由一个epitem结构的监听项变量维护的,所以在epinsert函数里面,首先调用kmemcachealloc函数,从slab分配器里面分配一个epitem结构监听项,然后对该结构进行初始化,这里也没有什么好说的。我们接下来看epitem_poll这个函数调用:

sysepollctl -> epinsert -> epitem_poll:

 
   
   
 
  1. static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)

  2. {

  3. pt->_key = epi->event.events;


  4. return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;

  5. }

epitempoll函数里面,调用目标文件的poll函数,这个函数针对不同的目标文件而指向不同的函数,如果目标文件为套接字的话,这个poll就指向sockpoll,而如果目标文件为tcp套接字来说,这个poll就是tcppoll函数。虽然poll指向的函数可能会不同,但是其作用都是一样的,就是获取目标文件当前产生的事件位,并且将监听项绑定到目标文件的poll钩子里面(最重要的是注册epptablequeueproc这个poll callback回调函数),这步操作完成后,以后目标文件产生事件就会调用epptablequeueproc回调函数。

接下来,调用listaddtailrcu将当前监听项添加到目标文件的fep_links链表里面,该链表是目标文件的epoll钩子链表,所有对该目标文件进行监听的监听项都会加入到该链表里面。

然后就是调用eprbtreeinsert,将epi监听项添加到ep维护的红黑树里面,这里不做解释,代码如下:

sysepollctl -> epinsert -> eprbtree_insert:

 
   
   
 
  1. static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)

  2. {

  3. int kcmp;

  4. struct rb_node **p = &ep->rbr.rb_node, *parent = NULL;

  5. struct epitem *epic;


  6. while (*p) {

  7. parent = *p;

  8. epic = rb_entry(parent, struct epitem, rbn);

  9. kcmp = ep_cmp_ffd(&epi->ffd, &epic->ffd);

  10. if (kcmp > 0)

  11. p = &parent->rb_right;

  12. else

  13. p = &parent->rb_left;

  14. }

  15. rb_link_node(&epi->rbn, parent, p);

  16. rb_insert_color(&epi->rbn, &ep->rbr);

  17. }

前面提到,epinsert有调用epitempoll去获取目标文件产生的事件位,在调用epollctl前这段时间,可能会产生相关进程需要监听的事件,如果有监听的事件产生,(revents & event->events 为 true),并且目标文件相关的监听项没有链接到ep的准备链表rdlist里面的话,就将该监听项添加到ep的rdlist准备链表里面,rdlist链接的是该epoll描述符监听的所有已经就绪的目标文件的监听项。并且,如果有任务在等待产生事件时,就调用wakeuplocked函数唤醒所有正在等待的任务,处理相应的事件。当进程调用epollwait时,该进程就出现在ep的wq等待队列里面。接下来讲解epollwait函数。

总结epoll_ctl函数:该函数根据监听的事件,为目标文件申请一个监听项,并将该监听项挂人到eventpoll结构的红黑树里面。

epoll_wait

epoll_wait等待事件的产生,内核代码如下:

sysepollwait:

 
   
   
 
  1. /*

  2. * Implement the event wait interface for the eventpoll file. It is the kernel

  3. * part of the user space epoll_wait(2).

  4. */

  5. SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,

  6. int, maxevents, int, timeout)

  7. {

  8. int error;

  9. struct fd f;

  10. struct eventpoll *ep;


  11. /* The maximum number of event must be greater than zero */

  12. if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)

  13. return -EINVAL;


  14. /* Verify that the area passed by the user is writeable */

  15. if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))

  16. return -EFAULT;


  17. /* Get the "struct file *" for the eventpoll file */

  18. f = fdget(epfd);

  19. if (!f.file)

  20. return -EBADF;


  21. /*

  22. * We have to check that the file structure underneath the fd

  23. * the user passed to us _is_ an eventpoll file.

  24. */

  25. error = -EINVAL;

  26. if (!is_file_epoll(f.file))

  27. goto error_fput;


  28. /*

  29. * At this point it is safe to assume that the "private_data" contains

  30. * our own data structure.

  31. */

  32. ep = f.file->private_data;


  33. /* Time to fish for events ... */

  34. error = ep_poll(ep, events, maxevents, timeout);


  35. error_fput:

  36. fdput(f);

  37. return error;

  38. }

首先是对进程传进来的一些参数的检查:

  • maxevents必须大于0并且小于EPMAXEVENTS,否则就返回-EINVAL;

  • 内核必须有对events变量写文件的权限,否则返回-EFAULT;

  • epfd代表的文件必须是个真正的epoll文件,否则返回-EBADF。

参数全部检查合格后,接下来就调用ep_poll函数进行真正的处理:

sysepollwait -> ep_poll:

 
   
   
 
  1. /**

  2. * ep_poll - Retrieves ready events, and delivers them to the caller supplied

  3. * event buffer.

  4. *

  5. * @ep: Pointer to the eventpoll context.

  6. * @events: Pointer to the userspace buffer where the ready events should be

  7. * stored.

  8. * @maxevents: Size (in terms of number of events) of the caller event buffer.

  9. * @timeout: Maximum timeout for the ready events fetch operation, in

  10. * milliseconds. If the @timeout is zero, the function will not block,

  11. * while if the @timeout is less than zero, the function will block

  12. * until at least one event has been retrieved (or an error

  13. * occurred).

  14. *

  15. * Returns: Returns the number of ready events which have been fetched, or an

  16. * error code, in case of error.

  17. */

  18. static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,

  19. int maxevents, long timeout)

  20. {

  21. int res = 0, eavail, timed_out = 0;

  22. unsigned long flags;

  23. u64 slack = 0;

  24. wait_queue_t wait;

  25. ktime_t expires, *to = NULL;


  26. if (timeout > 0) {

  27. struct timespec64 end_time = ep_set_mstimeout(timeout);


  28. slack = select_estimate_accuracy(&end_time);

  29. to = &expires;

  30. *to = timespec64_to_ktime(end_time);

  31. } else if (timeout == 0) {

  32. /*

  33. * Avoid the unnecessary trip to the wait queue loop, if the

  34. * caller specified a non blocking operation.

  35. */

  36. timed_out = 1;

  37. spin_lock_irqsave(&ep->lock, flags);

  38. goto check_events;

  39. }


  40. fetch_events:


  41. if (!ep_events_available(ep))

  42. ep_busy_loop(ep, timed_out);


  43. spin_lock_irqsave(&ep->lock, flags);


  44. if (!ep_events_available(ep)) {

  45. /*

  46. * Busy poll timed out. Drop NAPI ID for now, we can add

  47. * it back in when we have moved a socket with a valid NAPI

  48. * ID onto the ready list.

  49. */

  50. ep_reset_busy_poll_napi_id(ep);


  51. /*

  52. * We don't have any available event to return to the caller.

  53. * We need to sleep here, and we will be wake up by

  54. * ep_poll_callback() when events will become available.

  55. */

  56. init_waitqueue_entry(&wait, current);

  57. __add_wait_queue_exclusive(&ep->wq, &wait);


  58. for (;;) {

  59. /*

  60. * We don't want to sleep if the ep_poll_callback() sends us

  61. * a wakeup in between. That's why we set the task state

  62. * to TASK_INTERRUPTIBLE before doing the checks.

  63. */

  64. set_current_state(TASK_INTERRUPTIBLE);

  65. if (ep_events_available(ep) || timed_out)

  66. break;

  67. if (signal_pending(current)) {

  68. res = -EINTR;

  69. break;

  70. }


  71. spin_unlock_irqrestore(&ep->lock, flags);

  72. if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))

  73. timed_out = 1;


  74. spin_lock_irqsave(&ep->lock, flags);

  75. }


  76. __remove_wait_queue(&ep->wq, &wait);

  77. __set_current_state(TASK_RUNNING);

  78. }

  79. check_events:

  80. /* Is it worth to try to dig for events ? */

  81. eavail = ep_events_available(ep);


  82. spin_unlock_irqrestore(&ep->lock, flags);


  83. /*

  84. * Try to transfer events to user space. In case we get 0 events and

  85. * there's still timeout left over, we go trying again in search of

  86. * more luck.

  87. */

  88. if (!res && eavail &&

  89. !(res = ep_send_events(ep, events, maxevents)) && !timed_out)

  90. goto fetch_events;


  91. return res;

  92. }

ep_poll中首先是对等待时间的处理,timeout超时时间以ms为单位,timeout大于0,说明等待timeout时间后超时,如果timeout等于0,函数不阻塞,直接返回,小于0的情况,是永久阻塞,直到有事件产生才返回。

当没有事件产生时((!epeventsavailable(ep))为true),调用_addwaitqueueexclusive函数将当前进程加入到ep->wq等待队列里面,然后在一个无限for循环里面,首先调用setcurrentstate(TASKINTERRUPTIBLE),将当前进程设置为可中断的睡眠状态,然后当前进程就让出cpu,进入睡眠,直到有其他进程调用wakeup或者有中断信号进来唤醒本进程,它才会去执行接下来的代码。

如果进程被唤醒后,首先检查是否有事件产生,或者是否出现超时还是被其他信号唤醒的。如果出现这些情况,就跳出循环,将当前进程从ep->wp的等待队列里面移除,并且将当前进程设置为TASK_RUNNING就绪状态。

如果真的有事件产生,就调用epsendevents函数,将events事件转移到用户空间里面。

sysepollwait -> eppoll -> epsend_events:

 
   
   
 
  1. static int ep_send_events(struct eventpoll *ep,

  2. struct epoll_event __user *events, int maxevents)

  3. {

  4. struct ep_send_events_data esed;


  5. esed.maxevents = maxevents;

  6. esed.events = events;


  7. return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);

  8. }

epsendevents没有什么工作,真正的工作是在epscanready_list函数里面:

sysepollwait -> eppoll -> epsendevents -> epscanreadylist:

 
   
   
 
  1. /**

  2. * ep_scan_ready_list - Scans the ready list in a way that makes possible for

  3. * the scan code, to call f_op->poll(). Also allows for

  4. * O(NumReady) performance.

  5. *

  6. * @ep: Pointer to the epoll private data structure.

  7. * @sproc: Pointer to the scan callback.

  8. * @priv: Private opaque data passed to the @sproc callback.

  9. * @depth: The current depth of recursive f_op->poll calls.

  10. * @ep_locked: caller already holds ep->mtx

  11. *

  12. * Returns: The same integer error code returned by the @sproc callback.

  13. */

  14. static int ep_scan_ready_list(struct eventpoll *ep,

  15. int (*sproc)(struct eventpoll *,

  16. struct list_head *, void *),

  17. void *priv, int depth, bool ep_locked)

  18. {

  19. int error, pwake = 0;

  20. unsigned long flags;

  21. struct epitem *epi, *nepi;

  22. LIST_HEAD(txlist);


  23. /*

  24. * We need to lock this because we could be hit by

  25. * eventpoll_release_file() and epoll_ctl().

  26. */


  27. if (!ep_locked)

  28. mutex_lock_nested(&ep->mtx, depth);


  29. /*

  30. * Steal the ready list, and re-init the original one to the

  31. * empty list. Also, set ep->ovflist to NULL so that events

  32. * happening while looping w/out locks, are not lost. We cannot

  33. * have the poll callback to queue directly on ep->rdllist,

  34. * because we want the "sproc" callback to be able to do it

  35. * in a lockless way.

  36. */

  37. spin_lock_irqsave(&ep->lock, flags);

  38. list_splice_init(&ep->rdllist, &txlist);

  39. ep->ovflist = NULL;

  40. spin_unlock_irqrestore(&ep->lock, flags);


  41. /*

  42. * Now call the callback function.

  43. */

  44. error = (*sproc)(ep, &txlist, priv);


  45. spin_lock_irqsave(&ep->lock, flags);

  46. /*

  47. * During the time we spent inside the "sproc" callback, some

  48. * other events might have been queued by the poll callback.

  49. * We re-insert them inside the main ready-list here.

  50. */

  51. for (nepi = ep->ovflist; (epi = nepi) != NULL;

  52. nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {

  53. /*

  54. * We need to check if the item is already in the list.

  55. * During the "sproc" callback execution time, items are

  56. * queued into ->ovflist but the "txlist" might already

  57. * contain them, and the list_splice() below takes care of them.

  58. */

  59. if (!ep_is_linked(&epi->rdllink)) {

  60. list_add_tail(&epi->rdllink, &ep->rdllist);

  61. ep_pm_stay_awake(epi);

  62. }

  63. }

  64. /*

  65. * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after

  66. * releasing the lock, events will be queued in the normal way inside

  67. * ep->rdllist.

  68. */

  69. ep->ovflist = EP_UNACTIVE_PTR;


  70. /*

  71. * Quickly re-inject items left on "txlist".

  72. */

  73. list_splice(&txlist, &ep->rdllist);

  74. __pm_relax(ep->ws);


  75. if (!list_empty(&ep->rdllist)) {

  76. /*

  77. * Wake up (if active) both the eventpoll wait list and

  78. * the ->poll() wait list (delayed after we release the lock).

  79. */

  80. if (waitqueue_active(&ep->wq))

  81. wake_up_locked(&ep->wq);

  82. if (waitqueue_active(&ep->poll_wait))

  83. pwake++;

  84. }

  85. spin_unlock_irqrestore(&ep->lock, flags);


  86. if (!ep_locked)

  87. mutex_unlock(&ep->mtx);


  88. /* We have to call this outside the lock */

  89. if (pwake)

  90. ep_poll_safewake(&ep->poll_wait);


  91. return error;

  92. }

epscanready_list首先将ep就绪链表里面的数据链接到一个全局的txlist里面,然后清空ep的就绪链表,同时还将ep的ovflist链表设置为NULL,ovflist是用单链表,是一个接受就绪事件的备份链表,当内核进程将事件从内核拷贝到用户空间时,这段时间目标文件可能会产生新的事件,这个时候,就需要将新的时间链入到ovlist里面。

仅接着,调用sproc回调函数(这里将调用epsendevents_proc函数)将事件数据从内核拷贝到用户空间。

sysepollwait -> eppoll -> epsendevents -> epscanreadylist -> epsendevents_proc:

 
   
   
 
  1. static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,

  2. void *priv)

  3. {

  4. struct ep_send_events_data *esed = priv;

  5. int eventcnt;

  6. unsigned int revents;

  7. struct epitem *epi;

  8. struct epoll_event __user *uevent;

  9. struct wakeup_source *ws;

  10. poll_table pt;


  11. init_poll_funcptr(&pt, NULL);


  12. /*

  13. * We can loop without lock because we are passed a task private list.

  14. * Items cannot vanish during the loop because ep_scan_ready_list() is

  15. * holding "mtx" during this call.

  16. */

  17. for (eventcnt = 0, uevent = esed->events;

  18. !list_empty(head) && eventcnt < esed->maxevents;) {

  19. epi = list_first_entry(head, struct epitem, rdllink);


  20. /*

  21. * Activate ep->ws before deactivating epi->ws to prevent

  22. * triggering auto-suspend here (in case we reactive epi->ws

  23. * below).

  24. *

  25. * This could be rearranged to delay the deactivation of epi->ws

  26. * instead, but then epi->ws would temporarily be out of sync

  27. * with ep_is_linked().

  28. */

  29. ws = ep_wakeup_source(epi);

  30. if (ws) {

  31. if (ws->active)

  32. __pm_stay_awake(ep->ws);

  33. __pm_relax(ws);

  34. }


  35. list_del_init(&epi->rdllink);


  36. revents = ep_item_poll(epi, &pt);


  37. /*

  38. * If the event mask intersect the caller-requested one,

  39. * deliver the event to userspace. Again, ep_scan_ready_list()

  40. * is holding "mtx", so no operations coming from userspace

  41. * can change the item.

  42. */

  43. if (revents) {

  44. if (__put_user(revents, &uevent->events) ||

  45. __put_user(epi->event.data, &uevent->data)) {

  46. list_add(&epi->rdllink, head);

  47. ep_pm_stay_awake(epi);

  48. return eventcnt ? eventcnt : -EFAULT;

  49. }

  50. eventcnt++;

  51. uevent++;

  52. if (epi->event.events & EPOLLONESHOT)

  53. epi->event.events &= EP_PRIVATE_BITS;

  54. else if (!(epi->event.events & EPOLLET)) {

  55. /*

  56. * If this file has been added with Level

  57. * Trigger mode, we need to insert back inside

  58. * the ready list, so that the next call to

  59. * epoll_wait() will check again the events

  60. * availability. At this point, no one can insert

  61. * into ep->rdllist besides us. The epoll_ctl()

  62. * callers are locked out by

  63. * ep_scan_ready_list() holding "mtx" and the

  64. * poll callback will queue them in ep->ovflist.

  65. */

  66. list_add_tail(&epi->rdllink, &ep->rdllist);

  67. ep_pm_stay_awake(epi);

  68. }

  69. }

  70. }


  71. return eventcnt;

  72. }

epsendeventsproc回调函数循环获取监听项的事件数据,对每个监听项,调用epitempoll获取监听到的目标文件的事件,如果获取到事件,就调用putuser函数将数据拷贝到用户空间。

回到epscanready_list函数,上面说到,在sproc回调函数执行期间,目标文件可能会产生新的事件链入ovlist链表里面,所以,在回调结束后,需要重新将ovlist链表里面的事件添加到rdllist就绪事件链表里面。

最后,如果rdlist不为空(表示是否有就绪事件),并且由进程等待该事件,就调用wakeuplocked再一次唤醒内核进程处理事件的到达(流程跟前面一样,也就是将事件拷贝到用户空间)。

到这,epollwait的流程是结束了,但是有一个问题,就是前面提到的进程调用epollwait后会睡眠,但是这个进程什么时候被唤醒呢?在调用epollctl为目标文件注册监听项时,对目标文件的监听项注册一个epptablequeueproc回调函数,epptablequeueproc回调函数将进程添加到目标文件的wakeup链表里面,并且注册eppollcallbak回调,当目标文件产生事件时,eppoll_callbak回调就去唤醒等待队列里面的进程。

总结一下epoll该函数: epoll_wait函数会使调用它的进程进入睡眠(timeout为0时除外),如果有监听的事件产生,该进程就被唤醒,同时将事件从内核里面拷贝到用户空间返回给该进程。

参考

[1] http://blog.csdn.net/chen19870707/article/details/42525887

[2] http://www.cnblogs.com/apprentice89/p/3234677.html