内核中的TCP的追踪分析-16-TCP(IPV4)的客户端与服务器端socket连接过程-3
在上面15节处我们后跟踪到了内核的tcp_v4_rcv()函数处,它在/net/ipv4/tcp_ipv4.c中的1601行处,我们分段来看
int tcp_v4_rcv(struct sk_buff *skb)
{
const struct iphdr *iph;
struct tcphdr *th;
struct sock *sk;
int ret;
if (skb->pkt_type != PACKET_HOST)
goto discard_it;
/* Count it even if it's bad */
TCP_INC_STATS_BH(TCP_MIB_INSEGS);
if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
goto discard_it;
th = tcp_hdr(skb);
if (th->doff sizeof(struct tcphdr) / 4)
goto bad_packet;
if (!pskb_may_pull(skb, th->doff * 4))
goto discard_it;
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction,
* provided case of th->doff==0 is eliminated.
* So, we defer the checks. */
if (!skb_csum_unnecessary(skb) && tcp_v4_checksum_init(skb))
goto bad_packet;
th = tcp_hdr(skb);
iph = ip_hdr(skb);
TCP_SKB_CB(skb)->seq = ntohl(th->seq);
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff * 4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
TCP_SKB_CB(skb)->when = 0;
TCP_SKB_CB(skb)->flags = iph->tos;
TCP_SKB_CB(skb)->sacked = 0;
sk = __inet_lookup(dev_net(skb->dev), &tcp_hashinfo, iph->saddr,
th->source, iph->daddr, th->dest, inet_iif(skb));
if (!sk)
goto no_tcp_socket;
函数开头首先是检查一下数据包的类型是否是PACKET_HOST,即是否是发给本机的数据包,如果不是就要跳转到标号discard_it处丢掉数据包。TCP_INC_STATS_BH是一个宏
#define TCP_INC_STATS_BH(field) SNMP_INC_STATS_BH(tcp_statistics, field)
#define SNMP_INC_STATS_BH(mib, field) \
(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field]++)
我们这里看到了tcp_statistics它是在/net/ipv4/tcp.c中声明的
DEFINE_SNMP_STAT(struct tcp_mib, tcp_statistics) __read_mostly;
这里涉及到了SNMP的统计信息,SNMP(Simple Network Management Protocol,简单网络管理协议)的前身是简单网关监控协议(SGMP),用来对
通信
线路进行管理。具体的解释请朋友们看
http://baike.baidu.com/view/2899.htm
我们可以在/include/net/snmp.h中看到关于snmp的这个宏
#define DEFINE_SNMP_STAT(type, name) \
__typeof__(type) *name[2]
所以上面实际上是声明了一个struct tcp_mib的结构变量tcp_statistics,我们会在snmp.h中同时看到
struct tcp_mib {
unsigned long mibs[TCP_MIB_MAX];
} __SNMP_MIB_ALIGN__;
相关于具体的snmp在内核的原理请朋友们参考
,我们这里就不再具体分析了,但是从函数中调用字面上来看是增加了tcp_statistics中的mibs关于TCP_MIB_INSEGS的计数。我们继续往下看tcp_v4_rcv()函数,代码中接着调用了
static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len)
{
if (likely(len = skb_headlen(skb)))
return 1;
if (unlikely(len > skb->len))
return 0;
return __pskb_pull_tail(skb, len-skb_headlen(skb)) != NULL;
}
很明显是对数据包头的长度检测,如果数据包头的长度大于或等于struct tcphdr的tcp的头结构长度的话也会丢掉数据包。检测长度通过后,就会取得数据包skb中的tcp头结构,关于struct tcphdr我们在本类中的11节
已经列出了,朋友们可以回过头去看一下,代码中接着对这个tcp头结构进行检测,看是否匹配要求,doff是指示头长度的变量值,如果doff小于tcp的标准头部结构的四分之一就会跳转到bad_packet标号处,增加tcp_statistics的TCP_MIB_INERRS关于出错的snmp统计信息,然后丢掉数据包。接下来再次以doff为参考查看数据包头部是否越界了,接下来根据数据包的ip头部和tcp的头部结构计算数据包中的关于tcp_skb_cb结构的初始化操作。TCP_SKB_CB宏我们在本类的第10节看到了
这里我们就不具体解释结构变量的作用和意义了,我和具体的协议要求相关,我们重点关心的是追踪的关键过程,我们看到函数接着进入了__inet_lookup()函数中
static inline struct sock *__inet_lookup(struct net *net,
struct inet_hashinfo *hashinfo,
const __be32 saddr, const __be16 sport,
const __be32 daddr, const __be16 dport,
const int dif)
{
u16 hnum = ntohs(dport);
struct sock *sk = __inet_lookup_established(net, hashinfo,
saddr, sport, daddr, hnum, dif);
return sk ? : __inet_lookup_listener(net, hashinfo, daddr, hnum, dif);
}
注意传递进这个函数的参数除了数据包中的信息还有一个struct inet_hashinfo全局的结构变量tcp_hashinfo,这个结构在第6节中看到了
我们看到在__inet_lookup中,首先通过__inet_lookup_established()函数来查找已经处于连接的sock结构。
struct sock * __inet_lookup_established(struct net *net,
struct inet_hashinfo *hashinfo,
const __be32 saddr, const __be16 sport,
const __be32 daddr, const u16 hnum,
const int dif)
{
INET_ADDR_COOKIE(acookie, saddr, daddr)
const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
struct sock *sk;
const struct hlist_node *node;
/* Optimize here for direct hit, only listening connections can
* have wildcards anyways.
*/
unsigned int hash = inet_ehashfn(daddr, hnum, saddr, sport);
struct inet_ehash_bucket *head = inet_ehash_bucket(hashinfo, hash);
rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
prefetch(head->chain.first);
read_lock(lock);
sk_for_each(sk, node, &head->chain) {
if (INET_MATCH(sk, net, hash, acookie,
saddr, daddr, ports, dif))
goto hit; /* You sunk my battleship! */
}
/* Must check for a TIME_WAIT'er before going to listener hash. */
sk_for_each(sk, node, &head->twchain) {
if (INET_TW_MATCH(sk, net, hash, acookie,
saddr, daddr, ports, dif))
goto hit;
}
sk = NULL;
out:
read_unlock(lock);
return sk;
hit:
sock_hold(sk);
goto out;
}
这个函数我们暂且不具体解释了,相关的hash表和hash桶的内容已经在第5节地址绑定那节中我们看到了,但是如果想看明白上面的代码还需要了解关于hash表的过程,
这篇文章是针对2.4的内核但是其原理对于朋友们详细了解内核中的这个hash表有很好的学习作用。我们如果回忆一下在监听那节分析中,曾经进入了__inet_hash()函数中,将sock挂入到hash表的操作时
__inet_hash()函数首先判断sock的状态是否是TCP_LISTEN ,如果不是的话就会调用__inet_hash_nolisten(),函数我们可以在这个函数中看到
void __inet_hash_nolisten(struct sock *sk)
{
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
struct hlist_head *list;
rwlock_t *lock;
struct inet_ehash_bucket *head;
BUG_TRAP(sk_unhashed(sk));
sk->sk_hash = inet_sk_ehashfn(sk);
head = inet_ehash_bucket(hashinfo, sk->sk_hash);
list = &head->chain;
lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
write_lock(lock);
__sk_add_node(sk, list);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
write_unlock(lock);
}
上面是调用了inet_ehash_bucket()来挂入我们的全局的tcp_hashinfo中的ehash的杂凑队列中了,而__inet_lookup_established()则中再次调用inet_ehash_bucket()找到这个ehash杂凑队列头
static inline struct inet_ehash_bucket *inet_ehash_bucket(
struct inet_hashinfo *hashinfo,
unsigned int hash)
{
return &hashinfo->ehash[hash & (hashinfo->ehash_size - 1)];
}
反之我们在监听那节文章中讲到我们的服务器socket是启动监听过程的,所以会在__inet_hash()函数中挂入tcp_hashinfo的listening_hash杂凑队列中。
那里的服务器端的监听过程了。我们看__inet_lookup_established()在上面的代码中会通过INET_MATCH来查找我们在服务器端匹配数据包的要求的地址和端口的sock结构,这里我们知道数据包已经根据路由改变了相应的要查找的地址和端口,这些是在前面章节中路由过程中设置的,只是我们在那里没有详细的对路由过程描述。我们以后会补上这段详细的路由过程,因为我们在监听分析过程知道了服务器端的sock处于监听状态并没有挂入ehash杂凑队列所以__inet_lookup_established()没有找到想要的sock结构。所以回到__inet_lookup()函数会执行__inet_lookup_listener函数在listening_hash的监听的杂凑队列中查找
struct sock *__inet_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
const __be32 daddr, const unsigned short hnum,
const int dif)
{
struct sock *sk = NULL;
const struct hlist_head *head;
read_lock(&hashinfo->lhash_lock);
head = &hashinfo->listening_hash[inet_lhashfn(hnum)];
if (!hlist_empty(head)) {
const struct inet_sock *inet = inet_sk((sk = __sk_head(head)));
if (inet->num == hnum && !sk->sk_node.next &&
(!inet->rcv_saddr || inet->rcv_saddr == daddr) &&
(sk->sk_family == PF_INET || !ipv6_only_sock(sk)) &&
!sk->sk_bound_dev_if && net_eq(sock_net(sk), net))
goto sherry_cache;
sk = inet_lookup_listener_slow(net, head, daddr, hnum, dif);
}
if (sk) {
sherry_cache:
sock_hold(sk);
}
read_unlock(&hashinfo->lhash_lock);
return sk;
}
这个函数我们就不用多解释了,在这个函数中我们找到了已经创建并处于监听状态的sock结构,这是我们在第6节中讲述的。在那篇文章结尾有些未解的问题现在已经对接清楚了。
回到tcp_v4_rcv()函数中我们继续往下看代码
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
nf_reset(skb);
if (sk_filter(sk, skb))
goto discard_and_relse;
skb->dev = NULL;
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
struct tcp_sock *tp = tcp_sk(sk);
if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
tp->ucopy.dma_chan = get_softnet_dma();
if (tp->ucopy.dma_chan)
ret = tcp_v4_do_rcv(sk, skb);
else
#endif
{
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
}
} else
sk_add_backlog(sk, skb);
bh_unlock_sock(sk);
sock_put(sk);
return ret;
no_tcp_socket:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard_it;
if (skb->len (th->doff 2) || tcp_checksum_complete(skb)) {
bad_packet:
TCP_INC_STATS_BH(TCP_MIB_INERRS);
} else {
tcp_v4_send_reset(NULL, skb);
}
discard_it:
/* Discard frame. */
kfree_skb(skb);
return 0;
discard_and_relse:
sock_put(sk);
goto discard_it;
do_time_wait:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
inet_twsk_put(inet_twsk(sk));
goto discard_it;
}
if (skb->len (th->doff 2) || tcp_checksum_complete(skb)) {
TCP_INC_STATS_BH(TCP_MIB_INERRS);
inet_twsk_put(inet_twsk(sk));
goto discard_it;
}
switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
case TCP_TW_SYN: {
struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
&tcp_hashinfo,
iph->daddr, th->dest,
inet_iif(skb));
if (sk2) {
inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
inet_twsk_put(inet_twsk(sk));
sk = sk2;
goto process;
}
/* Fall through to ACK */
}
case TCP_TW_ACK:
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
goto no_tcp_socket;
case TCP_TW_SUCCESS:;
}
goto discard_it;
}
接下来要检查服务器端的这个sock结构是否处于TCP_TIME_WAIT状态延时状态,如果是的话就要跳到do_time_wait标号处等待了,我们不看这个过程了,我们也跳过对IPSEC规则的检测函数xfrm4_policy_check(),象在第8节
具体的分析放在后续工作中。接下来进入关键的过程
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
首先是通过tcp_prequeue()将数据包先链入tcp_sock结构中的预备队列,
,我们进入tcp_v4_do_rcv()函数
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
struct sock *rsk;
#ifdef CONFIG_TCP_MD5SIG
/*
* We really want to reject the packet as early as possible
* if:
* o We're expecting an MD5'd packet and this is no MD5 tcp option
* o There is an MD5 option and we're not expecting one
*/
if (tcp_v4_inbound_md5_hash(sk, skb))
goto discard;
#endif
if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
TCP_CHECK_TIMER(sk);
if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
rsk = sk;
goto reset;
}
TCP_CHECK_TIMER(sk);
return 0;
}
if (skb->len tcp_hdrlen(skb) || tcp_checksum_complete(skb))
goto csum_err;
if (sk->sk_state == TCP_LISTEN) {
struct sock *nsk = tcp_v4_hnd_req(sk, skb);
if (!nsk)
goto discard;
if (nsk != sk) {
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
}
return 0;
}
}
TCP_CHECK_TIMER(sk);
if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
rsk = sk;
goto reset;
}
TCP_CHECK_TIMER(sk);
return 0;
reset:
tcp_v4_send_reset(rsk, skb);
discard:
kfree_skb(skb);
/* Be careful here. If this function gets more complicated and
* gcc suffers from register pressure on the x86, sk (in %ebx)
* might be destroyed here. This current version compiles correctly,
* but you have been warned.
*/
return 0;
csum_err:
TCP_INC_STATS_BH(TCP_MIB_INERRS);
goto discard;
}
上面的代码中因为我们找到的sock已经是处于监听状态的,所以只会执行这段代码
if (sk->sk_state == TCP_LISTEN) {
struct sock *nsk = tcp_v4_hnd_req(sk, skb);
if (!nsk)
goto discard;
if (nsk != sk) {
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
}
return 0;
}
}
tcp_v4_hnd_req()函数我们下一节论述,这是个很重要的函数,因为我们是“次握手”,所以这个函数会再次取得我们的sock结构,如果与原来的sock不同就进入了tcp_child_process()函数
int tcp_child_process(struct sock *parent, struct sock *child,
struct sk_buff *skb)
{
int ret = 0;
int state = child->sk_state;
if (!sock_owned_by_user(child)) {
ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
skb->len);
/* Wakeup parent, send SIGIO */
if (state == TCP_SYN_RECV && child->sk_state != state)
parent->sk_data_ready(parent, 0);
} else {
/* Alas, it is possible again, because we do lookup
* in main socket hash table and lock on listening
* socket does not protect us more.
*/
sk_add_backlog(child, skb);
}
bh_unlock_sock(child);
sock_put(child);
return ret;
}
注意sock_owned_by_user()是检查是否正在释放sock结构,当然我们服务器端并没有释放操作,所以会执行tcp_rcv_state_process()函数,函数很长但是我们只关心与我们追踪的线索有关的过程
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
struct tcphdr *th, unsigned len)
{
。。。。。。
case TCP_LISTEN:
if (th->ack)
return 1;
if (th->rst)
goto discard;
if (th->syn) {
if (icsk->icsk_af_ops->conn_request(sk, skb) 0)
return 1;
/* Now we have several options: In theory there is
* nothing else in the frame. KA9Q has an option to
* send data with the syn, BSD accepts data with the
* syn up to the [to be] advertised window and
* Solaris 2.1 gives you a protocol error. For now
* we just ignore it, that fits the spec precisely
* and avoids incompatibilities. It would be nice in
* future to drop through and process the data.
*
* Now that TTCP is starting to be used we ought to
* queue this data.
* But, this leaves one open to an easy denial of
* service attack, and SYN cookies can't defend
* against this problem. So, we drop the data
* in the interest of security over speed unless
* it's still in use.
*/
kfree_skb(skb);
return 0;
}
。。。。。。
}
我们知道客户端传递过来的是SYN包,所以会执行icsk->icsk_af_ops->conn_request(sk, skb),这里我们在分析服务器端的socket创建文章时
http://blog.chinaunix.net/u2/64681/showart_1360583.html
在tcp_v4_init_sock()函数中看到
icsk->icsk_af_ops = &ipv4_specific;
也就是会执行钩子结构ipv4_specific中的conn_request()
struct inet_connection_sock_af_ops ipv4_specific = {
.queue_xmit = ip_queue_xmit,
.send_check = tcp_v4_send_check,
.rebuild_header = inet_sk_rebuild_header,
.conn_request = tcp_v4_conn_request,
.syn_recv_sock = tcp_v4_syn_recv_sock,
.remember_stamp = tcp_v4_remember_stamp,
.net_header_len = sizeof(struct iphdr),
.setsockopt = ip_setsockopt,
.getsockopt = ip_getsockopt,
.addr2sockaddr = inet_csk_addr2sockaddr,
.sockaddr_len = sizeof(struct sockaddr_in),
.bind_conflict = inet_csk_bind_conflict,
#ifdef CONFIG_COMPAT
.compat_setsockopt = compat_ip_setsockopt,
.compat_getsockopt = compat_ip_getsockopt,
#endif
};
很明显进入了tcp_v4_conn_request()函数,时间关系,我们明天继续追踪这个函数
文章来源CU社区:内核中的TCP的追踪分析-16-TCP(IPV4)的客户端与服务器端socket连接过程-3
相关文章