NAME (名称)
ip - Linux IPv4 协议实现
SYNOPSIS(总览)
#include <sys/socket.h>
#include <net/netinet.h>
tcp_socket = socket(PF_INET, SOCK_STREAM, 0);
raw_socket = socket(PF_INET, SOCK_RAW, protocol);
udp_socket = socket(PF_INET, SOCK_DGRAM, protocol);
DESCRIPTION(描述)
Linux 实现描述于 RFC791 和 RFC1122 中的 Internet 协议,版本4. ip 包括遵循 RFC1112 的第二层的多信道广播技术的实现.它也包括含包过滤器的IP路由器.
程序员的接口与 BSD 的套接字(socket)兼容.要获得关于套接字的更多信息,参见 socket(7)
创建一个IP套接字是通过以 socket(PF_INET, socket_type, protocol) 方式调用 socket(2) 函数来实现的.有效的套接字类型(socket_type)有: SOCK_STREAM 用来打开一个 tcp(7) 套接字, SOCK_DGRAM 用来打开一个 udp(7) 套接字,或者是 SOCK_RAW 用来打开一个 raw(7) 套接字用来直接访问 IP 协议. protocol 指的是要接收或者发送出去的包含在 IP 头标识(header)中的 IP 协议.对于TCP套接字而言,唯一的有效 protocol 值是 0 和 IPPROTO_TCP 对于UDP套接字而言,唯一的有效 protocol 值是 0 和 IPPROTO_UDP. 而对于 SOCK_RAW 你可以指定一个在 RFC1700 中定义的有效 IANA IP 协议代码来赋值.
当一个进程希望接受新的来访包或者连接时,它应该使用 bind(2) 绑定一个套接字到一个本地接口地址.任意给定的本地(地址,端口)对只能绑定一个IP套接字.当调用 bind 时中声明了 INADDR_ANY 时,套接字将会绑定到 所有 本地接口.当在未绑定的套接字上调用 listen(2) 或者 connect(2) 时,套接字会自动绑定到一个本地地址设置为 INADDR_ANY 的随机的空闲端口上.
除非你设置了 S0_REUSEADDR 标识,否则一个已绑定的 TCP 本地套接字地址在关闭后的一段时间内不可用.使用该标识的时候要小心,因为它会使 TCP 变得不可靠.
ADDRESS FORMAT(地址格式)
一个 IP 套接字地址定义为一个 IP 接口地址和一个端口号的组合.基本 IP 协议不会提供端口号,它们通过更高层次的协议如 udp(7) 和 tcp(7) 来实现.对于raw套接字, sin_port 设置为IP协议.
-
struct sockaddr_in { sa_family_t sin_family; /* 地址族: AF_INET */ u_int16_t sin_port; /* 按网络字节次序的端口 */ struct in_addr sin_addr; /* internet地址 */ }; /* Internet地址. */ struct in_addr { u_int32_t s_addr; /* 按网络字节次序的地址 */ };
sin_family 总是设置为 AF_INET. 这是必需的;在 Linux 2.2 中,如果该设置缺失,大多数联网函数会返回 EINVAL sin_port 包含按网络字节排序的端口号.端口号在1024以下的称为 保留端口. 只有那些有效用户标识为 0 或者 CAP_NET_BIND_SERVICE 有功能的进程才可以 bind(2) 到这些套接字.注意原始的(raw)IPv4协议没有这样的端口概念,它们只通过更高的协议如 tcp(7) 和 udp(7) 来实现.
sin_addr 指的是 IP 主机地址. 在 struct in_addr 中的 addr 部分包含按网络字节序的主机接口地址. in_addr 应该只能通过使用 inet_aton(3), inet_addr(3), inet_makeaddr(3) 库函数或者直接通过名字解析器(参见 gethostbyname(3)) 来访问. IPv4 地址分成单点广播,广播传送和多点广播地址.单点广播地址指定了一台主机的单一接口,广播地址指定了在一个网段上的所有主机,而多点广播地址则在一个多点传送组中寻址所有主机. 只有当设置了套接字标识 SO_BROADCAST 时,才能收发数据报到广播地址.在当前的实现中,面向连接的套接字只允许使用单点传送地址.
注意地址和端口总是按照网络字节序存储的.这意味着你需要对分配给端口的号码调用 htons(3). 所有在标准库中的地址/端口处理函数都是按网络字节序运行的.
有几个特殊的地址: INADDR_LOOPBACK (127.0.0.1) 总是代表经由回环设备的本地主机; INADDR_ANY (0.0.0.0) 表示任何可绑定的地址; INADDR_BROADCAST (255.255.255.255) 表示任何主机,由于历史的原因,这与绑定为 INADDR_ANY 有同样的效果.
SOCKET OPTIONS(套接字选项)
IP 支持一些与协议相关的套接字选项,这些选项可以通过 setsockopt(2) 设置,并可以通过 getsockopt(2) 读取. IP 的套接字选项级别为 SOL_IP
- IP_OPTIONS
- 设置或者获取将由该套接字发送的每个包的 IP 选项.该参数是一个指向包含选项和选项长度的存储缓冲区的指针. setsockopt(2) 系统调用设置与一个套接字相关联的 IP 选项. IPv4 的最大选项长度为 40 字节.参阅 RFC791 获取可用的选项.如果一个 SOCK_STREAM 套接字收到的初始连接请求包包含 IP 选项时, IP 选项自动设置为来自初始包的选项,同时反转路由头.在连接建立以后将不允许来访的包修改选项.缺省情况下是关闭对所有来访包的源路由选项的,你可以用 accept_source_route sysctl 来激活.仍然处理其它选项如时间戳(timestamp).对于数据报套接字而言,IP 选项只能由本地用户设置.调用带 IP_OPTIONS 的 getsockopt(2) 会把当前用于发送的 IP 选项放到你提供的缓冲区中.
- IP_PKTINFO
- 传递一条包含 pktinfo 结构(该结构提供一些来访包的相关信息)的 IP_PKTINFO 辅助信息. 这个选项只对数据报类的套接字有效.
-
struct in_pktinfo { unsigned int ipi_ifindex; /* 接口索引 */ struct in_addr ipi_spec_dst; /* 路由目的地址 */ struct in_addr ipi_addr; /* 头标识目的地址 */ };
-
- ipi_ifindex 指的是接收包的接口的唯一索引. ipi_spec_dst 指的是路由表记录中的目的地址,而 ipi_addr 指的是包头中的目的地址.如果给 sendmsg (2)传递了 IP_PKTINFO, 那么外发的包会通过在 ipi_ifindex 中指定的接口发送出去,同时把 ipi_spec_dst 设置为目的地址.
- IP_RECVTOS
- 如果打开了这个选项,则 IP_TOS , 辅助信息会与来访包一起传递.它包含一个字节用来指定包头中的服务/优先级字段的类型.该字节为一个布尔整型标识.
- IP_RECVTTL
- 当设置了该标识时,传送一条带有用一个字节表示的接收包生存时间(time to live)字段的 IP_RECVTTL 控制信息.此选项还不支持 SOCK_STREAM 套接字.
- IP_RECVOPTS
- 用一条 IP_OPTIONS 控制信息传递所有来访的 IP 选项给用户.路由头标识和其它选项已经为本地主机填好.此选项还不支持 SOCK_STREAM 套接字.
- IP_RETOPTS
- 等同于 IP_RECVOPTS 但是返回的是带有时间戳的未处理的原始选项和在这段路由中未填入的路由记录项目.
- IP_TOS
- 设置或者接收源于该套接字的每个IP包的 Type-Of-Service (TOS 服务类型)字段.它被用来在网络上区分包的优先级. TOS 是单字节的字段.定义了一些的标准 TOS 标识: IPTOS_LOWDELAY 用来为交互式通信最小化延迟时间, IPTOS_THROUGHPUT 用来优化吞吐量, IPTOS_RELIABILITY 用来作可靠性优化, IPTOS_MINCOST 应该被用作"填充数据",对于这些数据,低速传输是无关紧要的.至多只能声明这些 TOS 值中的一个.其它的都是无效的,应当被清除.缺省时,Linux首先发送 IPTOS_LOWDELAY 数据报, 但是确切的做法要看配置的排队规则而定. 一些高优先级的层次可能会要求一个有效的用户标识 0 或者 CAP_NET_ADMIN 能力. 优先级也可以以于协议无关的方式通过( SOL_SOCKET, SO_PRIORITY )套接字选项(参看 socket(7) )来设置.
- IP_TTL
- 设置或者检索从此套接字发出的包的当前生存时间字段.
- IP_HDRINCL
- 如果打开的话, 那么用户可在用户数据前面提供一个 ip 头. 这只对 SOCK_RAW 有效.参看 raw(7) 以获得更多信息.当激活了该标识之后,其值由 IP_OPTIONS 设定,并且 IP_TOS 被忽略.
- IP_RECVERR
- 允许传递扩展的可靠的错误信息. 如果在数据报上激活了该标识, 那么所有产生的错误会在每套接字一个的错误队列中排队等待. 当用户从套接字操作中收到错误时,就可以通过调用设置了 MSG_ERRQUEUE 标识的 recvmsg(2) 来接收. 描述错误的 sock_extended_err 结构将通过一条类型为 IP_RECVERR , 级别为 SOL_IP的辅助信息进行传递. 这个选项对在未连接的套接字上可靠地处理错误很有用. 错误队列的已收到的数据部分包含错误包.
- IP 按照下面的方法使用 sock_extended_err 结构: ICMP 包接收的错误 ee_origin 设为 SO_EE_ORIGIN_ICMP , 对于本地产生的错误则设为 SO_EE_ORIGIN_LOCAL . ee_type 和 ee_code 设置为 ICMP 头标识的类型和代码字段. ee_info 包含用于 EMSGSIZE 时找到的 MTU. ee_data 目前没有使用. 当错误来自于网络时,该套接字上所有IP选项都被激活 (IP_OPTIONS, IP_TTL, 等.)并且当做控制信息包含错误包中传递.引发错误的包的有效载荷会以正常数据返回.
- 在 SOCK_STREAM 套接字上, IP_RECVERR 会有细微的语义不同.它并不保存下次超时的错误,而是立即传递所有进来的错误给用户. 这对 TCP 连接时间很短的情况很有用,因为它要求快速的错误处理. 使用该选项要小心:因为不允许从路由转移和其它正常条件下正确地进行恢复,它使得TCP变得不可靠,并且破坏协议的规范. 注意TCP没有错误队列; MSG_ERRQUEUE 对于 SOCK_STREAM 套接字是非法的. 因此所有错误都会由套接字函数返回,或者只返回 SO_ERROR .
- 对于原始(raw)套接字而言, IP_RECVERR 允许传递所有接收到的ICMP错误给应用程序,否则错误只在连接的套接字上报告出来.
- 它设置或者检索一个整型布尔标识. IP_RECVERR 缺省设置为off(关闭).
- IP_PMTU_DISCOVER
- 为套接字设置或接收Path MTU Discovery setting(路径MTU发现设置). 当允许时,Linux会在该套接字上执行定义于RFC1191中的Path MTU Discovery(路径MTU发现). don't 段标识会设置在所有外发的数据报上. 系统级别的缺省值是这样的: SOCK_STREAM 套接字由 ip_no_pmtu_disc sysctl 控制,而对其它所有的套接字都被都屏蔽掉了,对于非 SOCK_STREAM 套接字而言, 用户有责任按照MTU的大小对数据分块并在必要的情况下进行中继重发.如果设置了该标识 (用 EMSGSIZE ),内核会拒绝比已知路径MTU更大的包.
Path MTU discovery(路径MTU发现)标识 含义
IP_PMTUDISC_WANT 对每条路径进行设置.
IP_PMTUDISC_DONT 从不作Path MTU Discovery(路径MTU发现).
IP_PMTUDISC_DO 总作Path MTU Discovery(路径MTU发现).
当允许 PMTU (路径MTU)搜索时, 内核会自动记录每个目的主机的path MTU(路径MTU).当它使用 connect(2) 连接到一个指定的对端机器时,可以方便地使用 IP_MTU 套接字选项检索当前已知的 path MTU(路径MTU)(比如,在发生了一个 EMSGSIZE 错误后).它可能随着时间的推移而改变. 对于带有许多目的端的非连接的套接字,一个特定目的端的新到来的 MTU 也可以使用错误队列(参看 IP_RECVERR) 来存取访问. 新的错误会为每次到来的 MTU 的更新排队等待.
当进行 MTU 搜索时,来自数据报套接字的初始包可能会被丢弃. 使用 UDP 的应用程序应该知道这个并且考虑其包的中继传送策略.
为了在未连接的套接字上引导路径 MTU 发现进程, 我们可以用一个大的数据报(头尺寸超过64K字节)启动, 并令其通过更新路径 MTU 逐步收缩.
为了获得路径MTU连接的初始估计,可通过使用 connect(2) 把一个数据报套接字连接到目的地址,并通过调用带 IP_MTU选项的 getsockopt(2) 检索该MTU.
- IP_MTU
- 检索当前套接字的当前已知路径MTU.只有在套接字被连接时才是有效的.返回一个整数.只有作为一个 getsockopt(2) 才有效.
- IP_ROUTER_ALERT
- 给该套接字所有将要转发的包设置IP路由器警告(IP RouterAlert option)选项. 只对原始套接字(raw socket)有效,这对用户空间的 RSVP后台守护程序之类很有用. 分解的包不能被内核转发,用户有责任转发它们.套接字绑定被忽略, 这些包只按协议过滤. 要求获得一个整型标识.
- IP_MULTICAST_TTL
- 设置或者读取该套接字的外发多点广播包的生存时间值. 这对于多点广播包设置可能的最小TTL很重要. 缺省值为1,这意味着多点广播包不会超出本地网段, 除非用户程序明确地要求这么做.参数是一个整数.
- IP_MULTICAST_LOOP
- 设置或读取一个布尔整型参数以决定发送的多点广播包是否应该被回送到本地套接字.
- IP_ADD_MEMBERSHIP
- 加入一个多点广播组.参数为 struct ip_mreqn 结构.
-
struct ip_mreqn { struct in_addr imr_multiaddr; /* IP多点传送组地址 */ struct in_addr imr_address; /* 本地接口的IP地址 */ int imr_ifindex; /* 接口索引 */ };
- imr_multiaddr 包含应用程序希望加入或者退出的多点广播组的地址. 它必须是一个有效的多点广播地址. imr_address 指的是系统用来加入多点广播组的本地接口地址;如果它与 INADDR_ANY 一致,那么由系统选择一个合适的接口. imr_ifindex 指的是要加入/脱离 imr_multiaddr 组的接口索引,或者设为0表示任何接口.
- 由于兼容性的缘故,老的 ip_mreq 接口仍然被支持.它与 ip_mreqn 只有一个地方不同,就是没有包括 imr_ifindex 字段.这只在作为一个 setsockopt(2) 时才有效.
- IP_DROP_MEMBERSHIP
- 脱离一个多点广播组.参数为 ip_mreqn 或者 ip_mreq 结构,这与 IP_ADD_MEMBERSHIP 类似. T P IP_MULTICAST_IF 为多点广播套接字设置本地设备.参数为 ip_mreqn 或者 ip_mreq 结构,它与 IP_ADD_MEMBERSHIP 类似.
- 当传递一个无效的套接字选项时,返回 ENOPROTOOPT .
SYSCTLS
IP协议支持 sysctl 接口配置一些全局选项.sysctl可通过读取或者写入 /proc/sys/net/ipv4/* 文件或使用 sysctl(2) 接口来存取访问.
- ip_default_ttl
- 设置外发包的缺省生存时间值.此值可以对每个套接字通过 IP_TTL 选项来修改.
- ip_forward
- 以一个布尔标识来激活IP转发功能.IP转发也可以按接口来设置
- ip_dynaddr
- 打开接口地址改变时动态套接字地址和伪装记录的重写. 这对具有变化的IP地址的拨号接口很有用.0表示不重写,1打开其功能,而2则激活冗余模式.
- ip_autoconfig
- 无文档
- ip_local_port_range
- 包含两个整数,定义了缺省分配给套接字的本地端口范围. 分配起始于第一个数而终止于第二个数. 注意这些端口不能与伪装所使用的端口相冲突(尽管这种情况也可以处理). 同时,随意的选择可能会导致一些防火墙包过滤器的问题,它们会误认为本地端口在使用. 第一个数必须至少>1024,最好是>4096以避免与众所周知的端口发生冲突,从而最大可能的减少防火墙问题.
- ip_no_pmtu_disc
- 如果打开了,缺省情况下不对TCP套接字执行路径MTU发现. 如果在路径上误配置了防火墙(用来丢弃所有 ICMP包)或者误配置了接口 (例如,设置了一个两端MTU不同的端对端连接),路径MTU发现可能会失败. 宁愿修复路径上的损坏的路由器,也好过整个地关闭路径MTU发现, 因为这样做会导致网络上的高开销.
- ipfrag_high_thresh, ipfrag_low_thresh
- 如果排队等待的IP碎片的数目达到 ipfrag_high_thresh , 队列被排空为 ipfrag_low_thresh . 这包含一个表示字节数的整数.
- ip_always_defrag
- [kernel 2.2.13中的新功能;在早期内核版本中,该功能在编译时通过 CONFIG_IP_ALWAYS_DEFRAG 选项来控制]
当该布尔标识被激活(不等于0)时, 来访的碎片(IP包的一部分,这生成于当一些在源端和目的端之间的主机认定包太大而分割成许多碎片的情况下)将在处理之前重新组合(碎片整理), 即使它们马上要被转发也如此.
只在运行着一台与网络单一连接的防火墙或者透明代理服务器时才这么干; 对于正常的路由器或者主机, 永远不要打开它. 否则当碎片在不同连接中通过时碎片的通信可能会被扰乱. 而且碎片重组也需要花费大量的内存和 CPU 时间.
这在配置了伪装或者透明代理的情况下自动打开.
- neigh/*
- 参看 arp(7)
IOCTLS
所有在 socket(7) 中有描述 的 ioctl 都可应用于ip.
用于配置防火墙应用的ioctl记载在 ipchains 包的 ipfw(7) 的文档中.
用来配置普通设备参数的ioctl在 netdevice(7) 中有描述.
NOTES(备注)
使用 SO_BROADCAST 选项要小心 - 它在 Linux 中没有权限要求. 不小心的广播很容易导致网络过载.对于新的应用协议而言,最好是使用多点广播组来替代广播.我们不鼓励使用广播.
有些其它的BSD套接字实现提供了 IP_RCVDSTADDR 和 IP_RECVIF 套接字选项来获得目的地址以及接收数据报的接口.Linux有更通用的 IP_PKTINFO 来完成相同任务.
ERRORS(错误)
ENOBUFS,EPERM对EACCES等.)
- ENOTCONN
- 操作只定义于连接的套接字,而该套接字却没有连接.
- EINVAL
- 传递无效的参数. 对于发送操作,这可以因发送到一个 blackhole(黑洞) 路由而引发.
- EMSGSIZE
- 数据报大于该路径上的 MTU,并且它不能被分成碎片.
- EACCES
- 没有必要权限的用户试图执行一项需要某些权限的操作. 这包括: 在没有 SO_BROADCAST 标识设置的情况下发送一个包到广播地址. 通过一条 禁止的 路由发送包. 在没有 CAP_NET_ADMIN 或者有效用户标识不为0的情况下修改防火墙设置. 在没有 CAP_NET_BIND_SERVICE 能力或者有效用户标识不为零0的情况下绑定一个保留端口.
- EADDRINUSE
- 试图绑定到一个已在使用的地址.
- ENOMEM 和 ENOBUFS
- 没有足够的内存可用.
- ENOPROTOOPT 和 EOPNOTSUPP
- 传递无效的套接字选项.
- EPERM
- 用户没有权限设置高优先级,修改配置或者发送信号到请求的进程或组.
- EADDRNOTAVAIL
- 请求一个不存在的接口或者请求的源端地址不是本地的.
- EAGAIN
- 在一个非阻塞的套接字上进行操作会阻塞.
- ESOCKTNOSUPPORT
- 套接字未配置或者请求了一个未知类型的套接字.
- EISCONN
- 在一个已经连接的套接字上调用 connect(2).
- EALREADY
- 在一个非阻塞的套接字上的连接操作已经在进行中.
- ECONNABORTED
- 在一次 accept(2) 执行中连接被关闭.
- EPIPE
- 连接意外关闭或者被对端关闭.
- ENOENT
- 在没有报到达的套接字上调用 SIOCGSTAMP .
- EHOSTUNREACH
- 没有有效路由表记录匹配目的地址.该错误可以被来自远程路由器的 ICMP消息或者因为本地路由表的缘故而引发.
- ENODEV
- 网络设备不可用或者不适于发送IP.
- ENOPKG
- 内核子系统没有配置.
- ENOBUFS, ENOMEM
- 没有足够的空闲内存. 这常常意味着内存分配因套接字缓冲区的限制而受限, 而不是因为系统内存的缘故,但是这也不是100%正确.
其它错误可能由重叠协议族生成;参看 tcp(7), raw(7), udp(7) 和 socket(7).
VERSIONS(版本)
IP_PKTINFO, IP_MTU, IP_PMTU_DISCOVER, IP_PKTINFO, IP_RECVERR 和 IP_ROUTER_ALERT 是Linux 2.2中的新选项.
struct ip_mreqn 也是新出现在Linux 2.2中的.Linux 2.0只支持 ip_mreq.
sysctl是在Linux 2.2中引入的.
COMPATIBILITY(兼容性)
为了与Linux 2.0相容,仍然支持用过时的 socket(PF_INET, SOCK_RAW, protocol) 语法打开一个 packet(7) 套接字.我们不赞成这么用,而且应该被 socket(PF_PACKET, SOCK_RAW, protocol) 所代替.主要的区别就是新的针对一般链接层信息的 sockaddr_ll 地址结构替换了旧的 sockaddr_pkt 地址结构.
BUGS
有许多不连贯的错误码.
没有描述用来配置特定IP接口选项和ARP表的ioctl.
AUTHORS(作者)
该man页作者是Andi Kleen.
SEE ALSO(另见)
sendmsg(2), recvmsg(2), socket(7), netlink(7), tcp(7), udp(7), raw(7), ipfw(7).
#p#
NAME
ip - Linux IPv4 protocol implementation
SYNOPSIS
#include <sys/socket.h>
#include <netinet/in.h>
tcp_socket = socket(PF_INET, SOCK_STREAM, 0);
raw_socket = socket(PF_INET, SOCK_RAW, protocol);
udp_socket = socket(PF_INET, SOCK_DGRAM, protocol);
DESCRIPTION
Linux implements the Internet Protocol, version 4, described in RFC791 and RFC1122. ip contains a level 2 multicasting implementation conforming to RFC1112. It also contains an IP router including a packet filter.
The programmer's interface is BSD sockets compatible. For more information on sockets, see socket(7).
An IP socket is created by calling the socket(2) function as socket(PF_INET, socket_type, protocol). Valid socket types are SOCK_STREAM to open a tcp(7) socket, SOCK_DGRAM to open a udp(7) socket, or SOCK_RAW to open a raw(7) socket to access the IP protocol directly. protocol is the IP protocol in the IP header to be received or sent. The only valid values for protocol are 0 and IPPROTO_TCP for TCP sockets and 0 and IPPROTO_UDP for UDP sockets. For SOCK_RAW you may specify a valid IANA IP protocol defined in RFC1700 assigned numbers.
When a process wants to receive new incoming packets or connections, it should bind a socket to a local interface address using bind(2). Only one IP socket may be bound to any given local (address, port) pair. When INADDR_ANY is specified in the bind call the socket will be bound to all local interfaces. When listen(2) or connect(2) are called on a unbound socket the socket is automatically bound to a random free port with the local address set to INADDR_ANY.
A TCP local socket address that has been bound is unavailable for some time after closing, unless the SO_REUSEADDR flag has been set. Care should be taken when using this flag as it makes TCP less reliable.
ADDRESS FORMAT
An IP socket address is defined as a combination of an IP interface address and a port number. The basic IP protocol does not supply port numbers, they are implemented by higher level protocols like udp(7) and tcp(7). On raw sockets sin_port is set to the IP protocol.
-
struct sockaddr_in { sa_family_t sin_family; /* address family: AF_INET */ u_int16_t sin_port; /* port in network byte order */ struct in_addr sin_addr; /* internet address */ }; /* Internet address. */ struct in_addr { u_int32_t s_addr; /* address in network byte order */ };
sin_family is always set to AF_INET. This is required; in Linux 2.2 most networking functions return EINVAL when this setting is missing. sin_port contains the port in network byte order. The port numbers below 1024 are called reserved ports. Only processes with effective user id 0 or the CAP_NET_BIND_SERVICE capability may bind(2) to these sockets. Note that the raw IPv4 protocol as such has no concept of a port, they are only implemented by higher protocols like tcp(7) and udp(7).
sin_addr is the IP host address. The addr member of struct in_addr contains the host interface address in network order. in_addr should be only accessed using the inet_aton(3), inet_addr(3), inet_makeaddr(3) library functions or directly with the name resolver (see gethostbyname(3)). IPv4 addresses are divided into unicast, broadcast and multicast addresses. Unicast addresses specify a single interface of a host, broadcast addresses specify all hosts on a network and multicast addresses address all hosts in a multicast group. Datagrams to broadcast addresses can be only sent or received when the SO_BROADCAST socket flag is set. In the current implementation connection oriented sockets are only allowed to use unicast addresses.
Note that the address and the port are always stored in network order. In particular, this means that you need to call htons(3) on the number that is assigned to a port. All address/port manipulation functions in the standard library work in network order.
There are several special addresses: INADDR_LOOPBACK (127.0.0.1) always refers to the local host via the loopback device; INADDR_ANY (0.0.0.0) means any address for binding; INADDR_BROADCAST (255.255.255.255) means any host and has the same effect on bind as INADDR_ANY for historical reasons.
SOCKET OPTIONS
IP supports some protocol specific socket options that can be set with setsockopt(2) and read with getsockopt(2). The socket option level for IP is SOL_IP. A boolean integer flag is zero when it is false, otherwise true.
- IP_OPTIONS
- Sets or get the IP options to be sent with every packet from this socket. The arguments are a pointer to a memory buffer containing the options and the option length. The setsockopt(2) call sets the IP options associated with a socket. The maximum option size for IPv4 is 40 bytes. See RFC791 for the allowed options. When the initial connection request packet for a SOCK_STREAM socket contains IP options, the IP options will be set automatically to the options from the initial packet with routing headers reversed. Incoming packets are not allowed to change options after the connection is established. The processing of all incoming source routing options is disabled by default and can be enabled by using the accept_source_route sysctl. Other options like timestamps are still handled. For datagram sockets, IP options can be only set by the local user. Calling getsockopt(2) with IP_OPTIONS puts the current IP options used for sending into the supplied buffer.
- IP_PKTINFO
- Pass an IP_PKTINFO ancillary message that contains a pktinfo structure that supplies some information about the incoming packet. This only works for datagram oriented sockets. The argument is a flag that tells the socket whether the IP_PKTINFO message should be passed or not. The message itself can only be sent/retrieved as control message with a packet using recvmsg(2) or sendmsg(2).
-
struct in_pktinfo { unsigned int ipi_ifindex; /* Interface index */ struct in_addr ipi_spec_dst; /* Local address */ struct in_addr ipi_addr; /* Header Destination address */ };
-
- ipi_ifindex is the unique index of the interface the packet was received on. ipi_spec_dst is the local address of the packet and ipi_addr is the destination address in the packet header. If IP_PKTINFO is passed to sendmsg(2) and ipi_spec_dst is not zero, then it is used as the local source address for the routing table lookup and for setting up IP source route options. When ipi_ifindex is not zero the primary local address of the interface specified by the index overwrites ipi_spec_dst for the routing table lookup.
- IP_RECVTOS
- If enabled the IP_TOS ancillary message is passed with incoming packets. It contains a byte which specifies the Type of Service/Precedence field of the packet header. Expects a boolean integer flag.
- IP_RECVTTL
- When this flag is set pass a IP_RECVTTL control message with the time to live field of the received packet as a byte. Not supported for SOCK_STREAM sockets.
- IP_RECVOPTS
- Pass all incoming IP options to the user in a IP_OPTIONS control message. The routing header and other options are already filled in for the local host. Not supported for SOCK_STREAM sockets.
- IP_RETOPTS
- Identical to IP_RECVOPTS but returns raw unprocessed options with timestamp and route record options not filled in for this hop.
- IP_TOS
- Set or receive the Type-Of-Service (TOS) field that is sent with every IP packet originating from this socket. It is used to prioritize packets on the network. TOS is a byte. There are some standard TOS flags defined: IPTOS_LOWDELAY to minimize delays for interactive traffic, IPTOS_THROUGHPUT to optimize throughput, IPTOS_RELIABILITY to optimize for reliability, IPTOS_MINCOST should be used for "filler data" where slow transmission doesn't matter. At most one of these TOS values can be specified. Other bits are invalid and shall be cleared. Linux sends IPTOS_LOWDELAY datagrams first by default, but the exact behaviour depends on the configured queueing discipline. Some high priority levels may require an effective user id of 0 or the CAP_NET_ADMIN capability. The priority can also be set in a protocol independent way by the (SOL_SOCKET, SO_PRIORITY) socket option (see socket(7)).
- IP_TTL
- Set or retrieve the current time to live field that is send in every packet send from this socket.
- IP_HDRINCL
- If enabled the user supplies an ip header in front of the user data. Only valid for SOCK_RAW sockets. See raw(7) for more information. When this flag is enabled the values set by IP_OPTIONS, IP_TTL and IP_TOS are ignored.
- IP_RECVERR (defined in <linux/errqueue.h>)
- Enable extended reliable error message passing. When enabled on a datagram socket all generated errors will be queued in a per-socket error queue. When the user receives an error from a socket operation the errors can be received by calling recvmsg(2) with the MSG_ERRQUEUE flag set. The sock_extended_err structure describing the error will be passed in a ancillary message with the type IP_RECVERR and the level SOL_IP. This is useful for reliable error handling on unconnected sockets. The received data portion of the error queue contains the error packet.
- The IP_RECVERR control message contains a sock_extended_err structure:
-
#define SO_EE_ORIGIN_NONE 0 #define SO_EE_ORIGIN_LOCAL 1 #define SO_EE_ORIGIN_ICMP 2 #define SO_EE_ORIGIN_ICMP6 3 struct sock_extended_err { u_int32_t ee_errno; /* error number */ u_int8_t ee_origin; /* where the error originated */ u_int8_t ee_type; /* type */ u_int8_t ee_code; /* code */ u_int8_t ee_pad; u_int32_t ee_info; /* additional information */ u_int32_t ee_data; /* other data */ /* More data may follow */ }; struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *);
-
- ee_errno contains the errno number of the queued error. ee_origin is the origin code of where the error originated. The other fields are protocol specific. The macro SO_EE_OFFENDER returns a pointer to the address of the network object where the error originated from given a pointer to the ancillary message. If this address is not known, the sa_family member of the sockaddr contains AF_UNSPEC and the other fields of the sockaddr are undefined.
- IP uses the sock_extended_err structure as follows: ee_origin is set to SO_EE_ORIGIN_ICMP for errors received as an ICMP packet, or SO_EE_ORIGIN_LOCAL for locally generated errors. Unknown values should be ignored. ee_type and ee_code are set from the type and code fields of the ICMP header. ee_info contains the discovered MTU for EMSGSIZE errors. The message also contains the sockaddr_in of the node caused the error, which can be accessed with the SO_EE_OFFENDER macro. The sin_family field of the SO_EE_OFFENDER address is AF_UNSPEC when the source was unknown. When the error originated from the network, all IP options (IP_OPTIONS, IP_TTL, etc.) enabled on the socket and contained in the error packet are passed as control messages. The payload of the packet causing the error is returned as normal payload. Note that TCP has no error queue; MSG_ERRQUEUE is illegal on SOCK_STREAM sockets. Thus all errors are returned by socket function return or SO_ERROR only.
- For raw sockets, IP_RECVERR enables passing of all received ICMP errors to the application, otherwise errors are only reported on connected sockets
- It sets or retrieves an integer boolean flag. IP_RECVERR defaults to off.
- IP_MTU_DISCOVER
- Sets or receives the Path MTU Discovery setting for a socket. When enabled, Linux will perform Path MTU Discovery as defined in RFC1191 on this socket. The don't fragment flag is set on all outgoing datagrams. The system-wide default is controlled by the ip_no_pmtu_disc sysctl for SOCK_STREAM sockets, and disabled on all others. For non SOCK_STREAM sockets it is the user's responsibility to packetize the data in MTU sized chunks and to do the retransmits if necessary. The kernel will reject packets that are bigger than the known path MTU if this flag is set (with EMSGSIZE ).
Path MTU discovery flags Meaning
IP_PMTUDISC_WANT Use per-route settings.
IP_PMTUDISC_DONT Never do Path MTU Discovery.
IP_PMTUDISC_DO Always do Path MTU Discovery.
When PMTU discovery is enabled the kernel automatically keeps track of the path MTU per destination host. When it is connected to a specific peer with connect(2) the currently known path MTU can be retrieved conveniently using the IP_MTU socket option (e.g. after a EMSGSIZE error occurred). It may change over time. For connectionless sockets with many destinations the new also MTU for a given destination can also be accessed using the error queue (see IP_RECVERR). A new error will be queued for every incoming MTU update.
While MTU discovery is in progress initial packets from datagram sockets may be dropped. Applications using UDP should be aware of this and not take it into account for their packet retransmit strategy.
To bootstrap the path MTU discovery process on unconnected sockets it is possible to start with a big datagram size (up to 64K-headers bytes long) and let it shrink by updates of the path MTU.
To get an initial estimate of the path MTU connect a datagram socket to the destination address using connect(2) and retrieve the MTU by calling getsockopt(2) with the IP_MTU option.
- IP_MTU
- Retrieve the current known path MTU of the current socket. Only valid when the socket has been connected. Returns an integer. Only valid as a getsockopt(2).
- IP_ROUTER_ALERT
- Pass all to-be forwarded packets with the IP Router Alert option set to this socket. Only valid for raw sockets. This is useful, for instance, for user space RSVP daemons. The tapped packets are not forwarded by the kernel, it is the users responsibility to send them out again. Socket binding is ignored, such packets are only filtered by protocol. Expects an integer flag.
- IP_MULTICAST_TTL
- Set or reads the time-to-live value of outgoing multicast packets for this socket. It is very important for multicast packets to set the smallest TTL possible. The default is 1 which means that multicast packets don't leave the local network unless the user program explicitly requests it. Argument is an integer.
- IP_MULTICAST_LOOP
- Sets or reads a boolean integer argument whether sent multicast packets should be looped back to the local sockets.
- IP_ADD_MEMBERSHIP
- Join a multicast group. Argument is a struct ip_mreqn structure.
-
struct ip_mreqn { struct in_addr imr_multiaddr; /* IP multicast group address */ struct in_addr imr_address; /* IP address of local interface */ int imr_ifindex; /* interface index */ };
- imr_multiaddr contains the address of the multicast group the application wants to join or leave. It must be a valid multicast address. imr_address is the address of the local interface with which the system should join the multicast group; if it is equal to INADDR_ANY an appropriate interface is chosen by the system. imr_ifindex is the interface index of the interface that should join/leave the imr_multiaddr group, or 0 to indicate any interface.
- For compatibility, the old ip_mreq structure is still supported. It differs from ip_mreqn only by not including the imr_ifindex field. Only valid as a setsockopt(2).
- IP_DROP_MEMBERSHIP
- Leave a multicast group. Argument is an ip_mreqn or ip_mreq structure similar to IP_ADD_MEMBERSHIP.
- IP_MULTICAST_IF
- Set the local device for a multicast socket. Argument is an ip_mreqn or ip_mreq structure similar to IP_ADD_MEMBERSHIP.
- When an invalid socket option is passed, ENOPROTOOPT is returned.
SYSCTLS
The IP protocol supports the sysctl interface to configure some global options. The sysctls can be accessed by reading or writing the /proc/sys/net/ipv4/* files or using the sysctl(2) interface.
- ip_default_ttl
- Set the default time-to-live value of outgoing packets. This can be changed per socket with the IP_TTL option.
- ip_forward
- Enable IP forwarding with a boolean flag. IP forwarding can be also set on a per interface basis.
- ip_dynaddr
- Enable dynamic socket address and masquerading entry rewriting on interface address change. This is useful for dialup interface with changing IP addresses. 0 means no rewriting, 1 turns it on and 2 enables verbose mode.
- ip_autoconfig
- Not documented.
- ip_local_port_range
- Contains two integers that define the default local port range allocated to sockets. Allocation starts with the first number and ends with the second number. Note that these should not conflict with the ports used by masquerading (although the case is handled). Also arbitary choices may cause problems with some firewall packet filters that make assumptions about the local ports in use. First number should be at least >1024, better >4096 to avoid clashes with well known ports and to minimize firewall problems.
- ip_no_pmtu_disc
- If enabled, don't do Path MTU Discovery for TCP sockets by default. Path MTU discovery may fail if misconfigured firewalls (that drop all ICMP packets) or misconfigured interfaces (e.g., a point-to-point link where the both ends don't agree on the MTU) are on the path. It is better to fix the broken routers on the path than to turn off Path MTU Discovery globally, because not doing it incurs a high cost to the network.
- ipfrag_high_thresh, ipfrag_low_thresh
- If the amount of queued IP fragments reaches ipfrag_high_thresh, the queue is pruned down to ipfrag_low_thresh. Contains an integer with the number of bytes.
- ip_always_defrag
- [New with Kernel 2.2.13; in earlier kernel version the feature was controlled at compile time by the CONFIG_IP_ALWAYS_DEFRAG option]
When this boolean frag is enabled (not equal 0) incoming fragments (parts of IP packets that arose when some host between origin and destination decided that the packets were too large and cut them into pieces) will be reassembled (defragmented) before being processed, even if they are about to be forwarded.
Only enable if running either a firewall that is the sole link to your network or a transparent proxy; never ever turn on here for a normal router or host. Otherwise fragmented communication may me disturbed when the fragments would travel over different links. Defragmentation also has a large memory and CPU time cost.
This is automagically turned on when masquerading or transparent proxying are configured.
- neigh/*
- See arp(7).
IOCTLS
All ioctls described in socket(7) apply to ip.
The ioctls to configure firewalling are documented in ipfw(4) from the ipchains package.
Ioctls to configure generic device parameters are described in netdevice(7).
NOTES
Be very careful with the SO_BROADCAST option - it is not privileged in Linux. It is easy to overload the network with careless broadcasts. For new application protocols it is better to use a multicast group instead of broadcasting. Broadcasting is discouraged.
Some other BSD sockets implementations provide IP_RCVDSTADDR and IP_RECVIF socket options to get the destination address and the interface of received datagrams. Linux has the more general IP_PKTINFO for the same task.
ERRORS
- ENOTCONN
- The operation is only defined on a connected socket, but the socket wasn't connected.
- EINVAL
- Invalid argument passed. For send operations this can be caused by sending to a blackhole route.
- EMSGSIZE
- Datagram is bigger than an MTU on the path and it cannot be fragmented.
- EACCES
- The user tried to execute an operation without the necessary permissions. These include: Sending a packet to a broadcast address without having the SO_BROADCAST flag set. Sending a packet via a prohibit route. Modifying firewall settings without CAP_NET_ADMIN or effective user id 0. Binding to a reserved port without the CAP_NET_BIND_SERVICE capacibility or effective user id 0.
- EADDRINUSE
- Tried to bind to an address already in use.
- ENOPROTOOPT and EOPNOTSUPP
- Invalid socket option passed.
- EPERM
- User doesn't have permission to set high priority, change configuration, or send signals to the requested process or group.
- EADDRNOTAVAIL
- A non-existent interface was requested or the requested source address was not local.
- EAGAIN
- Operation on a non-blocking socket would block.
- ESOCKTNOSUPPORT
- The socket is not configured or an unknown socket type was requested.
- EISCONN
- connect(2) was called on an already connected socket.
- EALREADY
- An connection operation on a non-blocking socket is already in progress.
- ECONNABORTED
- A connection was closed during an accept(2).
- EPIPE
- The connection was unexpectedly closed or shut down by the other end.
- ENOENT
- SIOCGSTAMP was called on a socket where no packet arrived.
- EHOSTUNREACH
- No valid routing table entry matches the destination address. This error can be caused by a ICMP message from a remote router or for the local routing table.
- ENODEV
- Network device not available or not capable of sending IP.
- ENOPKG
- A kernel subsystem was not configured.
- ENOBUFS, ENOMEM
- Not enough free memory. This often means that the memory allocation is limited by the socket buffer limits, not by the system memory, but this is not 100% consistent.
Other errors may be generated by the overlaying protocols; see tcp(7), raw(7), udp(7) and socket(7).
VERSIONS
IP_PKTINFO, IP_MTU, IP_MTU_DISCOVER, IP_PKTINFO, IP_RECVERR and IP_ROUTER_ALERT are new options in Linux 2.2. They are also all Linux specific and should not be used in programs intended to be portable.
struct ip_mreqn is new in Linux 2.2. Linux 2.0 only supported ip_mreq.
The sysctls were introduced with Linux 2.2.
COMPATIBILITY
For compatibility with Linux 2.0, the obsolete socket(PF_INET, SOCK_RAW, protocol) syntax is still supported to open a packet(7) socket. This is deprecated and should be replaced by socket(PF_PACKET, SOCK_RAW, protocol) instead. The main difference is the new sockaddr_ll address structure for generic link layer information instead of the old sockaddr_pkt.
BUGS
There are too many inconsistent error values.
The ioctls to configure IP-specific interface options and ARP tables are not described.
Some versions of glibc forget to declare in_pktinfo. Workaround currently is to copy it into your program from this man page.
Receiving the original destination address with MSG_ERRQUEUE in msg_name by recvmsg(2) does not work in some 2.2 kernels.
SEE ALSO
recvmsg(2), sendmsg(2), ipfw(4), netlink(7), raw(7), socket(7), tcp(7), udp(7)