(Is your server's network not living up to its potential? Order a server from us with promo code PACKETS for 15% off your first invoice)
In the daily grind of a network administrator’s job, troubleshooting can eat up as much as 90% of your time. You need good troubleshooting skills to quickly and effectively respond to issues that can come up. This article discusses areas you can easily look into to quickly isolate network-related problems.
Always start a troubleshoot with these basic questions:
- What changed?
- Has this issue occurred before? If yes, when?
- Can you replicate the problem?
- Did the user do anything differently? If yes, what?
- Are other users experiencing the same issue?
With each succeeding question, try isolating the problem by process of strategic elimination. For example, if a workstation cannot connect to the network, determine if it is a network-wide problem or a workstation-specific problem. If it is only the workstation, then you have removed a significant half of the variables and have moved closer to isolating the problem. Even if you cannot find a solution yourself, eliminating extraneous factors saves time when you seek outside help.
You will find that network-related challenges usually take either of these two forms:
- Slow response times from the remote server, which can be caused by
- network congestion
- overloaded server at the remote end of the connection
- poor routing
- misconfigured DNS
- misconfigured NIC duplex and speed
- bad cabling
- Lost connectivity/disconnection from network, which can be caused by
- power failures
- shut down of the remote server or an application on the remote server
- hardware and software failures (ex. kernel panic, OOM, etc.)
Note that slowness can escalate to the point where connectivity is lost. This means that symptoms for slow response times can be used to gauge lost connectivity.
- Application Layer (i.e. Secure Shell (SSH), Telnet, HTTPd)
- Transport Layer (i.e. flow control)
- Network Layer (i.e. addressing, routing)
- Link Layer (i.e. hardware or device drivers)
- Physical Layer (i.e. actual cables and other physical media)
When troubleshooting remote servers, we highly recommend using an Intelligent Platform Management Interface (IPMI) with iKVM since it runs through a dedicated network port and has a separate IP address. Â
Testing link via cables
Your server is communicating with the other devices on your network when the light of the network interface controller (NIC; also referred to as the network interface card or network adapter) link is on. It indicates that the connection between your server and the switch/router is functioning correctly. If it is not lighting up and there is link failure, start troubleshooting by checking the following basic sources:
- Are the cables in good condition?
- Are the cables plugged securely and properly?
- Is the switch or the router where the server is connected turned on?
Testing link status via command-line interface
The ethtool command brings up a report on the link status and duplex settings for supported NICs. In the example below, the NICs are operating at 100Mbps with full duplex, and the link is functioning properly.Â
# ethtool eth0
Settings for eth0:
       Supported ports: [ TP ]
       Supported link modes:  10baseT/Half 10baseT/Full
                               100baseT/Half 100baseT/Full
                               1000baseT/Full
       Supported pause frame use: Symmetric
       Supports auto-negotiation: Yes
       Advertised link modes: 10baseT/Half 10baseT/Full
                               100baseT/Half 100baseT/Full
                               1000baseT/Full
       Advertised pause frame use: Symmetric
       Advertised auto-negotiation: Yes
       Speed: 100Mb/s
       Duplex: Full
       Port: Twisted Pair
       PHYAD: 1
       Transceiver: internal
       Auto-negotiation: on
       MDI-X: Unknown
       Supports Wake-on: pumbg
       Wake-on: g
       Current message level: 0x00000007 (7)
                              drv probe link
       Link detected: yes
Â
Checking NIC status via command-line interface
The ifconfig command shows you all the activated NICs in your system, even those that have no link. An interface will not appear if it is turned off. Â
# ifconfig
The ifconfig -a command shows all NICs, whether they are functional or not. Network interfaces that are shut down or are non-functional do not show an IP address line. The word “UP†also does not appear in the second line of their output as can be seen in the examples below:
Shut-down Interface
eth1     Link encap:Ethernet HWaddr 0C:C4:7C:06:45:0F
         BROADCAST MULTICAST MTU:1500 Metric:1
         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
         Memory:fb900000-fb920000
Active Interface
eth0     Link encap:Ethernet HWaddr 0C:C4:7C:06:78:4E
         inet addr:X.X.X.X Bcast:Y.Y.Y.X Mask:255.255.255.248
         inet6 addr: fe80::ec4:7aff:fe06:780e/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
         RX packets:2148573 errors:0 dropped:0 overruns:0 frame:0
         TX packets:2652221 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:348130719 (332.0 MiB) TX bytes:452866425 (431.8 MiB)
         Memory:fb920000-fb940000
Viewing NIC errors via ifconfig
Slow connectivity can be traced to errors that creep up due to poor configuration or excessive bandwidth utilization. These errors should be corrected whenever possible as error rates in excess of 0.5% results in noticeable sluggish performance.
Aside from what was stated above, the ifconfig command also shows the number of overrun, carrier, dropped packets, and frame errors:
eth0     Link encap:Ethernet HWaddr 0C:C4:7C:06:78:4E
         inet addr:X.X.X.X Bcast:Y.Y.Y.X Mask:255.255.255.248
         inet6 addr: fe80::ec4:7aff:fe06:780e/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
         RX packets:2148573 errors:0 dropped:0 overruns:0 frame:0
         TX packets:2652221 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:348130719 (332.0 MiB) TX bytes:452866425 (431.8 MiB)
         Memory:fb920000-fb940000
Viewing NIC errors via ethtool
The ethtool command can provide a much more detailed report and show errors when used with the -S switch as shown in the example below:
# ethtool -S eth0
NIC statistics:
    rx_packets: 2148660
    tx_packets: 2652312
    rx_bytes: 356733445
    tx_bytes: 469741533
    rx_broadcast: 197923
    tx_broadcast: 6877
    rx_multicast: 0
    tx_multicast: 6
    multicast: 0
    collisions: 0
    rx_crc_errors: 0
    rx_no_buffer_count: 0
    rx_missed_errors: 0
    tx_aborted_errors: 0
    tx_carrier_errors: 0
    tx_window_errors: 0
    tx_abort_late_coll: 0
    tx_deferred_ok: 0
    tx_single_coll_ok: 0
    tx_multi_coll_ok: 0
    tx_timeout_count: 0
    rx_long_length_errors: 0
    rx_short_length_errors: 0
    rx_align_errors: 0
    tx_tcp_seg_good: 0
    tx_tcp_seg_failed: 0
    rx_flow_control_xon: 0
    rx_flow_control_xoff: 0
    tx_flow_control_xon: 0
    tx_flow_control_xoff: 0
    rx_long_byte_count: 356733445
 ...
    rx_errors: 0
    tx_errors: 0
    tx_dropped: 0
    rx_length_errors: 0
    rx_over_errors: 0
    rx_frame_errors: 0
    rx_fifo_errors: 0
    tx_fifo_errors: 0
    tx_heartbeat_errors: 0
 ...
Viewing NIC errors via netstat
The netstat command is useful for systems where ethtool is not available. It provides a limited report when used with the -i switch. See the example below:
# netstat -i
Kernel Interface table
Iface      MTU Met   RX-OK RX-ERR RX-DRP RX-OVR   TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0Â Â Â Â Â Â 1500Â Â 0Â 2148672Â Â Â Â Â 0Â Â Â Â Â 0Â Â Â Â Â 0Â 2652323Â Â Â Â Â 0Â Â Â Â Â 0Â Â Â Â Â 0 BMRU
lo       16436  0       0     0     0     0       0     0     0     0 LRU
Â
Learning the possible causes of Ethernet errors
The list below is a rundown of what causes Ethernet errors:
- Collisions happen when the NIC detects itself and another server on the LAN attempting data transmission at the same time. They can be expected as a normal part of operation and are typically below 0.1% of all frames sent. Note that faulty NICs or poor cabling may cause higher error rates. There are two kinds of collisions:
- Single collisions are when the Ethernet frame went through only one collision.
-
Multiple collisions are when several collisions caused the NIC to attempt sending a frame multiple times before doing so successfully.
-
Cyclic redundancy check (CRC) errors happen when the frames were sent but were corrupted in transit. CRC errors, when there are not many collisions, are indicative of electrical noise. Check if you are using the correct type of cable, that the cabling is undamaged, and that the connectors are plugged securely.
-
Frame errors happen when an incorrect CRC and a non-integer number of bytes are received. This is usually the result of collisions or a bad Ethernet device.
-
FIFO and overrun errors happen when the NIC is unable to properly hand off data to its memory buffers due to the existing data-rate capabilities of the hardware. This kind of error is usually a sign of excessive traffic.
-
Length errors happen when the received frame length is less than or exceeded the Ethernet standard usually due to incompatible duplex settings.
-
Carrier errors happen when the NIC loses its link to the hub or switch. If this occurs, check for faulty cabling or faulty interfaces on the NIC and networking equipment.
Checking ARP values to see MAC addresses
- Either server might be disconnected from the network
- Bad network cabling
- A NIC might be disabled or the remote server might be turned off
- The remote server might be running a firewall like iptables or the Windows built-in firewall
(Note: Typically in this case, you can see the MAC address and that the server is running the correct software. In spite of this, communication is not occurring to the client on the same network.)
The ifconfig -a command shows you both the NIC's MAC address and the associated IP addresses of the server you are currently logged into. In the example below, you can see that the eth0 interface has one IP address X.X.X.X tied to the NIC hardware MAC address of 0C:C4:7C:06:78:4E:Â
# ifconfig -a
eth0     Link encap:Ethernet HWaddr 0C:C4:7C:06:78:4E
         inet addr:X.X.X.X Bcast:Y.Y.Y.X Mask:255.255.255.248
         inet6 addr: fe80::ec4:7aff:fe06:780e/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
         RX packets:2148573 errors:0 dropped:0 overruns:0 frame:0
         TX packets:2652221 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:348130719 (332.0 MiB) TX bytes:452866425 (431.8 MiB)
         Memory:fb920000-fb940000
The arp -a command shows you the MAC addresses in your server's ARP table, i.e. all other nodes on the directly connected network that have been sending Ethernet frames during last few minutes. In the example below, we see some form of connectivity with the router at address Z.Z.Z.Z:Â
# arp -a
test.mydomain.com (Z.Z.Z.Z) at 90:e2:be:39:bb:49 [ether] on eth0
Note:Â Make sure the IP addresses listed in the ARP table match those of the servers you expect in your network. If they don't, your server might be plugged into the wrong switch or router port. Remember to also check the ARP table of the remote server to see whether it is populated with acceptable values.
Testing network connectivity via arping
An Ethernet network needs ARP to function properly; thus, ARP requests are not usually blocked by a firewall. If they were blocked, no host can find another host in a network and connect to it—essentially unplugging the system from the network.Â
To test connectivity using the arping, you need to be on the same subnet as the host you are trying to connect to. Note that your default gateway is usually a good target for this kind of testing. By sending an ARP request rather than an ICMP echo, you are virtually guaranteed to get a reply as long as the other host is actually reachable on the same subnet.
The arping utility makes testing hosts easy. It performs an action similar to the ping command, but on the Ethernet layer. You give it an IP address to ping, and arping sends the proper ARP request. Arping then listens for ARP replies and prints them (if any), including the round trip time as shown in the example below:
# arping 192.168.1.100
ARPING 192.168.1.100
60 bytes from 00:40:05:01:fc:1e (192.168.1.100): index=0 time=190.973 usec
You can also use arping to detect whether more than one host is configured to use the same IP address. In the example below, two machines are replying to queries for the same IP address:
# arping -I eth0 -c 2Â 192.168.1.100
ARPING 192.168.1.100 from 192.168.1.1 eth0
Unicast reply from 192.168.1.100 [0a:00:3e:d1:bf:49]Â 0.743ms
Unicast reply from 192.168.1.100 [00:02:b3:99:2c:f8]Â 0.768ms
Sent 2 probes (1 broadcast(s))
ARP pinging is a good ICMP ping replacement on Ethernet networks. It enables you to confidently take firewalls out of the equation and know that a failed ARP ping indicates a real problem that should be looked into.
NOTE: There are two (2) popular arping implementations:
-
- Linux iputils suite – cannot resolve MAC addresses to IP addresses
- Arping implementation written by Thomas Habets – can ping hosts by MAC address and IP address
Configuring Linux iptables firewallÂ
The Linux iptables firewall is coming to be a source of connectivity issues, especially for brand new servers. It is installed by default under most popular Linux distributions and usually allows only a limited range of traffic. To prevent this issue, read our article on configuring iptables HERE.
Resolving basic IP issues
An ICMP response of "Destination Host Unreachable" indicates that there is a default gateway misconfiguration on the initiating host. On the other hand, a ping response of "Request timed out" could suggest a misconfigured default gateway and/or other lower-layer (L1/L2) issues at play.Â
To verify that the IP address and network mask (netmask) displayed as your default gateway are correct, check the output when you key in any of the following commands:
# ifconfig ethX
Â
# route -nÂ
ORÂ
# netstat -rn
The "Destination Host Unreachable" error message is your router or server telling you that the target IP address is part of a valid network but is getting no response from the target server. This lack of response may be due to any of the following reasons:
A host on a directly connected network—
- The client or server might be down, or disconnected for the network.
- You may be using an incorrect type of cable. (Note: There are two basic types - straight through and crossover.)
A host on remote network —
The network device does not have a route in its routing table to the destination network and sends an ICMP reply type 3, which triggers the error message.
Testing connectivity via ping
It is always good practice to force a response from your server to check its connectivity with your local network. The ping command is the most common method to do this across multiple networks. It sends ICMP echo packets that request a corresponding ICMP echo-reply response from the device at the target address. Most servers respond to a ping query and a lack of response should alert you to potential problems that could be caused by any of the following situations:
- A server with that IP address does not exist.
- The server has been configured not to respond to pings.
- A firewall or router along the network path is blocking ICMP traffic.
- You have incorrect routing. In this case, check the routes and subnet masks on both the local and remote servers and all routers in between. A classic symptom of bad routes on a server is the ability to ping servers only on your local network and nowhere else. Use traceroute to ensure you are on the correct path.
- Either the source or the destination device has an incorrect IP address or subnet mask.
Note: There are a variety of ICMP response codes that can help in further troubleshooting.
The Linux ping command sends continuous pings once a second, until you order it to stop with a Ctrl-C. See below for an example of a successful ping to a Google server:
# ping -c 5 google.us
PING google.us (74.125.196.99) 56(84) bytes of data.
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=1 ttl=43 time=7.26 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=2 ttl=43 time=7.37 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=3 ttl=43 time=7.25 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=4 ttl=43 time=7.43 ms
64 bytes from yk-in-f99.1e100.net (74.125.196.99): icmp_seq=5 ttl=43 time=7.30 ms
Â
--- google.us ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4013ms
rtt min/avg/max/mdev = 7.257/7.327/7.432/0.127 ms