|
Softpanorama
(slightly skeptical)
Open Source Software Educational Society |
May the
source be with you,
but remember the KISS principle ;-)
|
Network Troubleshooting Tools and Strategies
The ping,
traceroute,
ngrep and other network tools are indispensable
tools for troubleshooting networking problems. They are preinstalled both
of Solaris and Linux. We will use Solaris as an example below. Network troubleshooting
means recognizing and diagnosing networking problems with the goal of keeping your
network running optimally. As a network administrator, your primary concern is maintaining
connectivity of all devices (a process often called fault management).
You may also continually evaluate and improve your network's performance.
Because serious networking problems can sometimes begin as performance problems,
paying attention to performance can help you address issues before they become serious.
Like in any investigation you need to avoid jumping to conclusion and calmly
collect all relevant facts. You can use famous "How
to solve it "approach. Among more network specific issues:
In general, there is no one correct way to determine the root
cause of a networking problem. Like any troubleshooting of complex systems this
is more art then science and the success depends both on your IQ and the level of
experience with the environment. However, there are a heuristics that
you can follow:
-
Concentrate on the problem in hand. Networks are a lot like
cars: You can start out investigating one problem and find 10 other things that
may need attention. Make a note of any non-related problems, but focus on investigating
the primary problem.
-
Focus on one area at a time. If several users are reporting problems
from different areas of the network at the same time, there is a good chance
that they are reporting elements of the same problem. It can be overwhelming
to have 1,000 or more users down at once, but if the same problem is simply
reoccurring in multiple parts of the network, you only have to figure it out
once.
-
If possible use the lab. Whenever possible, try to duplicate the problem
in a lab and troubleshoot it there. Sometimes changes performed
during troubleshooting has a greater negative impact on the end-user population
than the original problem.
-
Use intelligent testing strategies:
-
If any test requires reconfiguring a device, ensure that you can roll
back the change after the test, or you may find that you have backed yourself
into a corner and cannot proceed.
-
Use as few tests as possible to isolate and define the problem working
top-down TCP stack or bottom-up. Ensure that the results of the tests
are unambiguous before concentrating on the next level.
-
Validate the test results by repeating each one at least twice. Note
that running a command to verify a configuration parameter is not considered
a test in this sense and therefore doesn't need to be performed twice.
-
Document changes as your proceed not as an afterthought:
-
Document the tests performed and the results in case a bug is found.
-
Document any changes made to the network during the troubleshooting procedure
so that the network can be properly restored to its original condition.
-
Document any workarounds that were left in place so that other support
personnel will be able to understand how and why the network changed.
Troubleshooting Commandments
- Create a backup of the faulty system before fixing anything. Backup
can be done only for configuration files or for the complete system. Complete
backup is important as troubleshooting is a high stress activity and it is easy
accidentally to destroy some files.
Ghost is a great tool for
performing quick complete backups and Ghost 2003 works with Linux ext filesystems.
With the current sized of USB flash drives available most system partitions
can be backuped on a flash drive. Such backup also can be indispensable if the
fault disappears on its own: faults that fix themselves often come back on their
own too.
- Before changing and file always create a baseline. That prevents
you from the most typical mistake in troubleshooting: losing the initial configuration.
- Simplify your environment, if possible. Where possible try to remove
routers and firewalls from the networking path affected. Often problems are
introduced by network devices. This is typical for example for home environments
with cheap routers like Linksys.
In enterprise environment left hand often does not know what right is doing
and similar effects can be due the fact that someone may have upgraded a router's
operating system or a firewall's rule set.
Patches are just special kind of upgrade and can introduce problems too.
- Have a testing plan. Make sure that you can replicate the reported
fault at will. This is important because you should always attempt to re-create
the reported fault after effecting any changes. You need to be sure that you
are not changing or adding to the problem.
- Document all steps and results. This is important because you could
forget exactly what you did to fix or change the problem. This is especially
true when someone interrupts you as you are about to test a configuration change.
You can always revert the system to the faulty state if you backed it up as
suggested earlier.
- Where possible, make permanent changes to the configuration settings.
Temporary changes may be faster to implement but cause confusion when the
system reboots after a power failure months or even years later and the fault
occurs again. Nobody will remember what was done by whom.
Using ping as a Troubleshooting Tool
The ping utility sends ICMP echo request packets to the target host or hosts.
Once ICMP echo responses are received, the message target is alive, where target
is the hostname of the device receiving the ICMP echo requests, is displayed.
# ping problem.host.com
problem.host.com is alive
The -s option is useful when attempting to connect to a remote host that is down
or not available. No output will be produced until an ICMP echo response is received
from the target host. The -R option can be useful if the traceroute utility is not
available.
Statistics are displayed when the ping -s command is terminated.
# ping -s problem.host.com
Another useful troubleshooting technique using ping is to send ICMP echo requests
to the entire network by using the broadcast address as the target host. Using
the -s option with the broadcast address provides good information about which systems
are available on the network:
# ping -s 172.20.4.255
Using ifconfig as a Troubleshooting Tool
The ifconfig utility is useful when troubleshooting
networking problems. You can use it to display an interface's current status including
the settings for the following:
- MTU
- Address family
- IP address
- Netmask
- Broadcast address
- Ethernet address (MAC address)
Be aware that there are two ifconfig commands. The two versions differ in how
they use name services. The /sbin/ifconfig is called by the /etc/rc2 . d/S30sysid.
net startup script. This version is not affected by the configuration of the /etc/nsswitch.
conf file.
The /usr/sbin/ifconfig is called by the /etc/rc2 .d/S69inet and the /etc/rc2
. d/S72inetsvc startup scripts. This version of the ifconfig command is affected
by the name service settings in the /etc/nsswitch. conf file.
Power user - Use the plumb switch when troubleshooting interfaces that have been
manually added and configured. Often an interface will report that it is up and
running yet a snoop session from another host shows that no traffic is flowing out
of the suspect interface. Using the plumb switch resolves the misconfiguration problem.
Using arp as a Troubleshooting Tool
The arp utility can be useful when attempting to locate network problems relating
to duplicate IP address usage. Determine the Ethernet address of the target host.
You can do this by using the banner utility at the ok prompt, or the ifconfig utility
at a shell prompt on a Sun system. Armed with the Ethernet address (also known as
the MAC address) use the ping utility to determine if the target host can be reached.
Use the arp utility immediately after using the ping utility and verify that
the arp table reflects the expected (correct) Ethernet address.
The following example demonstrates this technique.
Working from the system three, use the ping and arp utilities to determine if
the system one is really responding to system three.
First, determine the Ethernet address of the host called one.
problem.host.com# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1 inet
127.0.0.1 netmask ff000000
hme0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST, IPv4> mtu 1500 index 6 inet
128.50.2.1 netmask ffffff00 broadcast 128.50.2.255 ether 8:0:20:76:6:b
problem.host.com#
The ifconfig utility shows that the Ethernet address of the hme0 interface is 8:0:20:76:6:b.
The first half of the address, 08:00:20 shows that the system is a Sun computer.
The last half of the address, 76:06:0b is the unique part of the system's Ethernet
address.
Search the Internet to determine the manufacturer of devices with unknown Ethernet
addresses.
2. Use the ping utility to send ICMP echo requests from system three to system
problem.host.com.
three# ping problem.host.com problem.host.com is alive three#
3. View the arp table to determine if the device that sent the ICMP echo response
is the correct system, 76:06:0b.
three# arp -a
Net to Media Table: IPv4
Device IP Address Mask Flags Phys Addr ------ -------------------- ---------------
----- ---------------
08: 00 : 20 : 76: 06: 0b 08: 00 : 20:
8e : ee : 18 08: 00 : 20: 7a: 0b:b8 08:00:20:78:54:90 00: 60:97:7f:4f:dd 01:
00 : 5e: 00 : 00: 00
Output from the arp utility will appear to hang if name resolution fails because
the arp utility attempts to resolve names. Use the netstat
-pn utility to obtain similar output.
The table displayed in step 3 proved that the correct device responded. If the
wrong system responded, it could have been quickly tracked down by using the Ethernet
address. Once located, it can be configured with the correct IP address.
Many hubs and switches will report the Ethernet address of the attached device,
making it easier to track down incorrectly configured devices.
The first half of the Ethernet address can also be used to refine the search.
The previous example showed a device, presumably a personal computer, as it reported
an Ethernet address of 00:60:97:7f:4f:dd. A quick search on the Internet reveals
that the 00:60:97 vendor code is assigned to the 3COM corporation.
Using snoop as a Troubleshooting Tool
The snoop utility can be
particularly useful when troubleshooting virtually any networking problems. The
traces that are produced by the snoop utility can be most helpful when attempting
remote troubleshooting because an end-user (with access to the root password) can
capture a snoop trace and email it or send it using ftp to a network troubleshooter
for remote diagnosis.
You can use the snoop utility to display packets on the fly or to write to a
file. Writing to a file using the -o switch is preferable because each packet can
be interrogated later.
problem.host.com# snoop -o tracefile
Using device /dev/le (promiscuous mode)
You can view the snoop file by using the -i switch and the filename in any of
the standard modes, namely:
Terse mode - No option switch is required.
Summary verbose mode - Use the -V switch.
Verbose mode - Use the -v switch.Verbose is most useful when you are troubleshooting
routing, network booting, Trivial File Transport Protocol (TFTP), and any network-related
problems that require diagnosis at the packet level. Each layer of the packet is
clearly defined by the specific headers.
View the snoop output file in terse mode and locate a packet or range of packets
of interest. Use the -p switch to view these packets. For example, if packet two
is of interest, type:
problem.host.com# snoop -p2,2 -v -i
tracefile
Using ndd as a Troubleshooting Tool
Use extreme caution when using the Solaris ndd
utility because the system could be rendered inoperable if you set parameters incorrectly.
Use an escaped question mark (\?) to determine which parameters a driver supports.
For example, to determine which parameters the 100-Mbit Ethernet (hme) device supports,
type:
# ndd /dev/hme \?
? (read only)
trans ceiver_inuse (read only)
link_status (read only)
link_speed (read only)
... ... lance_mode
ipg0
#
(read and write) (read and write)
Routing/IP Forwarding
Many systems configured as multi-homed hosts or firewalls may have IP forwarding
disabled. A fast way to determine the state of IP forwarding is to use the ndd utility.
problem.host.com# ndd /dev/ip ip_forwarding 0
This example shows that the system is not forwarding IP packets between its interfaces.
The value for ip_forwarding is 1 when the system is routing or forwarding IP packets.
Interface Speed
The hme (100-Mbit Ethernet) Ethernet card can operate at two speeds, 10 or 100 Mbits
per second. You can use the ndd utility to quickly display the speed at which the
interface is running.
# ndd /dev/hme link_speed 1
A one (1) indicates that the interface is running at 100 MBits per second. A
zero (0) indicates that the interface is running at 10 MBits per second.
Interface Mode
The hme interface can run in either full-duplex or half-duplex mode. Again, the
ndd utility provides a fast way to determine the mode of the interface.
# ndd /dev/hme link_mode
1 #
One (1) indicates that the interface is running in full-duplex mode. A zero (0)
indicates that the interface is running in half-duplex mode.
Using netstat as a Troubleshooting Tool
You can use the netstat utility to display
the status of the system's network interfaces. Of particular interest when troubleshooting
networks are the routing tables of all the systems in question. You can use the
-r switch to display a system's routing tables.
# netstat -r
Although interesting, the displayed routing table is not of much use unless you
are familiar with the name resolution services, be they the /etc/hosts, NIS, or
NIS+ services. The problem is that it is difficult to concentrate on routing issues
when any doubt can be cast on the name services. For example, someone could have
modified the name service database, and the system msbravo may no longer be the
IP address that you expected. Using the -n switch eliminates this uncertainty.
# netstat -rn
# ifconfig -a
This routing table is much easier to translate and troubleshoot, especially when
combined with the information from the ifconfig -a utility.
lo0: flags=849<UP,LOOPBACK,RUNNING,MULTICAST> mtu 8232
inet 127.0.0.1 netmask ff000000
hme0: flags=863<UP, BROADCAST, NOTRAILERS, RUNNING, MULTICAST> mtu 1500
inet 129.147.11.59 netmask ffffff00 broadcast 129.147.11.255 hme0: flags=863<UP,
BROADCAST, NOTRAILERS, RUNNING, MULTICAST> mtu 1500
inet 172.20.4.110 netmask ffffff00 broadcast 172.20.4.255 #
The verbose mode switch, -v displays additional information, including the MTU size
configured for the interface:
# netstat -rnv
Using traceroute as a Troubleshooting Tool
The traceroute utility is useful when
you perform network troubleshooting. You can quickly determine if the expected route
is being taken when communicating or attempting to communicate with a target network
device. As with most network troubleshooting, it is useful to have a benchmark against
which current traceroute output can be compared. The traceroute output can report
network problems to other network troubleshooters. For example, you could say, "Our
normal route to a host is from our router called router1-ISP to your routers called
rtr-a1 to rtr-c4. Today, however, users are complaining that performance is very
slow. Screen refreshes are taking more than 40 seconds when they normally take less
than a second. The output from traceroute shows that the route to the host is from
our router router1-ISP to your routers called rtr-a1, rtr-d4 rtr-x5, and then to
rtr-c4. What is going on?"
The traceroute utility uses the IP TTL and tries to force ICMP TIME_EXCEEDED
responses from all gateways and routers along the path to the target host. The traceroute
utility also tries to force a PORT_UNREACHABLE message from the target host. The
traceroute utility can also attempt to force an ICMP ECHO_REPLY message from the
target host by using the -I (ICMP ECHO) option when issuing the traceroute command.
The traceroute utility will, by default, resolve IP addresses as shown in the following
example:
# traceroute 172.20.4.110
traceroute to 172.20.4.110 (172.20.4.110), 30 hops max, 40 byte packets
1 129.147.11.253 (129.147.11.253) 1.037 ms 0.785 ms 0.702 ms
2 129.147.3.249 (129.147.3.249) 1.452 ms 1.569 ms 0.766 ms
3 * dungeon (129.147.11.59) 1.320 ms *
You can display IP addresses instead of hostnames by using the -n switch as shown
in the following example. In this example, the hostname dungeon for IP address 129.147.11.59
on line 3 is no longer resolved.
# traceroute -n 172 .20.4. 110
traceroute to 172.20.4.110 (172.20.4.110), 30 hops max, 40 byte packets
1
129.147.11.253
0.954 ms
0.657 ms
0.695 ms
2
129.147.3.249
0.844 ms
0.745 ms
0.771 ms
3
129.147.11.59
0.534 ms *
0.640 ms
Common Network Problems
Following is a list of some common problems that occur:
- Faulty RJ-45 - The network connection fails intermittently.
- Faulty wiring on patch cable - No network communications.
- mdi to mdi (no mdi-x) - Media data interfaces, such as hubs, are
not connected to another mdi device. Many hubs have a port that can be switched
to become an mdi-x mdi crossover port.
- Badly configured encryption - Once encryption is configured, things
are not as they appear. Standard tools such as ifconfig, and netstat will not
locate the problem. Use the snoop utility to view the contents of packets to
determine if all is normal.
- Hub or switch configured to block the MAC - Modern hubs and switches
are configured to block specific MAC addresses or any addresses if the connection
is tampered with. Access to the console of the hub or switch is necessary to
unblock a port.
- Bad routing tables or rogue router - Routing tables can be corrupted.
Sometime a rogue router can appear of the network due to installation of multihomed
host.
- Rogue DHCP server is present in DHCP environment. Often happens when
somebody installs Windows server on the network without understanding what they
are doing.
- Protocol not being routed - for example if jumpstart or bootp is
being used across routers.
- Interface not plumbed - Additional interfaces, when configured, are
not plumbed. The interface will appear to be functioning, but it will not pass
traffic.
- Connection to the wrong interface on multihoned host.
- Bad information in the /etc/hosts or NIS database - The IP address
of systems is incorrect or missing.
The user statement, "My application does not work" is just a tip of an iceberg
and the user often does not understand what exactly is not working by jumping to
conclusions that can mislead you in troubleshooting. Never believe the user story.
You need ask the user very specific question to uncover the real story. Among questions
to consider:
- Is the server up and functioning normally?
- Can other users access the server?
- Is the client system up and functioning normally?
- Has anything changed on the server?
- Has anything changed on the client?
Layers-based troubleshooting
When troubleshooting networks, some people prefer to think in layers, similar to
the TCP/IP Model while others prefer to think in terms of functionality.
Using the TCP/IP Model layered approach, you could start at either the Physical
or Application layer. Start at either end of the model and test, draw conclusions,
move to the next layer and so on.
The Application Layer
A user complains that an application is not functioning. Assuming the application
has everything that it needs, such as disk space, name servers, and the like, determine
if the Application layer is functional by using another system.
Application layer programs often have diagnostic capabilities and may report
that a remote system is not available. Use the snoop command to determine if the
application program is receiving and sending the expected data.
The Transport Layer and the Internet Layer
These two layers can be bundled together for the purposes of troubleshooting. Determine
if the systems can communicate with each other. Look for ICMP messages that can
provide clues as to where the problem lies. Could this be a router or switching
problem? Are the protocols (TFTP, BOOTP) being routed? Are you attempting to use
protocols that cannot be routed? Are the hostnames being translated to the correct
IP addresses? Are the correct netmask and broadcast addresses being used? Tests
between the client and server can include
using ping, traceroute, arp, and snoop.
The Network Interface Layer
Use snoop to determine if the network interface is actually functioning. Use the
arp command to determine if the arp cache has the expected Ethernet or MAC address.
Fourth generation hubs and some switches can be configured to block certain MAC
addresses.
When troubleshooting connectivity problems here are some useful questions:
- Have any changes been made to the network devices?
- Can the client contact the server using ping?
- Can the client contact any system using ping?
- Can the server contact the client using ping?
- Can the client system use ping to contact any other hosts on the local network
segment?
- Can the client use ping to contact the far interface of the router?
- Can the client use ping to contact any hosts on the server's subnet?
- Is the server in the client's arp cache?
- Can snoop be used to determine what happens to the service or arp request
- Is the client's interface correctly configured? (Has it been plumbed?)
- Has any encryption software installation been attempted?
The Physical Layer
Check that the link status LED is lit. Test it with a known working cable. The link
LED will be lit even if the transmit line is damaged. Verify that a mdi-x connection
or crossover cable is being used if connecting hub to hub.
Selected Troubleshooting Scenarios
Multi-Homed System Acts as Rogue Router
For example system A can use telnet to contact system B, but system B cannot
use telnet to contact system A. Further questioning of the user revealed that this
problem appeared shortly after a power failure.
For troubleshooting use the traceroute utility to show the route that network
traffic takes from system B to system A. If the traceroute output reveals route
that goes via additional system (let's call it system C) you have a rogue router
problem.
Often that happens due to the fact that system C had been modified by an end-user.
For example an additional interface was added, bit the user did not add
/etc/notrouter file to the system. In this case,
after rebooting the system, it came up as a router and started advertising routes,
which confuses the core routers and disrupts network traffic patterns.
Faulty Cable
For example users on network A could not reach hosts on network B even though routers
R1 and E2 appeared to be functioning normally.
First you need to verify that the routers R1and R2 were configured correctly and
that the interfaces are up.
They you need to verify that systems A and B were up and configured correctly.
They you need to use the traceroute utility to discover the actual route from
system A to system B.
For example the traceroute output shows that the attempted route from system
A on network net-1 goes through router R1 as expected. But the trafficnever reaches
router R2 though.
Investigate the router R2 log files. For example is they show that the interface
to network net-2 is flapping (going up and down at a very high rate) and corrupt
routing tables you can suspect that the cable is a problem,
To solve this problem, replace the network net-2 cable to router R2. If it fixes
the problem then it was faulty and causes intermittent connections.
Duplicate IP Address
Reported Problem: Systems on network net-1 could not use ping past router
R1 to a recently configured network, net-2.
You must be "root" or the sys to perform some of the other troubleshooting step
in the previous examples. Suggested steps:
- Verify that the T1 link between the routers R1 and R2 is functioning properly.
- Verify that router R1 can use ping to contact router R2.
- Verify that system A can use ping to reach the close interface of router
R2. System A cannot use ping on the far interface of router R2, though.
- Confirm that systems on network net-1 can use ping to reach router R1.
- Check that systems on network net-2 can use ping to reach router R2.
- Determine that the routers are configured correctly.
- Verify that the systems on network net-1 and network net-2 are configured
correctly.
- Make sure the systems on network net-1 can communicate with each other.
- Verify that systems on network net-2 can communicate with each other.
- Log onto router R1 and use traceroute to display how the data is routed
from router R1 to router R2.
traceroute reported that the traffic from router
- R1 to router R2 was going out the network net-1 side interface of the router
instead of the network net-2 side as expected. This indicates that the IP address
for router R2 may also exist on network net-1.
- Check the Ethernet address of router R2; compare the actual address with
the contents of router R1's arp cache. The arp cache revealed that the device
was of a different manufacturer than expected.
To solve the problem track down the device on network net-1, system C, that has
an illegal IP address (one that is the same as the network-net-1-side interface
of router R2). This resulted in a routing loop as the routers had multiple best-case
paths to take to the same location (which were actually in two different sites).
Correct the duplicate IP address problem on system C and make sure communications
work as expected.
Duplicate MAC Address (Mostly Sun environment problem)
Reported Problem: After adding an additional Ethernet interface to your host,
the system performance is very poor.
Troubleshooting (as user root):
- Use arp -a to view the address table on the host. If the MAC address appears
on more than one host that is on the same physical network, this may be the
problem.
- Use ifconfig -a to check the IP address and MAC address.
- If host with the same MAC address are on the same subnet you found the problem
Notice from the previous ifconfig output that all the interfaces have the same
MAC address. Host C is on different subnet, so this is not a problem. This
would cause problems because packets that leave either qfe0 or qfe1 would not be
guaranteed to receive a response since both interfaces are broadcasting themselves
as the source for those packets.
Tcpreplay 3.0.beta13 released
Tcpreplay is a set of Unix tools which allows the editing
and replaying of captured network traffic in pcap (tcpdump)
format. It can be used to test a variety of passive and
inline network devices, including IPS's, UTM's, routers,
firewalls, and NIDS.
Release focus: Major bugfixes
Changes:
This release fixes some serious regression bugs that
prevented tcprewrite from editing most packets on Intel and
other little-endian systems. Some smaller bugfixes and
tweaks to improve replay performance were made.
Author:
Aaron Turner
[contact developer]
Contents
- Overview
- Two Tips for Network Performance Checking
- Network Connectivity Troubleshooting
- Checking Network Settings
- Checking Routing Settings
- Changing the IP Address
- About the Author
... .... ...
b. Use ping with small (1Kbyte) and large (10K) packet sizes: Sometime routers
in the network can have issues depending upon the size of the packet, as some
use different queues within the router depending upon packet size.
....Network Connectivity Troubleshooting
Here is a checklist to help you locate and resolve network connectivity problems.
1. Use ifconfig -a to check that interfaces are
plumbed; that is, that they exist in the output. Also, check the network address
and netmask of the interface.
To plumb an interface, run the command ifconfig <interface><instance>
plumb, for example:
# ifconfig ce1 plumb
Use ifconfig to see if the interface now exists.
# ifconfig -a
lo0:
flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4>
mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
ce0:
flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4>
mtu 1500 index 2
inet 444.555.666.7 netmask ffffff00 broadcast
444.555.666.255
ether 5:3:de:de:de:de
ce1:
flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4>
mtu 1500 index 6
inet 0.0.0.0 netmask 0
ether 3:4:aa:bb:cc:dd
Give the interface its ipaddress and
netmask.
ifconfig ce1 555.66.77.88 netmask 255.255.255.0 up
2. Ping the interface address; it should work!
3. Ping your router/switch. If you see => fail,
then check your network settings. (See the Checking Network
Settings section of this article.)
4. Ping a host on another network. If that doesn't work, check the routing
table. (See the Checking Routing Settings section of
this document.)
... ... ...
CHAPTER 11
Follow these guidelines while
troubleshooting an
IP network:
- Always begin at the network interface layer and work up to the application
layer.
- Make sure protocols at each layer of the Internet protocol suite can
communicate with the layer above and below it.
To troubleshoot an IP
network
1. Ping successfully.
If you can ping successfully, you have verified
IP communications between
the network interface layer and the internet layer. The Ping command
uses the Address Resolution
Protocol (ARP) to resolve the
IP
address to a hardware
address for each echo
request and echo reply.
2. Establish a session with a host.
If you can establish a session, you have verified TCP/IP
session communications from the network interface layer through the application
layer.
Note If you are unable to resolve a problem, you
may need to use an IP
analyzer (such as Microsoft Network Monitor) to view network activity at each
layer.
The first goal in
troubleshooting is to
make sure you can successfully ping an
IP
address. Ping a host
with its host name only after you can successfully ping the host with its
IP
address.
To troubleshoot the network interface and internet layers by using the Ping
command
1. Ping the loopback
address to verify that
TCP/IP was installed
and loaded correctly.
If this step is unsuccessful, verify that the system was restarted after TCP/IP
was installed and configured.
2. Ping your IP
address to verify that
it was configured correctly.
If this step is unsuccessful, view the configuration by using the Network application
in the Windows NT Control Panel to verify that the
address was entered correctly,
and verify that the IP
address is valid and
that it follows addressing guidelines.
3. Ping the IP
address of the default
gateway to verify that the gateway is functioning and configured correctly.
If this step is unsuccessful, verify that you are using the correct
IP
address and subnet mask.
4. Ping the IP
address of a remote host
to verify the connection to the wide area network.
If this step is unsuccessful:
Make sure that IP
routing is enabled.
Verify that the IP
address of the default
gateway is correct.
Make sure that the remote host is functional.
Verify that the link between routers is operational.
After you can successfully ping the
IP
address, ping the host
name to verify that the name is configured correctly in the HOSTS file.
The next goal in troubleshooting
is to successfully establish a session. Use one of the following methods to
verify communications between the network interface layer and the application
layer.
To establish a session with a Windows NT–based computer or other RFC-compliant
NetBIOS-based host, make a connect with the Net use or Net view
command. If this step is unsuccessful:
- Verify that the destination (target) host is NetBIOS-based.
- Confirm that the scope ID on the destination host matches that of the
source host.
- Verify that you used the correct NetBIOS name.
- If the destination host is on a remote network, check the LMHOSTS file
for the correct entry.
To establish a session with a non-RFC-compliant NetBIOS-based host, use the
Telnet or FTP utility to make a connection. If this step is unsuccessful:
- Verify that the destination host is configured with the Telnet daemon
or FTP daemon.
- Confirm that you have the correct permissions on the destination host.
- Check the HOSTS file for a valid entry if you are connecting using a
host name.
A home directory already exists for this service. Creating a new home directory
will cause the existing directory to no longer be a home directory. An alias
will be created for the existing home directory. This message is a warning
only. It appears when the new home directory you are trying to add already exists.
The maximum number of home directories allowed is one per virtual root.
Invalid Server Name
While trying to connect to a server, you typed an invalid server name. Try
to connect again and make sure you type the name correctly.
More than 1 home directory was found. An automatic alias will be generated
instead.
When getting the directory entries from the server, Internet Service Manager
has determined that a duplicate
exists. This duplicate
may have been added by using the Registry Editor or in some other way.
No administerable services found.
While trying to connect to a server, you typed the name of a server that has
no installed services that Internet Service Manager can administer. That is,
WWW, FTP, and gopher services have not been installed on the computer you connected
to.
The alias you have given is invalid for a non-home directory.
You’re trying to assign the alias ‘/’ to a non-home directory. This alias automatically
means home.
The connection attempt failed because there’s a version conflict between
the server and client software.
This message is an RPC error message. The RPC interface does not match what
is expected. This should happen only if you are running a beta admininstration
tool or server. The official error is RPC_S_UNKNOWN_IF.
The service configuration DLL ‘filename’ failed to
load correctly.
The named service configuration DLL (for example, W3scfg.dll) failed to load.
The DLL or one if its dependencies could be missing or corrupted. Generally
this is a setup problem. Run the Setup program and select Remove All,
then reinstall Microsoft Internet Information Server.
Unable to connect to target machine.
This message is an RPC error message that appears while executing an API. The
computer could be offline. The system error was EPT_S_NOT_REGISTERED or RPC_S_SERVER_UNAVAILABLE.
Unable to create directory.
The directory name or path you typed in in the New Directory Name box
cannot be created. It could be an invalid path, or a file may already exist
that has this name.
15 May 2000 (support.novell.com) This document
addresses communication issues that generate about a third of the support calls
coming into the TCP/IP group at Novell Technical Support. We recommend that
anyone who is implementing TCP/IP in a NetWare 5.x environment read and understand
the information presented here.
This article is divided into two parts: understanding the concepts behind
IP routing, and troubleshooting common TCP/IP problems. A follow-up article
will explain some of the TCP/IP tools that are available for use in troubleshooting
problems in a TCP/IP environment.
Concepts Behind TCP/IP Routing
The majority of connectivity issues involve problems with routing table entries.
Every packet being processed by a TCP/IP host has a source and destination IP
address. Upon receiving each packet, the IP protocol examines the destination
address of the packet, compares it with entries in its local routing table,
and then decides what action to take:
- If the destination IP address is itself (that is, to a local application
such as GroupWise, BorderManager Proxy Server, etc.), the packet is passed
up to a protocol layer above IP.
- If the packet is destined for another known network, the packet is forwarded
through one of the locally-attached network adapters. (This assumes that
the TCP/IP host has multiple interfaces and has routing enabled.)
- If neither of the above apply, the packet is discarded.
The TCP/IP routing table can maintain four different types of routes, listed
below in the order that they are searched for a match:
- Host (a route to a single, specific destination IP address)
- Subnet (a route to a subnet)
- Network (a route to an entire network)
- Default (used when there is no other match)
IP compares the destination IP address of the packet that it is processing
with the entries in the table. If IP finds that a host entry exists and matches
the destination IP address, it will forward the packet to the next hop associated
with that host entry. Host entries are usually found in routing tables when
ICMP (Internet Control Message Protocol) has added the entry because of the
pathMTU algorithm, or from an "ICMP redirect" call. To check this, load the
TCPCON utility at the server console prompt and look at the IP Routing Table
option to verify if the protocol associated with that route is ICMP.
IP has three classes of addresses: Class A, Class B and Class C. Each class
contains a default subnet mask (for instance, Class A has 255.0.0.0. as a default
subnet) until a class of addresses is broken into extra networks (i.e., subnetted).
However, once the network is subnetted, the IP address will not have the default
subnet mask.
So if IP doesn't find a host entry, but does find a subnet entry that matches
the packet's destination IP address, IP will forward the packet to the next
hop associated with that subnet entry. Subnet entries exist when RIP2 (Routing
Internet Protocol v2), OSPF (Open Shortest Path First), or static entries have
been added to the routing table through a non-default subnet mask.
If IP doesn't find a subnet entry in the TCP/IP routing table but does find
a network entry that matches the destination IP address, IP will forward the
packet to the next hop associated with that network entry. (Customers running
in default NetWare TCP/IP mode will have network entries.)
Finally, if IP doesn't find a network entry, but does find that a default
route entry exists, IP will forward the packet to the next hop associated with
that default entry. The default route is most commonly inserted as a static
route through NetWare's server console INETCFG utility. However, the route may
also be learned via RIP or OSPF. Failure to at least have a default route can
often lead to communication problems on the network.
If an IP packet match has not been found in the TCP/IP routing table
at this stage, the packet is simply dropped and an ICMP "destination unreachable"
message is triggered to notify the sender that the host or network is unreachable.
When a TCP/IP communication problem occurs, the most common reason is that
a route entry doesn't exist for the network or host with which you are trying
to communicate. When this is the case, you can either add a route entry or try
to figure out why the route is missing.
Troubleshooting Common TCP/IP Problems
When troubleshooting any networking problem, it is helpful to take a logical
approach. Some questions to ask are:
- What does work?
- What doesn't work?
- How are the things that do and don't work related?
- Have the things that don't work ever worked on this computer/network?
- If so, what has changed since the last time it did work?
Troubleshooting a problem "from the bottom up" is often a good way to quickly
isolate what's wrong and come up with a solution. The "bottom up" approach from
an IP routing perspective is to start by verifying that the problem is not related
to the physical layer (cabling, hubs, switches, and so on) or ARP (Address Resolution
Protocol). Next, you ensure that the IP routing table is functioning correctly.
Finally, you check to see whether the problem is at a generic TCP/UDP or application
level.
Two of the fundamental aspects of Linux system security and
troubleshooting are knowing what services are running, and what connections
and services are available. We're all familiar with ps for viewing active
services. netstat goes a couple of steps further, and displays all available
connections, services, and their status. It shows one type of service that
ps does not: services run from inetd or xinetd, because inetd/xinetd
start them up on demand. If the service is available but not active, such as
telnet, all you see in ps is either inetd or xinetd:
$ ps ax | grep -E 'telnet|inetd'
520 ? Ss 0:00 /usr/sbin/inetd
But netstat shows telnet sitting idly, waiting for
a connection:
$ netstat --inet -a | grep telnet
tcp 0 0
*:telnet *:* LISTEN
This netstat invocation shows all activity:
$ netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0
*:telnet *:* LISTEN
tcp 0 0
*:ipp *:* LISTEN
tcp 0 0
*:smtp *:* LISTEN
tcp 0 0
192.168.1.5:32851 nest.anthill.echid:ircd
ESTABLISHED
udp 0 0
*:ipp *:*
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node Path
unix 2 [ ACC ] STREAM LISTENING 1065 /tmp/ksocket-carla/klaunchertDCh2b.slave-socket
unix 2 [ ACC ] STREAM LISTENING 1002 /tmp/ssh-OoMGfFm666/agent.666
unix 2 [ ACC ] STREAM LISTENING 819 private/smtp
Your total output will probably run to a couple hundred lines.
(A fun and quick way to count lines of output is netstat -a | wc -l.)
You can ignore everything under "Active UNIX domain sockets." Those are local
inter-process communications, not network connections. To avoid displaying them
at all, do this:
$ netstat --inet -a
This will display only network connections, both listening
and established. Already netstat has earned its keep- both the telnet
and smtp services are running. This is bad, because I don't want to have either
a telnet or smtp server running on this machine. So now I know I need to turn
them off, and re-configure my startup files so they won't start at boot.
How do you know what services you want running? That is a
mondo subject for another day, and an important one. For example, if your system
has been compromised, this is one place to find evidence of a Trojan horse or
other malware phoning home. In this example, ipp is Internet Printing
Protocol, which belongs to CUPS (Common Unix Printing System.) If you want your
printer to work, this needs to be here. The connection on 192.168.1.5:32851
is my active IRC (Internet Relay Chat) connection. Refer to your /etc/services
file to learn more about TCP and UDP ports, and the services assigned to them.
What It Means
"Proto" is short for protocol, which is either TCP or UDP.
"Recv-Q" and "Send-Q" mean receiving queue and sending queue. These should always
be zero; if they're not you might have a problem. Packets should not be piling
up in either queue, except briefly, as this example shows:
tcp 0
593 192.168.1.5:34321 venus.euao.com:smtp
ESTABLISHED
That happened when I hit the "check mail" button in KMail;
a brief queuing of outgoing packets is normal behavior. If the receiving queue
is consistently jamming up, you might be experiencing a denial-of-service attack.
If the sending queue does not clear quickly, you might have an application that
is sending them out too fast, or the receiver cannot accept them quickly enough.
"Local address" is either your IP and port number, or IP and
the name of a service. "Foreign address" is the hostname and service you are
connected to. The asterisk is a placeholder for IP addresses, which of course
cannot be known until a remote host connects. "State" is the current status
of the connection. Any TCP state can be displayed here, but these three are
the ones you want to see:
LISTEN- waiting to receive a connection
ESTABLISHED- a connection is active
TIME_WAIT- a recently terminated connection; this should last
only a minute or two, then change back to LISTEN. The socket pair cannot be
re-used as long the TIME_WAIT state persists.
UDP is stateless, so the "State" column is always blank.
A socket pair is both sides of a TCP/IP connection, like this
example for a locally-attached printer:
localhost:ipp localhost:34493
ESTABLISHED
Or a telnet connection to a remote server:
192.168.1.5:34437 65.106.57.106.pt:telnet
ESTABLISHED
A socket is any hostname-port combination, or IP address-port.
Continuous Capture
Because all these things change often, how do you capture
the changes? Run netstat continuously with the -c flag and record
the output:
$ netstat --inet -a -c > netstat.txt
Then check email, start and stop services, surf the web, log
in to a telnet BBS and play Legend of the Red Dragon; then review your capture
file to see what it all looks like.
Borken DNS
If netstat is taking too long, or not resolving a hostname
at all, give it the -n flag to turn off DNS lookups:
$ netstat --inet -an
Checking Interfaces
netstat can help diagnose NIC
problems. Use the -i flag when you're troubleshooting a flakey connection,
and you suspect your NIC:
$ netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 28698 0 0 0 33742 0 0 0 BMRU
lo 16436 0 14 0 0 0 14 0 0 0 LRU
You should see large numbers in the RX-OK (received OK) and TX-OK
(transmitted OK) columns, and very low numbers in all the others. If you are
seeing a lot of RX-ERRs or TX-ERRs, suspect the NIC or the patch cable. This
is what the flags mean:
B = broadcast address
L = loopback device
M = promicuous mode
R = interface is running
U = interface is up
Resources
Linux Network Administrator's Guide,
by Olaf Kirch & Terry Dawson
Sometimes when you talk to a seasoned system
or network administrator, he'll tell you that he knows that something is wrong
when things don't feel right. This isn't an admission of paranormal powers;
it's just a shorthand method for explaining that these experts know how their
system or network is supposed to behave and that it isn't acting like that now.
These administrators have created a baseline for their environment. Not all
of them have done it formally, but the ones who have will have gained significant
added benefits.
A Baseline Defined
Several things make up a baseline, but at its
heart, a baseline is merely a snapshot of your network the way it normally
acts. The least effective form of a baseline is the "sixth sense" that you develop
when you've been around something for a while. It seems to work because you
to notice aberrations subconsciously because you're used to the way things ought
to be. Better baselines will be less informal and may include the following
components:
Network traces
- Summarized network utilization data
- Logs of work done on the network
- Maps of the network
- Records of equipment on the network and
related configuration data
Network Traces
In Chapter 10, "Network Monitoring Tools" we
discussed the ethereal network analyzer. This tool's capability to save capture
files (or traces) enables you to maintain a history of your network. If the
only traces you have saved represent your troubleshooting efforts, you won't
have a very good picture of your network.
You also need to be aware that a lot of things
will influence the contents of the traces you collect. Weekend vs. weekday;
Monday or Friday vs. the rest of the week; and time of day are all examples
of the kinds of factors that will affect your data. Running ethereal (or some
other analyzer) at least three times a day, every day, and saving the capture
file will give you a much clearer idea of how things normally work.
Utilization Data
Several tools can give you a quick look at your
network's behavior: netstat, traceroute, ping, and even the contents of your
system logs are all good sources of information.
The netstat tool can show you several important
bits of information. Running it with the -M,
-i, and
-a switches are especially
helpful. I typically add the -n
switch to netstat as wellæthis switch turns off name resolution, which is a
real boon if DNS is broken or IP addresses don't resolve back to names properly.
The -i switch gives
you interface specific information:
[pate@cherry sgml]$
netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR
TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 0 0 0 0 39 0 0 0 BRU
lo 3924 0 36 0 0 0 36 0 0 0 LRU
[pate@cherry sgml]$
The -M
switch gives information pertaining to masqueraded connections:
[pate@router pate]$
netstat -Mn
IP masquerading entries
prot expire source destination ports
tcp 59:59.96 192.168.1.10 64.28.67.48
1028 -> 80 (61002)
tcp 58:43.75 192.168.1.10 206.66.240.72
622 -> 22 (61001)
udp 16:37.72 192.168.1.10 209.244.0.3
1025 -> 53 (61000)
[pate@router pate]$
The -a
switch gives connection-oriented output (this output has been abbreviated):
[pate@cherry pate]$
netstat -an
Active Internet connections (servers
and established)
Proto Recv-Q Send-Q Local Address Foreign
Address State
tcp 0 0 0.0.0.0:6000 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN
udp 0 0 0.0.0.0:111 0.0.0.0:*
raw 0 0 0.0.0.0:1 0.0.0.0:* 7
raw 0 0 0.0.0.0:6 0.0.0.0:* 7
Active UNIX domain sockets (servers and
established)
Proto RefCnt Flags Type State I-Node
Path
unix 1 [ ] STREAM CONNECTED 1332 /tmp/.X11-unix/X0
unix 1 [ ] STREAM CONNECTED 1330 /tmp/.X11-unix/X0
unix 0 [ ] DGRAM 440
[pate@cherry pate]$
The traceroute tool is especially important for
servers that handle connections from disparate parts of the Internet. Setting
up several traceroutes to different remote hosts can give you an indication
of remote users connection speeds to your server.
The ping tool can help you watch the performance
of a local or remote network in much the same way that traceroute does. It does
not give as much detail, but it requires less overhead.
When users connect to services on your hosts,
they leave a trail through your log files. If you use a central logging host
and a log reader to grab important entries, you can build a history of how often
services are used and when they are most heavily utilized.
Work/Problem Logs
You will likely find yourself touching a lot
of the equipment on your network, so it is important that you keep good records
of what you do. Even seemingly blind trails in troubleshooting may lead you
to discover information about your network. In addition, you will find that
your documentation will be an invaluable aid the next time you need to troubleshoot
a similar problem.
Some people like to carry around a paper notebook
to keep their records in; others prefer to keep things online. Both camps have
good points, many related to information access. If you keep everything in a
notebook but don't have it handy, it does you no good. Similarly, if everything
is online and the network is down, you're in bad shape.
My preference is to keep things online, but in
a cvs repository. Then you can keep it on a central server or two while also
keeping a copy on your laptop/PC/palmtop. If you like, you can even grab printouts.
A nice benefit to this is that several people can make updates to documentation
and then commit their changes back to the cvs repository when they've finished.
I won't get into the Web vs. flatfile vs. database
vs. XML vs. whatever conflict. They all have benefits. Choose the right option
for your organization, and stick to it. The important bit is that you have the
data, right?
Network Maps
A roundly ignored set of baseline information
is the network map. If you have more than two systems in your network and don't
have a map, set down this book for 20 minutes and sketch something out. It doesn't
have to be pretty, just reasonably accurate. Are you back? Good. Now that you
have a map showing what is where, we can get back to work.
Most people want to deal with two kinds of maps.
The first is a topological/physical map, which shows what equipment is where
and how it is connected. The second is a logical map. This shows what services
are provided and what user communities are supported by which servers. If you
can combine these two maps, so much the better; color coding, numeric coding,
and outlined boxes are all mechanisms that can help with this. A sample map
is shown in
Figure 1.
Figure 1 A sample network map
Like the information discussd that you keep your
maps online and in a couple of places. (cvs can be a good solution here as well.)
Nicely done maps also look good on your wall, not to mention that this is a
convenient place to find them when a problem breaks out and you need to start
troubleshooting.
Equipment Records
You should also have accurate records of the
hardware and software in your network. At a minimum, you should have a hardware
listing of each box on the network, a list of system and application levels
(showing currently installed versions and patches), and configurations of the
same. If you keep this in cvs, you'll also have a nice mechanism for looking
at your history.
If you decide to keep these records, it is vital
that they be kept up-to-date. Every time you make a change, you should edit
the appropriate file and commit it to cvs. If you fall behind, you'll miss something,
and then you'll really be stuck.
This article is excerpted from
Networking Linux: A Practical Guide to TCP/IP by Pat Eyler (New
Riders Publishing, 2000, ISBN 0735710317). Refer
to Chapter 7 of this book for more detailed information on the material covered
in this article.
...