UNIX administrator blog tips: June 2011

Monday, 20 June 2011

Error logging in AIX

The error logging process begins when an operating system module detects an error.The error-detecting segment of code then sends error information to either the errsave kernel service and errlast kernel service for pending system crash or the errlog subroutine to log an application error, where the information is, in
turn, written to the /dev/error special file.

This process then adds a time stamp to the collected data. The errdemon daemon constantly checks the /dev/error file for new entries, and when new data is written, the daemon conducts a series of operations.

Before an entry is written to the error log, the errdemon daemon compares the label sent by the kernel or application code to the contents of the error record template repository. If the label matches an item in the repository, the daemon collects additional data from other parts of the system.

To create an entry in the error log, the errdemon daemon retrieves the appropriate template from the repository, the resource name of the unit that detected the error, and detailed data. Also, if the error signifies a
hardware-related problem and hardware Vital Product Data (VPD) exists, the daemon retrieves the VPD from the Object Data Manager (ODM). When you access the error log, either through SMIT or with the errpt command, the error log is formatted according to the error template in the error template repository
and presented in either a summary or detailed report.

The system administrator can look at the error log to determine what caused a failure, or to periodically check the health of the system when it is running.

The software components that allow the AIX kernel and commands to log errors to the error log are contained in the fileset bos.rte.serv_aid. This fileset is automatically installed as part of the AIX installation process.

The commands that allow you to view and manipulate the error log, such as the errpt and errclear commands, are contained in the fileset called bos.sysmgt.serv_aid.

Error log file processing:-

The error log is used by system administrators. The error log contains error IDs, time stamp, error type, error class, and resource names associated with each error.

Error templates

The error template contains numbers that correspond to error messages in the codepoint catalogue.

System health check in AIX

- Use the errpt command to look at a summary error log report. Be on the lookout for recent additions to the log. Use the errpt -a command to examine any suspicious detailed error log entries

Check disk space availability on the system with the df -k command. A full file system can cause a number of problems, so it is best to avoid the situation if at all possible. The two solutions to a full file system are to either
delete some files to free up space, or use the LVM to add additional resources to the file system. The option you take will depend on the nature of the data in the file system, and whether there is any available space in the volume group.

Check volume groups for stale partitions with the lsvg command. If stale partitions, logical volumes, or physical volumes are reported, try and repair the situation with the syncvg command.

Check system paging space with the lsps -s command. A system will not perform very well if it is low on paging space. In extreme circumstances, the system can terminate processes in order to solve the problem. Obviously, it is better that the system administrator ensures that there is enough paging space. You can either increase the size of existing paging space volumes, or add a new paging space volume. Again, the option you take will depend on the available space in the volume groups on the system.

Check if all expected subsystems are running with the lssrc -a command.

Check the networking by trying to ping a well-known address. If you cannot
ping the address,

Lower down usage high of paging space without reboot server (on aix 5.0 above )

1) Create a new same sized paging space with the old one, and when using smitty pgsp , select:
Start using this paging space NOW? yes
Use this paging space each time the system is RESTARTED? yes
2) Then deactivate the original paging space. It's better to use command, for example, if /dev/hd6 is the old paging
space, then
# nohup swapoff /dev/hd6 > /tmp/swapoff.log 2>&1 &
3) Before step 2 finish, check paging space usage, by:
# lsps -a
you will find the usage of new paging space growing gradually while the old one decreasing gradually.
4) Without waiting step 2 finish, change the original paging space (hd6) to
Use this paging space each time the system is RESTARTED? no
5) After it finishes, Activate the original paging space (hd6)
6) (Optional) Remove the original paging space. (leave it if you want to swap paging space next time)
OR
1) Create a new same sized paging space with the old one, and when using smitty pgsp , select:
Start using this paging space NOW? yes
Use this paging space each time the system is RESTARTED? yes
2) Deactive the orginal paging space using smitty command and wait for an hour
3) Recative the original paging space
4) Remove the paging space you have created

Problems with file sizes over 1GB

If a user is having a problem creating a file over 1GB, you may want to check /etc/security/limits for that certain user
ID.
smitty users
Change / Show Characteristics of a User
type in user name (ie: neo)
* User NAME [neo] +
scroll down to
Soft FILE size
and you can set a higher limit or set to unlimited (-1)

If you still see % did not go down, then run the following command to see the unused processes, for example to see unused processes in /var filesystem:

d02http003:/# fuser -dV /var
/var:
inode=4123 size=9567989 fd=6 13938
inode=6175 size=19190691 fd=4 29172
Then verify the name of these id, by running:
#ps -deaf | grep 29172
root 29172 8780 0 Dec 06 - 144:09 /usr/lpp/ssp/bin/pmand
You can kill that unused id and it will restart it automatically but make sure you verify that it is restarted

If wtmp file is very large and you want compress the wtmp file, please follow the instruction below

1. run: stopsrc -s syslogd
2. compress -v wtmp
3. you will see the wtmp.Z filecreated
4. rename this file by running: mv wtmp.Z /tmp/wtmp_mm_dd_yyyyy.Z
5. run: touch wtmp (to create a new wtmp file)
5a. chown adm wtmp
chgrp adm wtmp
chmod 664 wtmp
6. run startsrc -s syslogd

After you replace the hardware and if you see same error messages are coming to GWA root, then do the followings:

1. cd /usr/bin
2. cp errpt -a > /tmp/errpt_mm_dd_yyyyy
3. run: errclear 0
4. run: /usr/lib/errstop
5. run: rm /var/adm/ras/errlog
6. run: /usr/lib/errdemon
7. Perform log repair action of the hardware that has been replaced by running diag --> task selection --> log repair
action --> chose the device that has been replaced and press F7 to cmmit and then exit

Sendmail problem: Mail is getting stuck on /var/spool/mqueue

1. Find out what version of sendmail is running by typing sendmail -d0.1
2. Look at the version by running the above command. If sendmail version 8.9.x then edit /etc/sendmail.cf and if
version 8.11.x then edit /etc/mail/sendmail.cf
3. Search for DS in sendmail.cf then add bracket [ ], for example: DS[d02smtp01.southbury.ibm.com] and by adding
the bracket you are forcing to look at CNAME instead of MX record name
5. Then kill the sendmail daemon kill -9 sendmail_pid (you can find senmail_pid by typing ps -ef | grep sendmail)
6. Then start the sendmail daemon /usr/lib/sendmail -q15m (If it is sendmail cleint) or /usr/lib/sendmail -bd -q15m
(If it is sendmail server)
7. You will see all the mail queues are getting cleared from /var/spool/mqueue directory after few minutes or just
run the command: #mailq

Saturday, 18 June 2011

Restore a file from mksysb tape in AIX

cd /tmp
rmt -f /dev/rmt0 rewind
restore -s4 -xvqf /dev/rmt0.1 ./var/spool/cron/crontabs/sksupp1

the file will be restored in /tmp/var/spool/cron/crontabs/sksupp1

Unable to increase the size of the filesystem

chfs -a size=+10G /dev/coc01dblv
0516-404 allocp: This system cannot fulfill the allocation request.
        There are not enough free partitions or not enough physical volumes
        to keep strictness and satisfy allocation requests. The command
        should be retried with different allocation characteristics.

root@dcccccccc::/> chfs -a size=+10G /dev/coc01dblv
0516-404 allocp: This system cannot fulfill the allocation request.
        There are not enough free partitions or not enough physical volumes
        to keep strictness and satisfy allocation requests. The command
        should be retried with different allocation characteristics.

Since your upper bound is 1 you cannot go more than 1 disk
chlv -u 4 lvnam

        -u upperbound
            Sets the maximum number of physical volumes for new allocation. The value of the upperbound variable
            should be between one and the total number of physical volumes. When using super strictness, the upper
            bound indicates the maximum number of physical volumes allowed for each mirror copy. When using striped
            logical volumes, the upper bound must be multiple of stripewidth. If upperbound is not specified it is
            assumed to be stripewidth for striped logical volumes.

now able to increase the size of the FS

Checking root and /usr file systems in AIX

Checking root and /usr file systems
To run the fsck command on / or /usr file system, you must shut down the system and reboot it from removable media because the / (root) and /usr file systems cannot be unmounted from a running system.

The following procedure describes how to run fsck on the / and /usr file systems from the maintenance shell.

With root authority, shut down your system.
Boot from your installation media.

From the Welcome menu, choose the Maintenance option.
From the Maintenance menu, choose the option to access a volume group.
Choose the rootvg volume group. A list of logical volumes that belong to the volume group you selected is displayed.
Choose 2 to access the volume group and to start a shell before mounting file systems. In the following steps, you will run the fsck command using the appropriate options and file system device names. The fsck command checks the file system consistency and interactively repairs the file system. The / (root) file system device is /dev/hd4 and the /usr file system device is /dev/hd2.
To check / file system, type the following:
$ fsck -y /dev/hd4
The -y flag is recommended for less experienced users (see the fsck command).

To check the /usr file system, type the following:
$ fsck -y /dev/hd2
To check other file systems in the rootvg, type the fsck command with the appropriate device names. The device for /tmp is /dev/hd3, and the device for /var is /dev/hd9var.
When you have completed checking the file systems, reboot the system.

How to change the status of a disk from 'removed' to 'active'

After an I/O failure to a disk due to a path or system crash, a volume group may have a disk in a removed state for one or more of it's disks. This will cause file systems to not mount and other failures related to the disk.

The status of a Volume Group disk may be seen by:

$ lsvg -p <VolumeGroupName>

where the VolumeGroupName is the Volume Group in question.

Example of viewing a disks in uservg volume group:
# lsvg -p uservg
uservg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk6 removed 639 238 128..00..00..00..110
hdisk7 active 639 639 128..128..127..128..128

The status of a disk will be shows on the 'PV STATE'

If a disk has a status of 'removed', you may not be able to mount file systems that exist on the

disk in question:
# mount /home/user1/userfs
mount: 0506-324 Cannot mount /dev/userlv1 on /home/user1/userfs: There is an input or output error.

Changing the status of a disk to active:

chpv -va <hdisk#>

where hdisk# is the disk in question.

Example of changing hdisk6 to active state in the uservg volume group:
# chpv -va hdisk6

Cannot telnet to Server After Changing Soft Filesize Limit

After changing soft filesize limit to "0", customer is unable to telnet into
server.

Users see error message:

/dev/pts/0: 3004-004 You must "exec" login from the lowest login shell.

Environment

AIX Version 5.1 To unlimit the soft filesize(fsize) value, change your value to "-1" in the /etc/security/limits file.

how to check out the network adapters for errors and network performance

entstat -d ent1 --> 10/100 PCI 23100020

0 CRC
0 alignment
media speed 100 Full selected and running
0 No Receive Pool Buffer errors
The adapter looks good.

ftp to another machine

login

bin

put "|dd if=/dev/zero bs=32k count=100" /dev/null

between 7 and 8 MB/sec which is good.

lscfg -vl ent0 --> dv210 microcode --> latest microcode

lslpp -l |grep 14108902 --> 5.3.0.50 --> latest device driver

Friday, 17 June 2011

AIX Interview question : PART 2

How do I set the tty name associated with a physical port?

Let's say you wanted to make a tty on the s1 port and call it rs0000 and a tty on the s2 port and call it rs0001.

You could run:

mkdev -c tty -s rs232 -t tty -l rs0000 -p sa0 -w s1   # creates rs0000
and
mkdev -c tty -s rs232 -t tty -l rs0001 -p sa1 -w s2   # creates rs0001

How do I use mksysb to clone a system?

I use the following steps on the master machine to clone an AIX system:
1) Remove the password from root.
2) Remove the NIS line from the end of the /etc/group file (the last
line with the +: )
3) Change most of the level '2' designations in /etc/inittab to level
'3' to prevent them from being started up when the new system is
booted (the minimum ones to change are rc.nfs and rc.tcpip)
4) Boot in service mode and change the name and ip address to a "spare"
set to avoid address collision.
5) Clear /tmp, /usr/tmp and /usr/spool/lpd/stat.
6) Run mkszfile and edit it to be sure /usr is as small as possible;
then mksysb from the command line.

The above changes allow me to boot in normal mode the first time, get in
as root, change the above files back and do the other things necessary
to configure the new system.

Then, of course, I go back and clean up and reboot my master machine.

Note: 1 and 2 lets you log in even if you can't get on the network.
It prevents the login process from trying to reach an NIS server.
Step 2 needed only if you use NIS.
How do I remove a non-existant physical volume?
To delete a phantom disk from the ODM use reducevg with the pvid instead of the disk name. You are running some command such as lsvg or varyonvg and it is griping about a disk that is no longer findable right? In that warning message, it should give you a pvid. Try one of the following, (note: reducevg updates the VGDA but not the ODM).

reducevg -f <vgname> <pvid>

ldeletepv -g VGid -p PVid
-g Required, specify the VGid of the volume group you are
removing the physical volume from
-p Required, specify the PVid of the PV to be removed

How do I kill a process that ignores kill -QUIT -KILL -STOP
If there is i/o pending in a device driver, and the driver does not catch the signal, you can't kill it - a reboot is the only way to clear it.

Furthermore, if the process stays hung for more than a few minutes, you can find out what device is wedged by doing this --

% echo trace -k $(expr <pid> / 256) | crash | tee stack

How can I see "console" messages?
Use the swcons command to redirect the console to a file. Or use chcons to do it permanently.
How do I merge my /etc/password and /etc/security/password for Crack?

/usr/sbin/mrgpwd. You must have permissions to read /etc/security/password.
I lost the root password, what should I do?

Boot from boot diskettes, bootable tape, or bootable CD.
At the Installation/Maint menu select item 4, "Start a limited function
maintenance shell.
At the subsequent "#" prompt enter the command:
getrootfs hdiskN
(where "N" is replaced by the number of a disk on your system
that is in rootvg.)
That will run for about a minute or so and you get a # prompt back. At this
point you are logged in as root in single user mode.
Change to /etc/security and edit the passwd file. Delete the three lines
under root: password, update time (or whatever it's called), and
flags. Save the file.
Then at the prompt, give root a new password.
Shutdown/reboot in normal mode. Log in with new password.

AIX Interview question : PART 1

IBM has announced "AIX5L".  It's essentialy AIX Version 5.  The 'L' stands
for "Linux Affinity".  A statement that AIX is going to support some of the
Linux API's and interfaces (for instance: the /proc filesystem)

Some changes to the filesystem limits, virtual IP's, dynamic dealocation
of swapspaces.

Using SMIT is probably very different from your normal way of doing
system administration, but could prove very useful in the long run. In
some areas, in particular TCP/IP, NFS, etc., you can also do things the
normal way, but it is unfortunately difficult to know exactly when the
normal way works. Again, always using SMIT is probably your best way
to go, even when you have to learn a new tool. 

What SMIT actually does is build up commands with all required options
to perform the functions requested and execute them. The commands
called and the output they produce are stored in the files smit.script
and smit.log in your home directory. Looking in smit.script may teach
you more about system administration.

How do I import an /etc/passwd or /etc/group file
from another box?

If the other box is non-AIX, copy the password and group entries for
the non-system users into AIX's /etc/passwd and /etc/group files.
Then run /bin/pwdck -t ALL. This will create the proper entries in
the shadow password file (/etc/security/users). You should also run
usrck and grpck.

To duplicate the password and group entries from another AIX box,
copy /etc/passwd, /etc/group, /etc/security/passwd, /etc/security/group,
/etc/security/user, /etc/security/limits, /etc/security/environ. The
last three are optional unless you modified them. If you modified
/etc/security/login.cfg, you should also copy that file.
How to fsck the root filesystem

You can run fsck either in maintenance mode or on mounted filesystems.
Try this:

1. boot from diskette (AIX 3 only --- AIX 4 boot from CD or tape)
2. select maintenance mode
3. type /etc/continue hdisk0 exit (replace hdisk0 with boot disk if
not hdisk0)
4. fsck /dev/hd4
How can I unmount /usr to run fsck on it?

In order to fsck /usr, it has to be unmounted. But /usr cannot be
unmounted because /bin is symbolically linked to /usr/bin. Also
/etc/fsck is symbolically linked to /usr/sbin/fsck.

To work around this, when you boot from the boot/maintenance diskettes
and enter maintenance mode, enter "getrootfs hdisk0 sh" instead of
"getrootfs hdisk0" where hdisk0 is the name of the boot disk. Then run
"fsck /dev/hd2".

How do I see/change parameters like number of
processes per user?

You can use SMIT as described below or simply use lsattr/chdev.
The former will list the current setting as in:

# lsattr -E -l sys0 -a maxuproc
maxuproc 40 Maximum # of processes allowed per user True

and you can then increase the maxuproc parameter:

# chdev -l sys0 -a maxuproc=200
sys0 changed

If you just type 'lsattr -E -l sys0' you will get a list of all
parameters, some of which can be changed but not others.

If you want to use smit, do as follows:

smit
System Environments and Processes
Change / Show Operating System Parameters
- on this screen you can change by overtyping the following fields:
- Maximum number of PROCESSES allowed per user
- Maximum number of pages in block I/O BUFFER CACHE
- Maximum Kbytes of real memory allowed for MBUFS
- toggle fields exist for:
- Automatically REBOOT system after a crash (false/true)
- Continuously maintain DISK I/O history (true/false)
How do I shrink the default paging space on hd6?

create a paging space to use temporarily
mkps -s 20 -a rootvg
change default paging space hd6 so it is not used at next reboot
chps -a n hd6
swapon /dev/paging00
sysdumpdev -p /dev/paging00
Update information in boot logical volume

bosboot -ad /dev/hdisk0
reboot
remove current hd6 and create a new one of smaller size
rmps hd6
mklv -y hd6 -t paging rootvg <size of PS in 4 Meg blocks>
swapon /dev/hd6
change the dump device back to hd6:

sysdumpdev -p /dev/hd6
Update information in boot logical volume

bosboot -ad hdisk0
change current paging device (paging00) so it is inactive at next boot
chps -a n /dev/paging00
shutdown, reboot, remove paging00 using the command:
rmps paging00
You can check your paging space with `lsps -a`
The swapper seems to use enormous amounts of paging space, why?

When you run ps, you may see a line like:

USER PID %CPU %MEM SZ RSS TT STAT TIME CMD
root 0 0.0% 14% 386528 8688 - S 17:06 swapper

This is normal behavior, the swapper looks to ps like it has the entire
paging space plus real memory allocated.
How do I remove a committed lpp?

installp has a new option, uninstall (-u) which can be used to remove lpps. BEWARE of pre-requisite chains.
How can I recover space after installing updates?

Note: If you are a /usr server, do not use this because the files
mentioned below are needed by /usr clients and cannot be deleted.

Installp creates numerous files in /usr to clean up after failed/rejected installs and also for de-installing uncommitted lpps. Once you have COMMITted packages you can remove these files safely. Depending on your installation activity the numbers can be significant: hundreds-to-thousands of files, megabytes of data.

Files eligible for removal are associated with each "product" you have installed; the largest collection being due to bos. After COMMITting bos lpps, you may safely remove all files of the form:
/usr/lpp/bos/deinstl*
/usr/lpp/bos/inst_U4*
/usr/lpp/bosadt/deinstl*
and /usr/lpp/bosadt/inst_U4*
You may repeat this for all additional COMMITted products (e.g.,bostext1, bosnet, xlc) you have on your system.
This problem of lingering install files is a known defect in installp. If you have installed PTF U411711 (or any superseder of it: U412397, U413366, U413425) the deadwood in /usr will not be quite as prevalent. No single PTF currently available completely corrects this problem.
On my own 320, the following freed up 12.4M in /usr:

# rm -R /usr/lpp/bos/deinstl*
# rm -R /usr/lpp/bos/inst_U4*
Where are the AIX log files kept?
AIX logs messages as specified in /etc/syslog.conf. Here's an
example

#
*.err;kern.debug;auth.notice;user.none /dev/console
*.err;kern.debug;daemon,auth.notice;mail.crit;user.none /var/adm/messages
lpr.debug /var/adm/lpd-errs

*.alert;kern.err;daemon.err;user.none operator
*.alert;user.none root
*.emerg;user.none *

# for loghost machines, to have authentication messages (su, login, etc.)
# logged to a file, un-comment out the following line and adjust the
# file name as appropriate.
#
# if a non-loghost machine chooses to have such messages
# sent to the loghost machine, un-comment out the following line.
#
auth.notice /var/log/authlog
mail.debug /var/log/syslog
# following line for compatibility with old sendmails. they will send
# messages with no facility code, which will be turned into "user" messages
# by the local syslog daemon. only the "loghost" machine needs the following
# line, to cause these old sendmail log messages to be logged in the
# mail syslog file.
#
user.alert /var/log/syslog
#
# non-loghost machines will use the following lines to cause "user"
# log messages to be logged locally.
#
user.err /dev/console
user.err /var/adm/messages
user.alert `root, operator'
user.emerg *
How can I log information about ftp accesses to a file?

1) In /etc/syslog.conf, add the line:
daemon.debug /tmp/daemon.log

2) # touch /tmp/daemon.log
# refresh -s syslogd

3) Modify your inetd.conf so that ftpd is called with the "-l" flag.
You may also want the "-d" flag. This can be done with 'smit inetdconf'.

All the syslog messages from various system daemons should now appear in
the file "/tmp/daemon.log".

How do I find a file name from the inode number?

ncheck -i nnnn /mntpoint

Thursday, 16 June 2011

Network problem determination: AIX tools for a system administrator:

As an IBM AIX® systems administrator, it's inevitable that at some point in your role you will encounter a problem that's linked to or directly caused by an issue within the LAN or WAN. In such instances, its good practice to make an initial diagnosis of the problem before engaging the help of a suitable network administrator to help identify root cause, or at least give the administrator a general direction in which to start his or her investigation.

Frequently used acronyms

BIND: Berkeley Internet Name Domain
DNS: Domain Name System
LAN: Local area network
RFC: Request for Comments
TCP/IP: Transmission Control Protocol/Internet Protocol
WAN: Wide area network

Once engaged, you may be required to assist in the analysis, so it's essential that you come armed with a relevant diagnostic toolkit. This article provides you with a set of commands available on AIX, many of which are also available on other flavors of UNIX®, which can help you to troubleshoot TCP/IP network-related issues.
For the purposes of this article, the target host system used in all sample commands and output is called testhost.
Is anybody there?

The first step in diagnosing any network-related issue is to verify whether the target host is running. You can use ping to test whether a host is reachable across a network (see Listing 1). This command sends an Internet Control Message Protocol (ICMP) echo request packet to the host and waits for an echo reply.

A successful ping means that:

Your host has an active network adapter that can be used to send out the request.
The target host is running and has an active network adapter configured with the IP address that you used.
Name resolution is working for that host if a name was used rather than an IP address.
There is a route from your host to the target and back.
No firewalls on the route between hosts or running on either host are blocking ICMP traffic.

The output from a successful ping can also be useful in helping to determine network latency, as it reports on the time taken to receive the echo reply. Long response times are likely to mean poor performance for any applications that exchange data with the target host.

Listing 1. Pinging a responsive host

# ping testhost

PING testhost: (10.217.1.206): 56 data bytes

64 bytes from 10.217.1.206: icmp_seq=0 ttl=253 time=0 ms

64 bytes from 10.217.1.206: icmp_seq=1 ttl=253 time=0 ms

64 bytes from 10.217.1.206: icmp_seq=2 ttl=253 time=0 ms

64 bytes from 10.217.1.206: icmp_seq=3 ttl=253 time=0 ms

----testhost PING Statistics----

4 packets transmitted, 4 packets received, 0% packet loss

round-trip min/avg/max = 0/0/0 ms

If no echo reply is received, then one or more of the conditions described above hasn't been met and the ping fails (see Listing 2). A ping fails when the number of packets received is less than the number sent and packet loss is greater than 0 percent

Listing 2. Pinging an unresponsive host

# ping testhost

PING testhost.testdomain.com: (10.216.122.12): 56 data bytes

----testhost.testdomain.com PING Statistics----

5 packets transmitted, 0 packets received, 100% packet loss

If the ping was unsuccessful, you can check whether the adapter used to send the request is up by using ifconfig.

You can use the ifconfig command to display the status of an individual adapter (for example, en1 shown in Listing 3) or all adapters using the -a switch, also shown in Listing 3. You should ensure that the adapter used to send packets out to your host is showing as UP and RUNNING. If it's not, then you need to investigate further.

Listing 3. Displaying network adapter status

# ifconfig en1

en1: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,

CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>

inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255

tcp_sendspace 131072 tcp_recvspace 65536

# ifconfig -a

en2: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,

CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>

inet 10.203.35.14 netmask 0xffffff80 broadcast 10.203.35.127

en1: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,

CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>

inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255

tcp_sendspace 131072 tcp_recvspace 65536

en0: flags=7e080822,10<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST,GROUPRT,64BIT,

CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>

lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>

inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

inet6 ::1/0

tcp_sendspace 65536 tcp_recvspace 65536

You can display Ethernet statistics on an adapter with entstat (see Listing 4). The example shown uses the -d switch to display all statistics, including device-specific statistics, for the adapter en2. This command can also be useful for telling you the link status (up or down) and media speed (for example, 100Mbps Full Duplex). The media speed is useful if you need to verify the setting based on the link partner and network that the adapter is connected to, as a speed or duplex mismatch can cause problems.

Listing 4. Displaying Ethernet statistics for a network adapter

# entstat -d en2

-------------------------------------------------------------

ETHERNET STATISTICS (en2) :

Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902)

Hardware Address: 00:02:55:d3:37:be

Elapsed Time: 114 days 22 hours 48 minutes 20 seconds

Transmit Statistics: Receive Statistics:

-------------------- -------------------

Packets: 490645639 Packets: 3225432063

Bytes: 9251643184881 Bytes: 215598601362

Interrupts: 0 Interrupts: 3144149248

Transmit Errors: 0 Receive Errors: 0

Packets Dropped: 0 Packets Dropped: 0

Bad Packets: 0

Max Packets on S/W Transmit Queue: 109

S/W Transmit Queue Overflow: 0

Current S/W+H/W Transmit Queue Length: 0

Broadcast Packets: 442 Broadcast Packets: 10394992

Multicast Packets: 0 Multicast Packets: 349

No Carrier Sense: 0 CRC Errors: 0

DMA Underrun: 0 DMA Overrun: 0

Lost CTS Errors: 0 Alignment Errors: 0

Max Collision Errors: 0 No Resource Errors: 0

Late Collision Errors: 0 Receive Collision Errors: 0

Deferred: 0 Packet Too Short Errors: 0

SQE Test: 0 Packet Too Long Errors: 0

Timeout Errors: 0 Packets Discarded by Adapter: 0

Single Collision Count: 0 Receiver Start Count: 0

Multiple Collision Count: 0

Current HW Transmit Queue Length: 0

General Statistics:

-------------------

No mbuf Errors: 0

Adapter Reset Count: 0

Adapter Data Rate: 200

Driver Flags: Up Broadcast Running

Simplex 64BitSupport ChecksumOffload

PrivateSegment DataRateSet

10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics:

--------------------------------------------------------------------

Link Status: Up

Media Speed Selected: 100 Mbps Full Duplex

Media Speed Running: 100 Mbps Full Duplex

PCI Mode: PCI-X (100-133)

PCI Bus Width: 64-bit Jumbo

Frames: Disabled

TCP Segmentation Offload: Enabled

TCP Segmentation Offload Packets Transmitted: 260772859

TCP Segmentation Offload Packet Errors: 0

Transmit and Receive Flow Control Status: Disabled

Transmit and Receive Flow Control Threshold (High): 32768

Transmit and Receive Flow Control Threshold (Low): 24576

Transmit and Receive Storage Allocation (TX/RX): 16/48

If the adapter is up, you can establish whether the route from your host to the target is correct by using route get (see Listing 5). If there's no route at all, then ping will inform you, but if there is, you will need to establish what it is to verify with the network administrator that it's correct. Based on the information in the routing table that your host uses, route get will tell you the gateway the packets will be routed to when leaving your host on the way to the target.

Listing 5. Getting routing table information for a host

# route get testhost

route to: testhost

destination: 10.203.35.128

mask: 255.255.255.128

gateway: 10.203.35.1

interface: en2

interf addr: myhost

flags: <UP,GATEWAY,DONE,PRCLONING>

recvpipe sendpipe ssthresh rtt,msec rttvar hopcount mtu expire

0 0 0 0 0 0 0 -9751026

Listing 6. Tracing a successful route to a host

# traceroute testhost

trying to get source for testhost

source should be 10.216.163.37

traceroute to testhost (10.217.1.206) from 10.216.163.37 (10.216.163.37), 30 hops max

outgoing MTU = 1500

1 10.216.163.2 (10.216.163.2) 1 ms 0 ms 0 ms

2 10.217.189.6 (10.217.189.6) 0 ms 0 ms 0 ms

3 testhost (10.217.1.206) 1 ms 1 ms 1 ms

An unsuccessful traceroute (see Listing 7) has asterisks (*) in the time fields, as they cannot be determined because the probe to the next router timed out. The example also shows the use of the -n switch, which prints numeric host addresses, thereby avoiding name lookup and resolution and speeding up the trace.

Listing 7. Tracing an unsuccessful route to a host

# traceroute -n testhost

traceroute testhost

trying to get source for testhost

source should be 10.216.163.37

traceroute to 10.216.122.12 from 10.216.163.37, 30 hops max

outgoing MTU = 1500

1 10.216.163.2 1 ms 0 ms 0 ms

2 10.216.191.238 1 ms 1 ms 1 ms

3 10.216.143.10 2 ms 2 ms 2 ms

4 * * *

5 * * *

6 * * *

Listing 7. Tracing an unsuccessful route to a host

# traceroute -n testhost

traceroute testhost

trying to get source for testhost

source should be 10.216.163.37

traceroute to 10.216.122.12 from 10.216.163.37, 30 hops max

outgoing MTU = 1500

1 10.216.163.2 1 ms 0 ms 0 ms

2 10.216.191.238 1 ms 1 ms 1 ms

3 10.216.143.10 2 ms 2 ms 2 ms

4 * * *

5 * * *

6 * * *

Services running at the application layer of a TCP/IP network listen on one or more ports that are used to exchange data between clients and the host server as managed by the transport layer. If a valid route exists to the host, and it's responding to pings but the application service is failing to respond, then you can check connectivity to the relevant ports using telnet.

The telnet command, used in its basic form, establishes a terminal connection to a host. However, you can also use it to establish a connection to a specific port on the host (the default being 23, the telnet service). For a list of standard ports, look in /etc/services.

If the connection is successful, a message indicating the telnet escape sequence is shown (see Listing 8). You need to enter this key sequence (typically, Control-]) to escape back to a telnet> prompt and enter quit to return to a shell prompt.

Listing 8. Testing port 80 (HTTP) on a host (successful)

# telnet testhost 80

Trying...

Connected to testhost.

Escape character is '^]'.

telnet> quit

Connection closed.

Depending on the type of connection you're making, the remote service you're connecting to may generate a message similar to Listing 9.

Listing 9. Testing port 25 (SMTP) on a host (successful)

# telnet testhost 25

Trying...

Connected to testhost.

Escape character is '^]'.

220 testhost.testdomain.com ESMTP Sendmail Wed, 10 Feb 2010 15:52:28 GMT

telnet> quit

Connection closed.

If the connection fails, then either a connection timeout or a connection refused message will be displayed (see Listing 10). This message can mean that the service on the target host isn't running (and therefore nothing is listening on the port), or that a firewall running on the host (or somewhere en route) is blocking connections to the port.

Listing 10. Testing port 515 (remote printing) on a host (unsuccessful)

# telnet testhost 515

Trying...

telnet: Unable to connect to remote host: Connection timed out

Do I know you?

When using a host name in an application or any of the diagnostic commands covered in this article, it's imperative that the host name can be resolved to an IP address. An IP address is what the Internet layer of a TCP/IP network uses when handling data packets.

A host name must resolve through one of the name-resolution services specified in /etc/irs.conf and /etc/netsvc.conf. The hosts record determines the order the name resolution is performed. Only local and BIND/DNS resolution is covered here; the remaining options are outside the scope of this article.

When local is specified, the /etc/hosts file is used to resolve host names. So, check to see whether there's an entry for the target host (see Listing 11).

Listing 11. Looking for a host in /etc/hosts

# grep testhost /etc/hosts

10.217.1.206 testhost testhost.testdomain.com aixserver

If you specify bind or dns, then DNS is used to resolve host names, and you can use nslookup to check whether the host name resolves (see Listing 12).

Listing 12. Resolving a host name via DNS

# nslookup testhost

Server: testdns.testdomain.com

Address: 158.177.79.90

Name: testhost.testdomain.com

Address: 10.217.1.206

A more powerful DNS interrogation tool is dig. This command has a much richer set of options and arguments than nslookup. The latter has an interactive mode that provides the additional functionality. So, for more complex queries—particularly where the output will be parsed by a script—dig is preferred (see Listing 13).

Listing 13. Reverse lookup of an IP address in DNS

# dig -x 10.217.1.206

; <<>> DiG 9.2.0 <<>> -x 10.217.1.206

;; global options: printcmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21351

;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:

;206.1.217.10.in-addr.arpa. IN PTR

;; ANSWER SECTION:

206.1.217.10.in-addr.arpa. 3600 IN PTR testhost.testdomain.com.

;; Query time: 11 msec

;; SERVER: 10.217.1.206#53(10.217.1.206)

;; WHEN: Fri Feb 12 13:28:16 2010

;; MSG SIZE rcvd: 82

A host uses the Address Resolution Protocol (ARP) table or arp cache to keep track of the media access control (MAC) address of other network devices alongside their IP addresses. The link layer of a TCP/IP network uses a device's MAC address, so the ARP table is used to translate a MAC address to an IP address and back. If your host has communicated successfully with another host, it's likely that there is an entry in the ARP table. You can use arp to display the entry for a particular host if one exists (see Listing 14).

Listing 14. Displaying a host entry in the ARP table

# arp testhost

testhost (10.217.1.206) at 0:c:29:44:90:28 [ethernet] stored in bucket 0

You can also display the entire table using the -a switch (see Listing 15). The -n switch specifies that host name-to-IP address resolution shouldn't be performed.

Listing 15. Displaying the contents of the ARP table

# arp -an

? (10.217.1.206) at 00:c:29:44:90:28 [ethernet] stored in bucket 0

? (10.203.35.1) at 0:10:db:27:d9:8 [ethernet] stored in bucket 4

? (10.216.163.40) at 0:1b:78:59:88:d8 [ethernet] stored in bucket 4

? (10.216.163.250) at 0:11:25:a6:20:78 [ethernet] stored in bucket 14

? (10.216.163.25) at 0:1b:78:57:a:d0 [ethernet] stored in bucket 14

? (10.216.163.1) at 0:0:c:7:ac:0 [ethernet] stored in bucket 15

? (10.216.163.4) at 0:d:65:e2:4c:c2 [ethernet] stored in bucket 18

? (10.216.163.60) at 0:11:25:a6:d7:9a [ethernet] stored in bucket 24

bucket: 0 contains: 1 entries

bucket: 1 contains: 0 entries

bucket: 2 contains: 0 entries

bucket: 3 contains: 0 entries

bucket: 4 contains: 2 entries

There are 8 entries in the arp table.

Can you hear me?

To establish a connection, TCP uses a three-way handshake. A client initiates a connection to a host (and a specific port) by sending a SYN synchronize packet. After successfully receiving it, the host responds with a SYN-ACK acknowledgement. If the client successfully receives this acknowledgement, the client completes the handshake with an ACK acknowledgement. All of this assumes that the host server is listening on the specified port, that a route exists from the client to the host and back again, and that no firewalls are blocking this kind of traffic.

You can use netstat to display existing connections from your host to other hosts and the current state of each. Using the command with the -a switch (show the state of all sockets) and the -n switch (show addresses numerically, avoiding lookup), you can pipe the output to a suitable grep to look for connections in a particular state (for example, ESTABLISHED for post-handshake, active connections) or connections to a particular host or port.

Listing 16 shows all connections and their state to a particular IP address (two connections to 10.217.1.206, both fully established), to a particular port at a particular IP address (to 10.217.1.206 at port 22), and all fully established connections to any host respectively.

Listing 16. Displaying the status of connections to hosts

# netstat -an | grep 10.217.1.206

tcp4 0 0 10.203.35.14.22 10.217.1.206.1023 ESTABLISHED

tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED

# netstat -an | grep 10.217.1.206.22

tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED

# netstat -an | grep ESTABLISHED

tcp4 0 0 10.203.35.14.22 10.217.1.206.1023 ESTABLISHED

tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED

tcp4 0 0 10.216.163.37.1521 10.216.163.37.44122 ESTABLISHED

tcp4 0 0 10.216.163.37.44122 10.216.163.37.1521 ESTABLISHED

tcp4 0 0 127.0.0.1.199 127.0.0.1.32769 ESTABLISHED

tcp4 0 0 127.0.0.1.32769 127.0.0.1.199 ESTABLISHED

tcp4 0 0 10.203.35.14.46183 10.203.35.170.22 ESTABLISHED

tcp4 0 0 10.216.163.37.32770 10.216.163.37.32771 ESTABLISHED

You can monitor outgoing data sent from a particular adapter using tcpdump, which displays the content of each packet as it is sent. The command takes various options to allow you to display more or less of the packet either in descriptive or raw form and allows a number of Boolean expressions to filter the type of data you want to see. For example, monitoring packets on adapter en2, you can show only data being sent to a specific host (see Listing 17).

Listing 17. Display packets destined for a specific host

# tcpdump -i en2 dst host testhost

tcpdump: listening on en2

10:08:24.912057892 myhost.46183 > testhost.22: P 1299060979:1299061027(48)

ack 3373421618 win 17520 (DF) [tos 0x10]

10:08:25.009291439 myhost.46183 > testhost.22: P 1:49(48) ack 48 win 17520 (DF)

[tos 0x10]

10:08:25.093832676 myhost.46183 > testhost.22: . ack 96 win 17520 (DF)

[tos 0x10]

10:08:25.249319253 myhost.46183 > testhost.22: P 1299061075:1299061123(48) ack 3373421714

win 17520 (DF) [tos 0x10]

53 packets received by filter

0 packets dropped by kernel

You can show only packets coming from a specific host (see Listing 18).

Listing 18. Display packets sent by a specific host

# tcpdump -i en2 src host testhost

tcpdump: listening on en2

10:10:38.505848354 testhost.22 > myhost.46183: . ack 130 win 24820 (DF) [tos 0x10]

10:10:38.505916972 testhost.22 > myhost.46183: F 529:529(0) ack 225 win 24820 (DF)

[tos 0x10]

10:10:43.855153846 testhost > myhost: icmp: echo reply

10:10:44.855224394 testhost > myhost: icmp: echo reply

102 packets received by filter

0 packets dropped by kernel

You can show only packets sent to or coming from a specific port (see Listing 19).

Listing 19. Display packets destined for or sent by a specific host on a specific port

# tcpdump -i en2 host testhost port 22

12:15:38.033833162 myhost.47216 > testhost.22: . ack 610148954 win 17520 (DF) [tos 0x10]

12:15:38.113807903 myhost.47216 > testhost.22: P 145:193(48) ack 192 win 17520 (DF)

[tos 0x10]

12:15:38.114291921 testhost.22 > myhost.47216: P 192:240(48) ack 193 win 24820 (DF)

[tos 0x10]

12:15:38.241718122 myhost.47216 > testhost.22: P 193:241(48) ack 240 win 17520 (DF)

[tos 0x10]

12:15:38.242344703 testhost.22 > myhost.47216: P 240:288(48) ack 241 win 24820 (DF)

[tos 0x10]

12:15:38.243844593 myhost.47216 > testhost.22: . ack 288 win 17520 (DF) [tos 0x10]

12:15:38.497817604 myhost.47216 > testhost.22: P 241:289(48) ack 288 win 17520 (DF)

[tos 0x10]

12:15:38.503088328 testhost.22 > myhost.47216: P 288:336(48) ack 289 win 24820 (DF)

[tos 0x10]

12:15:38.503154802 testhost.22 > myhost.47216: P 336:432(96) ack 289 win 24820 (DF)

[tos 0x10]

145 packets received by filter

0 packets dropped by kernel

You can stop the trace by pressing Control-C. The tcpdump command is much more feature rich than the simple examples shown here, so I recommend that you familiarize yourself with its man pages.

As you can see from the output in these three examples, traffic is shown with:

A timestamp
Source Host.Source Port
Destination Host.Destination Port
Packet flags
Other packet information

You can use the command to establish whether traffic is leaving your host destined for the target host and whether traffic is making its way back. If no inbound traffic appears, it may be that the host isn't responding or there's no valid route from your host to the target or vice versa. If a particular service (TCP port) isn't responding or a firewall is blocking packets of the type you are sending, you will typically see an R in the packet flags field, indicating that the connection has been reset. For more information on the exact layout and format of a TCP packet, refer to RFC 793: Transmission Control Protocol (see Resources for a link).

Depending on the nature of the problem, it is sometimes good practice to run a tcpdump for a period of time while capturing packet information to a file using the -w switch. Once you feel you have captured enough data, press Control-C to stop the trace. At this point, you can process the file using the -r option to read the packet data captured. You can then use the vast array of switches, options, and Boolean arguments to analyze the data. Listing 20 shows an example of this process.

Listing 20. Capture packet data to a file and analyze it

# tcpdump -w /var/tmp/tcpdump.out -i en1

tcpdump: listening on en1

305 packets received by filter

0 packets dropped by kernel

# tcpdump -r /var/tmp/tcpdump.out host testhost

13:10:12.017777365 testhost.22 > myhost.47216: P 790304:790352(48) ack 1110769 win 24820

(DF) [tos 0x10]

13:10:12.129146164 myhost.47216 > testhost.22: P 135249:135297(48) ack 126560 win 17520

(DF) [tos 0x10]

13:10:12.129992465 testhost.22 > myhost.47216: P 790352:790416(64) ack 1110817 win 24820

(DF) [tos 0x10]

13:10:12.203827965 myhost.47216 > testhost.22: . ack 790416 win 17520 (DF) [tos 0x10]

13:11:35.707809458 myhost > testhost: icmp: echo request (DF)

13:11:35.709883978 testhost > myhost: icmp: echo reply (DF)

# tcpdump -r /var/tmp/tcpdump.out not port 22

13:11:35.707809458 myhost > testhost: icmp: echo request (DF)

13:11:35.709883978 testhost > myhost: icmp: echo reply (DF)

13:11:36.579874114 arp who-has 10.203.35.59 tell 10.203.35.57

13:11:37.077504208 0:2:16:9e:20:a 1:80:c2:0:0:0 0026 38:

4242 0300 0000 0000 8000 0002 1695 aecb

0000 0026 8000 0002 169e 2008 8017 0200

1400 0200 0f00

13:11:38.065119802 oraclehost.testdomain.com.2175 > myhost.tnslsnr: P 502:591(89) ack 421

win 64056

13:11:38.071526597 oraclehost.testdomain.com.2175 > myhost.tnslsnr: P 591:606(15) ack 548

win 63929

13:11:38.896664820 10.203.35.37.netbios-ns > 10.203.35.127.netbios-ns: udp 50

13:11:39.071526597 10.203.35.20.netbios-ns > 10.203.35.127.netbios-ns: udp 50

Conclusion

This article covered some of the AIX tools you can use to test connectivity to a host, extract useful network-related information about a host, and analyze data sent to and from a host. In the next article, you'll get under the covers to see what is really going on when your host has problems communicating with another. The article will conclude with a step-by-step guide to logical problem diagnosis when encountering network-related issues.