Monday, 20 June 2011

Error logging in AIX


The error logging process begins when an operating system module detects an error.The error-detecting segment of code then sends error information to either the errsave kernel service and errlast kernel service for pending system crash or the errlog subroutine to log an application error, where the information is, in
turn, written to the /dev/error special file.


This process then adds a time stamp to the collected data. The errdemon daemon constantly checks the /dev/error file for new entries, and when new data is written, the daemon conducts a series of operations.


Before an entry is written to the error log, the errdemon daemon compares the label sent by the kernel or application code to the contents of the error record template repository. If the label matches an item in the repository, the daemon collects additional data from other parts of the system.


To create an entry in the error log, the errdemon daemon retrieves the appropriate template from the repository, the resource name of the unit that detected the error, and detailed data. Also, if the error signifies a
hardware-related problem and hardware Vital Product Data (VPD) exists, the daemon retrieves the VPD from the Object Data Manager (ODM). When you access the error log, either through SMIT or with the errpt command, the error log is formatted according to the error template in the error template repository
and presented in either a summary or detailed report.


The system administrator can look at the error log to determine what caused a failure, or to periodically check the health of the system when it is running.


The software components that allow the AIX kernel and commands to log errors to the error log are contained in the fileset bos.rte.serv_aid. This fileset is automatically installed as part of the AIX installation process.


The commands that allow you to view and manipulate the error log, such as the errpt and errclear commands, are contained in the fileset called bos.sysmgt.serv_aid.


Error log file processing:-


The error log is used by system administrators. The error log contains error IDs, time stamp, error type, error class, and resource names associated with each error.
Error templates

The error template contains numbers that correspond to error messages in the codepoint catalogue.















System health check in AIX


- Use the errpt command to look at a summary error log report. Be on the lookout for recent additions to the      log. Use the errpt -a command to examine any suspicious detailed error log entries


Check disk space availability on the system with the df -k command. A full file system can cause a number of problems, so it is best to avoid the situation if at all possible. The two solutions to a full file system are to either
delete some files to free up space, or use the LVM to add additional resources to the file system. The option you take will depend on the nature of the data in the file system, and whether there is any available space in the volume group.


Check volume groups for stale partitions with the lsvg command. If stale partitions, logical volumes, or physical volumes are reported, try and repair the situation with the syncvg command.


Check system paging space with the lsps -s command. A system will not perform very well if it is low on paging space. In extreme circumstances, the system can terminate processes in order to solve the problem. Obviously, it is better that the system administrator ensures that there is enough paging space. You can either increase the size of existing paging space volumes, or add a new paging space volume. Again, the option you take will depend on the available space in the volume groups on the system.

Check if all expected subsystems are running with the lssrc -a command.



Check the networking by trying to ping a well-known address. If you cannot
ping the address,












Lower down usage high of paging space without reboot server (on aix 5.0 above )


1) Create a new same sized paging space with the old one, and when using smitty pgsp , select:
Start using this paging space NOW? yes
Use this paging space each time the system is RESTARTED? yes
2) Then deactivate the original paging space. It's better to use command, for example, if /dev/hd6 is the old paging
space, then
# nohup swapoff /dev/hd6 > /tmp/swapoff.log 2>&1 &
3) Before step 2 finish, check paging space usage, by:
# lsps -a
you will find the usage of new paging space growing gradually while the old one decreasing gradually.
4) Without waiting step 2 finish, change the original paging space (hd6) to
Use this paging space each time the system is RESTARTED? no
5) After it finishes, Activate the original paging space (hd6)
6) (Optional) Remove the original paging space. (leave it if you want to swap paging space next time)
OR
1) Create a new same sized paging space with the old one, and when using smitty pgsp , select:
Start using this paging space NOW? yes
Use this paging space each time the system is RESTARTED? yes
2) Deactive the orginal paging space using smitty command and wait for an hour
3) Recative the original paging space
4) Remove the paging space you have created

Problems with file sizes over 1GB


If a user is having a problem creating a file over 1GB, you may want to check /etc/security/limits for that certain user
ID.
smitty users
Change / Show Characteristics of a User
type in user name (ie: neo)
* User NAME [neo] +
scroll down to
Soft FILE size
and you can set a higher limit or set to unlimited (-1)

If you still see % did not go down, then run the following command to see the unused processes, for example to see unused processes in /var filesystem:


d02http003:/# fuser -dV /var
/var:
inode=4123 size=9567989 fd=6 13938
inode=6175 size=19190691 fd=4 29172
Then verify the name of these id, by running:
#ps -deaf | grep 29172
root 29172 8780 0 Dec 06 - 144:09 /usr/lpp/ssp/bin/pmand
You can kill that unused id and it will restart it automatically but make sure you verify that it is restarted

If wtmp file is very large and you want compress the wtmp file, please follow the instruction below


1. run: stopsrc -s syslogd
2. compress -v wtmp
3. you will see the wtmp.Z filecreated
4. rename this file by running: mv wtmp.Z /tmp/wtmp_mm_dd_yyyyy.Z
5. run: touch wtmp (to create a new wtmp file)
5a. chown adm wtmp
chgrp adm wtmp
chmod 664 wtmp
6. run startsrc -s syslogd

After you replace the hardware and if you see same error messages are coming to GWA root, then do the followings:



1. cd /usr/bin
2. cp errpt -a > /tmp/errpt_mm_dd_yyyyy
3. run: errclear 0
4. run: /usr/lib/errstop
5. run: rm /var/adm/ras/errlog
6. run: /usr/lib/errdemon
7. Perform log repair action of the hardware that has been replaced by running diag --> task selection --> log repair
action --> chose the device that has been replaced and press F7 to cmmit and then exit