Maximize your productivity as an SRE with these 100 Linux commands 💪
Site Reliability Engineer (SRE) is a role within an organization that is responsible for ensuring the reliability, performance, and uptime of the company’s systems and applications. SREs work closely with development teams to design, build, and maintain scalable and reliable systems, and use a combination of software engineering and operations skills to improve the reliability and performance of the company’s systems.
Here are some tips for debugging as an SRE
Gather as much information as possible : Before starting to debug, try to gather as much information as possible about the problem. This might include logs, error messages, system metrics, and any relevant configuration files.
Break down the problem : Identify the specific component or subsystem that is causing the problem and focus on that. This will help you narrow down the scope of the problem and make it easier to troubleshoot.
Use tools and resources : There are many tools and resources available to help with debugging, such as log analysis tools, monitoring systems, and system profiling tools. Make use of these to help identify the cause of the problem.
Create test cases : Create test cases to help you isolate the problem and verify that it has been fixed. This will help you confirm that the problem has been resolved and ensure that it does not reoccur in the future.
Collaborate with others : Don’t be afraid to ask for help or collaborate with others. Often, other people can bring fresh perspectives and ideas to the problem, which can help you find a solution more quickly.
Document your work : Document the steps you took to debug the problem and the solution you implemented. This will help you understand the problem better and make it easier to troubleshoot similar issues in the future.
Best Practices for SRE
Monitor and alert on key performance indicators (KPIs) : Monitor key performance indicators (KPIs) for your systems and applications, and set up alerts to notify you when something goes wrong. This will help you quickly identify and resolve problems before they become critical.
Automate repeatable tasks : Automate as many repeatable tasks as possible to reduce the risk of human error and improve efficiency. This might include tasks such as provisioning new servers, deploying code updates, or creating backups.
Use version control : Use version control to manage and track changes to your systems and applications. This will help you easily roll back changes if something goes wrong, and make it easier to collaborate with other team members.
Implement a change management process : Implement a change management process to ensure that all changes to your systems and applications are properly documented, tested, and reviewed before being deployed. This will help you minimize the risk of errors and downtime.
Use testing and staging environments : Use testing and staging environments to validate changes before deploying them to production. This will help you identify and fix problems before they impact your users.
Practice disaster recovery : Regularly practice disaster recovery scenarios to ensure that you are prepared to handle unexpected events. This might include backing up data, testing failover processes, and simulating outages.
Document processes and procedures : Document processes and procedures for common tasks, such as troubleshooting, deployment, and maintenance. This will help you and other team members understand how your systems work and how to fix problems when they arise.
Here is a list of 100 Linux commands that a SRE (Site Reliability Engineer) might find useful, along with brief descriptions and code examples for each:
- ls - List directory contents
ls- cd - Change directory
cd /path/to/directory- pwd - Print working directory
pwd- mkdir - Create a new directory
mkdir new_directory- rmdir - Remove a directory
rmdir directory_to_remove- cp - Copy files and directories
cp file1 file2
cp -r directory1 directory2- mv - Move or rename files and directories
mv file1 file2
mv directory1 directory2- rm - Remove files and directories
rm file1 file2
rm -r directory1 directory2- touch - Create a new file or update the timestamp of an existing file
touch new_file- echo - Print a message or the value of a variable to the terminal
echo "Hello, world!"- cat - Display the contents of a file
cat file1 file2- less - View the contents of a file one page at a time
less file- head - Display the first few lines of a file
head file- tail - Display the last few lines of a file
tail file- grep - Search for a pattern in a file or stream
grep "pattern" file- find - Search for files and directories
find /path/to/search -name "pattern"- sort - Sort the lines of a file or stream
sort file- uniq - Remove duplicate lines from a file or stream
sort file | uniq- wc - Count the number of lines, words, and bytes in a file or stream
wc file- chmod - Change the permissions of a file or directory
chmod u+x file- chown - Change the owner of a file or directory
chown owner:group file- diff - Compare the contents of two files
diff file1 file2- patch - Apply a patch file to modify the contents of a file
patch file < patch_file- tar - Create or extract a tar archive
tar -cvf archive.tar file1 file2
tar -xvf archive.tar- gzip - Compress or decompress a file using gzip
gzip file
gunzip file.gz- diff - Compare the contents of two files
diff file1 file2- curl - Transfer data using various network protocols
curl https://www.example.com- wget - Download a file from the web
wget https://www.example.com/file.txt- scp - Securely copy files between hosts
scp file user@remote:/path/to/destination- rsync - Synchronize files and directories between hosts
rsync -avz source/ user@remote:/path/to/destination/- ssh - Connect to a remote host using a secure shell
ssh user@remote- ping - Test the reachability of a host
ping www.example.com- traceroute - Trace the route packets take to a destination host
traceroute www.example.com- nslookup - Query DNS to obtain information about a host
nslookup www.example.com- dig - Query DNS to obtain detailed information about a host
dig www.example.com- host - Query DNS to obtain information about a host
host www.example.com- whois - Look up information about a domain name or IP address
whois www.example.com- nmap - Scan networks for hosts and services
nmap -sS 192.168.0.0/24- tcpdump - Capture and analyze network traffic
tcpdump -i eth0- nc - Connect to or listen for network connections
nc -l 1234
nc www.example.com 80- telnet - Connect to a remote host using the telnet protocol
telnet www.example.com 80- ftp - Transfer files using the FTP protocol
ftp ftp.example.com- sftp - Transfer files securely using the SFTP protocol
sftp user@remote- rlogin - Connect to a remote host using the rlogin protocol
rlogin user@remote- rsh - Connect to a remote host using the rsh protocol
rsh user@remote- top - Display real-time information about running processes
top- ps - Display information about running processes
ps aux- kill - Send a signal to a process to terminate it
kill -9 12345- killall - Terminate all processes with a specific name
killall process_name- nice - Run a program with a modified scheduling priority
nice -n 19 command- cron - Schedule tasks to be run automatically
crontab -e- at - Schedule a command to be run at a specific time
at now +1 hour- screen - Create and manage multiple terminal sessions
screen -S session_name- tmux - Create and manage multiple terminal sessions
tmux new -s session_name- htop - Display real-time information about running processes with a interactive interface
htop- iotop - Display real-time information about I/O usage by processes
iotop- lsof - List open files and the processes that have them open
lsof- df - Display information about available disk space
df -h- du - Estimate the space used by a file or directory
du -sh /path/to/directory- fuser - Identify processes using a specific file or filesystem
fuser /path/to/file- chroot - Change the root filesystem for a command or shell
chroot /new/root command- chkconfig - Manage system service startup links
chkconfig --list
chkconfig service_name on- systemctl - Manage system services and daemons
systemctl list-units
systemctl start service_name- service - Manage system services
service --status-all
service service_name start- init - Manage system initialization and runlevel changes
init 3- reboot - Reboot the system
reboot- shutdown - Shut down the system
shutdown -h now- date - Display or set the system date and time
date
date -s "2 OCT 2006 18:00:00"- timedatectl - Manage the system time and timezones
timedatectl
timedatectl set-timezone America/New_York- hwclock - Manage the system hardware clock
hwclock
hwclock --systohc- ntpdate - Set the system clock using NTP
ntpdate pool.ntp.org- ntpq - Query NTP servers
ntpq -p- ntpd - Synchronize the system clock using NTP
ntpd -q- syslogd - System logging daemon
syslogd -f /etc/syslog.conf- rsyslogd - Enhanced system logging daemon
rsyslogd -f /etc/rsyslog.conf- journalctl - Query and display the system journal
journalctl
journalctl -u service_name- dmesg - Display kernel ring buffer messages
dmesg- ulimit - Control process resource limits
ulimit -n 1024- free - Display information about memory usage
free -m- vmstat - Display information about virtual memory usage
vmstat- iostat - Display information about I/O usage
iostat- mpstat - Display information about CPU usage
mpstat- sar - Collect and report system performance statistics
sar- uptime - Display system uptime and load average
uptime- last - Display information about previous logins
last- w - Display information about logged in users
w- who - Display information about logged in users
who- finger - Display information about users
finger user- id - Display information about a user
id user- groups - Display the groups a user is a member of
groups user- passwd - Modify a user’s password
passwd- adduser - Add a new user to the system
adduser new_user- useradd - Add a new user to the system
useradd new_user- deluser - Remove a user from the system
deluser user- userdel - Remove a user from the system
userdel user- groupadd - Add a new group to the system
groupadd new_group- groupdel - Remove a group from the system
groupdel group- visudo - Edit the sudoers file
visudo- sudo - Execute a command with root privileges
sudo command- renice - Modify the scheduling priority of a running process
renice -n 19 -p 12345