Senior Software Engineer - Site Reliability
Posted on: May 3, 2021
As a member of the SRE team you will
work with other SRE and DevOps practitioners to produce
mission-critical infrastructure, tools, and processes that will
ensure highest levels of availability and reliability of all our
websites, systems and services. As a senior member of the team you
will be expected to work with management, peers, and customers to
define and implement the technical vision of the team.
You're right for the job if you're
comfortable with deep technical Linux, networking topics, and
distributed architectures. You will work cross-functionally amongst
a variety of teams and be a core contributor in every significant
engineering service or solution that we deliver to our
stakeholders. You'll excel if you have enthusiasm for digging deep,
and a flare for sharp technical communication, prioritization and
organization. You will work directly with our Software Engineering
teams to build our next generation “always up” cloud based
e-commerce/Retail and Enterprise platform.
Site Reliability Engineers are hybrid
systems and software engineers who are responsible and take
ownership for reliability, scalability, automation, and other
issues related to uptime and availability of Walmart’s
e-commerce/Retail and Enterprise platform. Our goal is to build,
scale and guard the systems that delights the customers. To do so,
you will need to strong skills in following areas:
- Design, write and build tools to
improve the reliability, latency, availability and scalability of
Walmart e-commerce/Retail and Enterprise products.
- Engender reliability and availability
starting with metrics and measurements
- Enable scaling by providing tools,
developing training and/or augmenting processes
- Build tools/automate to prevent
re-occurrence of problem to mission critical
- Augment existing instrumentation to
build a cohesive picture of the characteristics of our systems with
special attention to points of failure.
- Participate in capacity planning,
demand forecasting, software performance analysis and system
- Develop a deep understanding of the
various services and applications that come together to deliver
Walmart e-commerce/Retail and Enterprise products.
- Design new tools to monitor and smart
alerts that help discover failures/issues in a timely fashion and
work with engineers to identify root cause and fix
- Influence, design and create new
architectures, standards and methods for large-scale enterprise
- Root-cause analysis complex problems
involving multiple parties, networks, hardware and software that
relate to scaling and performance
- Participate in on-call
- Secure the system from issues, be they
real, perceived or notional
- High focus on collecting and inferring
metric documentation to be used by others to build and maintain
- Scripting and Development
- Experience with configuration
management tools such as Ansible, Saltstack, Chef and
- Build and drive the automation systems
that maintain system health
- Eliminate Single Point of failure and
test disaster recovery and HA regularly.
Additional responsibilities may include:
- Drives standardization and service
focused instrumentation. Provides subject matter expertise.
Resolves break/fix scenarios, engaging broader teams as necessary;
and partners/leads to achieve continuous improvement. Contributes
to command and control related activities focused on restoration of
complex outages, and rapid restoration. Participate on 24/7 on-call
rotation. May work independently or as part of a team on more
complex projects. Provides mentoring and guidance to more junior
- Creates systems engineering and
architectural velop software in several modern languages. Develops
large/complex database-backed systems and has an understanding of
DB schema and query performance. Utilizes professional best
practices in day-to-day work like revision control, unit testing,
or other. Applies statistical data analysis techniques.
- Networking responsibilities:
Understanding and performing TCP dumps, snoop, and other network
sniffers. Understands and applies knowledge of most protocols
(TCP/IP, HTTP, UDP, etc.)
- Application Technologies: Provides
recommendations and advice to the team and/or department in the
areas of web services, OS, and storage, including being an active
liaison to Development, QA and the Business.
- Analyzes systems and makes
recommendations to prevent possible problems. Takes lead on issue
resolution activities using knowledge of complex and company-wide
- Lead end-to-end audit of monitors and
alarms based on subsystem knowledge.
- Utilizes time management and project
management skills to lead the resolution of issues in a timely and
organized manner, effectively communicating necessary information.
May consult directly with developers or third party vendors;
provides subject matter expertise.
- Consistent exercise of independent
judgment and discretion in matters of significance.
- Other duties and responsibilities as
- 6+ years in a software development,
DevOps role, or SRE role.
- Experience in designing,
investigating, analyzing and troubleshooting large-scale enterprise
- Methodical and systematic problem
solving approach, combined with a solid awareness of ownership,
initiative and drive.
- Fluency with running services at
scale; In depth understanding of Unix systems internals and
- Networking knowledge and in depth
understanding of network concepts, such as different protocols
(TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI
layers, and load balancing).
- Understanding of Unix/Linux systems
from kernel to shell and beyond, taking in system libraries, file
systems, and client-server protocols along the way. Experience
administering Linux systems in a production environment
- Programming experience in one or more
of the following languages: Go, Java, Python, Ruby,
- Bachelor's Degree in Computer Science
or a related field, or relevant work experience
- Experience with distributed version
control like Git or similar
- Experience with IaaS and PaaS
providers such as AWS, AZURE, OpenStack, GCP
- Experience with containerization and
container platforms. (e.g. Docker, Kubernetes, Docker EE,
- Experience with enterprise monitoring
solutions like Dynatrace, AppDynamics, New Relic, Prometheus,
Graphite, Grafana, Nagios, Sensu and Splunk
- Familiarity with continuous
integration/deployment processes and tools such as Jenkins, Maven,
below are the required minimum qualifications for this position. If
none are listed, there are no minimum
Bachelor’s degree in Computer Science and 3 years’ experience in
software engineering or related field OR 5 years’ experience in
engineering or related field.
below are the optional preferred qualifications for this position.
If none are listed, there are no preferred
Master’s degree in Computer Science or related field and 2 years'
experience in software engineering or related field
805 SE MOBERLY LN, BENTONVILLE, AR 72712,
United States of America
Keywords: Walmart, Fayetteville , Senior Software Engineer - Site Reliability, Other , Farmington, Arkansas
Didn't find what you're looking for? Search again!