This job can be performed Remotely, anywhere in the US
MediaMath’s Platform Operations team handles all the challenges of a real-time advertising stack. We run a hybrid environment, with multiple globally placed on-prem datacenters and the AWS cloud, supporting a broad range of services - from low latency bidding processes handling millions of transactions per second, through to big data storage and analytics, and client-facing UI and reporting solutions. Each has its own unique operational challenges, and our team is a key partner in ensuring these workloads are managed in stable, scalable ways.
As part of the Platform Operations team, the Systems Engineer (data management) will be responsible for a range of globally distributed production pipelines and datastores, handling our most business critical datasets. This includes a network of Kafka clusters, and multiple large noSQL database clusters, as well as the support of other data solutions in partnership with existing SMEs. This engineer will be hands-on at all levels, from day-to-day care and feeding, to fortifying and modernizing existing offerings, building solutions for management and monitoring, and helping guide overall organizational strategy.
ESSENTIAL DUTIES AND RESPONSIBILITIES
- Build, maintenance and administration of production kafka clusters and related services (AWS and on-prem)
- Monitor and regularly assess t of capacity needs for global kafka footprint, with consideration to upcoming roadmap items.
- Monitoring health of kafka cluster & associated services, and implement appropriate alerting.
- Migration of existing tooling for kafka management into company standard toolsets (chef, ansible, circleci)
- Support application teams with kafka development efforts.
- Back-up support of production Hadoop clusters, as requested by SME.
- Build, maintenance & administration of business critical database clusters (Scylla, Cassandra, Aerospike)
- Work with development teams to ensure cluster architectures supports business need, including replication strategy, hardware/site redundancy, failover testing, etc
- Monitor health and regularly assess capacity needs for production databases
- Monitor database cluster health and implement appropriate alerting
- Support hardware strategy for database services, including specing, testing, maintenance and tuning.
- Work with development teams to ensure cluster architectures supports business need, including replication strategy, hardware/site redundancy, failover testing, etc.
- Consult on the design of new database offerings, and provide recommendations on strategy.
- Back-up support of production PostgrSQL clusters, as requested by SME
- Conduct training sessions to share knowledge with peers and development groups
- Act as in-team SME for data pipelines & storage – providing guidance and oversight to others in-team, and across the development community.
- Communicate current status of all projects, problems, and issues to the department management team
- Support audit and compliance efforts, and initiate corrective action when appropriate for remediation
- Participate in on-call rota as part of Platform Operations team.
Qualifications
Ops basics:
- Proficiency with Linux system administration (Debian, Ubuntu, CentOS)
- Proficiency with basic AWS administration (IAM, EC2, Networking, cost analysis)
- Proficiency with scripting & basic coding (python, ruby, golang)
- Understanding of networking fundamentals, including application layer protocols (HTTP, SSH, SSL), load balancing solutions (lvs, nginx), and DNS
Role specific experience:
- Extensive production experience with at least one distributed data processing software (Kafka strongly preferred, Hadoop valuable)
- Extensive production experience with at least one noSQL database software (Scylla preferred)
- Experience with relational database software (PostgreSQL preferred)
- Experience supporting low-latency, globally distributed services at scale.
- Experience working with private datacenter infrastructure (“on-prem” servers)
Other nice-to-have:
- Experience leveraging config management toolsets (Chef, Salt, Ansible)
- Experience leveraging common deployment toolsets (CircleCi, Jenkins, Artifactory)
- Experience collecting and analyzing metrics for service level monitoring using prometheus and grafana
- Working knowledge of Kubernetes-based deployments
SKILLS
- Practical approach to real world problems, with “hands-on” approach to solutions.
- Ability to think strategically, understand business context, and make collaborative decisions
- Willing to gather information & recommend courses of action clearly and confidently.
- Fosters open communication, speaks with impact, listens to others, and writes effectively
- Comfortable partnering with internal clients in development teams to address issues, consult on solutions and plan for future needs.
- Ability to communicate sometimes complex ideas to to non-technical stakeholders, including product and support teams.
- Desire to mentor and provide guidance to junior engineers, both technically and professionally.
- Willingness to adhere to, streamline and help improve team processes for work tracking, knowledge sharing, incident response, and cross-org communication.